SEML: A Semi-Supervised Multi-Task Learning Framework for Aspect-Based Sentiment Analysis

Aspect-Based Sentiment Analysis (ABSA) involves two sub-tasks, namely Aspect Mining (AM) and Aspect Sentiment Classification (ASC), which aims to extract the words describing aspects of a reviewed entity (e.g., a product or service) and analyze the expressed sentiments on the aspects. As AM and ASC can be formulated as a sequence labeling problem to predict the aspect or sentiment labels of each word in the review, supervised deep sequence learning models have recently achieved the best performance. However, these supervised models require a large number of labeled reviews which are very costly or unavailable, and they usually perform only one of the two sub-tasks, which limits their practical use. To this end, this paper proposes a SEmi-supervised Multi-task Learning framework (called SEML) for ABSA. SEML has three key features. (1) SEML applies Cross-View Training (CVT) to enable semi-supervised sequence learning over a small set of labeled reviews and a large set of unlabeled reviews from the same domain in a unified end-to-end architecture. (2) SEML solves the two sub-tasks simultaneously by employing three stacked bidirectional recurrent neural layers to learn the representations of reviews, in which the representations learned from different layers are fed into CVT, AM and ASC, respectively. (3) SEML develops a Moving-window Attentive Gated Recurrent Unit (MAGRU) for the three recurrent neural layers to enhance representation learning and prediction accuracy, as nearby contexts within a moving-window in a review can provide important semantic information for the prediction task in ABSA. Finally, we conduct extensive experiments on ABSA over four review datasets from the SemEval workshops. Experimental results show that SEML significantly outperforms the state-of-the-art models.


I. INTRODUCTION
Product and service reviews posted by their users have been drawn a lot of attentions from both industry and academic communities.Document-level or sentence-level sentiment analysis tells an overall opinion about a review or sentence, whereas Aspect-Based Sentiment Analysis (ABSA) provides more fine-grained information by mining aspects and analyzing aspect-level opinions for a discussed entity [1], [2].For instance, a user posts a review on a laptop: ''I love the operating system but not the preloaded software'' which contains two aspects, i.e., ''operating system'' with a positive sentiment and ''preloaded software'' with a negative sentiment.
The associate editor coordinating the review of this manuscript and approving it for publication was Firooz B. Saghezchi .
Generally, ABSA can be divided into two sub-tasks, namely Aspect Mining (AM) and Aspect Sentiment Classification (ASC) [1].The AM sub-task extracts the aspect words from each sentence of reviews, which has been extensively studied by applying unsupervised models [3]- [5], supervised models [6]- [13], or semi-supervised techniques [14]- [19].The ASC sub-task that aims to predict the sentiment polarities on these aspects also has been increasingly discussed recently [20]- [25].However, these works [3]- [25] only focus on one of the sub-tasks.As a result, it is required to train two different models and pipeline them together for ABSA.Nonetheless, the literatures [26], [27] show that the pipeline method is usually not the best solution for highly related tasks in Natural Language Processing (NLP) and an integrated method is more effective by jointly training different and related tasks.Thus, increasing attention has been paid on this integration direction [28]- [30].As aspect words and sentiment words often co-appear and can help find each other, AM and ASC are strongly related sub-tasks; a jointly trained method for the two sub-tasks in ABSA is promising.Moreover, most existing works [6]- [13], [21]- [25] adopt supervised learning for the AM or ASC sub-tasks and require a large amount of labeled reviews.The manual labeling on training data is very costly, especially for domain-dependent aspects, i.e., different domains may have different aspect spaces.Researchers are motivated to develop more effective semi-supervised learning models for ABSA [31].Thus, our two main concerns are: (1) whether we can fully use both labeled and unlabeled reviews and (2) whether we can perform both AM and ASC sub-tasks in an end-to-end architecture at the same time.Our previous work [19] has addressed the first concern, in which the proposed model can leverage both labeled and unlabeled reviews only for the AM sub-task in the unified framework.
In this paper, we propose a new SEmi-supervised Multitask Learning framework (called SEML) to enhance ABSA on user reviews.SEML follows the method in our previous work [19] to alternately learn a model on a mini-batch of labeled reviews and a mini-batch of unlabeled reviews from the same domain based on Cross-View Training (CVT) [32] to enable semi-supervised learning.In the CVT, one primary prediction module for either AM or ASC is trained with the standard supervised learning on labeled reviews and four auxiliary prediction modules with different views on unlabeled reviews are trained to agree with the AM or ASC primary prediction module.CVT switches training on labeled and unlabeled reviews to improve both review representations and prediction modules.
Meanwhile, as AM and ASC are highly coupled together, SEML applies multi-task learning by sharing the representation learning in different layers for performing AM and ASC in the same framework.More specifically, three stacked bidirectional recurrent neural layers are employed to learn representations of reviews, in which the representations from the first layer are fed into the four auxiliary prediction modules of CVT to leverage unlabeled reviews, the representations from the second layer are fed into the primary prediction module for AM, and the representations from the third layer are fed into the primary prediction module for ASC.Moreover, each upper layer uses the representations from lower layer as inputs, so SEML enables multi-task learning and interaction between different sub-tasks to improve the aspect and sentiment prediction.
Further, SEML considers a significant observation that nearby contexts of a word in a sentence provide important semantic information for a prediction task in ABSA.For instance, the past nearby aspect words (e.g., ''operating system'') should be more significant than other words to guide the extraction of subsequent aspects (e.g., ''preloaded software''), and a closer sentiment word is more likely to be the corresponding opinion for the aspect (e.g., ''love'' for ''operating system'' and ''not'' for ''preloaded software'').Therefore, SEML devises a Moving-window Attentive Gated Recurrent Unit (MAGRU) as the neural unit in the three bidirectional recurrent neural layers; MAGRU extends Gated Recurrent Unit (GRU) [33] with an attention mechanism to encode the information within a moving-window.
In general, the contributions of this paper can be summarized as below.
• We propose the first semi-supervised deep multi-task learning framework for both AM and ASC sub-tasks in ABSA, which introduces CVT to use unlabeled reviews to improve the representation learning within a unified end-to-end architecture.
• We enable multi-task learning to perform AM and ASC sub-tasks in the same framework with three stacked bidirectional recurrent neural layers and corresponding prediction modules.
• We develop a moving-window attention mechanism within the GRU, i.e., MAGRU, to capture significant past nearby information for the aspect and sentiment prediction.
• We conduct extensive experiments to evaluate the performance of SEML for AM, ASC and complete ABSA based on the four review datasets from the SemEval workshops.Experimental results show that SEML is significantly better than the state-of-the-art models.The reminder of this paper is organized as follows.Section II discusses related works.Then, we present our SEML framework in Section III.Section IV shows the experimental results.Finally, Section V concludes this paper.

II. RELATED WORKS A. ABSA AS SEQUENCE LABELING
Sequence labeling is a very common problem in NLP (e.g., part-of-speech tagging and named-entity recognition) and aims to assign a label to each element in a sequential input.Both AM and ASC can be formulated as a sequence labeling problem, in which a label is given to each word in the review.Formally, AM predicts a label sequence {y A 1 , . . ., y A T } for a given sentence with T words {x 1 , . . ., x T }, where y A t ∈ {ASPECT, NONASPECT}, and the label space changes to y S t ∈ {SENTIMENT POLARITIES} for ASC.For instance, the reference [6] defines a set of labels to distinguish feature aspects, component aspects and function aspects, and trains hidden Markov models to label each word in a review.Further, the researchers [7] simplify these labels and apply {B, I, O} scheme, where B, I and O identify the beginning of an aspect, the continuation of the aspect, and other words, respectively.The {B, I, O} scheme can well handle aspects expressing in phrases and has been applied for AM [9], [18] and aspect-opinion term co-extraction [11], [12].In the ASC sub-task, as aspects are assumed to be known, the prediction model only needs to assign a sentiment polarity to each aspect with {POS, NEG, NEU, O} scheme [24], [25], where POS, NEG and NEU denote positive, negative and neutral sentiment, respectively, and O for other words.Recently, a collapsed labeling scheme is applied to perform ABSA as a single sequence labeling task [29], in which aspects and sentiments labels are combined as {B-{POS, NEG, NEU}, I-{POS, NEG, NEU}, O} scheme.We do not follow this collapsed scheme in our SEML, as we consider the interaction between two sub-tasks can improve the aspect and sentiment prediction.Thus, our SEML uses the {B, I, O} and {POS, NEG, NEU, O} labeling schemes.

B. SEMI-SUPERVISED APPROACHES
Most existing semi-supervised methods for ABSA are proposed only for AM.One direction is to use prior domain knowledge to guide an unsupervised topic model (e.g., Latent Dirichlet Allocation).For instance, some methods manually choose domain specified seed words [14]- [17] for topic modeling.However, this kind of methods often need manually defined domain knowledge and do not fully use labeled reviews.Another direction takes full advantage of unlabeled reviews in the same domain to improve supervised models.The idea of pre-training has been applied in the AM model [18] to learn domain-specific word embeddings from unlabeled reviews in advance which then fed into normal supervised models.However, instead of pre-training, our previous work [19] learns both task-and domain-specific representations of reviews in a unified framework, which improves the AM sub-task.For the sentiment classification problem, some researchers [20], [28], [31], [34], [35] prefer to use external linguistic resources to catch the affective information of words, which can be considered as the special case of semi-supervised approaches.For example, people propose commonsense knowledge networks to perform concept-level sentiment analysis [20], [31], [34], [35]; the authors of the work [28] try to encode commonsense knowledge into their attentive neural network for ASC.The literature [36] uses data augmentation method to generate more labeled training data to achieve semi-supervised learning for ABSA.Nonetheless, supervised deep learning models currently have achieved great successes when applied to AM [9]- [13], ASC [21]- [25] and complete ABSA [29].To the best of our knowledge, we are the first to propose an end-to-end semi-supervised deep learning framework that can leverage labeled and unlabeled reviews for both the AM and ASC sub-tasks.

C. CROSS-VIEW TRAINING
Normally, a deep learning model works best when it is trained on a large amount of data with reliable labels.However, for domain-dependent aspects, manual labeling could be a huge investment.One solution is to apply effective semi-supervised learning to leverage a plenty of unlabeled reviews.Current semi-supervised learning models [18] separate the training process into two phases: pre-training and supervised learning.A key disadvantage of such models is that the first phase on representation learning does not benefit from any labeled reviews.More sophisticatedly, CVT [32] implements semi-supervised learning by alternately switching the training process on labeled data and unlabeled data, which is the meaning of Cross-View.Note that, the term Cross-View may refer to multi-view learning in some works [37], [38], in which the models learn from multi-view data (e.g., images taken from different viewpoints) instead of switching between labeled and unlabeled data.Our previous work [19] has showed that CVT can well leverage both labeled and unlabeled reviews.Thus, the SEML framework is also based on CVT, but it has a more task-specific architecture that combines CVT with multi-task learning to jointly train multiply models at the same time.

D. MULTI-TASK LEARNING
Extensive works [26], [27] show that the models jointly trained for closely related tasks can outperform those models for a single task.For ABSA, a multi-task supervised model with two coupled GRU layers is proposed to co-extract aspects and sentiment words [12].The authors of the paper [9] employ a neural network with three Long Short-Term Memory (LSTM) layers to perform the multi-task learning for AM.Further, they add an attention mechanism into the AM model [13].The multi-task learning is also applied for ASC.For example, people propose the Sentic LSTM (a two-step attentive LSTM) [28] to perform aspect categorization and ASC; the researchers of the literature [29] propose the E2E-TBSA model for ASC based on two bidirectional LSTM layers and the collapsed labeling scheme; and an interactive multi-task learning network (IMN) [30] is proposed to learn the model from the token-level AM and ASC sub-tasks as well as the document-level tasks.However, all the aforementioned studies are based on supervised learning that relies on large amounts of labeled reviews to guarantee good performance.In contrast, our SEML framework can leverage unlabeled reviews to enhance supervised models and alleviate the costly demand on data labeling.

III. THE FRAMEWORK SEML
In this section, we present our semi-supervised deep learning framework SEML for ABSA.First, we formulate the two sub-tasks AM and ASC into sequence labeling problems.Then, we present the technical details of the key components in SEML.and D u to predict the sentiment polarities for the aspects.These two sub-tasks can be formulated as different sequence labeling problems by using different tagging schemes.Specifically, we use the {B, I, O} scheme for AM, where B, I, and O indicate the beginning of, the continuation of, and the out of the aspect, respectively (refer to Section II-A).For the ASC sub-task, the {POS, NEG, NEU, O} scheme is applied, where POS, NEG, and NEU express the VOLUME 8, 2020 TABLE 1.An example on AM and ASC as sequence labeling problems.Y A shows the aspect labels for each word, and Y S means the sentiment polarities for the aspect.

FIGURE 1.
The architecture of our SEML framework.Refined word embedding and char-features are fed into three stacked BiMAGRU layers.The first BiMAGRU layer is shared with CVT to leverage unlabeled reviews, in which four auxiliary prediction modules are trained to agree with both AM and ASC (i.e., the primary prediction modules).The second BiMAGRU layer trains the AM model to extract aspects, and the third layer trains the ASC model to predict sentiment polarities.
positive, negative, and neural sentiment respectively, and O means the NULL sentiment for a word.Then, each word x t in the review sentence X = {x 1 , . . ., x T } should be assigned as one of Y A ∈ {B, I, O} and one of Y S ∈ {POS, NEG, NEU, O} (see TABLE 1 for instance).

B. FRAMEWORK ARCHITECTURE
Our motivation is to fully use the labeled and unlabeled reviews and simultaneously perform both AM and ASC within an end-to-end framework.As shown in FIGURE 1, our SEML framework consists of four components including representation learning, AM, ASC and CVT.Since recurrent neural networks (RNNs) can naturally represent the sequential information, our framework employs deep RNNs as the basic architecture to build the shared contextualized representation learning component for both AM and ASC sub-tasks.Specifically, three stacked bidirectional recurrent neural layers with MAGRU are employed to build the shared memory; MAGRU extends GRU with moving-window attention mechanism to encode nearby semantic significances.We give a detailed design of MAGRU in Section III-C.
Moreover, each stacked Bidirectional MAGRU (BiMA-GRU) layer is designed to learn representations for different tasks.Specifically, the first layer is shared with CVT to train four auxiliary prediction modules for AM or ASC by leveraging unlabeled reviews; the second layer uses the representations from the first layer as inputs and trains one primary prediction module for AM; and the third layer inputs the representations from the second layer and trains the other primary prediction module for ASC.As each upper layer uses the outputs from the lower layer as inputs, our SEML enables not only multi-task learning but also the interaction between different sub-tasks to improve the aspect extraction and sentiment prediction.The detailed representation learning process is presented in Section III-D.
To enable semi-supervised learning, our SEML framework trains on both labeled and unlabeled reviews for two sub-tasks  (AM and ASC) in ABSA by applying CVT.While performing CVT, the primary prediction modules for AM and ASC are trained with the standard supervised learning on labeled reviews; on unlabeled reviews, four auxiliary prediction modules (namely p past , p fwd , p bwd , and p future ) with different views on the input data are trained to agree with the primary prediction modules.We discuss the specific multi-task CVT in Sections III-E and III-F.

C. MOVING-WINDOW ATTENTIVE GRU
As introduced above, our framework employs deep RNNs to build the shared representation learning component.However, in ABSA, the information from past nearby steps provide useful clues for a prediction, e.g., the aspect label ''I'' cannot follow ''O'', and the previous aspects can guide the extraction of subsequent aspects.Though RNNs with (LSTM) [39] or GRU [33] can well encode long period of sequential information, they are difficult to pay attention to exactly useful nearby contexts at each time step.To this end, our framework extends GRU with a Moving-window Attention mechanism (called MAGRU) that can capture past nearby significances.
We prefer extending GRU as it has a simpler structure and less parameters than LSTM but shows competitive performance in many NLP tasks [40].Specifically, as shown in FIGURE 2, MAGRU has three gates, namely reset gate r, update gate z, and attention gate a.The update gate z t at time step t is obtained as follow: where h t−1 is the previous hidden state, x t is the input of current step, U z and W z indicate gate parameters, and σ is the sigmoid activation function.At the same time, the reset gate r t is computed by: Thus, new candidate hidden state ht without any attentions for current time step can be obtained using tanh activation function: The above update and reset gates are the same with GRU.However, we add a new attention gate to encode past nearby significances.Specifically, the moving-window attention considers the most recent N (moving-window size) hidden states.At step t, we calculate the normalized significance score s t i of each cached past state h i (i ∈ [t − N , t − 1]) as follow: where tanh is the activation function, U a , W 1 a , and W 2 a are the attention parameters.Then the attention gate a t is given by: where we compute the weighted sum of the cached previous N hidden states h i with the score weights s t i , and apply the ReLU activation function.
Finally, to calculate current moving-window attentive hidden state h t at step t, our framework considers all the three gates: in which ht is determined by the reset gate to combine the new input with the previous hidden state.The update gate defines how much of the previous information to keep, and the attention gate gives the past nearby significance.

D. REPRESENTATION LEARNING
Pre-trained general word embeddings (e.g., GloVe [41]) have been widely used in recent NLP models, which became a essential component to convert texture information into contextualized vectors for later computation.However, the researchers [42] discover that these general word embeddings often represent opposite sentiment words (e.g., good and bad) with similar vectors, which affects the final sentiment classification.Thus, they propose an adjusting method to use a sentiment lexicon that refines the embeddings of sentiment words to be closer to sentimentally similar words and farther away from sentimentally different ones, and improves the classification performance on many sentiment related tasks.SEML also applies the same method [42].Moreover, because combining general embeddings with char-features can help handle misspelling words [43], SEML represents each word in the input sequence as the concatenation of the refined embedding vector and char-features from a character-level Convolutional Neural Network (CNN) [43].
Then the concatenation vectors are fed into the deep bidirectional RNN.
The RNN employs three stacked BiMAGRU layers to build the shared memory for both AM and ASC sub-tasks, in which each upper layer uses the hidden states from the lower layer as inputs.Specifically, we feed the inputs forwardly and backwardly to MAGRUs and combine them as one BiMAGRU layer, because both forward and backward information is important for the prediction on the current position.
in which t ∈ [1, T ] and ⊕ denotes the concatenation operation, h 1 t is the hidden representations from the first BiMAGRU layer at t time step, h 2 t is from the second layer, and h 3 t is from the third layer.

E. PREDICTION MODULES
SEML trains models on both labeled reviews and unlabeled reviews for two sub-tasks (AM and ASC) in ABSA.SEML learns one primary prediction module from labeled reviews and four auxiliary prediction modules from unlabeled reviews with restricted views of inputs for each sub-task (AM or ASC).Suppose y A t is the aspect label for the word x t ∈ X .The primary prediction module for AM determines the probability distribution p(y A t |x t ) over the aspect labels {B, I , O} from the representations (h 1 t and h 2 t ) from the first and second MAGRU layers with a simple one-hidden-layer neural network, given by: (10) in which U A p , A p and b A are the model parameters.Further, since ASC relies the position of aspects, the aspect boundary information from the primary module for AM is into the third BiMAGRU layer for ASC.Therefore, the moving-window attention in the third layer can help the primary module for ASC focus on the corresponding sentiment words and maintain the consistency of sentiment labels assigned to multi-word aspects.The primary prediction module for ASC adopts the similar architecture as in AM, given by where y S t ∈ {POS, NEG, NEU, O}.As mentioned above, SEML shares the first BiMAGRU layer with the auxiliary prediction modules that have restricted views of unlabeled reviews.There are four different auxiliary prediction modules (p past , p fwd , p bwd , and p future ) in the framework for each sub-task (AM or ASC), where p past means, for the prediction of current word, this module only has a view of all past words on the left of current word in the sentence; p fwd has a view of past (left) and current words; p bwd observes current and words on the future (right); and p future only observes all future words on the right, as shown in FIGURE 1. BiMAGRU can easily provide these restricted views without additional computation as follows: where k ∈ {A, S}, nn past , nn fwd , nn bwd , and nn future denote the neural network with the structure given in Equation ( 10) or (11).Since the second and third BiMAGRU layers have already seen all words, we can only feed the hidden representations − → h 1 and ← − h 1 from the first BiMAGRU layer to the auxiliary prediction modules in order to restrict their view on an input sequence.

F. MULTI-TASK CROSS-VIEW TRAINING
The key idea of CVT is to use unlabeled reviews from the same domain of labeled reviews to enhance the representation learning and alternately learn primary and auxiliary prediction modules on a mini-batch of labeled reviews or unlabeled reviews.In order to perform multi-task learning, i.e., to train one primary module and four auxiliary modules for AM or ASC, we randomly choose a sub-task (AM or ASC) with its labeled reviews at first.Then the Cross-Entropy (CE) loss is utilized to train the corresponding primary prediction module p(y A t |x t ) or p(y S t |x t ): For the unlabeled reviews D u , the framework first infers p(y A i |x i ) as well as p(y S i |x i ) (x i ∈ D u ) based on the primary modules for AM and ASC and then trains the auxiliary prediction modules to match two primary prediction modules by using the Kullback-Leibler (KL) divergence function as the loss: where j ∈ {left, fwd, bwd, right}, k ∈ {A, S}, and the parameters of the primary modules are fixed during training.The auxiliary prediction modules can learn to enhance the shared representations, because the new words that are not in labeled reviews may have been encoded into the model and be useful for making predictions on aspects and sentiments.Reviews labeled across both tasks are useful for multi-task models, but most publicly available labeled reviews are only for one particular task (e.g., either AM or ASC).SEML utilizes unlabeled reviews for both sub-tasks and actually constructs all-tasks-labeled examples from unlabeled reviews.Finally, we combine the supervised and CVT losses and minimize the total loss L with stochastic gradient descent: In particular, we randomly choose a sub-task and alternately minimize L A SUP or L S SUP over a mini-batch of corresponding labeled reviews and L CVT over a mini-batch of unlabeled reviews.

IV. EXPERIMENTS
In this section, we evaluate the performance of our proposed SEML framework and compare it with the state-ofthe-art approaches for both AM and ASC sub-tasks in ABSA.Moreover, we test SEML to perform complete ABSA and compare it to those pipeline and unified approaches.

A. EXPERIMENTAL SETTINGS 1) DATASETS
We conduct experiments over four benchmark datasets from the SemEval workshops [44], [45].TABLE 2 shows their statistics.D AM laptop and D AM rest contain reviews of the laptop and restaurant domain for the AM sub-task, while D ASC laptop and D ASC rest are for the ASC sub-task.In the AM datasets, the sentiment polarities for aspects are not given.In the ASC datasets, both aspects and their sentiment polarities are known.As some testing sentences in one sub-task may appear in the other sub-task's training dataset, we simply remove those sentences from the training dataset for fair comparison.Moreover, SEML needs unlabeled reviews for CVT (semi-supervised learning).We collect unlabeled reviews corresponding to two domains (laptop and restaurant) of labeled datasets to train the model, which include laptop reviews from Amazon Review Dataset (230,373 sentences) [46] and restaurant reviews from Yelp Review Dataset (2,677,025 sentences) [47].For comparison, we also train the model on a general unlabeled dataset (One Billion Word Language Model Benchmark) [48] to see whether perform CVT on general texts can improve the supervised model for AM and ASC.As some sentences in the testing dataset may also appear in unlabeled reviews, we remove these sentences in unlabeled reviews to make the comparison fair.

2) COMPARED MODELS
We first compare SEML with the state-of-the-art models for the AM sub-task, including: • CMLA [12] applies a multi-layer architecture with coupled-attentions to locate aspect words.
• MIN [9] consists of three LSTM layers for multi-task learning, in which a sentiment lexicon (to find opinion words) and dependency rules are used to extract corresponding aspects.
• DE-CNN [18] is based on CNNs and utilizes both general word embeddings and domain-specific embeddings learned from unlabeled reviews.
• EMOVA [19] uses the CVT and moving-window attention mechanism to leverage both labeled and unlabeled reviews.
Then, we compare SEML with the state-of-the-art models for the ASC sub-task, including: • RAM [49] employs the multiple attentions on multi-layer RNNs to combine hidden word features in each layer.
• TNet [24] utilizes a CNN layer instead of an attention layer to extract the salient features from the representations learned by deep RNNs.
• MGAN [25] applies transfer learning to leverage knowledge learned from a rich-resource source domain to improve the learning in a low-resource target domain.
In addition, since BERT [50] is one of the key innovations in the recent progress of language modeling and achieves the state-of-the-art performance on many NLP tasks, we fine-tune the pre-trained BERT model on the datasets for both AM and ASC as a baseline: • BERT [50] can learn better representations by training a deep language model on large amounts of texts, we apply BERT BASE on the datasets as the baseline to perform AM and ASC as well as complete ABSA.
We also investigate the performance of important variants of SEML: • SEML-SUP is our supervised model but without CVT on unlabeled reviews, so it is a purely supervised multi-task learning model.
• SEML-GNL is the full framework but only performing CVT on the general unlabeled text (One Billion Word Language Model Benchmark) [48] which is not specific to the laptop or restaurant domain.
• SEML-AM is the single task model for AM with CVT on unlabeled reviews.
• SEML-ASC is the single task model for ASC with CVT on unlabeled reviews.
Finally, our goal is to perform complete ABSA within an end-to-end framework, but the baselines above are for either the AM or ASC sub-task.While performing ASC, the testing datasets in D ASC laptop and D ASC rest show the golden aspects.In order to achieve complete ABSA, these aspects labels are removed from the testing datasets, denoted as D ABSA laptop and D ABSA rest , correspondingly.We compare SEML with the following baselines on the new testing datasets: • DE-CNN-MGAN is the pipeline method which combines two state-of-the-art methods DE-CNN 1 for AM and MAGAN 2 for ASC.
• LM-LSTM-CRF [51] is a competitive model on some sequence labeling tasks in NLP.We train the model 3 for complete ABSA in a collapsed labeling scheme.
• E2E-TBSA [29] is the state-of-the-art supervised model to perform complete ABSA in an unified framework with a collapsed labeling scheme.

3) TRAINING SETTINGS
We use pre-trained GloVe 840B 300-dimension vectors [41] and refine the sentiment vectors [42] to initialize the word embeddings, and the char-feature size is 50.All of the weight matrices except those in BiMAGRU are initialized from the uniform distribution U (−0.2, 0.2).For the initialization of the matrices in BiMAGRU, we adopt the Glorot Uniform strategy [52].We apply dropout and the rates are set as 0.5 for labeled reviews and 0.8 for unlabeled reviews.The hidden state size is set to 1,024, and the learning rate is 0.05.We set the mini-batch size as 30 sentences, and the moving-window size (i.e., the number of cached past nearby hidden states in MAGRU) N is 5.

B. EXPERIMENTAL RESULTS
We report the results of CMLA, MIN, DE-CNN, EMOVA, RAM, TNet and MGAN in their original works, since we use exactly the same datasets.For the other models, we average the evaluation results of five runs.We follow the standard evaluation metrics of SemEval workshops to report the F1 score for AM and the accuracy and Macro-F1 (MF1) score for ASC.

1) MAIN RESULTS
Results on AM.TABLE 3 depicts the results of all the evaluated models for AM, in which SEML performs the best.For example, compared to the competitive models including CMLA, MIN, DE-CNN and EMOVA, SEML achieves absolute gains of 5.57%, 5.79%, 1.78% and 1.65% on D AM laptop , and 5.47%, 4.80%, 3.87% and 3.06% on D AM rest , respectively.Even our pure supervised SEML-SUP (without CVT) can perform better than CMLA and MIN.The main reason is the effectiveness of MAGRU which can derive the significant information of nearby contexts of the aspects.Moreover, SEML-GNL with general unlabeled texts improves SEML-SUP, which verifies the advantage of semi-supervised learning.While comparing to the two-phase semi-supervised approaches including DE-CNN and BERT, SEML shows the great superiority; the two-phase training (i.e., pre-training and supervised learning) cannot take advantage of labeled reviews for learning representations in the pre-training step; however, SEML learns domain-and task-specific representations alternately over labeled and unlabeled reviews within an unified end-to-end framework.Finally, EMOVA also employs CVT but only performs the single AM task, so SEML records better results than EMOVA by enabling the multi-task learning.
Results on ASC.TABLE 4 depicts the results of all the evaluated models for ASC, where SEML also achieves the best accuracy and MF1.More specifically, SEML-ASC, i.e., the variant of SEML for the single ASC task already outperforms all the supervised models including RAM, TNet and MGAN, which shows that semi-supervised learning can improve the prediction performance by taking full advantage of unlabeled reviews.Interestingly, BERT gives a slightly better accuracy (0.03%) than SEML-ASC on D ASC rest , our explanation is that BERT learns representations by training on much more domain-free texts than SEML-ASC and the ASC sub-task is more domain-independent than the AM subtask, i.e., aspect words are more dependent on domains than sentiment words.Fortunately, while performing multi-task learning, the shared representations in SEML can get significantly improved and then enhance the final prediction results.over the pipeline model (DE-CNN-MGAN) and unified models (LM-LSTM-CRF and E2E-TBSA).The reason is that SEML leverages a more integrated way for multi-task learning for highly coupled tasks (e.g., AM and ASC) than the pipeline model.Further, compared to the unified models with a collapsed labeling scheme, SEML also shows the effectiveness of a joint model that considers the interaction between two related sub-tasks in ABSA.

2) ABLATION STUDY
The key components of SEML include char-features, refined word embeddings and auxiliary prediction modules, as shown in FIGURE 1.To show the significance of each key component, we disable each of them and evaluate the F1 score for AM and MF1 for ASC, as depicted in TABLE 5. Firstly, we disable the char-features and the result shows only slight effect in the row for w/o char-features.Then, we do not refine the word embedding with sentiment lexicon before training, the result drops slightly for AM but drops more for ASC in the row for w/o refining, which shows the essentiality of word embedding refining for the sentiment-related task.To explore which auxiliary prediction modules are more important, we only enable two of them (p fwd and p bwd , or p left and p right ) at each time.We find that SEML w/o fwd & bwd that do not see the current word is better than SEML w/o left & right, which may be caused by the more restricted view on the unlabeled input.

3) VISUALIZATION OF MOVING-WINDOW ATTENTION
We use an example to visualize the significance score in Equation ( 4) for the moving-window attention with the window size N = 5.FIGURE 4 shows the visualization results in the second BiMAGRU for AM and the third BiMAGRU layer for ASC.SEML pays more attention on ''software'' and ''system'' to identify the aspect label of ''preloaded'', and greatly attends on ''not'' and ''slow'' to predict the sentiment polarity of ''preloaded''.

4) EFFECTS OF MOVING-WINDOW SIZE
We also evaluated the effects of the size of moving-window in the MAGRU of our SEML framework, the results are shown in FIGURE 5.It is hard to improve the overall performance by simply increasing the moving-window size, i.e., SEML can achieve better AM and ASC accuracy by focusing attention on a certain number of nearby words.To reduce the computation cost, the moving-window size N is set to 5 in our experiments.

5) EFFECTS OF MODEL SIZE
Most supervised models for ABSA use RNNs (e.g., LSTM and GRU) with small hidden state sizes around 300 [12], [13], [29], as a larger hidden state size may not surely improve the performance of supervised model [53].We exam the effects of the hidden state size on our semi-supervised SEML and supervised SEML-SUP.FIGURE 6 shows that SEML-SUP without CVT also do not gain much from having a larger model size.However, as SEML can learn from unlabeled reviews by using CVT, the performance benefits from the increase of the model size.As the consequence, SEML enables the development of larger and more accurate models for the domain with limited amounts of labeled reviews but large numbers of unlabeled reviews, by using a large model size, e.g., 1,024 in our previous experiments.

6) LESS LABELED TRAINING DATA
A very common situation in aspect mining is some domains (or products) may not have large volumes of labeled data.To this end, we explore how SEML scales with less data by only feeding a subset (25%, 50%, and 75%) of the labeled training datasets, as presented in FIGURE 7. SEML with half of the training data can perform as well as SEML-SUP without CVT that sees all the training data.Thus, SEML is particularly useful when only a small set of labeled reviews is available, which greatly reduces the cost on manual labeling.

V. CONCLUSION AND FUTURE WORK
In this paper, we have proposed the first end-to-end SEmi-supervised Multi-task Learning framework (SEML) for ABSA on customer reviews.The two related sub-tasks, namely AM and ASC in ABSA are jointly learned in an end-to-end fashion.Moreover, SEML derives the shared representations of reviews based on three stacked and bidirectional neural layers with Moving-window Attentive Gated Recurrent Units (MAGRU); MAGRU extends GRU with the moving-window attention mechanism to capture significant nearby semantic contexts.Further, SEML employs CVT to train auxiliary prediction modules on unlabeled reviews to improve the representation learning in an unified end-to-end architecture.Finally, we have conducted experiments for AM and ASC sub-tasks as well as complete ABSA over four datasets from the SemEval workshops and the experimental results show that SEML significantly outperforms the stateof-the-art models, even on much smaller labeled training datasets.
We consider two future research directions.First, as SEML directly delivers hidden representations between sub-tasks that may bring inconsistency of AM and ASC results (e.g., the ASC predictor may label sentiment polarities on non-aspect words), we will design more constraints to enforce stronger consistency between two sub-tasks in the future.Second, in addition to labeled and unlabeled reviews, we will try to encode linguistic knowledge (e.g., commonsense knowledge bases) into the framework to improve the performance.

A
. PROBLEM STATEMENT Suppose there are one set (D u ) of unlabeled reviews from a domain (or an entity) and two sets (D AM l and D ASC l ) of labeled reviews from the same domain which are annotated for AM and ASC, respectively.The AM sub-task is to learn a classifier from the reviews in D AMl and D u to extract a set of aspects, while the ASC sub-task is to train a model from the reviews in D ASC l

Formally
, let V = {v 1 , . . ., v T } be the concatenation vectors of refined word embeddings and char-features.The hidden representations for each layer are derived by concatenating the outputs of both forward − −−−− → MAGRU and backward ← −−−− − MAGRU as follows:

FIGURE 3 .
FIGURE 3. Comparison results on F1 for ABSA.Note that the testing datasets D ABSA laptop and D ABSA rest are the same as in D ASC laptop and D ASC rest but without the aspect labels.

FIGURE 5 .
FIGURE 5. Effects of the moving-window size N.

FIGURE 6 .
FIGURE 6.Effects of the model (hidden state) size.

FIGURE 7 .
FIGURE 7. Performance vs. percent of the labeled training set.

TABLE 2 .
Statistics of labeled datasets.

TABLE 4 .
Comparison results on Accuracy and MF1 for ASC.FIGURE 3 reports the F1 score for ABSA based on the exact match, i.e., a joint labeling result is considered to be correct only if it matches with both aspect and sentiment labels.SEML obtains consistent improvement Results on ABSA.

TABLE 5 .
Ablation study on the key components of SEML.