Aspect-Level Sentiment Analysis Using CNN Over BERT-GCN

Context-based GCNs have achieved relatively good effectiveness in the sentiment analysis task, especially aspect-level sentiment analysis (ALSA). However, the previous context-based GCNs for ALSA often used GCNs with the following limitations: (i) Using GCNs limited to a few layers (two or three) due to the vanishing gradient, limiting their performance. (ii) Not considering helpful information about the hidden context between the words. To solve these limitations, this paper proposes a novel CNN over the BERT-GCN model for ALSA. The contributions of the proposed method are summarized as follows: (i) Handling the disadvantage of limiting the GCN to a few layers by adding convolutional layers of the convolutional neural network (CNN) model after GCN layers. (ii) Considering further helpful information about the hidden context between the words by combining the Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short Term Memory (BiLSTM) models. The proposed model includes the following steps: First, words in sentences are converted vectors using BERT. Second, the contextualized word representations are created based on BiLSTM over word vectors. Third, significant features are extracted and represented using the GCN model with multiple convolutional layers over the contextualized word representations. Finally, the aspect-level sentiments are classified using the CNN model over the feature vectors. Experiments on three benchmark datasets illustrate that our proposed model has improved the performance of the previous context-based GCN methods for ALSA.


I. INTRODUCTION
Since 2004, the number of users signing up on social networks has been growing more and more and increasing exponentially, and it has not ceased owing to wide popularity. The social networks statistics in 2021 show that 3.78 billion users worldwide have used social media in 2021 -and this number will continue growing over the next few years. 1 Opinions on social networks are significant sources of news and information for traders, governments, and researchers [27]. By extracting, evaluating, and analyzing hidden sentiments in the opinions of these data sources, traders, governments, and The associate editor coordinating the review of this manuscript and approving it for publication was Nazar Zaki . 1 https://www.statista.com/ researchers can gain insight into trade, policies, and proposals and make better decisions. The sentiment analysis (SA) of opinions is considered an exciting trend for artificial intelligence on social media. Today, as the economy is gradually transformed into a digital economy, SA is a significant step in several systems automatically, such as stance detection, recommendation systems, decision-making, and fake news detection on social media [15]. SA can be divided into three main levels: documents, sentences, and aspects. Document-level SA aims to determine the sentiment polarity of the entire text without dividing sentences, and sentence-level SA seeks to determine the sentiment polarity of separate sentences. In comparison, aspect-level SA aims at sentiment polarity regarding the aspects of entities appearing in the text. Aspect-level SA is the main objective of this study to classify the sentiments of aspects in a sentence as positive, negative, or neutral. For instance, ''Phone color is not nice, but its style is so modern.'' Here, the review focuses on ''color'' and ''style, which are two aspects of the entity ''Phone'', and the ''color'' is negative, while the ''style'' is positive. ALSA is increasingly used in the construction and development of numerous real-life applications. For example, producers can know which parts or aspects of their products are attractive to consumers to retain them and improve them in the opposite case.
In [13], the authors presented that ''Semantic information is compulsory to associate words with sentences according to their contextual information, which could be helpful in the extraction of explicit and implicit aspects.'' Therefore, based on contextual information, relationships between different data objects could also be mapped to improve sentiment classification accuracy. e.g., ''I went to an Italian restaurant, and they provided me with eight choices of spaghetti'', here, ''eight choices of spaghetti'' should be considered as a positive sentiment but normally taken as neutral''. This proves the context surrounding the aspect should be a significant impact on improving the efficacy of ALSA. Various approaches have been proposed for context-based ALSA, such as knowledge-based, machine learning-based, hybridbased, and most recent GCN-based approaches. Contextbased GCNs have achieved relatively good effectiveness in sentiment analysis task, especially ALSA, such as AHGCN-WIN [22], Hier-GCN [2], and GCNSA [4]. From this justification, we absolutely have the basis to believe that the contextualized-based GCN can improve the performance of the previous methods toward ALSA. However, the previous context-based GCN methods for ALSA often use GCNs in the following limitations: • They use GCNs limited to a few layers (two or three) due to the vanishing gradient, limiting their performance.
• They have not considered helpful information about the hidden context between the words.
Two above limitations motivated us proposing a novel aspect-level SA based on CNN over BERT-GCN to capture more effectively the context surrounding aspects and their sentiments. Why do we choose to combine these models? BERT is an embedding model that uses attention models as transformers to establish relationships between words via an encoder at the input and a decoder at the output. Unlike other embedding models that take the input as one word at a time, BERT can take the entire sentence as input based on transformers. Therefore, BERT can well learn the contexts between words in the sentence [5]. Unlike the basic grammar models that mainly depend on the statistical characteristics, BiLSTM is competent in learning context information by encoding the sentence according to both directions. Therefore, BiLSTM can capture the real meaning hidden between words from their context [3]. GCNs can effectively learn graph representations and have obtained satisfactory results in various applications and tasks, particularly classification tasks [25]. Meanwhile, CNNs have achieved superior perfor-mance in different tasks and applications, including sentence SA and document classification [14]. Therefore, we have a significant basis to believe that a combination of these models can effectively enhance the performance of ALSA. The proposed method includes the following steps: First, words in sentences are converted into vectors using BERT [5]. Second, contextualized word representations are created based on BiLSTM [7] over word vectors. Third, significant features are extracted and represented using the GCN model with convolutional layers over contextualized word representations. Finally, aspect-level sentiments are classified using a CNN over the feature vectors. Experiments on three benchmark datasets illustrate that our proposed model has improved the performance of the previous context-based GCN methods. The contributions of our proposed method are summarized as follows: • We handle the disadvantage of limiting the GCN to a few layers by adding convolutional layers of the CNN model after GCN layers.
• We consider further helpful information about the hidden context between the words by combining the BERT and BiLSTM models.
The remaining sections of this paper are organized as follows. Section 2 reviews the state-os-the-art works from which we have inherited the techniques and mechanisms. Section 3 elaborates on the research problem of the proposed method, including research problem definition and research questions. Section 4 presents a mathematical model to answer the research question. Section 5 offers a brief introduction to the datasets and experimental results of the proposed approach versus certain well-known methods. The conclusions and future directions are presented in Section 6.

II. RELATED WORKS
Various deep learning-based ALSA methods have been proposed by capturing the context information of aspects and their sentiments. TD-LSTM [19] constructs aspect-specific representations based on their right and left context and then uses two LSTMs to model them for ALSA with the highest accuracy and F 1 score by 78.00% and 68.43%, respectively, for the Restaurant dataset. ATAE-LSTM [21] creates an attention vector by integrating aspect embedding into a hidden state vector. This attention vector is then appended to each word vector to better capture the aspect information. This model obtained the highest accuracy by 77.20% for the Restaurant dataset. IAN [12] is based on the attention to aspects and contexts by creating aspects and context representations separately. These representations were then concatenated to predict the sentiment polarity of these aspects. This model obtained the highest accuracy by 78.60% for the Restaurant dataset. AOA [8] focuses on learning aspects and sentence representations to capture automatically the crucial parts in sentences. This model obtained the highest accuracy and F 1 score by 79.06% for the Restaurant dataset and 70.20% for the Twitter dataset.
Recently, graph neural networks have been widely applied for ALSA due to their adequate representation power, especially GCN, such as [22], [27], and [15]. In addition, to our best knowledge, the best accuracy and F 1 score of previous ALSA models are all obtained from the GCNs-based methods. GCN [11] is a graphically structure-based deep learning method and is one category of graph neural network models. Given a graph G = {V , E, A} where V and E are sets of nodes and edges, respectively, and A is the adjacency matrix. The GCNs focus well on learning node representations using convolutional network layers to integrate information from neighbors on the graph [23]. A GCN with one convolutional layer can only represent nodes using neighbor information. Otherwise, a GCN with multiple layers can represent nodes using more neighborhood information. We can make GCN layers more depth by extending the convolutional layers of the CNN model presented by Bijari et al in [1]. However, this CNN-GCN directly used CNN over GCN without the contextualized word representations and is not applied for ALSA. Various GCNs-based ALSA method also focus on considering the context information. AHGCN-WIN [22] is a model that focuses on explicit aspects based on contextual representations of graph nodes. Bi-LSTM is first used to capture the context of adjacent words. The GCN model with multiple layers is then used to capture sentiment features of aspect words and the remaining words in the opinion. Finally, a mask layer is applied to catch aspect-specific parts effectively. The AHGCN-WIN obtained the satisfactory performance with the highest accuracy and F 1 score by 82.02% for the Restaurant dataset and 73% for the Laptop dataset. However, this method has two limitations presented in Section 1. Hier-GCN [2] introduces a model with two parallel graph convolutional layers that can encode both intrarelations among aspects and interrelations between aspects and sentiments. The Hier-GCN obtained the highest F 1 score by 74.55% for the Restaurant-16 dataset. However, this method exists the first limitation presented in Section 1 and has not solved incomplete the second limitation. GCNSA [4] presented a combination of GCN and LSTM models, where convolution GCN layers were performed on the text graph to obtain hidden representations of full-text; meanwhile, LSTM was extended by an attention mechanism to capture the certain region information. The GCNSA obtained the highest F 1 score by 78.12% for the Restaurant dataset. However, this method exists both limitations presented in Section 1.
From above justifications, we have the basis for proposing this project and the motivation to improve the performance of GCNs-based ALSA methods. We wonder whether using CNN over BERT-GCN improves the performance of ALSA. This becomes a hypothesis that we need to solve and prove in this paper.

III. RESEARCH PROBLEM A. PROBLEM DEFINITION
Given a finite set of n opinions O = {o 1 , o 2 , . . . , o n } regarding the specific entity, for a specific opinion o i : let a ij be the j-th aspect of the given entity, c ij be the significant features related to aspect a ij , and sen ij refers to the sentiment of aspect a ij . The objective of this proposal is to construct a CNN over the BERT-GCN model for ALSA. This objective can be formalized by finding a mapping function F as follows: The main aim of this study is to propose a CNN over the BERT-GCN method for aspect-level SA. Therefore, we attempted to answer the following research questions: • How can words be converted into contextualized word representations by combining the BERT and BiLSTM models?
• How can we extract and represent significant features using a GCN over contextualized word representations?
• How to build the CNN over the BERT-GCN model using the CNN model over significant features?
• How can the CNN be used over the BERT-GCN model to analyze the sentiment of aspects?

IV. PROPOSED METHOD
In this section, we describe the concept and flow of our proposed CNN over the BERT-GCN model. The proposed method is illustrated in Figure 1.
The CNN over the BERT-GCN model includes the following steps: • Convert the words of the input sentence into word vectors using the BERT model.
• Extract contextualized word representations using the BiLSTM model over word vectors.
• Extract significant features using the GCN model over contextualized word representations.
• Construct the CNN over the BERT-GCN model using the CNN over significant features.  word w i as follows: where BERT (w i ) is the word vector extracted using pretrained BERT 2 [20], and d w is the dimension of the word vector. That means from the input sequence s, we obtain the sequence representation as X = {x 1 , x 2 , . . . , x m }.

B. CREATING CONTEXTUALIZED WORD REPRESENTATIONS
Contextualized word representations convert each word in a sentence into a vector, such that each vector is aggregated from the entire vector of the input sentence. In this study, we employed the BiLSTM model [9] over BERT embeddings for contextualized word representation. BiLSTM can learn the context information and latent meaning of words by reading the input sentence in two directions [3]. Contextualized representations were created according to the following steps: Input layer: The sequence representation X = {x 1 , x 2 , . . . , x m } is the input to the BiLSTM model.
BiLSTM layer: This layer integrates contextual information from the remaining words in the sentence in two directions for the target word [15]. BiLSTM includes two LSTMs, forward ( − − → lstm) and backward ( ← − − lstm)), to encode the sentence from left to right ( − → h i (i = [1, m])) and otherwise ( ← − h i (i = [m, 1])) as follows: where W is the weight matrix; for example, W x

C. EXTRACTING AND REPRESENT SIGNIFICANT FEATURES
We build a sentence graph convolutional network which includes the following steps: Building the sentence graph: A sentence graph G = (V , E, A) includes a set V of nodes corresponding to m words in the sentence, a set E of edges indicating the dependencies of adjacent node pairs in the syntactic dependency tree, 3 and an adjacency matrix A ∈ R |V |×|V | showing the nodes relations, and is defined as follows: In addition, graph G has a node feature matrix Q = [H ] ∈ R |V |×d h , where each row H i represents the contextualized representation of word node v i ∈ V .
Creating node embeddings: This step is realized as the following sub-steps: Node representation: Matrices K ∈ R |V |×d h and A ∈ R |V |×|V | are fed into a conventional GCN proposed by Kipf et al. [11] to create node representations as follows: where l is the number of GCN layers, K ∈ R |V |×d k ; H (0) = Q. α is a non-linear activation function as ReLU . W (l) ∈ R d k ×|V | are the transformation matrices in the l-th layer. b (l) are the biases of GCN layers, respectively.
D is the normalized symmetric matrix of A; M is the degree matrix of A, where: Position-aware transformation: This step reduces noise and bias when processing the GCN. In this study, we use a positional attention mechanism to capture essential parts in the sentence regarding this aspect as follows: where where W A ∈ R (d h +d p )×(d h +d p ) , b A ∈ R d h ×d p , and u A ∈ R d h ×d p are learnable matrices. ⊕ is a concatenation operator. p i ∈ R d p is the position weight of the i-th word and calculated as where start and end are the beginning and ending order of the aspect in the sentence, respectively.

D. BUILDING THE CLASSIFIER MODEL 1) CONVOLUTIONAL LAYER
The convolutional layer decreases the dimension of the node embeddings by creating feature vectors z i that are determined by sliding a filter F ∈ R f ×d h of length f from i to i + f − 1 and extracting important information [15] as follows: where i = [1, |V |] is the order of the node representations E; is the convolution operator; ReLU is an activation function. b is a bias term. Therefore, the feature vectors are created from node representations as follows: 2

) MAX-POOLING LAYER
The max-pooling layer creates feature vectors of the same size by selecting the maximum number from each vector z i . The main reason is that the size of the feature vectors z i depends on the dimensions of both matrices E and F. Therefore, the dimensions of vectors z i ∈ z will differ if the sentence length and filter size are different. New feature vectorsẑ are defined as follows: whereẑ i = Max(z i ).

3) FULLY CONNECTED LAYER
The fully connected layer fine-tunes the characteristics of the previous layers to determine the aspect sentiment as follows: where W E ∈ R l×|V | and b ∈ R l are a weight matrix and a bias of this layer. l is the number of sentiment classes.

E. TRAINING MODEL
Training the CNN over BiLSTM-GCN model is to minimize the cross-entropy loss function as follows: where l is the number of sentiment classes in the training set, y i is the distribution of the true sentiment of the i-th class, andŷ i is the distribution of the predicted sentiment of the i-th class. λ represents the L 2 regulation coefficient. θ is the parameter set from the previous layers. The steps to train the CNN over BERT-GCN model are illustrated as Figure 2:

V. EXPERIMENTAL EVALUATION A. DATASET AND EXPERIMENTAL SETUP
In this study, to prove the efficacy of the proposed model and ensure a fair comparison of the proposed method with other methods, we used three benchmark datasets: Laptop, Restaurant 4 [16], and Twitter 5 [6]. The restaurant and laptop datasets contain opinions on restaurants and laptops. These opinions were divided into separate sentences that included at least one aspect and sentiment. The tweet dataset consists of opinions regarding celebrities, products, and companies. These opinions were also divided into separate sentences that contained only one aspect and their sentiments. Detailed information on the databases is shown in Table 1.
We implemented the proposed model using pre-trained BERT 6 with dimension of 300 and a learning rate of 5e-05. We initialized all model weights with a uniform distribution. The dimensions of the hidden vectors were set to 204. The learning rate of the Adam optimizer [10] was 0.001. The  learning rate of the L 2 -regularization was 1e-05, and the batch size was 32. Moreover, the number of CNN and GCN layers was set to 1-3, which is the best performing depth for the proposed method. The filter sizes used in the CNN were set up as (5,3,1).
The experimental results were obtained by averaging five runs with random initialization, where F 1 and accuracy [17] were adopted as the evaluation metrics. We also executed a paired test on F 1 and accuracy to verify whether the improvements achieved by our models over the baselines are significant.

B. BASELINE METHODS
To prove that the performance of our model is better than that of other models, we deployed three different methods, including our proposal and two baselines, on two datasets.
• TD-LSTM [19] represents the aspect-specific using both the aspect's left-side context and the aspect's right-side context and then uses two LSTMs to model them for ALSA.
• ASGCN-DT and ASGCN-DG [24] represents the aspect-specific using both the aspect's left-side context and the aspect's right-side context and then uses two LSTMs to model them for ALSA.
• ASP-BiLSTM, ASP-GCN [18] are aspect-level SA methods by constructing a convolution over a dependency tree model with a BiLSTM to take advantage of sentence feature representations and a GCN that operates directly on the dependency tree of the sentence to enhance the role of embeddings further.
• SDGCN-A, SDGCN-G, and SDGCN-BERT [26] are aspect-level SA methods based on modeling aspect-specific representations from its context words with a bidirectional attention mechanism and position encoding and capturing the sentiment dependencies between different aspects in one sentence with GCN over the attention mechanism.

C. RESULTS AND DISCUSSION
The performance of the sentiment analysis methods over the testing sets of the given datasets is presented in Tables 2, 3, and 4.
From Table 2, we can see that our proposed method achieves the best performance, including the accuracy and F 1 score on the restaurant dataset, and the lowest performance on the Twitter dataset. Why is there a difference in the performance when using the same model on different datasets? We can easily see that the results obtained when training a model are better if the training datasets are more extensive. This is correct for the restaurant and laptop datasets. However, in the case of the restaurant and Twitter datasets, although the number of opinions used to train in the Twitter dataset is almost twice as large as that of the restaurant dataset, the results obtained on the restaurant dataset are better. The main reason for this is the quality of the samples in the two datasets. The proposed method can significantly improve this result by carefully preprocessing a dataset before using it.
A performance comparison of the models is presented in Table 3.
It can be seen that our model consistently outperforms all the compared models on the restaurant and Twitter datasets and achieves promising results on the laptop dataset compared with most of the baselines, except for SDGCN-BERT on the laptop dataset. The results demonstrate the effectiveness of our model and the sufficiency of using CNN layers after the GCN layers to effectively capture the essential features and use a combination of BiLSTM and BERT to capture the contextualized representations. Why can the proposed method enhance the accuracy and F 1 score of the baseline methods? In this study, the BERT model captured the semantics of the text well, and the BiLSTM model accurately extracted the context of the sentence. Moreover, the GCN model can best represent contextual text. In addition, the GCN model is currently one of the algorithms that achieve good accuracy for sentiment analysis. The results confirmed  that the use of a GCN for contextual text representation significantly impacts the accuracy of sentiment analysis methods. Why can the proposed method not obtain a performance comparison with SDGCN-BERT over the laptop dataset? One main reason for this phenomenon is that the laptop dataset is not as sensitive to context information; however, it is too responsive to syntactic information.
The impact of the number of GCN and CNN layers is presented in Table 4. The number of GCN and CNN layers is a significant parameter that needs to be set in our model. To determine the reasonable value for GCN and CNN layers, we experimented with different GCN layers from 1 to 5 and different CNN layers from 1 to 5 for the proposed model. In the experimental process, we observe that if the number of CNN is greater than three or the number of GCN layers is greater than three, overfitting occurs. To our knowledge, the main reason for this phenomenon of the performance decrease may be that as the parameters increase, the model is more challenging to train. Therefore, we only calculated the performance for a maximum of 3 CNN layers and 3 GCN layers. It can be seen that the proposed method obtained the best accuracy and F 1 over the restaurant and laptop datasets when number of CNN and GCN layers is 1, too. However, it achieved the best performance over the Twitter dataset when the number of CNN and GCN layers were 1 and 3, respectively. Why did the difference in performance occur between the datasets for the same number of GCN and CNN layers? The main reason is that opinions in the laptop and restaurant datasets are more syntactically written than those in the Twitter dataset. Therefore, the proposed model can best capture the context information with only one GCN layer.

VI. CONCLUSION AND FUTURE WORKS
This study introduced a method to improve the performance of aspect-level sentiment analysis based on a GCN model by using CNN over BERT-GCN to capture contextualized information more effectively. Experimental discussions show that the proposed method significantly improves the performance of sentence-level sentiment analysis over the three benchmark datasets. However, this proposal does not consider all contextual factors, semantic relations, and emotional knowledge simultaneously when building text-representation graphs. In the future, we will focus on building graphs that simultaneously represent contextual factors, semantic relations, and sentimental knowledge to enhance the performance of sentiment analysis methods.
NGOC THANH NGUYEN (Senior Member, IEEE) is currently a Full Professor with the Wroclaw University of Science and Technology, and the Head of Information Systems Department, Faculty of Computer Science and Management. He is the author or coauthor of five monographs and more than 350 journal and conference papers. He has given 22 plenary and keynote speeches for international conferences, and more than 40 invited lectures in many countries. His research interests include collective intelligence, knowledge integration methods, inconsistent knowledge processing, and multi-agent systems. He served as a member for the Council of Scientific Excellence of Poland, a member for Committee on Informatics of the Polish Academy of Sciences, an expert for National Center of Research and Development, and European Commission in evaluating research projects in several programs like Marie Sklodowska-Curie Individual Fellowships, FET, and EUREKA. He was a General Chair or a Program Chair of more than 40 international conferences. He also served as the Chair for IEEE SMC Technical Committee on Computational Collective Intelligence. He is also the Honorary Chair of the Scientific Board with Nguyen Tat Thanh University. He has been edited more than 30 special issues in international journals, 52 books, and 35 conference proceedings. He serves as an Editor-in-Chief for International Journal of Information and Telecommunication (Taylor&Francis), Transactions on Computational Collective Intelligence (Springer), and Vietnam Journal of Computer Science (World Scientific). He is also an Associate Editor-in-Chief for several prestigious international journals, among others, Journal of Intelligent and Fuzzy Systems, and Applied Intelligence. In 2009, he was granted the title Distinguished Scientist of ACM. He was also a Distinguished Visitor of IEEE and a Distinguished Speaker of ACM.
DOSAM HWANG received the Ph.D. degree from Kyoto University, Kyoto, Japan. He is a Full Professor with the Department of Computer Engineering, Yeungnam University, South Korea. He has served as the Head of Yeungnam University, Department of Computer Engineering for five years, from 2005 to 2009. He has also held a position as a Principal Researcher at the Korea Institute of Science and Technology (KIST) and also been a Visiting Professor at (KAIST). He has more than 50 publications. His research interests include natural language processing, ontology, knowledge engineering, information retrieval, and machine translation. He has been steering committee member of ICCCI and ACIIDS, and MISSI international conferences. He was awarded the Prize for Good Conduct from Kyunghee High School, in 1973. He has been the Co-Chair of several international conferences. He has been the