ALBERTC-CNN Based Aspect Level Sentiment Analysis

In order to solve the problem that most aspect level sentiment analysis networks cannot extract the global and local information of the context at the same time. This study proposes an aspect level sentiment analysis model named Combining with A Lite Bidirection Encoder Represention from TransConvs and ConvNets(ALBERTC-CNN). First, the global sentence information and local emotion information in a text are extracted by the improved ALBERTC network, and the input aspect level text is represented by a word vector. Then, the feature vector is mapped to the emotion classification number by a linear function and a softmax function. Finally, the aspect level sentiment analysis results are obtained. The proposed model is tested on two datasets of the SemEval-2014 open task, the laptop and restaurant datasets, and compared with the traditional networks. The results show that compared with the traditional network, the classification accuracy of the proposed model is improved by approximately 4% and 5% on the two sets, whereas the F1 value is improved by approximately 4% and 8%. Additionally, compared with the original ALBERT network, the accuracy is improved by approximately 2%, and the F1 value is improved by approximately 1%.


I. INTRODUCTION
Sentiment analysis is one of the important research tasks in the field of natural language processing, and it has been a hot research direction. However, most of the current sentiment analysis methods are aimed at the document and sentence levels. These two types of granularity sentiment analysis methods have solved many problems to a certain extent, but they still cannot meet the needs of most of the current applications. Therefore, the development of fine-grained aspect level sentiment analysis methods has been widely studied in the vision field. Aspect level sentiment analysis refers to the sentiment tendency analysis of one or several aspects of a given sentence, such as ''The menu is limited, but almost all of the dishes are excellent.'' The emotional tendency of this sentence to the aspect item ''menu'' is negative, but the emotional tendency to the aspect item ''dishes'' is positive; thus, the aspect level emotional analysis is to analyze different emotional information of different aspects in a sentence.
The associate editor coordinating the review of this manuscript and approving it for publication was Arif Ur Rahman .
By analyzing the emotional tendency of different aspects of a comment text in different fields, and applying the analysis results to the field, reasonable help and guidance to the consumer platform or consumers can be provided, such as monitoring the service attitude and consulting experience in an artificial service system, analyzing and controlling the current trend of public opinion, and recommending reasonable and personalized goods or experience items.
Aspect level sentiment analysis methods can be roughly divided into four method categories. The first category includes methods based on rules and dictionaries. Mahalakshmi et al. [1] improved the analysis effect by generating context-aware sentiment dictionaries. Alshari et al. [2] obtained more effective classification results by expanding the non-opinion words in sentiment vocabulary resources. These methods are easy to understand and use, but their accuracy is not high, and there are problems of dimension explosion and gradient disappearance, which limit their wide application in practice. The second category includes machine learning-based methods, such as those proposed by Álvarez-López [3], Mubarok [4], Rafeek [5], which use the support vector machine method [6], naive Bayes method [7], and logical regression method [8], respectively, to analyze each sentence part. These methods can improve the accuracy of emotion analysis to a certain extent but need to be combined with complex feature engineering for feature annotation and extraction, which requires much manpower, many material and financial resources, and the methods are not highly mobile, thus requiring heavy training for different tasks. All these problems in machine learning limit the further development of aspect level emotion analysis. The third category includes deep learning-based methods. With the development of convolutional neural networks (CNN) [9] and long short-term memory (LSTM) network [10], the neural networks have achieved excellent results in most tasks in the field of natural language processing. Chen [11] and Xue [12] applied these two network types to aspect level emotion analysis and achieved good results. These methods can effectively solve the problems of gradient explosion and gradient disappearance, but the accuracy needs further improvement. Also, the application of a pre-training model in image processing requires accuracy improvement. With the extensive use of a pre-training model in the field of image, many pre-training models have been developed in natural language processing, including the GPT(Generative Pre-Training) [13], ELMO (Embeddings from Language Models) [14], BERT (Bidirectional Encoder Representations from Transformers) [15], RoBERTa (A Robustly Optimized BERT Pretraining Approach) [16], ALBERT(A lite BERT) [17], and other pre-training language models. The performance of pre-training language models in various tasks has been surprisingly good, so the fourth method category, which includes methods based on a pre-training language model, was developed and has been widely studied and applied to many fields. Hoang et al. [18] used BERT to analyze aspect level emotion and achieved good results. Although this method can accurately extract the global emotion information in an aspect level text, it ignores the important local emotion information in the text. Therefore, this study explores using a pre-training model to extract global features and a convolution network to extract local features. The proposed model can make full use of the information in sentences and improves the accuracy of emotion analysis.
The main contributions of this paper can be summarized as follows.
(1) The transformer feature extractor in the ALBERT network is improved by convolution networks, and the ALBERTC model is obtained. This model not only can accurately extract the global semantic information and location information from a text but also can effectively extract the local aspect information and emotional information so as to obtain a more comprehensive aspect level text information representation.
(2) The output layer of the ALBERTC model is combined with convolution networks, and the obtained model is named the ALBERTC-CNN. This model can effectively improve the classification accuracy of aspect level emotion.  The rest of the study is organized as follows. The proposed network model is introduced in Section II. The experimental verification of the proposed model and experimental results analysis are presented in Section III. In Section IV, the main conclusions are drawn, and future work directions are given.

II. PREPARATION PROPOSED NETWORK
The main problem to be solved in this study is the analysis of the aspect level emotional information. For given sentence information L a = l a 1 , l a 2 , l a 3 , . . . , l a n of a length N and aspect information L b = l b 1 , l b 2 , l b 3 , . . . , l b m of a length M , the emotional tendency of an aspect item l b in a sentences l a can be effectively analyzed using the proposed model.
This study proposes an improved aspect level emotion analysis model based on the ALBERT network. As shown in Fig. 1, the proposed model is mainly composed of the input layer, TransConvs encoding layer, convolution layer, fully connected layer and softmax layer. The input layer mainly represents the aspect level text as a word vector, so that the later feature extractor can better extract the information from the text. The TransConvs feature extractor mainly extracts the global sentence feature information and position feature information, as well as the local aspect information and emotion information from the text vector. The output layer's convolution layer mainly strengthens the feature extractor output. The main function of the fully connected and softmax layers is to map the extracted feature information with the corresponding sentiment categories of the text and obtain the sentiment tendency of all aspects.

A. INPUT LAYER
For the aspect level sentiment analysis task, the input sentence needs to be represented as a vector form before the next step of coding representation learning.
The input layer plays the role of converting the sentence into a vector, as shown in Fig. 2. This layer first represents VOLUME 9, 2021 the input sentence as [CLS] aspect words [SEP] sentence words [SEP], Then, the input token semantic information, token location information and sentence segment information are represented in word vector form by Embedding.
After the input sentence passes through the input layer, the final output is shown in function (1). Where E i represents the vector information representation of the i-th word of the input sentence, E token represents the token semantic coding information of each word, E segment represents the segmentation information between aspect item and sentence, and E position represents the position information of each word.
The ALBERT model represents a simplified BERT model. Compared with the BERT model, the ALBERT model has fewer training parameters and can achieve better performance, which is because it uses factorized embedding parameterization technology and cross-layer parameter sharing technology, which can decrease the number of training parameters, reduce the hardware requirements of training, and shorten the training time. At the same time, in the ALBERT model, simple NSP (Next Sentence Prediction) tasks are replaced into more complex SOP (Sentence-Order Prediction) tasks, which effectively prevents overfitting and increases the accuracy. The main structural part of the basic ALBERT model is the encoder part of the transformer feature extractor, which performs well in global information extraction and can extract the important position information, sentence information, and token semantic information in the whole sentence. However, Transformer is mainly composed of forward propagation module and multi-head self-attention mechanism module [32]. In [32], it is mentioned that the self-attention mechanism can capture long-distance dependencies by using KEY, QUERY and VALUE triples, so as to effectively extract the global information of context. However, this will inevitably lead to the loss of part of local information. As can be seen from the article [33], CNN is forced to capture local information through the use of local receiving domain, shared weight, etc. Therefore, CNN has a strong ability to capture local information. Therefore, our study uses CNN to improve Transformer to obtain TransConvs feature extractor. TransConvs combines the excellent ability of CNN to capture local features with the excellent ability of Transformer to extract global features, so as to improve the accuracy and F1 value of aspect level sentiment analysis. In general, TransConvs module is more suitable for the text aspect level emotional analysis task than Transformer, which is mainly composed of MHSAM (Multi-Headed Self Attention Module), FFNM (Feed Forward Network Module), and CNNM(Convolution Network Module). As shown in Fig. 3, the first part of the TransConvs module is the multi-head self-attention module, which is composed of multi-head attention module, linear layer, normalization function between layers, and dropout function. The structure of the multi-head attention module is shown in Fig. 4, where it can be seen that when the word vector's features are input into this module, the module first extracts the relationship between every two words in a sentence using the multi-head attention mechanism; then, the score matrix is randomly inactivated by the dropout function to get the output u and the output u is connected with the initial input residual to get the output information after the addition. Finally, the final output y is obtained by the LayerNorm.
Assuming that the input of the module is denoted as x i and the output processed by the module is denoted as y i , the calculation formula of the module is given by (2) and (3), where subscript i refers to the sentence feature representation of the input i th input sentence.
The feed-forward module is connected to the multi-head attention module. The feed-forward module is composed of two layers having the linear transformation function and   nonlinear activation function. After the input y passes through the two layers having the linear function and random deactivation, the output obtained by the glue activation function is connected with the input y by residual, as shown in Fig. 5. Finally, the residual is added, and the final output z of the module is obtained by inter-layer standardization.
Assuming that the initial input is denoted as y i and the output processed by the module is denoted as z i , the calculation formula of the module is given by (4) and (5), where subscript i refers to the text feature representation of the input i th input sentence.
C. CONVOLUTION LAYER Inspired by [19], the convolution module in this study uses a point-by-point convolution as a basic network layer. The point convolution means that the size of the convolution core is one, and the same conversion is performed for each input word.
In this study, a three-layer convolution network is connected in series. The numbers of convolution kernels in the first, second, and third layers are set to 64, 64, and 768, respectively. These parameter settings are used because they can provide an effective combination of the ALBERT network and the proposed multi-convolution module and can achieve satisfactory emotion analysis results. The activation function layer is placed behind each convolution layer in the network, and the activation function of this layer is the glue activation function proposed in [20], which introduces the idea of random regularization to the proposed model, so better results can be achieved than by using the ReLU and ELU functions. The last layer of the multi-convolution module is connected with a layer with the random deactivation function dropout, which generates the final output of the convolution module. The structure of the convolution module is shown in Fig. 6.
Assuming that the initial input is denoted as z i and the output processed by the module is denoted as v i , the calculation formula of the module is given by (6) and (7), where PWconv represents the point convolution operation, δ is the activation function operation, Dropout denoted the random deactivation operation, and subscript i refers to the text feature representation of the input i th input sentence.

D. FULLY CONNECTED AND SOFTMAX LAYER
The full connection layer and the Softmax layer are connected in the last layer of the model, whose main role is to map the feature information extracted by the feature extractor to the text emotional tendency information. In this paper, the emotions of the text mainly include positive emotional tendencies, negative emotional tendencies and neutral emotional tendencies. Therefore, as shown in Fig. 7, the full connection layer and Softmax layer map the vector information represented by aspect items and sentence features into three-dimensional vector information.

III. EXPERIMENTAL RESULTS AND ANALYSIS
The experimental environment used in this study included Windows 10 64-bit operating system, NVIDIA GeForce GTX 1060Ti graphics card, Intel i7-9750H@2.60GHz, and 4G video memory.

A. DATASETS
To verify the performance of the proposed model, two public comment datasets, laptop and restaurant, were used in the test. The laptop dataset referred to consumers' sentiment toward notebook computers, and it included a total of 2328 training data samples and 638 test data samples. The restaurant dataset referred to the catering industry, and it included a total of 3608 training samples and 1120 test samples. The specific data distribution is shown in Table 1. The data were in the XML format. To analyze the emotion from more aspects, the data were transformed to the TXT format, where zero indicated that the emotion tendency of the aspect item was negative, one indicated that the emotion tendency of the aspect item was positive, and two indicated that the emotion tendency of the aspect item was neutral. The processed data are shown in Figs. 8 and 9.

B. EXPERIMENTAL PARAMETERS AND EVALUATION INDEXES
In the experiment, the evaluation indexes that have been commonly used in natural language processing were adopted, and they were as follows: accuracy, macro average recall rate (R Macro ), and macro average F1 value (F1 Macro ), and they were respectively calculated by: (11) where T denoted the number of categories with the correct classification, F denoted the number of categories with wrong classification, P denoted the number of categories with the correct prediction, N was the number of categories with a wrong prediction, and P Macro denoted the accuracy rate. In this study, the parameter setting of the experiment included two parts: the parameter setting of the pre-training language model, and the parameter setting of deep learning model. In the pre-training language comparison models, the BERT and ALBERT models, the parameters of the official base models were set. The maximum sentence length was 128, the learning rate was set to 2 × e-5, the dimension of the word vector was 128, and the dimension of the hidden layer vector was 768. In other comparison models, the initial word vector dimension was 300, and the initialization model xavier_uniform_ was used to initialize the text vector. The hidden word vector dimension was also set to 300, the learning rate was set to 1 × e-3, and the random deactivation rate was set to 0.1. For all the models, the learning framework of Python 1.6.0 was used to build and train the model. The Adam optimizer was used to optimize model parameters, and the cross loss function was used as a loss function. The specific parameter settings are shown in Table 2.

C. MODEL COMPARISON EXPERIMENT
To compare the performance of the proposed model, the following networks were selected for comparison. The details of the networks are given in the following, where * indicates that the experimental results are taken from the original paper. (1) SVM * (Kiritchenko et al., 2014): This method is a traditional machine learning-based method. In [21], surface features, lexical features, and syntactic analysis features were combined, and the support vector machine was used to analyze the aspect level text emotion. The experiments were performed using two SemEval-2014 datasets, and the obtained results were competitive for that time.
(2) LSTM: This study uses the standard bidirectional LSTM model to train the corpus; the input is the same as the model in this paper, that is, the input text format is In the IAN network, two LSTM networks are used to model the aspect items and text information at the aspect level. It uses the hidden information in sentences to generate the attention vector of aspect items and also uses the aspect information to generate the attention vector in sentences. Finally, the output of the two LSTM networks is cascaded and combined, and it is used for aspect level text sentiment analysis. The CABASC model has two attention enhancement mechanisms: sentence-level content attention mechanism and context attention mechanism. The former is mainly responsible for giving important information from a global perspective, while the latter is mainly responsible for considering the order of words and their relationship. The results presented in the original study also show that the two attention mechanisms play a key role.
(9) TNet-LF : In [27], two aspect dependent long-term and short-term memory networks were constructed, and the aspect information in the target sentence was actively considered.
(10) AOA : This model mainly models aspects and sentences in a joint way. It can learn aspect information and sentence information at the same time and can automatically focus on important parts of sentences.
(11) MGAN (fan et al., 2018): The multi-granularity attention network (MGAN) model was proposed in [29], and this model can capture the character level interaction information between aspect items and context using the fine-grained attention mechanism.
(12) ASGCN (Zhang et al., 2019): In [30], the graph neural network was used to capture syntactic information and long-distance word dependence information to solve the problem that most models cannot effectively use syntactic constraints and long-distance word dependence so as to improve the accuracy of aspect level sentiment analysis.
(13) ASEGCAN (Xiao et al., 2020): In [31], the syntactic edge enhancement graph convolutional network is adopted to realize the sentiment analysis at the aspect level of sentences. With full consideration of different types of domains and boundary constraints, the network can efficiently learn better the representation of aspect information and subjective words, thus improving the accuracy of the sentiment analysis at the aspect level to some extent. (14) BERT: This study adopts the BERT-based-uncased model released by Google, which follows the basic setting of the official website, and the learning rate is set to 2 × e-5.
(15) ALBERT: The ALBERT-BASE-V2 model is also released by Google, and the parameter settings of this model are the same as that of the Google official settings.
As shown in Table 3, the accuracy, macro average F1 value, and macro average recall rate of the proposed model were better than those of the other models on both datasets. On the laptop dataset, the proposed model achieved the accuracy of 80.56%, the macro average F1 value of 76.97%, and the macro average recall rate of 79.13%. The performances of the proposed model on the restaurant dataset were as follows: the accuracy was 87.23%, the average F1 value of macro was 81.01%, and the average recall rate of macro was 80.44%. The experimental results of the proposed model were better than those of the traditional machine learning network and the deep learning network. Compared with them, the accuracy of the proposed model was improved by 4%-10% on the laptop dataset and by 4%-7% on the restaurant dataset. The presented experimental results demonstrate the superiority and expressiveness of the proposed ALBERTC-CNN model over the other models in aspect level emotion analysis task.

D. MODEL ABLATION EXPERIMENT
Compared to the original ALBERT model, the proposed model has three main improvements. The first one is that using the ConvNets, the Transformer feature extractor is improved, and the ALBERTC is obtained. The second is that  by adding the ConvNets to the output layer of the ALBERT, the ALBERT model is improved, and the ALBERT-CNN model is obtained. The third is that both the transformer feature extractor and the output layer of ALBERT are improved by adding the ConvNets, and the ALBERTC-CNN model is obtained. The mentioned improvements were tested experimentally to determine their effects on model performance. The experimental results are shown in Table 4.
As shown in Table 4, after adding the ConvNets, the model performance was improved compared to the original ALBERT model, and both the accuracy and the macro average recall were improved on both datasets. The experimental performance of the ALBERT-CNN model, which was obtained by improving the output layer of the ALBERT by adding the ConvNets, was better than that of the original ALBERT model. Compared with the original ALBERT model, the accuracy of the ALBERTC model was improved by 0.78% and 0.63% on the laptop and restaurant datasets, respectively, which indicated that the muti-convolation networks could improve model performance. When the transform feature extractor and the output layer of the ALBERT model were both improved by adding the muti-convolution network (the ALBERTC-CNN model), the best performance was achieved. The results showed that the accuracy, macro average F1 value, and macro average recall rate of the proposed model were improved by approximately 1.5% and 1% on laptop and restaurant datasets, respectively. The experimental results effectively illustrate the superiority of the VOLUME 9, 2021 proposed model and the necessity of improving the ALBERT model by the introduction of an convolution networks. Compared with the original ALBERT model, the model proposed in this paper performs better, mainly because the model in this paper combines the ability of Transformer to extract long-dependent information [32] and the ability of CNN to extract fine-grained information [33], which enables the text to extract more comprehensive text information in the feature extraction stage. So that accuracy and F1 worth to be effectively improved.

IV. CONCLUSION
Using the aspect level sentiment analysis can provide improvements in many fields, so it is of great significance to study the aspect level sentiment analysis. The global semantic information and local emotional information cannot be extracted simultaneously by most networks, so this study proposes an improved aspect level emotion analysis model based on the ALBERT. The proposed model is verified by comparison experiments, and the obtained experimental results show that the improved feature extractor TransConvs of the ALBERT can effectively extract the global semantic information, location information, local aspect information, and emotional information from a text at the same time. Introducing the convolution network in the output layer of the ALBERT can enhance the aspect and emotional information representation ability of a network, and the text aspect level emotional analysis ability. Compared with the traditional network and the baseline network ALBERT, the proposed model can achieve higher accuracy and F1 value and better convergence, which shows that it can be used for the emotion analysis in various fields of aspect level emotion analysis due to its advantages and competitiveness.
In future work, we will further consider the applicability of an ALBERTC-CNN and further improve the structure of the network. In addition, we will try to increase the feature extraction ability of the proposed model and apply it to other tasks of natural language processing, such as sentence-level sentiment analysis, named object recognition, machine translation, and opinion extraction for the purpose of further evaluation.