Capturing Contextual Factors in Sentiment Classification: An Ensemble Approach

Sentiment classification is a crucial task in sentiment analysis, and has received significant attention from researchers. Previous studies have focused on using several techniques to solve this problem. However, to the best of our knowledge, none of these works has fully investigated the exploitation and the manipulation of contextual information in the text, or taken advantage of the combined power of state-of-the-art models. In this paper, we propose an effective ensemble learning model for the sentiment classification problem. In our system, the contextual information in the text is fully captured by integrating rule-based methods and other state-of-the-art deep learning models. We found that the combination of word embedding representation and the attention mechanism, along with pre-defined rules and specific-domain sentiment dictionaries are helpful in dealing with numerous valence-shifting cases. Although the computational cost of the proposed system is higher than those of certain other algorithms, this system obtains better results than other approaches when tested on three different datasets.


I. INTRODUCTION
Sentiment analysis has been one of the most exciting areas of natural language processing since 2002 [1]. The primary task of sentiment analysis is to automatically determine the value or semantic orientation of a document [2], which refers to a measure of subjective opinion that indicates a polarity (positive, negative, or neutral) and the strength of words, phrases, sentences, or documents. Sentiment analysis has been studied through two approaches [1]: (i) as a machine learning method that implements the construction of text classifiers by selecting features and algorithms that match the labeled data, and (ii) as a semantic-oriented approach that includes the calculation of the overall polarization through the semantic orientation of words or phrases in the text. Numerous tools and applications were developed to exploit user-generated content on the web. However, the performance of these systems is not high due to the complexity of natural languages. Many studies have shown that sentiment analysis is a more complex problem than merely subjective text The associate editor coordinating the review of this manuscript and approving it for publication was Weipeng Jing . classification [3]. These approaches remain ineffective in dealing with some linguistic phenomena, especially with a contextual valence shifter.
Of the two solutions to the sentiment analysis problem mentioned above, the machine learning methodology is dominant because of increasingly abundant training data as well as good quality data sources that belong to a specified application domain within the problem's goals. For example, to classify sentiments from hotel reviews, training data is typically sourced from tourism-hotel reviews, which can lead to the highest level of accuracy. However, machine learning methods often have difficulty in scenarios that cause value shifting due to the context. In contrast, based on the rules for calculating the sentiment values of words, phrases, sentences, and paragraphs, semantic orientation approaches can solve many cases with contextual valence shifters. Unfortunately, defining the rules that cover all situations of linguistic data is not feasible. This challenge is especially true for the Vietnamese language as data sources for sentiment analysis are not yet fully available, for example, the ontology WordNet [4] or a dictionary, such as SentiWordNet [5] for Vietnamese.
In this paper, we propose a novel sentiment classification methodology that adopts ensembles of algorithms to learn on various sub-datasets while focusing on contextual features. We mine the contextual features in text as follows.
(1) We select a word embedding to model the text and the relationship between the context and the target word. Text representation is the key to obtain better performance for classifiers, including sentiment classifiers. To refactor the words into machine learning algorithms, text data must be converted to vector representations. Although one-hot representation is a simple method to implement, this approach leads to a loss of contextual information of the text and results in a high-dimensional space. (2) With attention mechanisms showing promising results in natural language processing tasks [6]- [8], including sentiment analysis [9], [10], our model applies this technique to capture all words in the context. (3) Domain-specific dictionaries are used to assign contextual scores to each sentiment word in the dataset, which are necessary for the context-based learners in our ensemble model.
The main contributions of our paper are as follows: • We propose a novel ensemble model for sentiment classification by integrating many sub-models learned on various sub-datasets.
• Contextual factors in linguistics are fully exploited. A rule-based method captures contextual valenceshifting, and deep learning models improve the adaptive ability of the feature extraction tasks. We also adopt word embedding representation and attention mechanisms along with domain-specific sentiment dictionaries.
• Experiments on multiple datasets using our proposed model suggests improved results compared to the stateof-the-art models.
The remainder of the paper is organized as follows. Section II presents related preliminaries and previous research. Section III reviews the proposal of our model, and Section IV describes our experiments and evaluations. Finally, Section V presents our conclusion and introduces directions for future research.

II. PRELIMINARIES AND RELATED WORK A. CONTEXTUAL VALENCE SHIFTERS
In addition to describing events objectively, texts often convey information about the attitudes of writers, or participants of a described event. This emotional attitude is expressed through the choice and the arrangement of words. Although some words always show positive or negative polarity, others are likely to have contextual valence changes due to the influence of nearby words as well as the organization of the words in the text. Cases of valence-shifting include the following: Valence-shifting occurs when the semantic value of a word could be changed in a specific context [11]. Machine learning methods with bag-of-words and n-grams do not consider the effects of negation structures and other sentiment valence shifter structures, such as in the two sentences ''I like this hotel but the price is quite high.'' and ''I like this hotel but the price is too high.'' Through a bag-of-words model, the sentiment value of the two sentences is likely the same because they include the same emotional words of ''like'' and ''high.'' However, the bag-of-words model does not consider the words ''quite'' and ''too,'' which instead have the effect of making the sentiment value of the two sentences different. Work as in [12] uses sequence mining to extract patterns of valence-shifting, such as negation, contrasting, intensifying, and diminishing of polarization. With rule-based methods and a combination of techniques, SO-CAL [13] is a pioneering system that handles valence-shifting with rules and sets of words with emotional annotations. The authors in [14] and [15] apply a dependency tree containing rich syntactic structure information to formulate syntactic rules for determining the impact of negation and other valence-shifting structures on the emotional value of sentences or the entire document.
Here, we reviewed simple but effective workarounds of using rules to capture contextual valence-shifting situations. With the development of deep learning models based on neural network methods, such as the Long short-term memory (LSTM) network [16], the attention mechanism proves to be a powerful method for modelling context, as is detailed in Section II.B.

B. ATTENTION MECHANISM
Deep learning is a research direction within the field of machine learning that has brought about results with superior efficiency creating a revolution in technology and applications of artificial intelligence in life. Deep learning has helped breakthrough many limitations of artificial intelligence systems in various fields of study from voice processing, imaging processing, and text analysis. Moreover, deep learning has brought about a great deal of interest in the automatic learning of feature representations, such as Word2Vec [17], Glove [18]. These embedded features offer many effects and meanings to handle a variety of data types more easily from text, image, and video formats as well as creating packages to apply them to many different problems with a variety of techniques of machine learning.
The attention mechanism is an influential idea from the deep learning community and was first proposed by Bahdanau et al. in 2014 [19] for machine translation tasks. Before this approach, translation tasks relied on reading complete sentences and encoding all information into a fixed-length vector. So, a sentence with hundreds of words could be represented by only several words. This approach would lead to information loss and inadequate translations. The mechanism of attention overcomes this limitation by allowing the machine translator to scan all information in the source sentence and create the appropriate word according to the current word on which it operates as well as the context. If words are identified as being inherently important or unimportant, then models without the attention mechanism might work well because the model would automatically assign low weights to irrelevant words. However, the importance of words is highly dependent on the context. For example, the word ''good'' might appear in a review with the lowest rating because users are only satisfied with part of the product or service or because they use it within a negation, such as ''not good''. The attention mechanism has been used successfully in parsing [20], natural language question answering [21], image question answering [22]. The application of the attention mechanism to the task of sentiment analysis has received attention in the research community, such as with document level problems by Yichun Yin and Song [23]. Also, Zhou et al. [24] used featured representations with word embeddings and the LSTM model [16] combined with a hierarchical attention mechanism. At the aspect level, the authors in work of [25] proposed the attention-based method on the sentiment classification task. The system used the two types of attentions, global and local, to classify sentence text into positive, negative, or neutral. The proposed approach can achieve higher score than that of other models.

C. HYBRID APPROACH
Studies have shown that both semantic and machine learning approaches for sentiment analysis have been applied in many different fields, each providing advantages and limitations. Recently, hybrid methods to improve the performance of these model systems have been proposed. For example, Khan et al. [26] proposed a hybrid method to predict the sentiment trends of movie reviews using SentiWordNet and machine learning. They built an emotional dictionary based on mutual information and applied a supervised learning method to detect sentiment polarization. In their research, only the scores of the adjectives in the document are considered, and the proposed algorithm results in a higher average accuracy compared to SentiWordNet. Table 1 lists notable studies that used hybrid approaches.

D. ENSEMBLE LEARNING
Combining models with different capabilities in an appropriate way can create a stronger bonding model than using individual models alone. Ensemble learning follows this idea as well as Condorcet's theory of voting [31] that states if the probability of voting by an independent voter is p > 1/2 (i.e., each voter votes as they wish), then more voters provides increased accuracy of the majority decision. The probability of the majority vote approaches 1 as the number of voters increases.
Three methods are commonly used to combine prediction models are: • Bagging [32]: Building multiple parallel operation models on different sub-training sets taken from the training  data set. This approach can reduce overfitting in complex models.
• Boosting [33]: Building multiple models in sequential operation, such that each model learns how to correct the mistakes of the previous model in the series. The drawback of this strategy is that the training data set must be very large.
• Stacking [34]: Building multiple base-learners and a meta-learner that learns how to best combine the prediction results following those from the base-learners. This paper applies a stacking strategy for building our classification model, as illustrated in Figure 1. By using multiple learners, the combined ability of the ensemble can be greater than a single learner [35]. Ensemble learning has been applied in many fields, such as bioinformat- ics [36], finance [37], and healthcare [38]. Ensemble learning applied to sentiment analysis has demonstrated advantages for many classification problems as shown by recent studies [39], [40].

A. DATASETS
In this study, we use three review datasets as listed in Table 2. The first dataset includes hotel reviews in Vietnam (HOTEL-Reviews) with comments posted on mytour.vn from August 2010 to July 2017. This set consists of 3,728 reviews with an average length of 47 words and separated into two positive and negative groups at a ratio of 50:50 for the training and testing sets.
The second set includes student comments about a university (UIT-VSFC) [41], and consists of 10,280 reviews with an average length of 15 words. The reviews are separated into two positive and negative groups at a ratio of 50:50 for the training and testing sets.
The third set includes food reviews in Vietnam (FOODY-Reviews) with comments on restaurants, cafes, and food posted by users on foody.vn from August 2016 to July 2019. This set consists of 40,000 reviews with an average length of 102 words and separated into two positive and negative groups at a ratio of 50:50 for the training and testing sets.
All comment data is pre-processed by eliminating misspelled words, adjusting abbreviations, social networking languages, and symbols.

B. PROPOSED MODEL
We adopt an ensemble learning model comprised of various base-learners to work on multiple datasets. The input training datasets are labeled as positive or negative, and the texts are standardized (correcting spelling errors, standardizing abbreviations, discarding stop-words). Each text is classified by extracting feature components into polarity shifters, including negation, contrast, inconsistency, and un-shifting. Then, TF-IDF is used for text representation to create sub-datasets that are separately processed and trained with the base learners. Other machine learning and deep learning base-learners are applied to all datasets with the word embeddings used for text representation. Finally, the base-learners' results are merged through a meta-learner to provide a final prediction of positive or negative. Figure 2 describes the entire ensemble learning process.

1) CONTEXTUAL VALENCE-SHIFTING BASE-LEARNERS
Our classifiers are designed based on Contextual Valenceshifting as reviewed Section II.A. This study also incorporates sentiment dictionaries for determining the sentiment scores of words, as we empirically determined to increase the accuracy of the model.

a: SENTIMENT DICTIONARY
Building the sentiment dictionary is performed by the following steps.
Step 1: Use the online dictionary Vdict.com to translate text to Vietnamese from the English SentiWordnet. Here, the article translates each synset in SentiWordNet into Vietnamese, denoted by w, and stores its score, denoted by SWN(w).
Step 2: Equate the sentiment score of the Vietnamese word from Step 1 to the score of the corresponding English word in SentiWordnet. If a word has multiple sentiment values, then select the value with the smallest difference between the word's score obtained by the results of the training process using logistic regression. The process for determining the sentiment score of the vocabulary set with logistic regression is as follows.
• Collect and pre-process documents: We collected 14,618 comments from agoda.vn and foody.vn and preprocessed the documents.
• Text representation: Use a bag-of-words model with TF-IDF representation, which is considered the most standard model in text representation [42].
• Manually label each text: Two experts label each review as positive or negative, resulting in 10,420 positive and 4,198 negative comments.
• Perform training with Logistic Regression model: In statistics, Logistic Regression is used to model the probability of a particular class or event that exists, such as pass or fail, or positive or negative. This technique worked well on the Vietnamese datasets for adjusting the sentiment scores of words. Logistic Regression can estimate the coefficients that fit the training data during the creation of the classification model. The sentiment score of each word (denoted as LR(w)) is determined by the pseudocode:

then return w and LR(w)
The complete algorithm is illustrated as follows.

Algorithm VNSD.[43]
Input: list of LR(w) Output: list of w final Method: LR(w): score of the lexical item w is extracted from the Logistic Regression (LR) training task. w i : score of the i th of lexical item w is extracted from the SentiWordnet (SWN) translation task. The final score of the lexical item w is determined by the function: function finalScore(w) w

LR= LR(w) for each w i in the set of SWN(w): return w i if w i makes |w i -w LR | get a minimum value
Algorithm VNSD results in a set of sentiment words and a corresponding score that is separated into lists of positive and negative words sorted in descending order by the absolute values of the words.

b: CONTEXTUAL VALENCE-SHIFTING BASE-LEARNERS
We adopted Logistic Regression (LR), n-gram representation and bag-of-words TF-IDF. The training process for the entire dataset and the sub-datasets of negation, contrast, inconsistency, and un-shifting is as follows.
Negation Sub-Dataset: Negation structure is the most common form of valence-shifting. Determining the negation structure is performed by checking for the appearance of negation words, such as ''no'', ''not'', ''none'', ''don't'', and ''never'' in a sentence after removing exceptions, such as ''not only . . . but also . . .'', which cannot be considered as a negation.
These reviews are added to the negation dataset, and after determining the location of the negation word, it is removed and a negative sentiment word replaces the first sentiment word behind the negation word according to the score in the sentiment dictionary. The sentiment words behind this first word will also be replaced if they are identified to have the same sentiment polarity as this word. For example: (1) ''I don't like this hotel!'' will be changed to ''I dislike this hotel!'' (2) ''The staff is not friendly, enthusiastic.'' will be changed to ''The staff is unsympathetic, unenthusiastic.'' Contrast Sub-Dataset: Contrast structures are also common for valence-shifting. These words are separated into two groups of Fore-Contrast, such as ''but'' and ''however'', and Post-Contrast, such as ''though'' and ''although''. If the sentence contains a Fore-Contrast word, then the valence shift occurs in the clause immediately preceding this word. If the sentence contains a Post-Contrast word, then the valence shift occurs within the clause, in which the Post-Contrast word appears. These contrast sentences are then added to the contrast dataset. For example, ''The hotel is very nice, convenient location but the price is a bit expensive'' results in a valence shift of ''The hotel is very nice, convenient location''.
Inconsistency Sub-Dataset: Inconsistency sentences do not show signs of a grammatical shift, but the semantics provides an opposite sentiment to the sentiment expressed in the entire text. This conflict is due to the complexity of human language. Sentimental inconsistency can also be identified by scoring each word in the text, and the review can be estimated by a valence shift with h (r i ) = y.
where y is the positive or negative label of the review, r i is the i th review, |r i | is the number of words in the i th review, and w(t i ) is the sentiment score of the word t i . Based on the estimated value, one of the following decisions is made: • If h(s i ) < 0, then the review is added to the inconsistency set containing the inconsistency sentiment reviews.
• If h(s i ) ≥ 0, then the review is added to the un-shifting set containing the un-shifted sentiment reviews. Examples of sentiment inconsistencies are: (1) ''More specifically, he has spent a lot of effort.'' (2) ''The teacher works so horribly every day he stays up until 2 am.'' Un-Shifting Sub-Dataset: Datasets containing sentences without valence-shifting are considered un-shifted. In other words, the un-shifted dataset includes the text that remains after the removal of sentences that contain a valence-shifting cases of negation, contrast, or inconsistency.

2) ATTENTION BASE-LEARNER
We apply the bidirectional long short term memory (Bi-LSTM) network [16] with an attention mechanism to design an additional base-learner. Similar to Bi-GRU [44], this model leverages deep learning to capture the contextual factors in a text. In our experiments, Word2Vec word embedding for all datasets initializes using 300-dimensional word vectors pre-trained by [45].
The architecture of the attention mechanism model is illustrated in Figure 3. First, the model maps words into low dimensional word vectors using pre-trained word embeddings from the input layer. Given S = {w 1 , w 2 , w 3 , . . ., w n }, each word in S is mapped into a k-dimensional vector e i ∈ R k through a lookup from a pre-trained word embedding matrix E ∈ R k.|V| , such as Glove [18] and Word2vec, such that where k is the dimension of the word vector and |V| is the vocabulary size. Next, the Bi-LSTM operates on these word vectors to preserve the sequential information in a memory modelling layer. The forward LSTM learns the forward weight of the hidden state to obtain the t th forward context vector − → c t . Given the hidden state − − → c t−1 ∈ R d and the word embedding e t , the hidden state − → c t at time step t th can be calculated as The backward LSTM learns the backward weight of the hidden state to obtain the t th backward context vector ← c t . Then, a concatenation of − → c t and ← − c t provides the t th context vector c t ∈ R 2d of each word as where [;] stands for the concatenate operation, and − → c t and ← − c t are the output of the forward and backward LSTM at time step t th , respectively. We next use a word-level context vector u w that is randomly initialized to measure the importance of the word as the similarity of u t . The context vector u w is a highlevel representation of the informative word, and the value of u w is updated during the training process. We obtain the attention score α t normalized by the softmax function.
After this process, we compute the document vector as a weighted sum of the word annotations at the current time in the decoded state using Finally, in a softmax layer, the sentiment polarity distribution predicts the probability p k of the text belonging to the category k (positive or negative), such that where W s represent weight matrices, and b s are bias vectors.

3) META-LEARNER
A meta-learner is trained from the combined predictions of the previous base-learners. These base predictions are input into a sequential layer and combined to form a new set of predictions. In this study, the different base-learners are fit independently, and the meta-learner is trained on top of these to predict an output based on the outputs of the base-learners. Different algorithms can be selected for doing this part of the sentiment classification, and we apply logistic regression for our meta-learner as it suggested the best results through experimentation.

IV. EMPIRICAL EXPERIMENTS A. MODEL COMPARISONS
To verify the effectiveness of the proposed model, we tested a series of variations of the model. First, five base-learners were used, ignoring the deep learning base-learner. Tests were conducted with and without the use of sentiment dictionaries. Next, we estimated the model using six base-learners and included the attention-based model with and without sentiment dictionaries. Finally, we attempted to compare our method with other baseline models of SVM, LSTM, BiLSTM [46], attention-based, BERT-based [47], and the state-of-theart model CEM in [48].
• SVM-based model: The traditional machine learning method of SVM representing the TF-IDF score for different unigrams.
• LSTM-based model: The Long Short-Term Memory (LSTM) Network proposed by Hochreiter and Schimidhuber [16] as enhanced by Gers [49]. Numerous deep learning models developed from LSTM have been proposed but the original LSTM model remains valuable as a strong baseline model [50]. We use one LSTM network to model the context, which includes 2 hidden layers, 64-units, and Word2Vec feature representation with a one-hot vector dimension of 65,000, reduced to 300 after applying Word Embedding. The average value of all hidden states is treated as the final context representation.
• BiLSTM-based model: We use bidirectional LSTM networks model to model the context, which includes 128-units, and Word2Vec feature representation with a one-hot vector dimension of 65,000, reduced to 300 after applying Word Embedding.
• Attention-based model: A vanilla global attention mechanism applied to the output of BiLSTM. To mine deeper feature information from the text, a stacked hierarchical training mechanism is used to train the data, and multilayer feature learning obtains a more representative feature representation.
• BERT-based model: The Bidirectional Encoder Representations from Transformers (BERT) is a stateof-art language model that produces contextualized representation learning [47], and within this model, the Masked Language Modeling (MLM) task embeds each word based on surrounding words. The Next Sentence Prediction (NSP) task, on the other hand, learns semantic coherence between sentences. The BERT model archives positive results in multiple natural language processing tasks, including sentiment analysis [51]. We use the fine-tuned BERT-based model proposed by Nguyen 1 that outperforms the result of the winning model from the AIViVN's sentiment classification contest. 2 The proposed model and its variants include the following.
• CEM(6C-ATT-WLLR) model: This model includes six base-learners likes the CEM(6C-LSTM-WLLR) as in [48], but the LSTM model is replaced with the attention mechanism. The base-learners include a contrast learner, an inconsistency learner, a negation learner, an un-shifting learner, a logistic regression learner, an attention learner, and the meta-learner which uses Logistic Regression.
• CEM(6C-ATT-VNSD) model: This model includes six base-learners likes the CEM(6C-LSTM-WLLR) as in [48], but the LSTM model is replaced with the attention mechanism and the WLLR (weighted log-likelihood ratio) statistical method as in [48] is replaced with the sentiment dictionaries (VNSD) to consider for sentiment score ranking. The base-learners include a contrast learner, inconsistency learner, negation learner, unshifting learner, logistic regression learner, and attention learner, and the meta-learner uses Logistic Regression.

B. EVALUATION RESULTS
The experimental results are evaluated with the accuracy ratio, as shown in Tables 3, 4, and 5.

1) EVALUATION RESULTS FOR HOTEL-REVIEWS
HOTEL-Reviews is a small dataset which has a lot social networking slangs. The deep learning models that need rich training data such as BERT-based, LSTM, BiLSTM, and the attention mechanism, which give a lower accuracy than traditional machine learning methods like SVM. The models proposed in this article all have good performance -CEM(6C-ATT-WLLR) showed the best performance and CEM(6C-ATT-VNSD) was the second-best. Based on the experimentation in this study, the proposed models are better than the best model in [48] by 0.3%-1.7%, which is CEM(6C-LSTM-WLLR).

2) EVALUATION RESULTS FOR UIT-VSFC
UIT-VSFC is a medium-sized dataset with well-written comments. Through experiments, we learn that the deep learning models, such as BERT-based, BiLSTM, LSTM, and the attention mechanism give superior results in comparison to traditional machine learning methods such as SVM. The ensemble learning model was superior with sufficiently large training data when CEM(6C-ATT-WLLR) and CEM(6C-ATT-VNSD) performed well. The proposed models are better than the best model in [48] -CEM(6C-LSTM-WLLR) -by up to 1.65%.

3) EVALUATION RESULTS FOR FOODY-REVIEWS
FOODY-Reviews is a medium-large-sized dataset and with comments written on social networks. Through these experiments, we learn that traditional machine learning methods, such as SVM offer more accurate results compared to deep learning models, such as LSTM, BiLSTM, and the attention mechanism, except for the BERT-based model, which works well with relatively sufficient data. The ensemble learning model was superior when sufficient training data was available and had good performance with CEM(6C-LSTM-WLLR), CEM(6C-ATT-WLLR), and CEM(6C-ATT-VNSD). The proposed models are superior to   respectively. This demonstrates the automatic learning features of deep learning that contribute to the improvement of system performance.
• The use of the attention mechanism in CEM(6C-ATT-WLLR), instead of LSTM in CEM(6C-LSTM-WLLR), helps ensemble learning systems obtain superior classification results; however, the attention mechanism itself (Attention-based) does not appear to be superior to LSTM in sentiment classification, as shown in Tables 3, 4, and 5.
• The use of sentiment dictionaries VNSD in CEM(6C-ATT-VNSD), instead of WLLR in CEM(6C-LSTM-WLLR) [48], helps ensemble-learning systems achieve better classification results, as shown in Tables 3, 4, and 5. Sentiment dictionaries can obtain VOLUME 8, 2020 more accurate emotional scores than the WLLR statistical method, which enhances the performance of the system.
• The BERT-based model also outperforms other deep learning models that require larger amounts of training data, as listed in Tables 4 and 5. The opposite is not good for small data (HOTEL-Reviews, Table 3 ). When sufficient training data is provided, as shown in Table 5, the BERT-based model results are nearly equivalent to the proposed model (CEM).
• With comments written on social networks, we learn that traditional machine learning method SVM gives more accurate results than deep learning models such as LSTM, BiLSTM, and Attention-based, as shown in Tables 3 and 5.
• In Table 5 , the ensemble learning model with the use of sentiment dictionaries CEM(6C-ATT-VNSD) was superior when sufficient training data was available and the reviews are quite long. On the contrary, CEM(6C-ATT-WLLR) achieves better results than CEM(6C-ATT-VNSD), as shown in Tables 3 and 4. In this section, we also carefully analysed both the correctly and incorrectly predicted samples in the three datasets. Most errors occur when opinions are expressed in an implicit way. For example, in the review (1) ''Tôi nâng_cấp phần_ăn c a mình lên 50k.'' (English: ''I upgraded my meal to 50k.'') contains facts without any opinion words, as more positive samples in the training may bias the neutral sentence to seem positive. In this case, the sentiment score cannot be determined by detecting explicit opinion words, but rather require a deep understanding of the entire sentence. Gaining a better understanding of implicit opinions remains challenging in deep learning methods. With the reviews (2) and (3), the system predicted correctly (as negative). review (2) ''Bằng n a giá ho . c phí so v i Huflit thì tốt.'' (English: ''It would be great if the tuition fee is half as much as Huflit's.''), the opinion is expressed in subjunctive style. Review (3) ''Ðo . 'i c tiếng m i c nh n cái phòng tốt.'' (English: ''Have to wait for an hour to get a good room.'') could only be understood using common sense. The visualizations of the attention weights are shown in Figure 4.

V. CONCLUSIONS
This paper introduced an effective ensemble learning model for sentiment classification. Our system can adaptively capture contextual information in text reviews by combining rule-based methods and other state-of-the-art deep learning models. We adopted word embedding representation and the attention mechanism, along with pre-defined rules and specific-domain sentiment dictionaries, to identify numerous valence-shifting cases. Although the computational cost of the proposed system is higher than those of compared algorithms, the model has multiple distinctive characteristics and gives better results than other approaches.
In the future, the following related tasks will be considered. First, we will attempt to improve the performance of this model by using other techniques like BERT-based, ELMo-based [52]. Second, an online application will be introduced to reuse of these results.
THIEN KHAI TRAN received the master's degree (with a gold medal) in computer science from the University of Information Technology VNU-HCM. He is currently pursuing the Ph.D. degree in computer science with the Ho Chi Minh City University of Technology (VNU-HCM). He is currently a Lecturer and a Researcher with the Ho Chi Minh City University of Foreign Languages and Information Technology. His research interests include information retrieval and natural language processing. He has served as a Reviewer and has published numerous articles in prestigious journals such as the Journal of Intelligent and Fuzzy Systems (SCIE), Applied Sciences (SCIE), and IEEE ACCESS (SCIE).
TUOI THI PHAN received the Ph.D. degree in computer science from Charles University, Czech Republic, in 1985. She was a Full Professor with the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (VNU-HCM). She is the Principal with the Ho Chi Minh City University of Technology and the chief investigator of national key projects. Her research interests include compilers, information retrieval, and natural language processing. She has served as a Reviewer and has published numerous articles in prestigious journals such as the Aslib Journal of Information Management (SCIE), the Journal of Intelligent and Fuzzy Systems (SCIE), and Applied Sciences (SCIE). VOLUME 8, 2020