Sentiment and Context-Aware Hybrid DNN With Attention for Text Sentiment Classification

A tremendous amount of unstructured data, such as comments, opinions, and other sorts of data is generated in real-time with the growth of web 2.0. Due to the unstructured nature of the data, building an accurate predictive model for sentiment analysis remains challenging. While various DNN architectures have been applied to sentiment analysis with encouraging results, they treat various features equally and suffer from high-dimensional feature space. Moreover, state-of-the-art methods cannot properly utilize semantic and sentiment knowledge to extract meaningful relevant contextual sentiment features. This paper proposes a sentiment and context-aware hybrid DNN model with an attention mechanism that intelligently learns and highlights salient features of relevant sentiment context in the text. We first use integrated wide coverage sentiment lexicons to identify text sentiment features then leverage bidirectional encoder representation from transformers to produce sentiment-enhanced word embeddings for text semantic extraction. After that, the proposed approach adapts the BiLSTM to capture both word order/contextual text semantic information and the long-dependency relation in the word sequence. Our model also employs an attention mechanism to assign weights to features and give greater significance to salient features in the word sequence. Finally, CNN is utilized to reduce the dimensionality of feature space and extract the local key features for sentiment analysis. The effectiveness of the proposed model is evaluated on real-world benchmark datasets demonstrating that the proposed model significantly improves the accuracy of existing text sentiment classification.


I. INTRODUCTION
Sentiment analysis, also called opinion mining, is an influential research area that evaluates people's opinions and sentiments from the text [1], [2]. The purpose of sentiment analysis is to classify a user's reviews or comments into positive and negative classes. In recent years, it has become a primary way of analyzing a massive volume of unstructured data from social networks by firms, marketers, and political observers. However, due to the unstructured nature of the The associate editor coordinating the review of this manuscript and approving it for publication was Adnan Abid. data, building an accurate predictive model for sentiment analysis remains a challenging task.
There are two primary sentiment analysis approaches: (i) lexicon-based and (ii) learning-based approaches. Lexicon-based approaches [3], [4] use a list of known sentiment words called sentiment lexicon/dictionary for sentiment classification. The lexicon-based approach assigns a numerical score to each sentiment word in a review text in order to determine the overall sentiment orientation of the review text. Lexicon-based approaches are widely employed due to their interpretability and explainability for textual sentiment classification without requiring labeled data for training. Different sentiment lexicons with predefined sentiment orientations have been developed for text sentiment classification in the literature. Some of the most popular and widely used general sentiment lexicons for sentiment analysis are AFFIN [5], OL [4], SO-CAL [6], WordNet-Affect [7], GI [6], SentiSense [8], MPQA Subjectivity Lexicon [9], NRC Hashtag Sentiment Lexicon [10], SenticNet5 [11], and SentiWordNet [12]. Hu et al. [13] built a manually generated sentiment lexicon for sentiment analysis. They used WordNet to identify the sentiment polarity of adjectives by finding their synonyms and antonyms. Ding et al. [14] proposed a holistic lexicon-based method for sentiment analysis employing external evidence and linguistic knowledge. While being widely used for gauging sentiment in social media contexts, current sentiment lexicons have issues with word coverage and may overlook significant sentiment terms that are unique to a given domain. Besides, creating a sentiment lexicon manually takes a lot of time and effort. In order to enhance current sentiment lexicons, we combine them incorporating linguistic and semantic knowledge to expand their vocabulary.
Unlike lexicon-based methods, learning-based approaches [15], [16] utilize machine learning (ML) to train the classifier on many labeled sets to perform sentiment classification. Although ML approaches can classify text sentiment, the classification model requires selecting relevant sentiment words/features for text sentiment classification [15]. Many traditional feature-based ML techniques have been commonly utilized for sentiment analysis. [15], [17], [18]. Pang et al. [17] applied three ML algorithms (NB, ME, and SVM) for sentiment analysis and obtained a high accuracy of 82.9% using unigrams as features. Saleh et al. [18] employed SVM and n-gram-based approach along with weighting strategies (TFIDF, BO, TO) to assess the performance of different domain datasets. Using unigram features and the TFIDF scheme, they were able to achieve a high degree of accuracy of 91.5%. Khan et al. [15] proposed a novel part-ofspeech (POS) and n-gram based ensemble learning method called EnSWF for review document sentiment classification. The EnSWF reduced the high dimensional feature space and improved sentiment classification accuracy. Existing proposals for sentiment analysis acknowledge the importance of both lexicon-based and learning-based approaches [2], [3], [15].
Deep neural networks (DNNs) have shown remarkable performance in the field of textual sentiment analysis in recent years [19], [20], [21], [22]. Deep learning is a part of the broader family of ML methods that use artificial neural networks to learn tasks using a network of multiple layers [23]. DNN-based methods outperform traditional feature-based ML methods in terms of classification accuracy. Recent DNN-based learning approaches investigate dual-channel CNN [24], LSTM and BiLSTM [25], [26], [27], and their combinations [28], [29], [30], [31], [32] to improve text sentiment classification. In literature, the attention mechanism along with deep learning technique [33], [34], [35] is also proposed to significantly raise the standard of learning sentiment representation. Additionally, many studies of deep learning models have concentrated on training neural language models to train word embeddings, also known as word vectors, and then performing classification on these word embeddings [21], [28], [36]. Bidirectional Encoder Representations from Transformers (BERT) [37] is one of them that has revealed further space for advancement. However, existing proposals based on deep learning models do not fully exploit the rich set of sentiment features that come from the combination of DNNs (i.e., CNN, LSTM, and BERT) with attention mechanism as well as sentiment knowledge (i.e., wide coverage domain sentiment lexicons, linguistic rules, POS tagging, and sentiment clue) that may affect classification performance.
This paper presents SCA-HDNN, Sentiment and Contextaware Attention-based Hybrid DNN model that leverages both sentiment knowledge (i.e., Wide coverage domain sentiment lexicons, linguistic semantic rules, POS tagging with sentiment clues) and combination of DNN models (i.e., CNN, LSTM, and BERT) with an attention mechanism for sentiment analysis. First, we parse the review text to tokenize and apply linguistic rules to identify mixed opinion sentences (composed of positive and negative clauses or small sentences). Then, the part-of-speech (POS) tags are assigned to the specific words/features that bear sentiment clues such as adjectives, adverbs, verbs, and nouns by the Stanford POS tagger [38]. Next, we leverage the integrated wide coverage sentiment lexicon (WCSL) as semantic and sentiment knowledge to identify and extract sentiment words/features (i.e., adjective, adverb, verb, and noun). Furthermore, we use pretrained BERT for sentiment-enhanced text embedding/word vector representation. After that, BiLSTM along with an attention mechanism is used to capture contextual semantic sentiment features and to assign the weight of salient features, respectively. Finally, the CNN (Convolution and Pooling layers) is utilized to lower the dimensionality of feature space and extract local key features for sentiment classification.
The key contributions of this paper are summarized as follows.
• We propose a novel sentiment and context-aware hybrid deep neural network-based accurate sentiment analysis model named SCA-HDNN. The proposed model efficiently utilizes contextual semantics, linguistic and sentiment knowledge with standard DNNs to identify and extract meaningful contextual sentiment features in the review text. Specifically, the inclusion of linguistic semantics and sentiment information with standard deep neural networks (BERT, BiLSTM, Attention Mechanism, and CNN) had a significant (statistically) impact on the proposed hybrid sentiment analysis model.
• We leverage the integrated wide coverage sentiment lexicon to determine the sentiment words in the review text VOLUME 11, 2023 for sentiment-enhanced word embedding. Sentiment embedding can differentiate words with similar contexts but different sentiments. We discover that modeling integrated wide coverage sentiment lexicon can enhance the effect of sentiment classification.
• We utilize Stanford POS Tagger to assign the POS tags to the words (Adjective, Adverb, Verb, and Noun) in the review text for sentiment orientation identification. In order to determine sentiment orientation, the aforementioned specific words are searched in wide coverage sentiment lexicons and obtained their sentiment orientations.
• We employ linguistic semantic rules to determine the correct class of a sentence/review text that contains mixed opinions (both positive and negative). Linguistic semantic rules have a positive impact on mixed opinion classification. They help more rigorously modelling the SA problem.
• The predictive performance of the proposed approach has been evaluated on six well-known sentiment classification benchmark datasets. The proposed approach enhances the performance of sentiment analysis efficiently and effectively and outperforms many baseline models.
The remaining sections of the paper are arranged as follows. Related work is shown in Section II. The detailed methodology and architecture along with technical details are discussed in Section III. Experimental setup and results are presented in Section IV. Section V concludes the paper along with future work.

II. RELATED WORK
This section discusses the state-of-the-art methods for text sentiment analysis. Text sentiment analysis methods are mainly divided into sentiment lexicons, ML methods, and deep learning methods.

A. SENTIMENT LEXICON
Several sentiment lexicons with pre-determined polarity have recently been developed for text sentiment analysis. A sentiment lexicon also known as a sentiment dictionary is a collection of words or phrases that indicate sentiments. In some cases, a numerical score is assigned to each word in the sentiment lexicon to measure the strength of the sentiment associated with the word. Sentiment lexicons are essential for the process of analyzing the sentiment of text [55], [56], [57]. This method looks at each word in the text and assigns it a sentiment score based on a sentiment dictionary. The overall sentiment polarity of the text is determined by combining the sentiment scores of each individual word [4]. Some of the most commonly used sentiment lexicons are AFFIN, OL, MPQA, SO-CAL Subjectivity Lexicon, NRC Hashtag Sentiment Lexicon, and others, etc. According to the current literature, the general sentiment lexicons have been developed via manual, semi-automated, or automated methods [58]. The manual creation of sentiment lexicons is a laborious task that requires considerable expertise and effort. The development of sentiment lexicons is often a combination of manual and automatic processes. Hu et al. [13] constructed a manually compiled sentiment lexicon for text sentiment analysis. Khan et al. [59] created a general-purpose sentiment lexicon for text sentiment classification employing a semi-supervised method.
Han et al. [60] proposed a new method for generating domain-specific sentiment lexicons, which uses mutual information to assign terms to POS tags and chooses training data from an unlabeled corpus based on sentiment scores evaluated by a SentiWordNet based sentiment classifier. Wu et al. [61] developed a new method for automatically constructing a target-specific sentiment dictionary, in which each term is made up of a pair of opinion words and an opinion target. Despite the fact that sentiment lexicons are frequently used in social media for sentiment analysis, most of them are inadequate because of the limited number of words they include, which could cause them to miss out on important words that express sentiment in the subject domain. We amalgamate different sentiment lexicons employing linguistic and semantic knowledge to enlarge the sentiment lexicons for better sentiment analysis.

B. ML-BASED SENTIMENT ANALYSIS
The majority of classical sentiment analysis studies relied on supervised ML approaches. These methods often used the bag-of-words (BOW) model, n-gram features, and POS patterns for review text classification. Pang et al. [17] utilized a supervised ML approach to classify the sentiment of movie reviews, employing three different classifiers: Naive Bayes (BN), Maximum Entropy (ME), and Support Vector Machine (SVM). They used various n-gram feature sets, including unigrams, bigrams, unigrams and bigrams combined, and unigrams combined with Part-of-Speech (POS) tags, to classify sentiment. They attained an accuracy of 81% for Naive Bayes (NB) classifier, 82.9% for maximum entropy classifier, and 80.4% for Support Vector Machine (SVM) classifier. They found that both NB and SVM classifier using unigram features acheived good results.
Similarly, Go et al. [62] applied the distant supervision method approach in order to carry out sentiment analysis on tweets. They were able to successfully employ emoticons as noisy labels to train data for distant supervised learning. They evaluated the performance of NB, SVM, and maximum entropy, and attained accuracies of 81.3%, 82.2%, and 80.5% respectively. Zhang et al. [63] employed NB and SVM to classify the sentiment of restaurant reviews written in Cantonese. They looked into how the representation of features and the size of features impact classification accuracy. The NB classifier had the best performance, with an accuracy of 95.67%. Tripathy et al. [64] employed four supervised ML classifiers namely, NB, ME, SVM, and Stochastic Gradient Descent for the task of categorizing movie reviews. They showed how to utilize n-gram techniques such as unigram, bigram, and trigram, as well as combinations of these, to generate sentiment-related features. The SVM classifier achieved the best results when combining unigram, and bigram, as well as unigram, bigram, and trigram together.
M. Rushid et al. [18] explored various feature representation schemes (TFIDF, BO, TO) and n-grams techniques (unigrams, bigrams, and trigrams). They evaluated the performance of SVM using a range of different feature sets on various datasets. Their results showed that the trigram model outperformed both the unigram and bigram models. Bai et al. [65] utilized ME, NB, SVM, and Markov Blanket Model algorithms for sentiment classification. They evaluated their models using datasets from movie reviews and news articles. They presented a heuristic search-enhanced Markov blanket model that captures word dependencies and provides the vocabulary for sentiment extraction. The VOLUME 11, 2023 results of their proposed model were found to be better than other classifiers. Kalaivani et al. [66] compared SVM, NB, and KNN for the movie review sentiment classification. The SVM approach achieved better results than both the NB and KNN approaches, as shown by their experimental results. The reported accuracy of SVM was greater than 80%. Bilal et al. [67] conducted a study to classify Urdu and English blogs' opinions using NB, Decision Tree, and KNN. According to their results NB was more effective than Decision Tree and KNN.
Osman et al. [68] proposed a wrapper-type feature selection approach for sentiment analysis, which is based on the Iterated Greedy meta-algorithm. They employed Multinomial NB as a classifier and Gain Ratio (GR) filter scores as heuristic information for greedy selection. Their experiments demonstrate that their proposed algorithm is more effective than existing filter-based feature selection approaches and the Genetic Algorithm-based feature selection technique. Kalaivani et al. [69] proposed a ML-based feature selection method utilizing IG and Genetic Algorithm. They applied NB, logistic regression, SVM, and ensemble techniques on multi-domain datasets and movie review datasets for evaluation. According to their experimental results, Information Gain (IG) and the genetic algorithm with ensemble technique performed better. Khan et al. [15] proposed a novel POS and n-gram based ensemble ML method for sentiment analysis while considering semantics, sentiment clue, and order between words called EnSWF. In this method, ensemble techniques were used to identify and select suitable features for sentiment analysis by examining Part-of-Speech (POS) patterns and n-gram patterns. However, the majority of traditional ML methods establish bag of words paradigm, which ignores the semantic aspects between words and treats each word in the text as an independent unit.

C. DNN-BASED SENTIMENT ANALYSIS
Recent research has concentrated on deep neural networks and has made great progress in sentiment classification. Kim [40] constructed a CNN model with various filters and used max-pooling to extract the important features, the features were then fed into the fully connected layer for classification. Rezaeinia et al. [41] proposed a CNN-based model for document sentiment analysis that took advantage of improved word embedding [18]. They used lexical, positional, and syntactical features to improve word embedding in their model. Then, to select key features from the text, three distinct CNN modules were applied sequentially. Yang et al. [24] proposed dual-channel DNN utilizing the pre-trained Word2vec model for text features extraction and intent classification. They achieved promising results in multi-intent text classification tasks. However, CNN overlooks the text's sequence information while focusing on the local features of a text.
LSTM and its variants are useful for general sequential modeling tasks, as well as capturing long dependence information between words in a sentence [25], [28], [48], [70], [71]. Wang et al. [26] used word2vec model for word embedding and proposed a sentiment classification method based on LSTM for short text in social media. Yang et al. [27] proposed an attention-based bidirectional LSTM technique to improve target-dependent sentiment classification. While the capability of the LSTM structure to learn local text features, which is a key quality of the CNN, is lost. It is necessary to investigate a technique to combine the CNN and LSTM network with an attention mechanism leveraging linguistic and sentiment knowledge to improve text sentiment classification.
The development of hybrid deep neural networks is a potential research area in sentiment analysis. Recent developments in sentiment analysis research suggest that hybrid deep neural network architectures integrating LSTM and CNN can produce promising prediction results [28], [30], [32], [51], [72], [73], [74]. Kowsari et al. [30] proposed hierarchical deep learning for text classification (HDLTex), and combined CNN, RRN, and MLP to examine semantic information of a document at each hierarchical level. Li et al. [31] used the Cbow model for word embedding with the LSTM-CNN Hybrid model for Chinese news text classification and achieved good results. Zhang et al. [32] proposed a hybrid model of LSTM and CNN for movie review sentiment classification. Additionally in literature, a hybrid model that combines DNNs with an attention mechanism has been proposed to learn text features for sentiment analysis [33], [35], [75]. Liu et al. [35] presented a hybrid DNN model named AC-BiLSTM for sentiment analysis and question answering, which combines bidirectional LSTM and CNN networks with an attention mechanism. They utilized BiLSTM to access contextual information and then employed the attention mechanism to concentrate on key portions of the text. Basiri et al. [33] proposed a hybrid sentiment analysis model named ABCDM which combines DNN models such as BiLSTM, BiGRU, and CNN in conjunction with the attention mechanism. It tested five English comment datasets and three Twitter datasets, and the best results were obtained.
Recently deep learning-based models using BERT [37] have revealed further space for advancement [76], [77], [78], [79], [80], [81]. BERT is a Google-developed language model that was trained on a large corpus and released as open source in 2018. Unlike Word2Vec, which generates a static word vector, BERT learn the text representation from both left-toright and right-to-left direction. The BERT adjusts the word vector representation dynamically based on the context in which the word is found. Deep neural networks have made great progress, but the majority of techniques do not fully utilize linguistic semantics and sentiment knowledge that may affect classification performance. These methods also suffer from high dimensional feature space and treat different features equally.
Our model for text sentiment analysis is different from state-of-the-art models in some aspects. In SCA-HDNN, we employ linguistic semantic and sentiment knowledge (i.e., wide coverage domain sentiment lexicons, linguistic rules, sentiment clue) to extract the meaningful context-rich sentiment-bearing words and identify the true class of opinion sentences. In this way, the model can completely utilize the hidden significant information in the sentence and learn sentiment feature information. Next, we leverage BERT to generate sentiment-enhanced word embedding. After that, we employ BiLSTM along with an attention mechanism to extract context-rich salient sentiment features. Finally, CNN is applied to reduce the dimensionality and extract the local key features for sentiment classification.

III. METHODOLOGY
In this section, we present a hybrid sentiment analysis model SCA-HDNN. The proposed model incorporates the standard DNN-based techniques with linguistic semantics and sentiment knowledge to identify and extract contextual local key sentiment features for text sentiment analysis. Figure 1 shows the structure of SCA-HDNN model. This model consists of five layers: Feature engineering layer, BERT Embedding layer, BiLSTM layer, CNN layer, and output layer.

A. FEATURE ENGINEERING LAYER
The feature engineering layer is composed of two components: text pre-processing and sentiment feature extraction. Text pre-processing is used to create suitable feature vectors for better learning performance in sentiment classification. The text pre-processing component contains six types of modules. These modules are sentence parsing, linguistic rules, tokenization, lowercase, noise removal, and spell correction. For sentiment features extraction, we employ Stanford POS Tagger to assign the POS tags to the specific words (Adjective, Adverb, Verb, and Noun) that bear sentiment clues leveraging wide coverage sentiment lexicons.
In this layer, the review data is initially loaded and then processed by a sentence parser and tokenizer. The noise removal module is then employed to eliminate noisy text like stop words, URLs, numeric symbols, etc. After that, the abbreviated words are converted to their full words. Further, the case transformer and spell checker module is utilized to convert the text to lowercase and correct typos respectively. Next, the POS tagger is used to assign POS tags to the likely words such as adjectives, adverbs, verbs, and nouns. The Penn Treebank annotation scheme [82] is employed for POS tagging as shown in Table 2. Finally, the polarity of these words is searched in the integrated WCSL for sentiment features detection/extraction.
We also utilize linguistic rules to predict the correct class of review sentences composed of mixed opinions. For example, in the statement ''the filmmaker is well-known, but the film is dull'', linguistic norms only only take into account the clause after ''but'', whereas the clause preceding ''but'' is not considered. It involves using certain words that can alter the polarity of a statement, such as 'but', 'despite', 'while', 'unless', and so on. Many review texts contain a range of mixed viewpoints. The adoption of linguistic rules could have a positive impact on mixed-opinion reviews [83], [84]. Based on the supposition made by earlier research, each review sentence/text has a single polarity. Following the relevant work [83], [84], We employ the same linguistic rules as shown in Table 3.

B. WIDE COVERAGE SENTIMENT LEXICONS
Sentiment lexicon is a list of words and phrases associated with either positive or negative feelings. [1]. Sentiment lexicons are essential for identifying sentiment-bearing words. In literature, various sentiment lexicons such as SentiWordNet [12], WordNet-Affect [7], SenticNet5 [11], NRC Hashtag Sentiment lexicon [10], MPQA Subjectivity Lexicon [9], SentiSense [8], GI [6], SO-CAL [4], OL [13], AFFIN [5] with different sizes have been built. Many existing sentiment lexicons have limited words to accurately determine the sentiment orientation of a domain-specific sentiment word. We integrate many sentiment lexicons to enlarge the word coverage of the sentiment lexicon. Moreover, there is no optimum general sentiment lexicon to be used for sentiment analysis, besides, the format of the sentiment lexicons are different from each other.
We standardize and integrate the aforementioned ten stateof-the-art sentiment lexicons to a uniform format by assigning them one of three scores: +1, −1, or 0. We calculate the average sentiment score of the shared words between these lexicons, to generate a larger sentiment lexicon with more sentiment words. Table 4 shows the size and format of the state-of-the-art sentiment lexicons for word coverage. First, We assigned scores of +1, −1, and 0 to positive, negative, and neutral words respectively in order to standardize them. Then, the sentiment score of each word was determined by taking the average score of the words that are shared by multiple sentiment lexicons, generating a wide coverage sentiment lexicon (WCSL).
The main difficulty with sentiment lexicons is their limited vocabulary coverage. If a sentiment word is not present in the existing sentiment lexicons, it is typically omitted, which can have an impact on the sentiment of the review text. In order to address this issue, we utilize semantic knowledge from WordNet to find synonyms of the word in order to determine its sentiment orientation. For example, given the word ''W'' we determine its synonym words (w 1 , w 2 , . . . .w m ) in the WordNet. For every synonym word w i , we search it in the WCSL, if the word is present, then its sentiment orientation is attained. We take the average sentiment score of all synonyms words. If the average sentiment score is greater than zero, then the sentiment orientation of the given word ''W'' is positive otherwise negative.
The integrated WCSL is employed to identify and extract the sentiment words in the review text for sentiment-enhanced word embedding and classification. Given an input review text/sentence S = {w 1 , w 2 , w 3 , . . . , w n } that contains n words. The task is to apply WCSL to identify and retrieve the sentiment words from S and form sentiment-enhanced word embedding vector

C. BERT EMBEDDINGS LAYER
BERT is an attention-based language model that learns textual information via a stack of transformer encoders and decoders. BERT is more versatile and advantageous than the classic language model word2vec because each word in Word2Vec has a fixed representation independent of context. The word vector representation of the BERT is dynamically adjusted depending on the context in which the word is discovered. BERT learn the text representation from both left-to-right and right-to-left direction. The Masked Language Model and Next Sentence Prediction are two paradigms that the BERT utilizes to represent the input sequence at the word and sentence levels with the intention of lowering the combined loss function of these two methods.
The Masked Language Model trains a deep bidirectional transformer representation by randomly masking words from each sentence, then predicting them based on the context of the nearby non-masked words. Word sequences are replaced with a [MASK] token in 15% of the words before feeding it to the BERT. The model takes the context of the other non-masked words in the sequence and tries to guess what the original value of the masked word was. The goal of next sentence prediction is to comprehend how two sentences relate to one another, which enables BERT to adapt to diverse downstream tasks more effectively. In this study for each review text/sentence in the dataset, we use the BERT language model to construct sentiment-enhanced word embeddings. The sentiment-enhanced embedding layer takes sentiment word tokens as input and converts each one into a word vector. These word vectors are then processed by downstream DNN models (BiLSTM with attention mechanism and CNN) as high-quality salient sentiment features for classification.

D. BiLSTM LAYER
Long short-term memory network (LSTM) [85] is a kind of recurrent neural network (RNN) architecture. LSTM  takes into account the relationship between word sequences and effectively addresses the issues of long-term distance dependence and gradient vanishing problems in RNN. The standard LSTM unit includes a memory cell and three different types of gates: an input gate, an output gate, and a forget gate. These gates are intended to control how information enters and exits the memory cell. The gates are determined at each timestep t using the following equations: The x t is the input at time t and h t−1 is the hidden unit at timestep t − 1, b stands for bias vector; W and U represent the weight matrix of each gate; σ and tanh stand for the sigmoid function and hyperbolic tangent, respectively. Input gate (i t ) and output gate (o t ), are used to regulate the input and output of the memory unit respectively. Reset memory is provided by the forget gate (f t ); c t is the total hidden layer state at time t over the past and present. The inclusion of c t allows LSTM to learn longer information reliance by reducing the issue of gradient vanishing. Where h t stands for the output vector in each LSTM layer.
The LSTM network can only extract data in one direction at a time, as a result, the adjacent words will have an impact on the current word. The BiLSTM is an extension of the LSTM model in which the input vector V n with a sequence of W n tokens are processed by two distinct LSTMS to produce an output h as shown in Figure 1. One LSTM process the inputs text sequence from the left to right, known as forward pass − → (h t ) and another LSTM process the input text sequence from the right to left, known as reverse/backward pass ← − (h t ). The forward pass and backward pass outputs are combined using the component-wise sum operation, as h t = − → h t ⊕ ← − h t , which is the final hidden state at time step t. The output of the BiLSTM layer is {h 1 , h 2 , h 3 , . . . , h n }, where n is the length of the sentence containing the entire sequence of contextualrich features. In this study, the sentiment-enhanced word embedding vectors generated by the BERT model are fed to the BiLSTM to capture both the order information and the long-dependency relation in the word sequence. The BiLSTM works well and gives the algorithm more context (e.g. takes into account the context on both sides of a word sequence.)

E. ATTENTION MECHANISM
The common LSTM for sentiment categorization is unable to identify the main component of words because some words have a noticeable sentiment clue, but others have little or no sentiment clue. Words have a variety of consequences when assessing the sentiment polarity of the full text. Therefore, we adapt the attention mechanism to give distinct sentiment weights to distinct words and pick salient, insightful words. Attention uses a softmax function to give every word a weight a i . Stronger sentiment words receive more attention, which turns into high weight being given to increase their importance. It is anticipated that this attention weight would concentrate on context-rich features that are crucial for enhancing the text's meaning in the context of sentiment prediction. As depicted in Figure 1, for the hidden state output H = {h 1 , h 2 , h 3 , . . . , h n } of BiLSTM, the attention weight is VOLUME 11, 2023 computed as follows. where, W h and h t are the learned parameters, h t is created by joining the forward and backward LSTM representations. The word's importance to the text is defined by the attention weight a t , which is given to each word. D is a representation of the whole input text vector, which includes word-by-word sentiment information.
Algorithm 1 Pseudo-code of SCA-HDNN 1) Through sentiment knowledge and BERT-based embedding converting word sequence w 1 , w 2 , ..., w n into corresponding sentiment word vectors W s V 1 , W s V 2 , ..., W n . 2) Using BiLSTM to model sentences and learning the hidden vectors for each word (h 1 , h 2 , . . . , h n ). 3) Utilize the attention mechanism to produce improved word vectors using Eqs. (7), (8), and (9). 4) Apply the convolution layer to extract the local word/feature vectors. 5) Use the maximum-pooling strategy to filter the highly informative features from the local features. 6) Concatenate the output of the pooling layer by flatten layer and sending it into the fully connected layer. 7) Apply the sigmoid function to the comprehensive context representations to get the corresponding class label. 8) Use the loss function with the Adam method to update the model's parameters.

F. CNN LAYER
After the acquisition of the context-rich salient sentiment features, we employ CNN to further extract the local informative, salient context sentiment features of the text. The CNN used in this study performs a convolution utilizing three different length convolutional windows (sliding windows). The convolution is applied to the complete sequence of features obtained from the attention layer. The representation for each word/feature in the text sequence S e is expressed as: where W a i is the word vector (after computing attention weight) in the sequence. Various filters with varying window sizes are applied to the input sequence, and a set of feature maps is yielded. Assume that the convolution kernel K c ∈ R h×l has the following characteristics: c indicate the number of convolution kernels, l show the length of the convolution kernel, and h represent the width of the convolution kernel. For the input text sequence, the feature map is created P = {p 1 , p 2 , . . . , p n−h } ∈ R m−h+1 by repeatedly applying a convolution kernel R in order to carry out the convolution operation. Over the convolution output, the ReLU activation is applied. We do the pooling process once the convolutional procedure is completed. The purpose of the pooling layer is to condense the convolutional feature vector, lowering the vector dimension and computing cost. The max-pooling layer is applied to each feature map and took the maximum valuê c = max{c} [86]. Max-pooling captures the most informative features feature. The maximum pooling technique is employed in this research to retain the most important features [39]. These features are then concatenated by flatten layer and sent to the fully connected layer.
In order to predict sentiment polarity, the fully connected layer is utilized. The primary purpose of a fully connected layer is to take the output of the preceding layer, process it, and assign it to a specific category. A sigmoid function as given in equation (11) is utilized to get the final output for the binary class. The probability distribution on the label is the output. This function is capable of mapping the input data to the [0, 1] range. The y near 0 denotes a sentiment class that is approaching negative, while y near 1 denotes a sentiment class that is approaching positive.
where x is the strong sentiment state, w is the parameter matrix and b is the bias element learned across the training process. The major stages of the proposed sentiment analysis framework have been briefly summarised in Algorithm 1.

IV. EXPERIMENTAL SETUP AND RESULTS
In this section, we first present the experimental setup including the datasets used in the empirical analysis and baseline methods. Then we analyse the outcomes of a number of proposed models and compare the final model SCA-HDNN with SOTA approaches.

A. DATASETS AND SENTIMENT RESOURCES 1) DATASETS
In this study, We evaluated the predictive performance of our models on the following six text sentiment classification datasets.
• RT-2k: This standard dataset contains 2,000 full-length positive and negative movie reviews [88] • SST-2: This is Stanford Sentiment Treebank's binary labeled version dataset, with all neutral reviews removed. It contains 9,613 positive and negative samples [89] • CR: Customer review (CR) datasets contain reviews of five electronics products taken from Amazon and CNET [13] • SemEval 2013, and SemEval 2014: The SemEval datasets [90] consists of Twitter comments taken from Twitter on a variety of subjects. We follow the work [53] and used the training and development sets of SemEval-2013 to train and tune the classifier, respectively, and tested the performance for the test sets of both SemEval-2013 and SemEval-2014. This is because SemEval-2014 only contains test samples. Besides, similar to [53] we only take into account binary targets (positive and negative) and filter out neutral comments. Only SST-2 dataset offers distinct training, development, and test sets for evaluation. For the other datasets, we randomly split them with ratios of 0.8, 0.1, and 0.1, for training, development and testing, following prior work [44], [54]. The statistical details of these datasets are shown in Table 5.

2) SENTIMENT RESOURCES
We integrated 10 state-of-the-art sentiment lexicons leveraging semantic and sentiment knowledge to generate WCSL that was used for sentiment information extraction from review texts. We took the average of the sentiment score of the overlapping words in sentiment lexicons and standardized them by assigning scores, +1, −1, and 0 to positive, negative, and neutral words respectively.

B. EXPERIMENTAL SETTINGS
We used the Keras library with TensorFlow and Rapidminer Studio (visual workflow designer) for implementation. The BERT-BASE model uncased version with 12 network layers, a hidden layer dimension of 768, 12 attention heads, and a total of more than 110 million parameters, was used with an Adam learning rate of 2e-5 for word vectorization in the pre-trained BERT model. We have also considered three conventional pre-trained embedding models for word vectorization, namely Word2Vec, Glove, and fastText each with 300 dimensions. The number of hidden neurons for each layer in BiLSTM and CNN is set to 128. The activation function employed in each layer of the network is Rectified Linear Unit (ReLU), and the dropout rate is set to 0.5. The filter windows (h) of convolutions are 3, 4, and 5, with 100 feature maps each. The sigmoid is used for the probability of class labels in the fully connected layer and output layer of CNN. The total number of epochs in the architecture was set to 10.

C. EVALUATION MEASURES
We evaluated the predictive performance of our proposed hybrid method using Accuracy (ACC), Precision (PRE), Recall (REC), and F-measure. The rate of correctly classified instances (TP + TN) out of the total number of instances (TN + TP + FN + FP) determined by the classification algorithm is referred to as classification accuracy.
Precision is the proporation of true positives against the total number of true positives and false positives.
Recall is the ratio of true positives against the total number of true positives and false negatives.
The F-measure is the harmonic mean of precision and recall.
To evaluate the performance of the proposed solution, we employed a paired t-test (with a significance level of p<0.05).

D. BASELINE MODELS
In this research, We evaluated the predictive performance of SCA-HDNN model against the following state-of-the-art baseline DNN-based models for textual sentiment classification. These are effective and popular methods that have produced good results.
• CNN-non-static: Kim [40] proposed 1D-CNN using two sets of pre-trained word embeddings for text classification.
• IWV: Rezaeinia et al. [41] proposed improved word vectors in conjunction with Convolutional Neural Network (CNN)-based model that comprises of three convolution layers, a max pooling layer, and a fully connected layer for sentiment classification.
• SAAN: Lei et al. [47] proposed a sentiment-aware multi-head attention CNN-based model for text sentiment classification.
• SAT: Huang et al. [46] proposed a transformer-based BERT model with two-stage training strategy for textual sentiment analysis.
• BiLSTM-CRF: Chen et al. [48] proposed a neural network-based sequence model BiLSTM-CRF to extract target expression in opinionated sentences. This model then divides opinionated sentences into three categories based on the number of targets. Further, they trained various 1d-CNNs for sentiment classification using these three types of sentences.
• ATTPooling: Usama et al. [50] proposed a model architecture based on RNN with CNN-based attention for textual sentiment analysis.
• RCNNGWE: Onan et al. [51] proposed a Bidirectional convolutional recurrent neural network architecture with a group-wise enhancement mechanism for textual sentiment classification.
• CoSE: Wang et al. [53] proposed a contextual sentiment embeddings model via two-layers GRU language model for textual sentiment analysis.
• MVA: Zhang et al. [54] proposed a multiview attention model to learn a sentence representation from multiple perspectives for text classification.

E. MODEL VARIATIONS
In this study, we evaluate the effectiveness of the proposed method by using four model variations. The following four model variations are tested in the experiment: • SCA-HDNN-1: The model employs pre-trained BERT embedding scheme in conjunction with BiLSTM, and CNN without considering attention mechanism and sentiment knowledge.
• SCA-HDNN-2: The model uses pre-trained BERT scheme in conjunction with BiLSTM, Attention Mechanism and CNN without considering sentiment knowledge.
• SCA-HDNN-3: The model utilizes sentiment knowledge and employs pre-trained BERT scheme in conjunction with BiLSTM, and CNN without Attention Mechanism.
• SCA-HDNN:(The proposed scheme) The model leverages sentiment knowledge and employs pre-trained BERT scheme in conjunction with BiLSTM, Attention Mechanism, and CNN.

1) OVERALL COMPARISON
In this section, the effectiveness of the proposed model (namely, SCA-HDNN) in terms of classification accuracy on multi-domain six datasets has been compared with previously proposed baseline models on the binary sentiment classification problem. The comparison results in terms of classification accuracy from the different DNN-based models have been presented in Table 6. The bold text denotes the best results. According to In comparison with three CNN-based methods (DCNN, CNN-non-static, and IWV) SCA-HDNN gives better results on the two datasets. Compared to two LSTM methods (Tree-LSTM, and LR-LSTM) SCA-HDNN performs better on the three datasets. Similarly compared to two capsule-based methods (Capsule-B and RNN-Capsule) SCA-HDNN delivers better performance on three datasets.
At the same time, compared with similar hybrid DNN-based methods (SAAN, BiLSTM-CRF, SAMF-BiLSTM, AC-BiLSTM, ATTPooling, RCNNGWE, SAWE, CoSE, and MVA) which model the language knowledge and sentiment resources, the classification performance of the SCA-HDNN method is better than them on five different datasets. The results show that SCA-HDNN is a more sentiment and context-aware attentive model that leverages linguistic semantics, and sentiment knowledge (Wide coverage domain sentiment lexicons, linguistic rules, POS tagging, and sentiment clues). Also, employ BERT Embedding and Attention Mechanism in conjunction with hybrid DNNs for sentiment analysis. These validate the effectiveness of SCA-HDNN in terms of classification accuracy for textual sentiment analysis.

2) THE IMPACT OF SENTIMENT KNOWLEDGE AND ATTENTION MECHANISM ON SCA-HDNN
In this subsection, we will evaluate the effects of sentiment knowledge and attention mechanism on the proposed SCA-HDNN and other model variations (SCA-HDNN-1, SCA-HDNN-2, and SCA-HDNN-3) (described in section E). Table 7 shows that the sentiment knowledge and attention 28172 VOLUME 11, 2023   mechanism have a significant impact on the performance of SCA-HDNN and other variant models. Every component of SCA-HDNN contributes to the final results. Among all of the aforementioned models SCA-HDNN yields the best results. We follow the work [35], [48] and use the relative improvement ratio and the classification accuracy as the evaluation metric. We calculate the relative improvement ratio as follows: = (ACC SCA-HDNN − ACC var ) ÷ ACC var × 100 (16) where ACC SCA-HDNN is the classification accuracy of our approach and ACC var is the classification accuracy of each VOLUME 11, 2023    SCA-HDNN variant. When we compared SCA-HDNN to SCA-HDNN-1, gives relative improvements ranging from 5.02% to 10 % as shown in Table 8. It is observed that the performance of SCA-HDNN drastically deteriorates when the sentiment knowledge and attention mechanism scheme are eliminated. These components significantly raise SCA-HDNN classification accuracy. When compared to SCA-HDNN-2, and SCA-HDNN-3, SCA-HDNN gives the relative improvements from 1.54% to 6.85%. It is noted that the classification performance of SCA-HDNN-2 ( using only sentiment knowledge), and SCA-HDNN-3 (using only attention mechanism) are significantly inferior to SCA-HDNN (using both sentiment knowledge and attention mechanism). The ablation results of each variant model (SCA-HDNN-1, SCA-HDNN-2, SCA-HDNN-3), and proposed model SCA-HDNN for multi-domain datasets has shown in Figure 2. As shown in Figure 2, the SCA-HDNN obtained the best performance on six different multi-domain datasets.
The sentiment knowledge input has a significant effect on the performance of the proposed method for sentiment classification. The sentiment knowledge we utilize includes MW (Meaningful words obtained by word conversion and typos correction), LR (Linguistic rules with POS tagging), and WCSL (Wide coverage sentiment lexicon) (see Figure 1). In order to examine the effect of sentiment knowledge input on the proposed model, we conducted sentiment knowledge adaptation experiments on the SCA-HDNN on six datasets. Table 9 demonstrates that performance is somewhat varying when sentiment knowledge is added. However, by including complete sentiment knowledge, the model's overall performance is improving. It shows how sentiment knowledge enhances SCA-HDNN performance.

3) THE IMPACT OF PRE-TRAINED WORD EMBEDDING SCHEME
Generally, word embedding vector generation techniques have different effects on classification performance. Thus, we investigate the impacts of different pre-trained word embeddings on classification performance. We take into account word2vec, glove, fasttext, and BERT, the four most widely used pre-trained word embedding models for vectorization and sentiment classification. In this section, a series of experiments will be conducted to examine the impact of the aforementioned word embedding scheme on SCA-HDNN performance. As shown in Table 10, the pre-trained BERT vectorization obtained better classification performance compared to word2vec, glove, fasttext. The features of the BERT-generated word vector are of high quality and is more versatile and advantageous than the conventional language models (i.e., word2vec, glove, and fasttext). The fasttext word embedding approach produced good classification performance when compared to the pre-trained word2vec and glove embeddings. The primary drawback of Word2Vec and Glove is their ability to create a random vector within a word rather than within the dataset [91], fastText manages to overcome this drawback. The use of the pre-trained word2vec embedding scheme results in lower classification performances.

G. ANALYSIS OF THE REASONS FOR THE HIGHER ACCURACY OF THE PROPOSED SCA-HDNN MODEL
There are many reasons for the higher accuracy of the proposed model as compared to SOTA baseline DNNbased models. The first reason is that noisy and unnecessary features are eliminated from the review text, and corrected typos during text pre-processing. The second reason is the extraction and selection of contextually relevant sentiment features. The classification of mixed opinions using linguistic semantic rules is the third reason. The fourth reason is the incorporation of WCSL for sentiment feature identification. The fifth reason is the ability to represent sentiment features in a real-valued vector with high density, and capturing of both contextual text semantic information and the long-dependency relation in the word sequence. The sixth reason is the optimized hyper-parameters setting in the proposed semantic and context-aware hybrid sentiment analysis model. The seventh reason is the effective combination of BERT, BiLSTM, Attention Mechanism, and CNN with sentiment knowledge in the proposed approach that effectively extracts highly informative sentiment features for accurate sentiment classification.

H. DISCUSSIONS
The empirical results obtained by our proposed approach are more robust than various baseline models. According to the empirical results, the semantic and sentiment knowledge along with the attention mechanism significantly affect the performance of our proposed model. We leveraged linguistic semantic rules with POS tagging and integrated wide coverage sentiment lexicon to capture both semantic and sentiment clues in the text. Next, we used the pre-trained BERT model to generate word embeddings for text semantic and sentiment representation. The BiLSTM is then applied to recognize both the prior and post contextual sentiment information in the text sequence. We also utilized an attention mechanism to assign weight to different words and can identify salient features in the word sequence. In addition, CNN is applied to extract local key features. The confluence of these procedures enhances the overall classification performance of SCA-HDNN. In this regard, we have evaluated the impact of sentiment knowledge and attention mechanism on SCA-HDNN. The empirical analysis shows that removing the sentiment knowledge and attention mechanism scheme significantly reduces the classification performance of SCA-HDNN. The accuracy of SCA-HDNN classification is considerably improved by these factors.
The methods used to create word embedding vectors can have an impact on how well text is classified. Regarding the predictive performance of pre-trained word embedding models for text classification, the BERT model surpasses Word2vec, GloVe, and fastText. BERT creates contextual word embedding in order to offer many forms for capturing complex linguistic information, which shows how well the BERT representation works.
A comparison of the proposed method with seventeen SOTA deep neural network models for sentiment analysis on six datasets was conducted to confirm its efficacy. The best-predicted performances were attained by SCA-HDNN on five datasets (MR, RT-2K, CR, SemEval 2013, and SemEval 2014) and comparable results on SST-2 dataset with many previous relevant methods.
Regarding computational cost, we executed all model variations up to 10 epochs to compare running time. Figure 3 shows the execution time of the proposed model with other models variations on the big size of Movie Reviews (MR) datasets. The proposed model SCA-HDNN is slightly slower than other models due to the inclusion of both sentiment knowledge and Attention Mechanism.

V. CONCLUSION
The vast amount of unstructured user-generated data in the form of text comments and reviews available online provides a wealth of valuable insights that can be used for e-commerce and business analysis. Sentiment analysis is a significant research area, which aims to categorize online unstructured user-generated data into positive and negative categories. Building an accurate sentiment analysis model is a tricky task because of the enormous amount and unstructured nature of textual data. This paper rethinks the well-known sentiment analysis problem in the context of the Attention-based Hybrid DNN model (BERT, BiLSTM, CNN) in conjunction with linguistic semantic and sentiment knowledge. We presented SCA-HDNN, a sentiment and context-aware Attention-based hybrid DNN model for sentiment classification leveraging wide coverage domain sentiment lexicons, linguistic rules, and POS tagging with sentiment clues. Linguistic semantic and Sentiment knowledge provides helps to capture sentiment and context-aware features for model learning. The combination of BERT, BiLSTM, Attention Mechanism, and CNN in the proposed approach effectively extracts highly informative features for sentiment classification.
We compared our proposed model SCA-HDNN against SOTA model. Our in-depth evaluation demonstrated that the proposed model significantly improved the accuracy of existing text sentiment classification.
Additionally, compared to context-based pre-trained methods using Word2vec, GloVe, and fastText models, the proposed approach for learning sentiment-enhanced word embeddings with BERT has produced substantially better results.
For future work, we plan to investigate alternative more effective sequence learning models and pre-trained language models for target text semantic and sentiment detection [53], [80]. We will utilize dependency parsing vectors to learn the sentiment clues from different perspectives to improve the sentiment classification performance [76]. We may employ PCA statistical feature selection technique to reduce the high dimensional features space and select optimum sentiment features. We intend to improve the proposed system for other tasks/domains, making it a more versatile algorithm framework. Additionally, we plan to work on aspect level-sentiment analysis to tackle the issue of aspect sentiment analysis in order to better understand user attitudes by connecting them to particular features or aspects. For many businesses, this is quite important because it enables them to get in-depth user feedback and determine which aspects of their goods or services need to be improved.