Intelligent Hybrid Feature Selection for Textual Sentiment Classification

Sentiment Analysis (SA) aims to extract useful information from online Unstructured User-Generated Contents (UUGC) and classify them into positive and negative classes. State-of-the-art techniques for SA suffer a high dimensional feature space because of noisy and irrelevant features from the UUGC. Researchers have also proposed feature extraction and selection techniques to reduce high dimensional feature space, but they fall short in extracting and selecting the most effective sentiment features for sentiment model learning. Effective feature extraction and selection are significant for the SA because they can boost the learning algorithm’s predictive performance while reducing the high-dimensional feature space. To address these concerns, we propose an Intelligent Hybrid Feature Selection for Sentiment Analysis (IHFSSA) based on ensemble learning methods. IHFSSA first identifies sentiment features in the review text utilizing Penn Treebank part-of-speech tagset and integrated Wide Coverage Sentiment Lexicons (WCSL). The sentiment features subset is then selected employing a fast and simple rank-based ensemble of multiple filters feature selection method. The selected sentiment features are further refined by applying a wrapper-based backward feature selection method. Finally, for textual sentiment classification, the well-known classification algorithms Support Vector Machine (SVM), Naive Bayes (NB), Generalized Linear Model (GLM) are trained in the ensemble model on the refined sentiment feature set. The in-depth evaluation using heterogeneous domain benchmark datasets demonstrates that IHFSSA outperforms existing SA techniques.


I. INTRODUCTION
B LOGS, discussion forums, shared knowledgeseeking networks, social network platforms, and product and movie review portals [1]- [5] are only a handful of social media platforms that have come up with Web 2.0 [6], [7]. On these social platforms and review websites, users express their thoughts, opinions, and perceptions about a range of topics, activities, people, organizations, goods, and services. These expressions are often beneficial to both businesses and customers.
Such user-generated content must be analyzed to become usable information for business strategy and product purchase decisions. However, due to their unstructured nature and huge volume, usergenerated contents incur significant cognitive overload to manually search and evaluate. The challenge in accurately extracting valuable knowledge from the massive volume of UUGC is the major hurdle for product manufacturers and data scientists.
SA, also called opinion mining, is the process to analyze, process, and extract the useful knowledge from UUGC and classify it into positive and negative classes employing text mining and machine learning.Textual SA has been carried out at three levels: document, sentence, and aspect [8]. In this paper, we focus on document-level SA, processing each sentence in the document and convert them to words employing sentence parser and tokenizer, respectively.
In textual SA, text pre-processing/transformation, and feature extraction/selection are vital process [9]. The former is used to create a feature vector for machine understanding, while the latter process is used to extract and select the most important features for SA. Specifically, feature extraction focuses to identify and extract sentiment words/phrases, while, in feature selection, the subset of most important and relevant features are selected. Feature extraction and selection are essential for making a scalable generalized model to improve performance and minimize model complexity and storage requirements [10].
Proposals for sentiment classification acknowledge the importance of feature extraction and selection [9], [11]- [14]. State-of-the-art techniques for sentiment classification mostly depend on the conventional Bag-of-Words (BOW) approaches [4], [15], [16]. These approaches can be useful due to their simplicity but they ignore semantics and word order, suffering high dimensionality. Researchers have also proposed semantic-based SA methods utilizing high order n-grams, Part-of-Speech (POS) patterns, and dependency relation features considering word order [9], [11], [12], [17]- [19]. However, not every word or phrase in high-order n-gram, POS patterns, and dependency relation features shows sentiment. Therefore, extracting and selecting the subset of the most relevant feature by considering semantics, word order, and sentiment information is an important task for robust and intelligent model development and high classification performance. Khan et al. [20] focused on exploring POS patterns and POS n-gram patterns for SA while considering sentiment clue and employing ensemble techniques. However, unlike existing work [20], IHFSSA uses hybrid of filter and wrapper method in the selection and refinement of optimum sentiment features to address the issue of SA and to improve the sentiment classification performance. Moreover, it also tackles the challenge of context-based SA that contains mixed opinions and standardizes WCSL to identify the sentiment features.
This paper addresses these challenges via integrating two main types of feature selection, namely, filter and wrapper methods. The filter-based method employs ranking criteria to assign a score (weight) to each feature based on the algorithm implementation. The top ranks subset features are selected according to filter criteria. Examples are Document Frequency (DF), Information Gain (IG), Chi-square (CHI), and Mutual Information (MI) [21]. The filter method is simple, fast, but neglects the interconnection with the classifier. The wrapper approach evaluates the features and selects the most important feature that has the greatest effect on classification performance. Forward selection, backward selection, and evolutionary search are some examples of wrapper method [22], [23]. The wrapper method is computationally costly than the filter-based method.
In this paper, we propose an intelligent model for textual SA based on a hybrid feature selection with ensemble learning methods. The Ensemble of Multiple Filters Feature Selection (EMFFS) is integrated with a wrapper-based backward feature selection (BFS) method. The goal of integrating filter and wrapper methods is to utilize its benefit to find the most effective/optimal sentiment features while reducing feature space and enhancing sentiment classification performance. In this study, first, the EMFFS method that is based on well-known multiple base filters, such as IG, CHI, Gain Ratio (GR), Standard Deviation (SD), and Gini Index (GI), is employed. Each base filter ranks the feature set of a dataset and then selects top-k features. After that, the subset of top-k common features is selected by individual base filters based on the Majority Voting Threshold (MVT). Next, the selected common features are further refined to get the most effective sentiment features by the wrapper-based BFS method. Finally, we train the classification model on the refined most effective sentiment features for SA. We conducted the experiments on different benchmark review datasets which are widely used for SA. The experimental results show that the proposed intelligent hybrid approach is capable of accurately predicting the true sentiment class, as well as improving the performance of SA efficiently and outperforming baseline methods. This paper makes the following contributions: • We propose IHFSSA that integrates filterbased EMFFS with wrapper-based Backward Feature Selection (BFS) to select optimum sentiment features with reduce feature space and improve sentiment classification performance. • We utilize the integrated WCSL to identify the sentiment features in the review text. • We employ linguistic rules to determine the correct class of review text that contains both positive and negative sentences/clauses. • We apply POS-unigram and bigram to extract semantic and context-aware sentiment features. • We refine and reduce the feature space to obtain the subset of most effective sentiment features for sentiment classification. • We evaluate IHFSSA using two well-known benchmark datasets and demonstrate the effectiveness of hybrid feature selection and improvement in sentiment classification. The remaining paper is organized as follows. Sec-tion II presents related works. The detailed methodology and architecture for feature extraction, selection, and dimensionality reduction are described in Section III. Experimental study, datasets, evaluation results are discussed in Section IV. Section V concludes the paper along with future work.

II. RELATED WORK
SA systematically identifies, extracts, selects, and classifies sentiment information from source materials using Natural Language Processing (NLP), text mining, and Machine learning (Machine Learning (ML)) techniques [8]. Before 2000, neither NLP nor linguistics received much coverage. The main reason was that there was only a limited amount of opinionated textual data available in digital format at the time. Since its inception in the year 2000, the field of SA has expanded to become one of the most active research fields in NLP, data mining, web mining, information retrieval, and management sciences [8]. The SA techniques have been broadly used for a variety of purposes, including classification of customer reviews, [24]- [26], tracing political viewpoints [27], [28], and predicting stock market trends [29], [30] etc.
Researchers have studied various feature representation schemes and proposed different techniques to improve sentiment classification performance in several studies on textual SA. For sentiment classification, Pang et al. [13] used a supervised machine learning technique. For movie review sentiment classification, they used three machine learning classifiers (NB, ME, and SVM). They utilized various n-gram feature sets, such as unigrams, bigrams, unigrams bigrams together, and unigrams with POS tags, to classify sentiment. They achieved 81%, 82.9%, and 80.4 % accuracy for NB, maximum entropy, and SVM classifiers respectively. They also explained that either NB or SVM classifier using unigram features performed well. Similarly, Go et al. [31] used the distant supervision method to conduct SA on Twitter message. They were able to effectively use emoticons as noisy labels to train data for distant supervised learning. They conducted experiments employing NB, SVM and maximum entropy, and achieved an accuracy of 81.3%, 82.2%, and 80.5% respectively. Zhang et al. [32] used NB and SVM for the sentiment classification of restaurant review written in Cantonese. They also examined how feature representation and feature size affect classification accuracy. The NB classifier achieved the highest accuracy 95.67%.
For the movie review classification, Tripathy et al. [11] used four supervised machine learning classifiers: NB, ME, SVM, and Stochastic Gradient Descent. They demonstrated n-gram techniques, unigram, bigram, trigram, and various combinations of these for sentiment feature generation. They had the best results on the SVM classifier by merging unigram and bigram, as well as a unigram, bigram, and trigram. Using n-grams and a dependency relation scheme, Ng et al. [17] used an SVM classifier to classify textual reviews. They addressed two issues: the presence of user reviews in a text document, as well as the polarity of those reviews (positive or negative On product review and movie review datasets, Singh et al. [35] analyzed and compared the performance of Multi-layer Perceptron, NB, and SVM. In both the movie and product review datasets, SVM worked best. The presumption of attributes in NB is being independent, which may not be necessary in any case. MLP needs more execution time compared to other techniques. Xue Bai et al. [36] utilized ME, NB, SVM, and Markov Blanket Model algorithms for sentiment classification. They evaluated their models using datasets from movie reviews and news articles. They suggested a heuristic search-enhanced Markov blanket model to grab word dependencies and provide the vocabulary for sentiment extraction. In comparison to other classifiers, their proposed model yielded good results. Kalaivani et al. [37] compared SVM, NB, and KNN for the movie reviews sentiment classification. The SVM approach outperformed NB and KNN approaches, according to their experimental results. The accuracy of SVM was reported for more than 80%. Bilal et al. [38] researched Urdu and English blogs opinions. They employed NB, Decision Tree, and KNN to classify Urdu and English blogs' opinions. According to their results NB outperformed Decision Tree and KNN. Turney [14] presented an unsupervised approach  for review document sentiment classification (recommended or not recommended). In Turney's work, the review document is classified by computing the average semantic orientation of phrases containing adjectives or adverbs. Another unsupervised learning approach presented by Xiangua et al. [39] to automatically determine the aspects addressed in Chinese reviews as well as the sentiments expressed in different aspects. They found multi-aspect global topics using the Latent Dirichlet Allocation (LDA) model, then extracted local topics and related sentiments using a sliding window context over the review document. The local topic was determined by the trained LDA model, and the polarity of related sentiment was classified by the HowNet lexicon. The results of their experiments indicate that their method performed well in topic exploration and sentiment analysis. Mullen et al. [40] assigned semantic values to phrases and terms, then combined these values as features to construct the classification model. Mastumoto et al. [41] defined the syntactic relationship between words as the basis for review text sentiment classification. They mined frequent word sub-sequence and dependency sub-trees from sentences and used them as features for sentiment classification using an SVM classifier.
Feature selection is an essential task in SA. Feature selection helps to minimize the dimensionality of feature space by choosing the subset of the most useful and effective features without losing too much information. Feature selection methods are mainly divided into two groups, namely filters and wrappers. The filter method chooses a subset of relevant features that are independent of the classifier. The wrapper method for feature selection uses classifier performance to select a subset of relevant features. Tan et al. [42] utilized four statistical filters for feature selection namely Mutual Information, IG, CHI, and Document Frequency with five ML algorithms (K-nearest neighbor, Centroid Classifier, Window Classifier, NB Classifier, and SVM) for Chinese corpus sentiment classification. Their results conducted on Chinese corpus show that IG and SVM perform the best among all the feature selection methods and ML algorithms for sentiment classification. Sharma et al. [43] employed five filter-based feature selection techniques namely IG, CHI, Document Frequency, GR, and Relief for SA on online movie review dataset. Their study show that features selection can boost the sentiment classification performance but it is dependent on the number of features selected and the technique used to select them. The results of their experiments indicate that GR outperformed all other techniques.
Alireza et al. [19] presented a hybrid of filter and wrapper feature selection methods to minimize the reliance on feature selection techniques and obtain a minimal feature subset for document sentiment classification. They presented two techniques for the feature selection namely ordinalbased integration of different feature vectors (OIFV) and frequency-based integration of different feature subsets (FIFS). They used unigrams and POS patterns for the sentiment features identification along with their proposed feature selection techniques. Their experimental results show that POS-based features are more effective than unigram-based features. Osman et al. [4] presented a wrapper feature selection algorithm for sentiment classification based on the Iterated Greedy meta-heuristic. They used Multinomial NB as a classifier and GR filter scores as heuristic information for greedy selection. Their experimental results show that their proposed algorithm outperforms traditional filter-based feature selection techniques as well as the Genetic Algorithm-based feature selection algorithm. Jing et al. [15] proposed two feature selection methods called modified categorical proportional difference (MCPD) and balance category feature (BCF) that equally selects attributes from text reviews. Their experimental results showed that the combination of BCF and MCPD methods can not only reduce feature space but also improve the sentiment classification performance.
Ensemble techniques have reported successful results and improved classification performance for many domains [12], [44]- [48]. Kalaivani et al. [49] proposed machine learning-based feature selection method utilizing IG and Genetic Algorithm. They applied NB, logistic regression, SVM, and ensemble techniques on multi-domain datasets and movie review datasets for evaluation. According to their experimental results, IG and genetic algorithm with ensemble technique performed better. Xia et al. [12] investigated the effectiveness of ensemble techniques on different feature sets for sentiment classification. Khan et al. [20] proposed a novel POS and n-gram based ensemble method for SA while considering semantics, sentiment clue, and order between words called EnSWF. In this method appropriate features for SA were extracted and selected based on the POS patterns and POS n-gram patterns employing ensemble techniques. Onan et al. [16] proposed an ensemble approach to feature selection, in which several feature ranking lists obtained by different feature selection methods were aggregated. They used the genetic algorithm for the aggregation of ranked feature lists. They tested their proposed ensemble approach on a variety of domain datasets and found that it outperforms individual filter-based feature selection methods. Table 1 provides an overview of different approaches proposed by different authors for textual sentiment classification.
Recently, deep learning-based algorithms such as long short-term memory (LSTM), bidirectional long short-term memory (BiLSTM), gated recurrent unit (GRU), bidirectional gated recurrent unit (Bi-GRU), convolutional neural network (CNN) [50], capsule network [51] and Transformer based language model (BERT) [52] has also been widely used for SA and challenging natural language processing applications that yield state-of-the-art prediction results. The scope of this work is traditional featurebased approaches. Such techniques are still useful for relevant features selection and dimensionality reduction. Also these methods have their own advantages in interpretability and time complexity. Furthermore, the main objective of this study is to select the most optimum sentiment features with reduced feature space and to improve the sentiment classification performance. According to the literature review, different feature extraction or/and selection strategies, as well as ensemble learning methods for sentiment classification, have been introduced by researchers. They also improved the sentiment classification results. However, the majority of them overlooked the successful use of hidden significant information that contains more sentiment information than other vocabularies. Likewise, they did not consider the most effective sentiment features. The existence of noisy and irrelevant features in the text is also ignored. Besides, state-of-the-art feature extraction or/and selection techniques have their own statistical biases and also required an educated guess to identify appropriate features. For the evaluation of the most appropriate sentiment features in SA, existing sentiment lexicons and domain knowledge may be used. The combination of the filter-based feature selection method with the wrapper-based backward feature selection method employing the most effective sentiment features for SA is the focus of this paper. Figure 1 depicts the framework of our proposed hybrid approach for textual sentiment classification. There are four major phases in the framework: (1) Feature representation (2) Sentiment features extraction (3) Sentiment features selection (4) Training and applying classification model.

III. METHODOLOGY
We, first, split the review text into sentences, then transform it into words and assign POS tags to recognize suitable words such as an adjective, adverb, verb, and noun. After this, the identified POS words are matched with the WCSL to extract suitable sentiment words. Second, we apply the ngram model (unigram and bigram) to the POS fea- This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  Figure 1: Proposed hybrid sentiment analysis framework.
tures and generate POS tagged unigram and bigram features. Besides, we utilize dimensionality reduction techniques (Filter and wrapper-based features selection) to reduce the feature set and select the most effective refine sentiment features for accurate sentiment classification. The filter method is simple, fast, and scales well with the dimensions and numbers of samples. The wrapper method takes care of interaction among features and increases classification performance. Finally, we utilize three distinct supervised ML algorithms such as SVM, NB, and GLM as base learners in conjunction with classifier ensemble (CE) employing majority voting approach for sentiment classification. The main goal of this study is to select the most effective sentiment features with reduce feature space and to improve the sentiment classification performance. The technical details of the proposed methodology are elaborated in the following sub-sections.

A. FEATURE REPRESENTATION
The identification of suitable features from unstructured textual data is a challenging task in sentiment classification [9]. In literature, various feature representation methods have been used for textual sentiment classification. These methods consists: BOW, POS patterns, linguistic patterns, n-grams, dependency relation features, semantic and contextual features [19], [53], [54]. In this study, we use different modules for feature representation and construction. These modules are sentence parser, linguistic semantic rules, tokenizer, noise remover, lower case transformer, POS tagger, and Term Frequency-Inverse Document Frequency (TFIDF) scheme.
The product review dataset is first loaded, after which the review text is split into words using the sentence parser and tokenizer. The noise remover module is then used to eliminate all noise from the text, and the case transformer module is used to convert the text to lowercase. Further, the POS tags are assigned to the input words by the POS tagger to identify and extract likely words like adjectives, adverbs, verbs, and nouns. After that, these words are searched in the integrated WCSL to extract the suitable sentiment words and remove non-sentiment words. Finally, the feature vector is created and selected using the TFIDF and pruned words below 'absolute = 3.0 and above 'absolute = 30000 from the feature vector space. In this study, We employed the Penn Treebank annotation scheme [55] for POS tagging as shown in Table 4.
On product review datasets, the feature representation process also applies linguistic rules. Linguistic rules provide help to the context-based SA that contains mixed opinions. For example, in the sentence, "the director is popular but the movie is boring", in this sentence using linguistic rules, only the clause after "but" is considered and the clause before "but" is ignored. It contains specific words, like 'but', 'despite', 'while', 'unless' etc., that can switch the polarity of a sentence. Many review documents contain mixed opinions. The use of linguistic rules could positively affect the reviews that contain mixed opinions. Based on the assump-tion of previous researches, each review document or sentence has a single polarity. Following the work [56], [57], we apply the same linguistic rules as shown in Table 2.
We standardize and integrate the aforementioned ten state-of-the-art sentiment lexicons to have one of three different scores, +1,-1, 0. We compute the average sentiment score of the overlapping words in these lexicons to get more sentiment words and create a huge sentiment lexicon. The size and format of the state-of-the-art sentiment lexicons for word coverage are shown in Table 3. First, we standardized them by assigning scores, +1,-1, 0 to positive, negative, and neutral words respectively. Then calculated the sentiment score of every word by the average score of the overlapping words in the sentiment lexicons.
In this study, sentiment words in the review documents are matched against integrated wide coverage sentiment lexicons and then used for sentiment classification.

C. SENTIMENT FEATURES EXTRACTION
Feature extraction is a method of identifying and extracting text features that are useful for classification. The primary issue in sentiment classification is the extraction of appropriate features that perform better than simple features [9]. Identification of appropriate features is critical for reliable model learning. BOW, higher-order n-grams, POS, and negation are some of the feature extraction methods used in SA. In this step, we identify and extract suitable features that express sentiment clues. In literature, four types of POS words, i.e. adjectives and adverbs, accompanied by verbs and nouns are considered as the sentiment features for sentiment classification [12], [19]. We use the same POS words for an initial set of feature extraction in this study, based on the Algorithm 1 Backward search for optimum features selection Input: T F means Total Features set, OF means Optimum features subset and OF 0 means total features in the input space. 2 Use features in OF 0 to train the classifier 3 Calculate Classifier Accuracy (AC) 4 for Step two 13 else 14 The feature selection process is completed 15 end 16 Return optimum features aforementioned literature. For the initial set of sentiment features extraction, first, the extracted POS words are matched with the integrated WCSL. Then the n-gram model (unigram and bigram) is utilized to generate POS tagged unigram and bigram sentiment features. The sentiment features are usually single words, i.e., unigrams (beautiful, nice, bad, awful, etc.) or bigrams (very nice, much better, not bad ), As a result, we extracted sentiment features using unigram and bigram. The list example set for the N-gram model is shown in Table 5.

D. FEATURES SELECTION
The process of selecting the most important and applicable features in order to minimize the highdimensional feature space and make an accurate prediction is known as feature selection. Feature selection has the potential to improve classification performance. Feature selection techniques can be mainly classified into filter and wrapper methods. The filter method assigns a score (weight) to each feature based on the relevant algorithm implementation using ranking criteria. The filter method is simple, fast, and scales well with the dimensions and the number of samples. A classifier is used in the wrapper approach to evaluate a subset of features and select the most significant feature that has the greatest impact on classification performance. The  If a sentence contains "but", ignore previous sentiment and only consider the sentiment after "but" part.

R2
I love this movie, despite the fact that I hate the director.
If a sentence contains "despite", only consider the sentiment before the "despite" part.

R3
Everyone like this video unless he is a sociopath If sentence contains" unless" and the "unless" is followed by a negative clause, ignore the "unless" clause.

R4
While they did their best, the team played a horrible game.
If a sentence contains "while" ignore the sentence following the "while" and consider the sentiment only of the sentence that follows the one after the "while".

R5
The film counted with good actors, however, the plot was very poor.
If a sentence contains "however" ignore the sentence before"however" and consider the sentiment of the sentence after"however".   This is, is amazing, amazing film greedy search algorithm is used for the wrapper process. Even though it is computationally costly, the interaction between features is observed. The five well known statistical filters in the EMFFS for sentiment features selection are used in the proposed method, which is described as follows: Information gain (IG): The relevance of features in the documents for the given category prediction is calculated by information gain. IG assigns weights to features. The higher a feature weight, the more relevant it is considered to be [21]. Based on information gain value (weight), we pick the top-ranked most appropriate N features. IG is used to reduce the size of features by retaining the highest-ranked features and excluding those with information gains smaller than a predefined threshold. Equation 1 (1) [67] defines the IG for the feature 'X.' Where X is the feature, and Y is class. Standard deviation (SD): The standard deviation indicates how much divergence or dispersion occurs in the feature space from the average (mean, or expected value). A low standard deviation means that the features are fairly similar to the mean, while a high standard deviation indicates that the features are distributed over a wide range of values. The formula is straightforward: it is equal to the square root of the variance. The standard deviation is measured as follows [68]. where f is the feature, N is the number of instances in a class, x ij is the weight of j th feature, mean k is the mean and Stdev(f i , C k ) is the standard deviation of the j th feature to the k th class.
Chi-square (CHI): The chi-square [21] is used to compute the degree of association between feature f and its category c, and choose the desired number of features with the best chi-square score. The intuition is that if the category c is independent of the feature f then the feature is removed. Using Chi-square, the worth of a feature is measured by computing the value of the chi-square statistic with respect to the target class. The formula for the Chisquare is given in equation 4 [21].
Where f is the feature, c is its category, N is the number of total documents, A is the number of times f and c co occur, B is the number of times f occurs without c, C is the number of times c occurs without f , and D is the number of times neither c nor f occurs.
Gain Ratio (GR): The Gain Ratio, also known as the Information Gain Ratio, is a modification of information gain that decreases its bias against highbranch features. It is the ratio of information gain to intrinsic information gain [69]. The GR formula is given in Equation 5.
Where X represents the feature. Gini Index (GI): The original GI algorithm is used to assess the impurity of features in terms of classification. The feature's GI is then the smallest at all judgment thresholds f in [0,1] [48]. The lower the value, the lower the impurity and the better the feature. On the other hand, when calculating the purity of a feature for categorization, the higher the value reflects the purity and the greater the feature. The calculation of purity has been used in several experiments on the GI theory. The improved GI formula [48] is given in the below equation 6.

E. THE ENSEMBLE OF MULTIPLE FILTERS FEATURE SELECTION
The EMFFS contains five state-of-the-art different statistical filters, namely IG, SD, CHI, GR, and GI are used as base fillers for features selection. In this method, first, the features are ranked by the aforementioned base filters. Then the subset of top-k common ranked features based on MVT is selected. A MVT 'λ' (i.e. λ ≥ t, t = 3, where t is the value of MVT is used to choose common ranked features selected by at least three filters among base filters. After that, the ranks of common features are aggregated using the arithmetic mean. Furthermore, normalization is used to scale the values of these features to fit in a certain range. When dealing with features of various scales, adjusting the value range is important. In feature ranking, each statistical filter assigns a different ranking score to features that may reveal various scales. In order to avoid various scales, we use feature scaling to re-scale the features' values assigned by different statistical filters in the range between 0 and 1 employing min-max normalization techniques. The equation 7 is used to normalize the values of selected common features. Finally, the scaled features are sorted in descending order for further processing (refinement). Generally, the aim of EMFFS method is to produce the subset of relevant sentiment features and is to decrease the risk of irrelevant features that do not meet the MVT criteria. The workflow of EMFFS is shown in Figure 2.

F. WRAPPER-BASED BACKWARD FEATURE SELECTION
The wrapper approach generates and evaluates various subsets of features in the feature space. The heuristic search algorithm is used to generate a subset of features, and a classifier evaluates the goodness of a feature subset. Wrapper approaches often result in better classification performance. BFS, also known as backward elimination, is the wrapper approach used in this analysis. We use BFS to eliminate features that do not have a noticeable impact on the prediction performance. The BFS is an iterative process. It begins with the complete set of features/attributes and eliminates each remaining attribute of the given example set in each round. Cross-validation is used to estimate the performance for each removed attribute. Only the attribute giving the smallest decrease of performance is finally omitted from the selection. Then a new round is started with the modified selection. This process is repeated as long as there is a performance improvement. When there is no substantial improvement in per-   formance, the BFS procedure is terminated, and the remaining attributes are used to train and evaluate the final classification model. In this method a 5% significance level is selected, which means the Pvalue/probability threshold value is 0.05. If the Pvalue of the feature is greater than the selected significance level, then the feature will be removed from the feature set. This process continues until we reach a point where the highest P-value from all the remaining features in the set is less than the significance level, which means that we are done with the optimum feature selection process. Algorithm 1 and Figure 6 include a comprehensive procedure and illustration of BFS for optimal feature selection. As shown in Figure 6, the least important attributes are discarded one by one.

G. EFFECTIVE FEATURE SELECTION USING HYBRID APPROACH
This research proposes a hybrid feature selection approach for effective sentiment feature selection SA that includes both filter-based EMFFS and wrapperbased BFS methods. For appropriate feature selection, filter methods use statistical and mathematical criteria. Filter methods are computationally simple, fast, and scalable, but they ignore the classifier's interconnection. Furthermore, since this approach is univariate, treat each feature individually and relinquish it to establish the relationship between features. Wrapper methods use the classifier to pick the best features. In high-dimensional feature space, these approaches (Wrappers) are computationally costly. Although wrapper methods provide better classification results in general. As a result, using a hybrid solution is the intelligent one. It makes a trade-off between filter and wrapper method. Algorihm 2 describes the detailed method of the proposed solution for best/effective feature selection using EMFFS and BFS. Algorithm 2 begins by examining the input dataset D, which contains n typical features. First, the EMFFS which consists of five separate base filters: IG, SD, CHI, GR, and GI is used to rank and pick the top-k features. Based on MVT, these top-k features are then integrated and common features are obtained. These common features are then fed into the BFS method for refinement. The BFS begins with all of these features and removes the least significant feature at each iteration, improving the classifier's accuracy. As the optimal features, the feature subset that increases classification performance in each iteration is chosen. This procedure is repeated until no change in feature elimination is detected. Once we have a final subset of optimum/effective sentiment features, classification algorithms are then used to compute accuracy using 5-fold cross-validation.

H. CLASSIFICATION ALGORITHMS AND ENSEMBLE LEARNING METHOD
In this section, we briefly discuss the classification algorithms with ensemble learning method for sentiment classification.
Classification Algorithms: Supervised learning trains models to predict the optimal outcome using a training set. In a variety of traditional text classification and sentiment classification tasks, supervised ML classifiers have been commonly used. In this study, we use three well-known ML classifiers for Support Vector Machine (SVM): SVM [70] is one of the popular supervised ML classifiers that is commonly utilized for classification problems. SVM has been successfully applied in text catego-rization [71] due to its robustness in high dimensional space and has shown good performance in SA [13], [33], [34]. The aim of SVM is to divide hyperplane in the search space that can best separate the group of instances in one class from another class. Given a training set of n instances in the form (x 1 , y 1 ), ..., (x n , y n ), where x i mean the training instance, labeled with class y i , which is either 1 or -1. We want to determine the "maximum-margin hyperplane" that divides the positive and negative training instances with maximum margin. In this study, we select SVM linear for training and testing.
Naive Bayes classifier (NB): The NB classifier is a basic supervised ML classifier based on the Bayes theorem and naive independence assumptions between features. The NB classifier assumes that the features are conditionally independent of one another provided some class. Given a problem instance represented by a vector X = (x 1 , ..., x n ), where n denotes the number of features. The probability that an instance X i belongs to a class C k is calculated using Bayes' theorem, as given in below equation 8.
Where P (x i |C k ) represents the conditional probability that features x i belongs to class Ck and P (C k ) represents the prior probability that the instance exists in class C k . Because of its positive results, the NB classifier has been widely used for text classification [72]- [74]. In this study, we employed Multinomial NB classifier [75], [76]. Multinomial NB evaluate the conditional probability of a feature/word given a class as the frequency related to feature f in the documents belongs to class c.
Generalized Linear Model (GLM): The GLM [77], [78] is an extension of general linear model. The GLM fits generalized linear models to the data by maximizing the log-likelihood. The GLM is a flexible, robust and highly interpretable model provide useful approach to modeling classification [79].
The ensemble learning method: The ensemble learning approach incorporates the outputs of multiple base classifiers to boost generalization ability/robustness and classification accuracy over a single classifier. It also reduces the rate of error in classification tasks, which may occur with single classifiers. The use of ensemble methodology has been considered by researchers from various disciplines [80]. To perform sentiment classification, we use a combination of classifiers in this process. As base classifiers in the mix, SVM, NB, and GLM are included. The user's review text is divided into VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.   Movie  1000  1000  2000  Book  1000  1000  2000  DVD  1000  1000  2000  Electronics  1000  1000  2000  Kitchen  1000  1000  2000 positive and negative classes based on the majority voting of base classifiers.

IV. EXPERIMENTAL STUDY A. SENTIMENT ANALYSIS DATASETS
In this paper, we test the efficacy of our proposed method on two commonly used public SA datasets: the Cornell movie review dataset [13] and the Amazon product review datasets [81]. The Cornell movie review dataset contains 2000 ratings, 1000 of which are positive and 1000 of which are negative. The Amazon product reviews datasets are divided into four categories: books, DVDs, electronics, and kitchen, with 1000 positive and 1000 negative reviews in each category. Table 6 lists the dataset's specifics.

B. EVALUATION MEASURES
To assess the predictive performance of our proposed hybrid method, we used different evaluation measures: Accuracy (ACC), Precision (PRE), Recall (REC), F-measure, receiver operating characteristic (ROC) curve, area under the curve (AUC), and precision-recall curve (PR curve). The ratio of true positive (TP) and true negative (TN) instances to the total number of instances attained by the classification algorithm is known as classification accuracy.

ACC =
T N + T P T N + T P + F N + F P Precision is the ratio of true positives against the total number of true positives and false positives.
Recall is the ratio of true positives against the total number of true positives and false negatives.
The F-measure is the harmonic mean of precision and recall.
ROC curve is a performance measurement for the classification problems at distinct threshold settings. The ROC curve is plotted with True Positive Rate (TPR) against the False Positive Rate (FPR) where TPR is on the y-axis and FPR is on the x-axis. A precision-recall curve (or PR curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. These measuring metrics are widespread in machine learning, and they have been well used in a number of different studies [82]- [84] For all of the experiments in multi-domain datasets, we used 5-fold cross-validation, which is similar to state-of-the-art SA approaches. In 5-fold cross-validation, each domain dataset was randomly split into a training set and a testing set, with one set used for testing and the rest for training. The averaged results of 5-fold cross-validation are used to estimate classification performance. After building the model, We also used an external validation set to assess our model's predictive performance on unseen data. To calculate the evaluation measures of the proposed solution, we used the paired t-test (P<0.05).

C. EXPERIMENTAL SETTING
All of the experiments were carried out on a PC with an Intel(R) Core(TM) i7 processor, 16.0 GB of RAM, and the Windows 10 64-bit operating system. We used Rapidminer Studio [85] for the proposed algorithm implementation, which is a complete environment for ML, text mining, deep learning, data processing, and predictive analytics. We utilized SVM, NB, GLM, as base classifiers in conjunction with Classifier Ensemble (CE). The hyperparameters along with the optimal values are given in Table 7.

D. EXPERIMENTAL RESULTS AND DISCUSSION
On multi-domain datasets, we evaluate the results obtained by different techniques used in the proposed approach in this section. We also analyze and discuss how the proposed hybrid approach (IHF-SSA) could intelligently address the sentiment analysis issue. All of the features were represented using the POS tagged n-gram features known as POS unigram and bigram (Adjective, Adverb, Verb, and Noun). All the experimental results were conducted through 5-fold cross-validation. In the CE, three separate base classifiers (SVM, NB, and GLM were used. IG, SD, CHI, GR, and GI are the baseline statistical filters of EMFFS. BFS is used as a wrapper for the final feature refinement and feature selection. The different proportions of features are evaluated using average classification accuracies. Besides, the proposed method is compared against baseline filters and EMFFS.

E. PERFORMANCE ANALYSIS OF STATISTICAL FILTERS WITH CLASSIFICATION ALGORITHMS
We compare the predictive performance of various filter methods (IG, SD, CHI, GR, and GI together with classification algorithms (SVM, NB, GLM, and CE) used for SA in this section. For each domain dataset, we first performed different experiments by selecting the top-k high-ranked score features (300~1000) for each filter. These features are then used to train classification algorithms. Finally, based on the input feature space (top-k high ranked score features), the trained algorithms' accuracy is assessed. Table 11 shows the effects of statistical filters with classification algorithms for various proportions of top-k high-ranked features.
The experimental results of these methods gave us the following intuition. In terms of individual filter feature selection methods, IG-based filter feature selection achieves high predictive performance. GR, GI, CHI, and SD all have acceptable predictive performance. In terms of classification performance on the selected input feature space (to-k high ranked features), SVM outperformed NB, particularly on the Movie dataset. On other domain datasets (Book, DVD, Electronics, and Kitchen), however, NB outperforms SVM in terms of predictive performance. In comparison to SVM and NB, the performance of GLM is low. The ensemble method is also employed. The CE based on the majority voting method achieved the best performance for review text sentiment classification.

F. PERFORMANCE ANALYSIS OF THE ENSEMBLE OF MULTIPLE FILTERS FEATURE SELECTION STRATEGY
The predictive performance of EMFFS for SA is examined in this section. We employed the EMFFS and obtained the subset of top-k high-rank common features in each domain. Based on MVT the topk high-rank common features that are ranked by the base filter in the EMFFS are integrated, and the classification algorithms are trained on these integrated common features. The accuracy of the trained algorithm is then assessed. Table 8 shows the results of the EMFFS method. The reduced feature space by EMFFS for Movie, Book, DVD, Electronics, and Kitchen datasets are 945, 950, 968, 916, and 920, respectively. On the Movie review dataset, the SVM classifier achieved the highest predictive performance, while NB achieved the second-highest predictive performance. On other domain datasets (Book, DVD, Electronics, and Kitchen), NB outperformed SVM in terms of predictive performance. In comparison to SVM and NB, GLM has a low performance score. The CE outperformed among base classifiers (SVM, NB, and GLM). The experimental results show that using ensemble learning methods (Ensemble of filters and classifiers) to extract and pick appropriate and suitable features has a statistically significant effect on predictive performance.

G. PERFORMANCE ANALYSIS OF HYBRID FEATURE SELECTION APPROACH
We examine the overall performance of the proposed IHFSSA approach for textual SA in this section. To obtain the most effective refine sentiment features for SA, we fused the filter-based EMFFS with the wrapper-based BFS. The EMFFS is used to obtain a common high-rank feature set, which is then fed into the BFS for refinement in the second stage of the proposed hybrid approach. All the parameters of BFS are set at their default values in Rapidminer Studio. The results of the proposed IHFSSA approach for textual SA are shown in Table 9. For the domains of Movie, Book, DVD, Electronics, and Kitchen datasets, the final effective refined features generated by the proposed approach are 896, 862, 910, 775, and 824, respectively. SVM achieved a high classification performance on refined features space in multi-domain datasets, especially on the Movie dataset, and NB obtained the second-highest classification performance. However, the predictive performance of NB on other domain datasets (Book, DVD, Electronics, and Kitchen) is high, SVM comes in second. The performance of GLM is low compare to SVM and NB. The CE based on majority voting is also utilized and obtained the best results. Table 9 shows that the feature space has decreased substantially and classification performance has improved significantly. VOLUME 4, 2016    Baseline 2: Word-relation based ensemble [12] Baseline 3: Unigram-based OIFV [19] Baseline 4: Unigram-based FIFS [19] Baseline 5: POS-based OIFV [19] Baseline 6: POS-bsed FIFS [19] Baseline 7: Iterated Greedy Metaheruistic [4] Proposed Method: IHFSSA Regarding dimensionality reduction, the proposed hybrid feature selection approach highly reduced the feature space while obtained high accuracy. The total feature space in the domain of the movie dataset after noise removal is 35480. The proposed approach IHFSSA selected the best feature set and reduced the feature space to 945. After noise reduction, the cumulative feature space of the Book, DVD, Electronics, and Kitchen datasets is 21724, 21255, 10499, and 9362, respectively. Similarly, for the domains of Book, DVD, Electronics, and Kitchen, the IHFSSA reduced and selected the best feature set. The reduced feature space for Book, DVD, Electronics, and Kitchen datasets are 862, 910, 775, and 824, respectively. In terms of high dimensional feature space reduction, IHFSSA significantly reduced the feature space and achieved the highest classification accuracy as compared to state-of-the-art methods [13], [19]. In [19], the twohybrid feature selection methods, namely FIFS and OIFV have been proposed for SA. The FIFS method reduced the average feature space to 10,179 and 2325 on the movie and kitchen datasets, respectively. The OIFV method reduced the average feature space to 4201 and 3427 for Book and electronics dataset, respectively. In [13] the reduced feature space on the movie domain dataset is 2633.

I. PERFORMANCE COMPARISON WITH STATE-OF-THE-ART APPROACHES
Over five benchmark datasets, we compare the classification performance of our proposed solution IHFSSA in terms of best accuracy with stateof-the-art feature selection sentiment classification approaches. Table 10 and Figure 5 provide a quantitative performance analysis across benchmark datasets. Figure 5 along with the table presents the accuracy rate in percentage for each method. For sentiment classification, Xia et al. [12] presented two kinds of feature sets: POS-based feature sets and word relation-based feature sets. They used three different ensembles methods, namely Fixed combination, weighted combination, and meta classifier combination. Word-relation based ensemble method was found best in terms of classification accuracy. Alireza et al. [19] proposed two hybrids of filter and wrapper methods for SA: frequency-based integration of different feature subsets (FIFS) and ordinal-based integration of different feature vectors (OIFV). FIFS was obtained high accuracy on Movie and Kitchen datasets, while OIFV was achieved high accuracy on Book and Electronics datasets. Osman et al. [4] proposed a wrapper feature selection algorithm based on Iterated Greedy metaheuristic for sentiment classification. They reported the high accuracy for Electronics and Kitchen datasets and obtained comparable results for the Book and DVD datasets. On five separate datasets, we tested our proposed solution IHFSSA against state-of-the-art approaches. As seen in Table 10 and Figure 5, the proposed IHFSSA approach outperformed stateof-the-art feature selection sentiment classification methods. There are several explanations why the proposed solution IHFSSA achieves the best and comparable results. The first explanation is that during text pre-processing, noisy and meaningless features are removed from the textual data. The extraction and selection of effective sentiment features is the second explanation. The third explanation is to classify mixed-opinionated texts using linguistic rules and semantic information. The inclusion of WCSL for sentiment features identification is the fourth reason. The fifth reason is the integration of EMFFS with BFS. The sixth explanation is the prediction of final sentiment orientation based on CE.
We conducted a statistical test based on accuracy results to compare the predictive performance between the two proposed feature selection methods EMFFS and IHFSSA. We ran the IHFSSA in comparison to EMFFS. We employed the paired ttest (P<0.05) to evaluate the performance difference between IHFSSA and EMFFS. Table 11 shows the statistically significant difference between IHFSSA and EMFFS. IHFSSA significantly outperformed than EMFFS in term of classification accuracy on 5 different domain datasets.

J. VALIDATION RESULTS
We used an external validation dataset (400 review documents for each domain i.e., Movie, Book, DVD, Electronics, and Kitchen) to evaluate the predictive performance of our trained intelligent hybrid model on these unseen domains datasets to confirm its efficiency. The accuracy performance of  the training and validation datasets is then compared in Figure 6, which indicates that our model was reliable.

K. ROC CURVE AND PRECISION-RECALL CURVE ANALYSIS
Area under the ROC Curve (or AUC) and PR curve are the well known measurement metrics in machine learning used for evaluating a classifier's performance. We present these metrics curves of our intelligent hybrid model on each domain dataset.

V. CONCLUSION
A gigantic amount of unstructured user-generated data in the form of textual comments and reviews exist online. This unstructured data contains valuable information for business analysis and product purchase decisions. SA is the fastest growing research area of text mining and web mining, which goals to classify online unstructured user-generated data into positive and negative classes. Due to the huge amount and unstructured nature of textual data, state-of-the-art techniques for SA suffers from noisy and irrelevant features. Likewise, these techniques could not identify the hidden significant information in the unstructured data efficiently. Moreover, the extraction and selection of the most effective sen- timent features are not handled successfully. In this respect, the extraction and selection of suitable and relevant sentiment features are significant for SA. Also to deal with high dimensional feature space, an intelligent feature selection technique is essential for accurate sentiment classification. In this study, an intelligent hybrid feature selection approach, which we called IHFSSA, is proposed for textual SA. The proposed approach enables us to identify, refine and select the most effective sentiment features with the goals of decreasing high dimensional feature space, taking into account sentiment information, semantic aspects, and word order for textual sentiment classification. The results of the experimental study show that the proposed hybrid approach outperformed state-of-the-art baseline approaches. In future work, We will incorporate deep learning, pre-trained NLP models such as BERT, GPT3, ELMo, and XLNet, etc. for testing IHFSSA. VOLUME 4, 2016