Emotion Processing by Applying a Fuzzy-Based Vader Lexicon and a Parallel Deep Belief Network Over Massive Data

Emotion processing has been a very intense domain of investigation in data analysis and NLP during the previous few years. Currently, the algorithms of the deep neural networks have been applied for opinion mining tasks with good results. Among various neuronal models applied for opinion mining a deep belief network (DBN) model has gained more attention. In this proposal, we have developed a combined classifier based on fuzzy Vader lexicon and a parallel deep belief network for emotion analysis. We have implemented multiple pretreatment techniques to improve the quality and soundness of the data and eliminate disturbing data. Afterward, we have performed a semi-automatic dataset labeling using a combination of two different methods: Mamdani’s fuzzy system and Vader lexicon. As well, we have applied four feature extractors, which are: GloVe, TFIDF (Trigram), TFIDF (Bigram), TFIDF (Unigram) with the aim of transforming each incoming tweet into a digital value vector. In addition, we have integrated three feature selectors, namely: The ANOVA method, the chi-square approach and the mutual information technique with the objective of selecting the most relevant features. Further, we have implemented the DBN as classifier for classifying each inputted tweet into three categories: neutral, positive or negative. At the end, we have deployed our proposed approach in parallel way employing both Hadoop and Spark framework with the purpose of overcoming the problem of long runtime of massive data. Furthermore, we have carried out a comparison between our newly suggested hybrid approach and alternative hybrid models available in the literature. From the experimental findings, it was found that our suggested vague parallel approach is more powerful than the baseline patterns in terms of false negative rate (1.33%), recall (99.75%), runtime (32.95s), convergence, stability, F1 score (99.53%), accuracy (98.96%), error rate (1.04%), kappa-Static (99.1%), complexity, false positive rate (0.25%), precision rate (97.59%) and specificity rate (98.67%). As a conclusion, our vague parallel approach outperforms baseline and deep learning models, as well as certain other approaches chosen from the literature.


I. INTRODUCTION
Opinion mining is a data extraction technique and an automatic language processing. For a piece of text, identify its The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . sentimental score as positive, neutral, or negative and provide multiple ways, resources, and performance criteria to perform this task [1]. The sentimental score of a sentiment can be determined based on various threshold values and can be considered as different categories. With the growth of consumer-posted texts in the micro-blogging sites and the social networking like Facebook, Instagram,YouTube, Ticktock, Trip Advisor, Twitter, Whats-app and Amazon. The analysis of sentiments in social media and websites has become increasingly popular in many scientific and industrial research communities [2].
Opinion mining is important active research field in NLP. Indeed, the past several years have testified an augmentation in the range of text-based sentiment data resources becoming widely available on the World Wide: web users' comments, which are more and more centralized by forums, customer investigations carried out by the leading brands, search engines, and social networks [3]. With such a wealth of data and resources available, automating the aggregation of various sentiments is becoming essential to efficiently obtain a comprehensive view of sentiments on a particular topic. The value of this data is enormous, both for organizations that want to get the feedback of the customer on their brand image or their products/services, and for individual who want to find out about an outing, a trip or a purchase.
Micro-blog platforms have recently attracted the interest of researchers and users due to the fact of their easiness and quickness of data exchange [4]. These platforms can be deemed as a giant storehouse of data with many millions of written posts and messages, usually arranged in complicated networks with users interchanging with each other at particular moments. Because of its widespread popularity, The Twitter is known as the first micro-blogging network in the entire world, it provides APIs for freely collecting data that can then be utilized for performing analysis or developing new applications, That's the reason why we have selected it for our diverse experiences [5].
Many research papers have been published on opinion mining on Twitter in various fields: natural disasters, politics, marketing, etc. Actually, Twitter is nowadays one of the greatest chances for a company to increase its visibility and accessibility among its prospective customers [6]. Marketers are noting the numerous new opportunities that Twitter provides and are beginning to introduce novel digital social innovations at an incredible ratio. As a consequence, the worldwide brands have recognised Twitter as a fully integrated advertising and marketing platform and are utilizing it in innovative ways to feed their promotional campaigns [7]. Twitter is equally being used as a political campaign platform by making it an incorporated media at the core of the policy communications strategies [8].
From the perspective of sentiment analysis tools, this liberty to hold expression is a critical challenge, as the goal is to extract the preoccupations of respondents in nonstructured data. This explains the significant work carried out on this topic in the NLP area, adapted to the resolution of this kind of data extraction [9]. In essence, the high-noise content in data, typified by the occurrence of misspellings, formatting of content, syntactical mistakes whether unintentional or intentional poses a challenge at several points in the analysis of the data, from data-preprocessing (lemmatization, categorization, grammatical, word and sentence segmentation) to word/sentiment retrieval [10].
Deep learning patterns have the potential to be used to capture meaningful knowledge that is unobserved in the daily generated social network content [11]. There exists many different deep learning patterns that aid in the learning process, such as Restricted Boltzmann Machines (RBMs), Long Short Term Memory Networks (LSTMs), Recurrent Neural Networks (RNNs), Deep Belief Networks (DBNs), Generative Adversarial Networks (GANs), Autoencoders, Multilayer Perceptrons (MLPs), Self Organizing Maps (SOMs), Convolutional Neural Networks (CNNs), Radial Basis Function Networks (RBFNs), and FeedForward Neural Networks (FFNNs). These deep learning methods work on most all type of data and require higher levels of computation and learning capacity to resolve challenging problems [12].
In this proposal, we have designed a combined classifier based on fuzzy Vader lexicon and a parallel deep belief network for emotion analysis, which integrates the strengths of deep belief networks, Vader lexicon and fuzzy system of Mamdani to carry out the classification of feelings with high efficiency. Moreover, this proposal is implemented in parallel way utilizing both Hadoop and Spark framework with the purpose of overcoming the problem of long runtime of massive data. Consequently, the principal suggestions of our proposal can be summarised as indicated below: 1) Our combined classifier based on fuzzy Vader lexicon and a parallel deep belief network classifies the collected Tweets into 3 different classes: neutral, negative or positive. 2) Multiple pretreatment techniques like negation procedure, stop words, lemmatization process and warping process are implemented to improve the quality and soundness of the data and eliminate disturbing data. 3) A semi-automatic dataset labeling using a combination of two different methods: Mamdani's fuzzy system and Vader lexicon is applied over the Sentiment140 dataset. 4) Application of four feature extractors, which are: GloVe, TFIDF (Trigram), TFIDF (Bigram), TFIDF (Unigram) with the aim of transforming each incoming tweet into a digital value vector. 5) Integration of three feature selectors, namely: The ANOVA method, the chi-square approach and the mutual information technique with the objective of selecting the most relevant features. 6) Implementation of the DBN as classifier for classifying each inputted tweet into three labels: negative, neutral or positive. 7) Implementation of our proposed approach in parallel way employing the Hadoop framework with the purpose of overcoming the problem of long runtime of massive data 8) A comparison between our newly suggested hybrid approach and alternative hybrid models available in the literature is carried out. VOLUME 10, 2022 9) Our suggested vague parallel approach is more powerful than the baseline patterns in terms of false negative rate, recall, runtime, convergence, stability, F1 score, accuracy, error rate, kappa-Static, complexity, false positive rate, precision rate and specificity rate.
The remainder of this work is structured in this way: The 2nd section discusses the driving forces behind the creation of this contribution, the 3rd section introduces the earlier published research studies, the 4th section outlines all steps in this contribution, the 5th section describes the results achieved in the different performed experiments, and finally, the 6th section makes the synthesis of the proposed hybrid approach and formulates a few guidelines for further work.

II. MOTIVATION
Nowadays, eCommerce online platforms allow their clients to publish reviews or comments on the items they have purchased. The insights offered by consumer feedback are important in helping other potential consumers decide whether or not to buy a product based on the opinions and experiences of other consumers about a specific product. Additionally, companies can also gather consumer feedback through online comments to enhance the quality of their products. Nevertheless, as the number of consumers purchasing items grows, the number of comments also rises over time and thus it is impossible for manufacturers or users to review all the comments of previous consumers on a certain item. Furthermore, certain customer feed-backs are very lengthy, which makes it challenging for users to identify both positive and negative product characteristics when considering if a product is actually worth purchasing, or for producers to know if the item requires enhancement. A sentiment rating process, which analyzes whether a consumer provides a negative or positive rating of a particular product, is very important and strongly recommended to potential consumers and manufacturers because it enables them to conveniently collect precious information about products through a diversity of feedback, that assists them in decision-making based on others' opinions. Motivated by the significant impact of sentiment analysis on our daily routine. In this proposal, we have developed a combined classifier based on fuzzy Vader lexicon and a parallel deep belief network that is employed to carry out the classification of sentiments. This contribution incorporates NLP techniques for performing the data preprocessing, Mamdani fuzzy system + Vader lexicon for carrying out the dataset semi-automatic labelling, feature extractor for transforming each tweet into digital vector, feature selector for choosing the most appropriate features and for reducing the high dimensional vector space of every extracted feature, DBN for performing the classification of each tweet into 3 categories(positive, neutral, or negative). Finally, both Hadoop and Spark framework are implemented for overcoming the problem of long runtime of massive data The initial stage of our suggested methodology is the application of data pre-processing techniques. Therefore, data pre-processing step has a significant influence on the data classification process, as discussed in this article [13]. Their authors have supplied us with a comparative analysis process to assess the impact of data pre-treatment technologies on tweets classification by measuring the accuracy. Experimental findings indicated that applying the data preprocessing techniques on the linguistic data significantly enhanced the classification efficiency. Also, many other papers in literature [14] have proven that the data preprocessing technologies have positively influenced the data classification procedure in terms of precision, recall, and accuracy [15]. We were thus motivated by the strong performance that was reported on data pre-processing technologies, and we have incorporated these pre-processing techniques into this proposal.
After the preprocessing process, the following stage is a semi-automatic dataset labeling using a combination of two different methods: Mamdani's fuzzy system and Vader lexicon. The subsequent stage is the data mapping in which we implemented 4 feature extractors, including GloVe, TFIDF (Trigram), TFIDF (Bigram), and TFIDF (Unigram), in order to map every tweet into a digital vector. Then, we have incorporated 3 feature selectors, which are ANOVA approach, chi-square technique and mutual information method. In addition, we have implemented the DBN for classifying each tweet into 3 categories (negative, neutral, or positive). In the final stage, we have deployed our combined proposal in parallel way by employing both Hadoop and Spark frameworks.
In summary, the goal of this research is to raise the classification efficiency of feeling analysis through integrating the strengths of data pretreatment approaches in improving the thoroughness of tweets by removing noisy and unsuitable data, of the feature extraction techniques which transform every tweet into a number vector and captures the most pertinent characteristics of the tweet, of the feature selection techniques which minimize the high dimensional characteristics obtained in the previous stage and choose the most interesting characteristics, of the DBN in classifying every tweet and improving the performance of data classification, and that of both Hadoop and Spark frameworks for overcoming the problem of long runtime of massive dataset.

III. PREVIOUS RESEARCH
Here are some examples of existing published works that have applied many diverse deep learning models to tackle opinion retrieval problems in a diverse range of languages.
Es-sabery et al. [16] suggested a new classifier based on CNNs, FFNNs and Mamdani Fuzzy System (MFS). Firstly, they used the CNN as an efficient automatic procedure to retrieve and choose the most appropriate features. Then, they applied the FFNN to calculate the negative and positive emotional values. Finally, they employed the MFS as a classifier to categorize the outputs of the employed patterns (FFNNs+CNNs) into 3 categories, namely: negative, neutral, and positive. The empirical findings proved that their suggested fuzzy parallel approach is more efficient than the 87872 VOLUME 10,2022 baseline patterns in terms of false negative rate, recall, runtime, convergence, stability, F1 score, accuracy, error rate, kappa-Static, complexity, false positive rate, precision rate and specificity rate.
The authors of the paper [17] proposed an hybrid deep learning model for opinion mining of Malayalam Tweets. Their suggested approach combine Bi-LSTMs, CNNs, Gated Recurrent Unit (GRU) and LSTMs. They applied the CNNs for extracting and choosing the most relevant features. Then, they employed LSTMs to eliminate long-term dependency but retain some valuable data. In addition, they used Bi-LSTM to generate the precise copy of LSTM in the opposite sense. Finally, they applied GRU to decrease the architecture complexity of the LSTM. Their experimental results show that CNN+GRU achieved a maximum classification rate of 87.23% and CNN+BiLSTM achieved a classification rate of 74%.
In the paper [18], the authors designed a new hybrid model of deep learning structures, namely CNNs and LSTMs. Their hybrid model is provided for the classification of the opinions of reviews published in various fields. The authors chose to apply deep CNNs because of their high efficiency in selecting local features, and they applied LSTM because of its efficiency in sequentially processing a long text. Their suggested Co-LSTM approach has two primary purposes in the analysis of sentiment. Firstly, it is ideally suited for addressing large social data with scalability is taking into account, and secondly, in contrast to traditional machine learning algorithms, it is independent of any specific field. The experimental findings demonstrate that the overall suggested pattern exceeds other machine learning algorithms in terms of accuracy and other metrics.
Bodapati et al. [19] applied two models for conducting feelings analysis. One pattern is constructed using RNNs and LSTMs architecture and another with CNNs. In their first pattern, the RNNs + LSTMs pattern were employed to detect syntactic and semantic relationships among words in a review using the word2vec word embedding method. In their second pattern, uni-dimensional CNNs were utilized for learning the structure in a set of terms and the feature-specific position. They applied both models to the IMDB movie review dataset and the experimental findings were compared. Both models performed extremely well.
The authors of the paper [20] developed a new hybrid model incorporating RNNs+LSTMs and logistic regression to perform sentiment classification. The purpose of their research was to familiarize themselves with the concept of sentiment analysis and the manner in which social media acts as an essential part of it. Additionally, they used YouTube and Twitter web scraping to select a standard data set to do more analysis. Experimental results have shown that the RNNs+LSTMs model is more accurate than logistic regression with an accuracy equal to 83.25%.
In the paper [21], the authors proposed a new hybrid approach of deep learning by combining the two bidirectional models LSTM and GRU to solve the high dimension problem of the feature space. They used two distinct layers which are GRU and bidirectional LSTM layers to extract future and past features by attaching two opposites hidden layers at the same level of background. Also, they used the group-wise improvement technique over the set of features retrieved by the bi-LSTM layer, that splits the features into several categories, improving the relevant features of each category, while reducing the less valuable ones. The introduced pattern utilizes both pooling and convolution layers to retrieve a set of features and to minimize the high dimensional space of features. Experimental findings reveal that the introduced convolutional two-way RNN infrastructure with a group-wise improvement technique perform better than the state-of-theart results for opinion mining.
Subhashini et al. [22] introduced a new decision-making model in which negative, positive, and boundary areas are categorized through the use of fuzzy logic concepts to overcome the limitations of ML approaches in managing uncertainties in people's opinions. They then applied the CNN to additional classification of vague concepts initially attributed to the boundary area. Their proposed framework utilizes some formal concepts for representing the uncertainties, and the CNNs categorize the concepts of boundary areas into either negative or positive opinions. Experimental results show that the proposed decision-making with a 3-way scheme deals effectively with opinion uncertainties.
The authors of the paper [23] implemented and evaluated several deep learning models such RNNs, LSTMs, gated recurrent unit (GRU), Group LSTMs, and update recurrent unit (URU). Then, they combined all evaluated deep learning models with several feature extractor, namely Skip-grams, FastText, word2vec and Glove. The 5 diverse deep learning models with the 3 feature extractors are assessed on the basis of F1 score, precision, recall, and accuracy for the unbalanced and balanced datasets. Their experimental results show that for the balanced dataset, the LSTM model achieved a maximum accuracy of 88.39%. And for the unbalanced dataset the GRU model combined with the FastText word embeddings approach obtained the best accuracy of 93.75%.
In the paper [24], the authors proposed a new hybrid CNN-LSTM. In the first step of their approach, they applied Word2vec word embedding to transform each word-based text into a digital vector. Once the word embedding process is done, in the next step they applied the convolution layer and the max-pooling layer to extract and select the most relevant features with long-term relationships. The model they offer also utilizes drop-out technique, standardization and a rectified linear unit to improve accuracy. Experimental findings show that their newly developed hybrid CNN-LSTM model surpasses both conventional machine and deep learning approaches in terms of precision, accuracy, recall, and F1-score.

IV. MATERIALS AND METHODS
We will discuss in the next subsection of this document, the reasons why we are proposing and developing this hybrid VOLUME 10, 2022 parallel pattern. In addition, the underlying architecture of this suggested hybrid parallel pattern is composed of six steps; The first step is to collect data using the massive Sen-timent140 dataset in order to evaluate our newly suggested hybridized pattern. The 2nd step, referred to as data preprocessing, is designed to eliminate noisy and unwanted data. The 3rd phase is known as semi-automatic dataset labeling using a combination of two different methods: Mamdani's fuzzy system and Vader lexicon. The 4th phase is feature extraction, that converts the data-based text into digital vectors. The 5th phase is the characteristic selection to minimize high dimensionality of the retrieved characteristics in data extraction phase. In step six, we set up the DBN as a classifier so that every tweet is classified into 3 categories (neutral, negative or positive). Finally, we have deployed our proposed approach in parallel way employing both Hadoop and Spark framework with the purpose of overcoming the problem of long runtime of massive data.
As illustrated in the figure 1, our combined classifier based on fuzzy Vader lexicon and parallel DBN's general structure is made up of 5 steps: data collection stage, data pre-treatment stage, data representation stage, characteristic extraction stage, characteristic selection stage, data classification, data parallelization employing Hadoop.

A. DATA COLLECTION STAGE
Opinion mining systems need a corpus of user comments to train a categorizer or to assess it. The most commonly used corpuses have been gathered through social networking websites, since the available contents are free, easy and instantaneous. People can share and discuss their thoughts in public. In this proposal, we have employed the massive Sen-timent140 corpus. It includes 1,600,000 gathered tweets with all emoticons in this corpus suppressed. For each single tweet, it was tagged by using 2 labels: positive and negative, wherein 4 indicates the positive feedback label while 0 denotes the negative feedback tag. Sentiment140 corpus comprises six features that are outlined below: • User: indicates the username who tweeted the message. • Location: displays the right location where the tweet has been published.
• Text: introduces the full text of every tweet.
• Flags: is used to indicate the request content of the username. While the string ''NO QUERY'' represents the value of flag variable in the case when the user has not published a request.
• Ids: is a unique number (542369871) which uniquely identifies every tweet.
• Target: Recognizes the category tag for every tweet, wherein 4 denotes the positive feedback tag while 0 denotes the negative feedback tag.
We focus on examining opinions in this contribution. It also means that we collect every viewpoint offered by all Twitter user in any tweet posted. As a result, the other attributes of this corpus have no bearing on the training objective. We have preserved the attributes ''Text'' and ''Target'' from the dataset while discarding the attributes ''User,'' ''Date,'' ''Ids,'' and ''Flag.'' The dataset chosen for this contribution is divided into two subsets, the formation and test subsets. Consequently, we used both subsets to demonstrate, in comparison to previous existing methods in the literature, the effectiveness of our hybrid classifier based on fuzzy Vader lexicon and a parallel deep belief network. Figure 2 presents the proportion of tweets from the formation and evaluation sub-set that are neural, favorable, and unfavorable. 1120,000 tweeted posts were taken as the total number of tweets in the formation phase. And 480000 tweeted posts we taken as the total number of tweets in the test phase. As a result, the test subset accounts for 30% of the overall tweeted posts in the original corpus.

B. TWEETS PRETREATMENT STEP
Numerous applications that employ raw and unstructured data make use of the preprocessing task, which has been the subject of extensive research. The necessity for pretreatment is considerably increased when research focuses more on published posts in social networks since many posts are misspelled and improperly written. Data pre-processing methods are thus required to obtain a cleaner corpus and the next classification operation performs more effectively when the dataset has been cleaned.
Almost all of the above listed sentiment analysis studies [9], [10], [11], [12], [13], [14], [15], the applied pretreatment methods are integrated. The pretreatment methods used are easy to understand, such as correction of simple error, punctuation, filtering of letters and of words. Lexicons are used to repair common faults like misspelled words and repeated letters. And the dictionaries are used to fix mistakes. Similar to this, abbreviations and acronyms are turned to words from a comprehensive dictionary.
A substitute strategy that has been used is removing the pointless content. Like stop words and punctuation which are removed from tweets to reduce the diversity of words used because they do not significantly affect the sentimental score [25]. The overuse of letters is one example of a filtering technique. A good illustration of vowel recurrence is ''Weeeeeeeell''. Repeated punctuation is an illustration of this example ''well !!!!!!!!!!!!!!''. By spotting the overuse of more than two of the following letters, these filtering techniques can be used [26].
The most widely used method for filtering and replacing strings is the regular expression [27]. It is an useful method for locating sub-strings in a string and allows for the identification and removal of errors and abuses. Regular expressions are used in many ''search and replace'' functionalities of programs because they are seen to be a particularly powerful matching technique for strings [28].
Tokenization is a useful additional tactic that must be used to reduce the broad variety of phrases. It is a technique for stripping verb forms down to their root [29]. The reduction of ''liked'' to ''like'' is an example of tokenization. This reduces the variety of verb conjugations that a word can have, which in turn reduces the amount of data [30].
Most of the methods described are language-specific. The language of the text is crucial for pretreatment and categorization [31]. It is essential to have the clearest feasible corpus. When we look at different languages, we find that each one has a unique term vocabulary, syntax, and grammatical form. Therefore, it is vital to identify the language for select a confirmed grammatical context. This is the main area of study for lingual recognition, which has received an extensive research [32].
When a message's language has been determined, the pattern that will be trained can proceed as though all tweeted posts originate from that language. Therefore, it is safe to VOLUME 10, 2022 conclude that the tweeted posts contain just words from the recognized language [33].

1) DATA EXPLORATION
The word cloud is a traditional form of data representation in autonomous word recognition [34]. The most frequently employed terms can be easily displayed in this format. The recurrence of terms from the dataset utilized is relative to the size of the terms in concern in the image, with the exception of colors, which serve just as decoration (as depicted in figure 3). On the two offered graphical representations, we have charted the original data both with and without pre-treatment. Mostly in case of crude data, the tweet headers' structure is primarily to blame for the noise that was noticed. Once we have successfully removed a significant amount of noise from the pre-treated data, it becomes meaningful and can begin to be studied.
The semi-automatic tagging phase is the step in this approach that comes after the preprocessing stage for tweets. That indicates that the data obtained following the completion of all of the aforementioned activities for preprocessing of tweets will be fed to the mixed method fuzzy Vader lexicon.

C. SEMI-AUTOMATIC LABELING STEP
An important phase in ML (Machine Learning) is data labeling [35]. It is essential to label the data before using it to train an AI model. After carefully examining and analyzing the dataset, we discovered that certain tweets were incorrectly tagged. As a result, we chose to combine the Mamdani fuzzy system and Vader lexicon for performing the semi-automatic data tagging.

1) VADER LEXICON
Vader, a rule-based lingual dictionary and web usage mining tool created to evaluate social network sentiments, and it stands for Valence Aware Dictionary and Sentiment Reasoner. It has an MIT license and is open source [36]. It makes use of a range of technologies and tools. A sentiment vocabulary is a collection of lexical features (such as terms) that are classified as neutral, positive, or negative based on their sentimental score [37]. It displays the intensity of a negative or positive mood in addition to the rates of positivity and negativity [38]. VADER maintains the benefits of traditional vocabulary like LIWC. It is more substantial, simple to understand, quick to use, and simple to grow. VADER's emotion dictionary is quality gold-standard and has been approved by professionals. VADER differs from LIWC in its awareness of feeling expressions in the context of social media and also in its more favorable generalization to other fields. In this proposal, we have used the VADER lexicon to compute the negativity and positivity rates of each tweet before feeding it into the Mamdani fuzzy system. For example, we have applied the VADER onto the tweet described in the Table 1, and we got as positivity rate equal to 0.575% and negativity rate equal to 0.425%.

2) FUZZY LOGIC SYSTEM OF MAMDANI
Once both sentimental scores NS (Negativity score) = 0.425 and PS (Positivity score) = 0.575 have been computed. The implementation of Mamdani's fuzzy logic, which is composed of three stages as illustrated in the figure 4, is the following stage. A first stage prior to the fuzzification operation is the setting the in and out lingual variables and the setting of the lingual words for every lingual variable [39]. Therefore, in our contribution, we have implemented the fuzzy system 87876 VOLUME 10, 2022 of Mamdani as a categorizer on the two variables PS and NS obtained in the preceding stage of Vader lexicon. Moreover, the variables linguistic inputs are NS and PS and every linguistic variable is assigned three lingual words which are Lower (between 0.0 and 0.35), Medium (between 0.35 and 0.65) and Higher (between 0.65 and 1). Further, the output linguistic variable is the class label that takes three lingual words positive (is between 0.65 and 1.0), negative (is between 0.0 and 0.35) and neutral (is between 0.35 and 0.65). In conclusion, the Table 3 depicts the inputs and outputs linguistic variables.
Fuzzification Phase: After the setting of lingual variables and lingual words, the subsequent stage is the implementation of fuzzification operation [40] to the net rates of PS and NS, employing one of the membership functions (MFs) detailed by the equations (1), (2), (3), (4) and (5) for computing the membership degree µ of both sentimental scores PS and NS in the Lower, Medium, and Higher fuzzy sets.
A triangular membership function [41] is defined by a low value lv, a modal value mv, a high value hv and lv < mv < hv. It is defined as follows: A trapezoidal membership function [41] is defined by a low value lv, a high value hv and two values vp and vz which represent the boundaries of its kernel. The formula for the trapezoidal membership function is represented as follows: A monotonically increasing membership function [42] is defined by two metrics d and p. It is defined by the next equation: A monotonically decreasing membership function [42] is defined by two metrics d and p. Its formula is represented as follows: A Gaussian function is defined [41] by its modal value m and by a value k > 0. It reaches 1 only for the modal value m. The formula associated to the Gaussian membership function is described as follows: For example, we apply the triangular MF (1) to measure the degrees of membership of the variables PS and NS to the Lower, Medium, and Higher fuzzy sets. The computational process is presented as follows: In the case of the linguistic word Lower, and the optimum scalar metrics are lv = 0; mv = 0.175; and hv = 0.35; then, we used these metrics to measure the degrees of membership of the two linguistic parameters PS and NS to the Lower fuzzy set. The outcomes are the following: Consequently, the metric values of each used membership function have been found empirically, and we choose the optimum values of these metrics that yield the better classification results.
In the case of the linguistic word Medium, and the optimum scalar metrics are lv = 0.35; mv = 0.5; and hv = 0.65; then, we used these metrics to measure the degrees of membership of the two linguistic parameters PS and NS to the Medium fuzzy set. The outcomes are the following: In the case of the linguistic word Higher, and the optimum scalar metrics are lv = 0.65; mv = 0.825; and hv = 1; then, we used these metrics to compute the degrees of membership of the two linguistic parameters PS and NS to the Middle fuzzy set. The outcomes are the following: Base of Fuzzy Rules: The establishment of the fuzzy rules (FR) for IF-THEN statements comes next after the fuzzification procedure as described below. VOLUME 10, 2022 Inference Mechanism: Once the IF-THEN fuzzy rules has been made. The next stage is the application of the inference procedure [43] which is a technique for gathering the data of a particular model using a defined collection of rules for the representation of any issue. Every rule provides a part of its conclusion that is later merged with the rest of the rules in order to get a full conclusion. In general, three rules govern the inference operation: Application, Implication, and Aggregation. Which are introduced below.
Application Sub-Stage: this step of a inference mechanism matches the fuzzy membership degrees of each rule's inputs to a firing strength for that rule [44]. The firing strength rate of each rule is measured by the intersection of the antecedent block for the fuzzy rule. Where intersection (conjunctive) expressed in the logic connective ''OR'' by t-norm = maximum. And in the logic connective ''AND'' it denoted by t-norm = minimum. This process is defined by both equations (6) and (7) below: For example we have: Implication Sub-Step: this step aims to apply an implication operator at every IF-THEN activated fuzzy rule and this implication operator applied mostly the operation minimum between the consequent block of every rule and fuzzy results given by the previous application step [45]. The following formula 8 presents this implication sub-stage: For example we have: Aggregation Sub-Step: the last sub-stage in the inference process is the combination of the results given by the implication sub-step. In other terms, all IF-Then Fuzzy Rules having the same class label will be aggregated together [46]. There are various aggregation indicators, such as mean,maximum, arithmetic mean, geometric mean and minimum. A widely utilized operator is the maximum that is defined by the next formula (9): For example we have: Defuzzification Phase: is Mamdani's fuzzy system's final step. It is the procedure of yielding a measurable outcome in Crisp logic, based on the corresponding fuzzy sets and membership degrees [47]. It is the procedure that transforms a fuzzy result into a crisp result. There are several defuzzifi- Max-Membership Principle: This defuzzification approach is also termed as the height approach [48]. It is restricted to peak outcome functions and it is described by the next algebraic formula.
where µ(y) indicates the membership rate of the element x and µ(y * ) is the membership rate of the defuzzified element y * . The representation graphic of the MMP defuzzification method is illustrated in the figure 5. Centroid Approach: this method is also renowned as the centre of mass, of gravity, or of area [49]. It is the most 87878 VOLUME 10, 2022 frequently applied defuzzification technique. Its core idea is to identify the point x * at which a vertical boundary line would split the aggregate into two independent equal masses. It is defined by the following equation (11) x where x i denotes the element in the instance,µ(x i ) is the membership rate of the variable x i , and n presents the overall number of the variables in the used example. The representation graphic of the CM defuzzification approach is illustrated in the figure 6. Weighted Average Approach: is the simplest and, most commonly applied defuzzification method [50]. This technique is also referred as the ''Sugeno defuzzification'' approach. it can be formed by averaging every function of the outcome by its corresponding maximum belonging degree. This approach is also useful for fuzzy sets with symmetric outcome belonging functions and gives results quite comparable to the output of the CLA approach. And it is defined by the next algebraic equation (12).
where x represents an element of the instance and µ i (x) indicates the membership degree of the element x. The following figure 7 illustrates the representation graphic of the Weighted Average defuzzification approach. Mean-Max Membership: this approach is also referred to as the medium of maxima procedure [51]. It is very closely linked to the maximum membership function, with the exception that the peak membership positions can be non-unique. The defuzzified outcome here is expressed by the subsequent equation (13): where x represents the maximum membership degree and N indicates the overall count of elements in the instance. The figure 8 depicts the representation graphic of the Mean-Max Membership defuzzification approach. Centre of Sums Approach: in this procedure, the overlapping region is covered several times, whereas the centroid procedure only does so once [52]. It utilizes the algebraic VOLUME 10, 2022 summation of the single fuzzy subsets rather than their fusion and it defined by the following equation (14).
where K represents the number of fuzzy lingual terms, n indicates the overall count of the fuzzy sets, and µ i j represents the jth fuzzy set' membership degree. The representation graphic of the centre of Sums defuzzification method is depicted in the figure 9. Centre of Largest Area: It can be applied where the outcome has at least 2 non-overlapping fuzzy convex subsets. The outcome, in this scenario, is skewed to one side of a membership method [53]. Whenever the fuzzy outcome has two or more convex areas, so, the centroid of the convex fuzzy subarea with the biggest value is taken to get the defuzzified value x * . The value is determined by the next equation (15).
where α = min{y; y ∈ Z }, β = max{y; y ∈ Z } and y = yBOA is the vertical axis dividing the zone between y = α, y = β v = 0 and v = µ i (y) into both zones which belongs to the same area, µ i (y) represents the membership rate of the variable y, with y * is the y variable' derivative. The representation graphic of the Centre of Largest Area defuzzification approach is depicted in the figure 10.
First of Maxima: this approach aggregate the overall outcome or union of all outcome fuzzy sets C i for finding the smallest value of the area that maximized the membership degree in the fuzzy sets C i [53]. Therefore, the defuzzified value is described by the next formula (16) The figure 11 depicts the representation graphic of the First of maxima Membership defuzzification approach.
Last of Maxima Method: It aggregate the overall outcome or union of all outcome fuzzy sets C i for finding the greatest  value of the area that maximized the membership degree in the fuzzy sets C i [53]. Therefore, the defuzzified value is outlined by the next formula (17).
The figure 12 depicts the representation graphic of the Last of maxima Membership defuzzification approach.

D. TWEET REPRESENTATION STEP
So because attributes values should be multiplied by the system weights, feature extractor is often employed in machine learning models to turn the attributes into into real number vectors. In our contribution, we have applied four feature extractors, which are: GloVe, TFIDF (Trigram), TFIDF (Bigram), TFIDF (Unigram) in order to discover which one provides great accuracy rate.

1) N-GRAMS
It is an n-element sub-strings created from a used string. The information theory work of Claude Shannon appears to be the source of the concept.His theory was that the probability ratio of the occurrence of the following letter could be determined, for instance, from a specified sequence of letters. It is simple to create a likelihood function for the following letter with a history of size n from a training corpus. N-grams are frequently employed in the analysis of natural language. Their application is predicated on the underlying premise that, provided a sequence of k items (k ≥ n), Therefore, only the n-1 previous elements determine the likelihood of an element appearing at position i. We thus have: With n = 3 (case of the trigram), we have: The probability of the sequence is: 2) TF-IDF It is completely statistical and relies on how often terms occur. It is frequently used during knowledge discovery, and information extraction in specific. This statistical measure enables one to assess a term's significance in relation to a corpora or set of terms. The word's weight rises in direct proportion to how frequently it appears in the text. Additionally, it changes based on how frequently a term appears in the corpus. In order to determine if a document is relevant to the user's search requirements, search engines frequently employ variations of the original algorithm.

a: TERM FREQUENCY (TF)
The frequency of occurrence of a word in the example document is its ''raw'' frequency. We can choose this frequency to express the frequency of a term.
where: n t reveals how frequently the word t appears in the text. m k=1 n k represents the overall count of the term in the document.

b: INVERSE DOCUMENT FREQUENCY (IDF)
IDF is a measurement of the frequency of the word in the whole corpora. The TF-IDF pattern is designed to give greater weight to the less frequent words, deemed to be more discriminatory. It consists in calculating the logarithm (in base 10 or base 21) of the inverse of the percentage of the corpus that includes the word w: P: infers our corpus of documents. It can equally be described as P = m 1 , m 2 , . . . , m n where n is the number of documents.
|m ∈ P : w ∈ m|: means the total number of repetition of the word w in the document m (the m ∈ P). Therefore: where: w: means words or terms; m: indicates every document; P: denotes the corpus.

3) GLOVE
Terms are transferred into a vector space containing digital values when term integration procedures are used. A good term incorporation should ideally map terms so that two different terms with almost the same semantic importance have mappings that are extremely comparable in the vector space [54]. It is possible to keep additional linguistic linkages between terms that are unrelated. As an illustration, the subsequent operations ''King − Man + Woman'' deliver a rate that closely resembles the word ''Queen'' vector space representation when we employ these vector space representations. A widely applied word integration pattern is GloVe (Global Vectors). Which is a model suggested by the NLP research staff at Stanford University. This method merges the benefits of both local context and global matrix factorization methods. The content is a window of a fixed size of lexical elements that is placed around the word. We try to map every term i and every term j paired in the similar content by the vectors spaces z i and z j respectively, of size d like: where X ij indicates the frequency with which word j occurs in the content of term i. bs i and bs j are the biases related to the terms i and j respectively.

E. STAGE OF FEATURE SELECTION
Finding the ''relevant'' subset of features from the starting collection is the goal of feature selection. The system's goals and criteria are always taken into consideration while determining the significance of a subset of features. In our research, we combined three ways to carry out the feature selection:

1) MUTUAL INFORMATION APPROACH
The semantic similarity of two random attributes is a measure of the mutual correlation between the two random attributes in knowledge and likelihood theories. More specifically, it refers to the quantity of knowledge gained about one random attribute through investigation into another random attribute. It can identify non-linear correlations between the two random attributes and is symmetric. This approach is valuable in the topic of feature selection since it allows us to determine the relevance features from a subset of features with regard to the outcome space vector. In formal terms, mutual information approach is described as follows: where MI is equal to zero when the random variables w 1 and w 2 are both statistically unrelated,i.e. p(w 1 (c), w 2 (d)) = p(w 1 (c)).p(w 2 (d)), and p(w 1 (c), w 2 (d)) represents the joint mass probability between both variables w 1 and w 2 .

2) CHI-SQUARE METHOD
In statistics, the chi-square is typically used to examine the unrelated of 2 features. It is a computational test that determines the divergence from the expected apportionment while accounting for the fact that the variable occurrence is not associated with the choice of variable's value. Chi-square determines whether the appearance of a certain word and the appearance of a particular class are unrelated in feature selection. As a result, every term is assessed, and the terms are ranked according to their scores. A high rating suggests that the term's appearance and the class are correlated, and thus the null hypothesis of unrelated must be discarded.
The feature is chosen for categorization process if the term and the class depend on one another. In general, the chi square score is computed from the following parameters such as false positives (fp), false negatives (fn), true negatives (tn) true positives (tp), probability of number of negative instances P neg and probability of number of positive instances and P pos .
chi − square − score = t(fn, (fn + tn).P pos ) + t(tn, (fn + tn).P neg ) +t(tp, (tp + fp).P pos ) + t(fp, (tp + fp).P neg ) (27) where t(observed value, expected value) = (observed valueexpected value) 2 /expected value. The chi-squared approach involves the following stages: 1) Define the hypothesis 2) Create an assessment plan 3) Investigate the samples of the data 4) Determine the outcomes. Create an Assessment Plan: after the hypothesis is declared, the assessment plan specifies how to use the data from the model to reject or accept the hypothesis.
• Importance Range: the researchers select an significance range that equals 0.01, 0.05, or 0.10, although it may be any number between 0 and 1.
• Testing approach: the chi-square test is employed to assess the degree of independence to determine if there is a significant association between both categorical variables. Investigate the Samples of the Data: the sample of data must be examined to compute the degrees of liberty, the test value, the expected frequencies, and the P-value that is related to the test value.
• Degrees of liberty Where r is the set of the levels of one categorical attribute and c is the set of the levels of the second.

3) ANOVA TECHNIQUE
In this proposal, two one-way ANOVA approaches, based on P-value and F-value, are utilized to statistically pick out the relevant features.

Algorithm 1 Pseudocode of One-Way ANOVA Based on F-Values
Input : A pair (F, C), where F denotes the set of features retrieved by one of the used extractors, and C the class label of every feature. Also, %m is the percentage of the chosen features Output: A chosen subset of relevant features according to the F-value. Begin T Classes ← find(C) // Extract the total number of class labels. for each F j ∈ (F, C) do n instance_per_class ← find(C i ) n total_instance ← find(F, C) d 1 ← T Classes − 1 // rate of liberty between the classes d 2 ← n total_instance − 1 //rate of liberty within the classes sum square_all_features ← (

end for Order ascending (F based on F-value) BV ← Choose (The biggest %m of F according to the F-value) return BV
In the former one-way ANOVA approach, features were chosen based on the F-values and the specified percentile (p%) of the initial number of features. Only the features with the highest score (p%) were utilized to train the machine learning classifiers.
The second approach is dependent on the p-values of the one-way ANOVA, that identify the appropriate features of the classification process as well as a comparison with the level of significance. If the P-value of a variable is lower than the significance degree, the variable is retained for further processing. If not, it is rejected. The importance degree (α) is usually fixed at 0.05 [55].
The feature selection methodology for the two one-way ANOVA approaches is depicted in the Algorithms 1 and 2.

Algorithm 2 Pseudocode of One-Way ANOVA Based on P-Values
Input : A pair (F, C), where F denotes the set of features retrieved by one of the used extractors, and C the class label of every feature. Also, %m is the percentage of the chosen features Output: A chosen subset of relevant features according to the F-value. Begin T Classes ← find(C) // Extract the total number of class labels. for each F j ∈ (F, C) do n instance_per_class ← find(C i ) n total_instance ← find(F, C) d 1 ← T Classes − 1 // rate of liberty between the classes d 2 ← n total_instance − 1 //rate of liberty within the classes sum square_all_features ← (  [56]. This approach can tackle complex data structures and find features that are inaccessible to direct detection. As a result, increasingly complicated structures can be represented by successively stacking RBMs. In fact, the current RBM is trained to recognize traits that were implicit in the prior RBM using the output of the previous RBM, etc. Generally, the layer-by-layer algorithm is used to train the DBN, and finding descriptive features that demonstrate the relationship between the inputs in each layer is one of its VOLUME 10, 2022 benefits [57]. The layer-by-layer learning approach makes it possible to optimize the weights within layers more effectively. Additionally, initializing the DBN probabilities may enhance the outcomes in comparison to using randomized weights. The advantages of DBN learning also include its capacity to lessen the negative impacts of underlearning and overlearning,where both affect large deep learning model and are prevalent issues. These factors led to the DBN being selected as the predictor in this study.
The probabilistic energy-based pattern is a popular approach [58] that is used to set up a joint relationship between the hidden variables hv and the observed data ov, as indicated below: where P(hv i |hv i+1 ) is a conditional probability distribution for the hidden-hidden neuron in an RBM connected to the earth layer of the DBN and P(hv n−1 |hv n ) represents the joint hidden by hidden probability distribution in top-layer RBM.
At each layer, the computed outcome was taken as input for the subsequent layer.

1) RESTRICTED BOLTZMANN MACHINE
The restricted Boltzmann machine is a type of Boltzmann machine without any inner link between the visible and hidden layers. In this pattern, the joint probability configuration (k,hv) is described as follows: Jp(k, hv) = e −(Energy(k,hv)) nc (31) where nc = i,j e −(Energy(k i ,hv j )) is known as the normalization coefficient with k is the number of stacks of the restricted Boltzmann machine. The probability of a visible neuron is obtained by the sum of all hidden neurons.
The derivative of the logarithm of the mentioned probability equation is described as follows: With ϕ + and ϕ − being called positive and negative stages, respectively.
The estimation of the positive stage is straightforward due to the lack of inner link between the hidden or visible neurons. The conditional probability for every pair of hidden neurons is obtained by:

Algorithm 3 Stages Carried Out by the Algorithm of Contrastive Divergence
Stage 1 Initialize n, m, N, W, a, b, and (learning ratio) Stage 2 Then, assign a sample s as the initial state v 0 for the visible layer from the training data. Stage 3 Based on the Equation (35) evaluate P(hv 0n = 1|vv 0 ) and from the conditional distribution P(hv 0n = 1|hv 0 ) extract hv 0n ∈ {0, 1}, where n=1,2,. . . ,k Stage 4 Based on the Equation (36) evaluate P(vv 1m = 1|hv 0 ) and from the conditional distribution P(vv 1i = 1|hv 0 ) extract vv 1i ∈ {0, 1}, where m=1,2,. . . ,l Stage 5 Using Equation (35), evaluate p(hv 1n = 1|vv 1 ) Stage 6 Based on the subsequent equations update the parameter: Assign another sample as the initial state v0 for the visible layer from the training data and again iterate from steps 3 to 7. Continue this process, till applying Ntraining data for processing.
where M i is the ith row of the matrix M and sig(x) represents the sigmoid activation function. The visible neurons can be reconstructed in the same manner as the hidden neurons.
The second negative phase, must be computed for every hidden and visible neuron. One algorithm suggested to approximate the gradient of log-likelihood is the divergence contrastive (DC). This particular algorithm has been employed to upgrade the learning metrics, weights and biases in every RBM. The benefit of this algorithm is obvious when applying parallel computing with Matlab.
By supposing that the hidden neurons are binary, all visible neurons are sorted into different classes according to the batch size set in the former stages. Afterwards, the hidden neurons are computed with the next equation (37).
Lastly, the hidden neuron will be activated if the likelihood exceeds the threshold. For updating the visible neurons, it is usual to apply the probability, pi, which is calculated using the following equation (38): 87884 VOLUME 10, 2022 After having computed the gradient, we can now update the metrics, the weights and the biases. Two major metrics, the learning ratio and the momentum, can enhance the upgraded metrics with respect to the former ones. The learning ratio is multiplied by the matrix M . If this metric is too big, the reconstruction error will increase, and if it is too weak, the running time will be important. The better score for the learning ratio is combined with the averaged weights across multiple upgrades. Momentum is helpful to boost the learning velocity. It is applied after calculating the lot data and maintaining the metrics, therefore it is multiplied by M old All the stages of training the RBM approach can be summarized as follows: 1) Identify the necessary metrics: a) : Learning ratio for gradient descent in a stochastic manner b) µ: Momentum for the update of the metrics c) Hidden and visible biases set to a zero number d) The starting weights variables are defined as a random value with a Gaussian probability distribution. e) The total number of hidden neurons. f) The total number of layers g) Lot size for divergence contrastive sampling 2) Calculate ϕ + = hv.k 3) Calculatehv = P(hv j = 1) andk = P(k j = 1) for all i,j applying divergence contrastive algorithm.

G. PARALLELIZATION USING HADOOP
Behind the so-called ''big data'' systems, we have a core concept: which is distributing both data and treatments on a set of machines/computers that form a cluster using the framework Hadoop. In this framework, the storage of raw data is most often based on a distributed file system and MapReduce is the first implementation for big data of the principle of parallel processing applied to distributed files [59]. It is based on two main functions, map and reduce, which are sometimes applied multiple times. The former function applies a transformation to the values of a collection of data in the key/value format; the latter applies an operation to all the values of the same key. In this contribution, in order to resolve this runtime issue encountered by our proposed hybrid approach, we have utilized the Hadoop framework. This framework provides us the ability to parallelize our proposed approach across five computational nodes: four slaves nodes and one master node. This framework employs its HDFS with which to stock the sentiment140 datasets to be evaluated and the classification decision. Furthermore, it uses the programming model MapReduce, which handles and evaluates our fuzzy DBN jobs in a parallel mode through the use of various reducers and mappers as depicted in the figure 14. The implementation of our suggested hybrid model on the MapReduce programming pattern mainly comprises 3 steps: the Map stage, the Combining phase and the Reduce phase, briefly outlined as follows:

1) MAP PHASE
The mapping stage is composed of 4 mappers. Every mapping function (Mapper) takes one or more pieces of input data from the HDFS as input data under form of different key-value pairs. Each mapper implements the semi-automatic labelling process (Vedar+FuzzyLogic) on every piece of data, then store the labelling process results in the HDFS as a first phase of our proposed approach. In its second phase, each mapping function takes one more pieces of labelled data and applies on them the data preprocessing tasks, then it turns out them into numerical vector space by using one of the presented features extractors previously. Furthermore, the mapper applies one of the outlined feature selectors previously for reducing the dimensional vector spaces of the extracted features. Finally, the mapper applies the deep belief classifier into the treated pieces of data. After processing all pieces of the input data, the results produced employing our proposed approach are converted into a set of key-value intermediate pairs and are written to the local hard drive. The benefit of Hadoop framework is its capability to prevent the issue of computer nodes crashing by providing redundant storage of information on multiple computational computers, which allows for automatic data backup. This means the same piece of data is saved on different computational computers. If one computational server crashes, its same piece of data is always ready for use on another computational computer. The MapReduce scheduling framework is a software that provides scaling and reliability requirements for handling and running of distributed operations. More specifically, this scheduling scheme decomposes automatically the calculations into several parallelization jobs. For instance, if a single job is unable to complete its workload, it may be reloaded with no negative impact on the other running jobs. MapReduce avoids the network bottleneck issue by placing computational jobs nearer to the data being stored VOLUME 10, 2022 and disallowing data to be copied across the network, thereby reducing the network bottleneck problem and balancing the computational and information load. The MapReduce pattern also gives its adopters a very easy and simple pattern that removes the complexities of all computational jobs associated with its operation.
In order to minimize the running time and boost the effectiveness of our proposal, we used the Hadoop framework in this research. Our proposed solution requires for the use of a sizable dataset (Sentiment140). In the initial phase, we employed HDFS to divide and store the massive dataset among all of the Hadoop cluster's computing devices in parallel way. The stage that comes after putting the dataset in HDFS is applying our suggested strategy to the pre-stored dataset. We have applied the MapReduce scheduler paradigm in this stage 2 to parallelize our strategy among all computational devices in the Hadoop cluster. Each round of the Hadoop Mapreduce begins with a tweet that needs to be classified, and the output is a classified tweet. Every tweet's classification outcome will likewise be stored in the HDFS. Figure 14 provides an outline of all these phases, and the Algorithm 4 presents the MapReduce technique used in our suggestion for categorizing tweets.

H. CROSS-VALIDATION METHOD
Cross-validation strategy is one of the most widely employed techniques for adjusting hyper-parameters (CV), which is addressed in the survey [60]. The 10-fold CV computes ten measurement scores for each hyper-parameter adjustment, similar to the common CV. The average performance measure is then determined for each hyper-parameter. The final outcome metric for the pattern is the highest median performance measure. The hyper-parameters of our DBN have been modified in this contribution according to the provided values as shown in the Table 3.
Choosing the hyper-parameters, like the number of hidden units, RBM and DBN learning rates, the number of epochs, the batch size, the number of hidden and visible layers and the depth of features has a serious influence on the classification's accuracy and the calculation's complexity. The precision of DBN might not be more effective than conventional ML methods. if the depths are wrongly defined. All possible combinations of values should be investigated in order to discover the ideal number of concealed and exposed RBM layers and the ideal number of concealed neurons. For the parameter estimation in this work, we applied the grid search strategy. The grid search methodology is primarily an effective way to select the optimal values for a particular problem's or algorithm's hyper-parameters that result in greater effectiveness. In this proposal, the best hyper-parameter settings for our Proposed deep learning classifier are selected using the CV of 10 folds with a grid search algorithm. However, our experimental results provide some guidance on reliable ranges for  hyper-parameters of our DBN classifier, i.e., It seems sufficient to have 6-8 RBM hidden layers with 200-225 hidden neurons per hidden layer, the number of epochs equals 50, the batch size is either 32 or 64,Learning rates for RBM and DBN are 0.0015 and 0.0001, respectively. Additionally, there have been 125 back-propagation rounds.

I. TRAINING AND EVALUATION DATA SET
After the semi automatic labelling stage, we split the labelled dataset into three subsets. The partitioning of the dataset is intended to ensure the representativeness of the training dataset utilized for model building. The three subsets are presented as follows:

1) LEARNING SUBSET
The training subset is a subset of examples employed for learning, which involves adjusting the parameters of a pattern. For instance, a training subset is employed to train the weights of our fuzzy deep belief network. Furthermore, the training set must cover most of the predicted variability of the future example in the data spaces to obtain better models.

2) VALIDATION SUBSET
The validation subset is a subset of examples taken to adjust the parameters or the structure of a given model. For example, a validation subset is employed to set the number of hidden layers with the number of hidden neurons in the deep belief network.

3) TESTING SUBSET
The testing subset (forecast subset) is a subset of examples utilized purely to evaluate the effectiveness of a fully defined pattern. To train our system we have taken 60% of the corpus, to validate it we have used 10% and 30% to evaluate it as described in the following Table 4.

J. ASSESSMENT CRITERIA
It's crucial to assess a classification system's effectiveness [61]. Numerous metrics are available to evaluate the VOLUME 10, 2022 effectiveness of paradigms, Each of which has distinct qualities and it is often necessary to combine several of them in order to get a full picture of how well our model is performing. Nine assessments [62] were employed in this paper, and they are as follows:

1) PRECISION
is used to gauge the accuracy of a classification system. Greater precision implies fewer fake positive cases, while lower precision implies more fake positive cases. This measure is computed employing the next equation (39).
2) RECALL is used to gauge the correctness, or sensibility, of a classification system. Highest recall implies least fake negative cases, while lower recall implies more fake negative cases. This measure is computed employing the next equation (40).
3) F1-SCORE: represents the weighted harmonic mean of the recall and precision. This metric is computed utilizing the subsequent equation (41).

4) CLASSIFICATION RATE
enables for symmetrical evaluation of the model's performance on both negative and positive items. It gauges the ratio of all items with accurate predictions. The following equation is used to determine this metric (42):

5) FALSE POSITIVE RATE
represents the ratio used to identify the ineffectiveness of a classification system and to calculate the mis-categorization ratio by computing the number of cases that are in fact negatives but that the classification system has forecasted as positives. The false positive ratio is measured by utilizing the formula (43).

6) SPECIFICITY
is used to gauge how effective a classifier is at finding the total number of cases with the class label negative. This measure is calculated according to (44).

7) FALSE NEGATIVE RATE
represents the ratio used to identify the ineffectiveness of a classification system and to calculate the miscategorization ratio by computing the number of cases that are in fact Positives but that the classification system has forecasted as Negatives. The false Negative ratio is measured by utilizing the formula (45).

8) ERROR RATIO
metric is employed to gauge the miscategorization ratio, i.e., this specific measure determines the number of incorrect classification cases over all cases in the utilized corpus. Usually, its purpose is to assess the effectiveness of the classification system in limiting misclassifications. The error ratio is given by the following formula (46) is a measurement of the performance of a classifier that is used to compare an actual accuracy with an estimated accuracy. It is employed not only for evaluating a particular classifier but also for inspecting classifiers against each other. The kappa statistic obtained by calculating the following formula (47)

V. RESULTS
The findings of the numerical experiments will be discussed in this section. To evaluate the effect of each preprocessing technique on the corpus, a first experiment was conducted. The objective of the second experiment is to identify the most effective feature extractor. Finding the higher efficiency characteristic selector is the goal of the third experiment. And the final one displays the outcomes of various combinations used in our strategy.

A. INFLUENCE OF DATA PRE-PROCESSING
In this experiment, we study the effect of data pre-treatment. We confirm that the use of various pre-treatment techniques yields varying classification results. In addition, not all methods of data pre-treatment are required, and a error rate is computed to determine if every data pre-treatment technique is required and which one is most efficient. In order to handle complex datasets, there are six principal approaches of data pre-treatment that can be applied: field-delete technique, normalized technique, exponent change technique,PCA technique, global ratio change technique, and local ratio change technique.

1) FIELD-DELETE TECHNIQUE
If more than 99 percent of the dimensions' data values are equal to 0. This feature (dimension) requires removal because the feature values do not provide sufficient information for the purpose of designing the classifier. This method decreases the features' size of the initial models and simplifies the design of the classifier.

2) NORMALIZED TECHNIQUE
If the average or the co-variance of the values of data in a feature is extremely large, we need to standardize the values of data in that feature so that the center is 0. The basic calculation of this technique is given by the following formula (48) y ij = y ij − y iaverage covar i (48) where y ij represents the j-th value of i-th feature, y iaverage is the average of the values of the i-th feature data, covar i denotes the co-variance of the i-th feature data.

3) EXPONENT CHANGE TECHNIQUE
If the standardized approach fails to condense the feature data, the exponent change methodology is a suitable option to minimize all feature values. Its data value range is [0,1] and the j-th value of the i-th feature (y ij ) is computed by the Equation (49).

4) LOCAL RATIO CHANGE TECHNIQUE
If certain feature values are extremely large or extremely small in certain features, but in other features the values of the data are approximately the same, then a local ratio change approach can be a useful option. The basic calculation of this approach is described by the next Equation (51) y ij = y ij − y imin y imax − y imin (50) where y imax indicates the maximum values of i-th feature and y imin represents the minimum values of the i-th feature.

5) PCA TECHNIQUE
PCA is in close relationship with the factor analysis; it utilizes a characteristic matrix to map the models into a novel space.
PCA allows models to be projected from a high-dimensional space into a low-dimensional space, where the models can be representative of the original models. The stages of the PCA are the following: 1) Calculate the model dispersion matrix.
2) Calculate the values and the vector spaces of each feature. 3) Order the values of the characteristics from the largest to the smallest. 4) select over 86% of the major features and then aggregate the respective feature vectors in the form of a projection matrix. 5) Utilize the projection matrix to project the models in the initial area into a new area, where the size of the area is given by the dimensions of the matching feature vectors in this mapping matrix.

6) GLOBAL RATIO CHANGE TECHNIQUE
This approach has a similar functionality as the local approach, the only difference is that the local approach modifies every feature by different minimum and maximum data values in every feature, whereas the global approach modifies every feature by the minimum and maximum values of all the dataset values. Its computing result is given as follows: where y max indicates the maximum values and y min represents the minimum values of the whole dataset. Table 5 displays the findings in terms of error rate (ER), accuracy (AC) and runtime (RT with Hadoop) after applying each explained data pretreatment previously on the Senti-ment140 dataset.
From Table 5, we deduce that our first experiment is divided into several parts. The former part aims the application of every data pretreatment separately (PP). In this part of the experiment, we notice that the Global ratio change method attains a high accuracy equal to 75.02 %, a minimal error rate equal to 24.98% and a less runtime equal to 2.13s. Furthermore, we remark that the exponent change data preprocessing method outperforms the normalized method with an accuracy equals 74.07%, and an error rate equals 25.93%. Therefore, in the next experimental part, we will apply only the exponent change method. Because that the exponent change and normalized methods aim to carry out the same functionality and the former method outperforms the latter one in terms of error rate and accuracy.
In the second experimental part (SP), we notice that the aggregation Exponent change+Global ratio change gives high accuracy equal to 89.47%, with minimal error rate equal to 10.53% and less execution time equal to 5.63s. Therefore, in the third experimental part (TP), we will keep this combination and we will vary the other methods.
In the third experimental part, we remark that the combination Exponent change+Global ratio change+Fielddelete outperforms all other combinations since it reaches VOLUME 10, 2022 an accuracy equal to 98.56%, an error rate equal to 1.44% and a minimal runtime equal to 8.52s. Moreover in the fourth experimental part (FP), we will keep the combination Exponent change+Global ratio change+Field−delete and we will change the other approaches.
In the fourth experimental part, we notice that the aggregation Exponent change+Global ratio change+Field− delete+PCA perform better than the combination Exponent change+Global ratio change+Field−delete+Local ratio change , since it gives a high accuracy equal to 98.96%, a less error rate equal to 1.04% and a minimal runtime equal to 11.32s. Consequently, in the rest of this contribution, we will only apply the combination Exponent change+Global ratio change+Field−delete+PCA as a data preprocessing technique.

B. DATASET AFTER LABELING
As was previously stated, the sentiment140 corpus's tweets were incorrectly labeled since its designers believed that every tweet with positive emoticons, like :), is positive, and that tweets using emoticons denoting sadness, like :(, are negative. As a result, we made the decision to re-label by fusing the Vader lexicon with Mamdani's fuzzy model, which is founded on manually made rules. Our re-labelling process aims to label each tweet by computing the semantic orientation of their composed words and by employing the Mamdani fuzzy system to deal with uncertain and vague tweets. As we said previously, the Mamdani fuzzy system consists of fuzzification and defuzzification process. Also, The fuzzification/defuzzification process is carried out by applying different fuzzification/defuzzification approaches. Therefore, a comparative study is performed in order to determine the fuzzification/defuzzification combination with the highest performance in terms of classification and error rates as depicted in the Table 6. From the Table 6, we notice that the combination Gaussian function as fuzzification method and Centre of Largest Area as defuzzification approach is the best fuzzification/defuzzification combination with the highest classification rate (98.96%) and lower error rate (1.04%). Figure 15 describes the datasets before the re-labeling process and the figure 16 shows the dataset after re-labeling.
From the figures 15 and 16, we remark that our hybrid process re-labeled the dataset into three class label: Negative label represents 34%, Positive label represent 44%, and Neutral represents 22% of the whole dataset instead of both class label Negative and Positive in the original dataset and every class label represents 50% of the whole dataset. According to this re-labelling process, we deduce that the original data set is mislabelled with 22% as error rate.

C. A FEATURE EXTRACTORS EVALUATION
The aim of this step is to generate a collection of digital vectors from the input tweets. Based on error rate and accuracy, with regard to the feature extractors employed  in this contribution, we conducted a second experiment to identify the most effective feature extractor: GloVe, TF-IDF Trigram, TFIDF Bigram, and TF-IDF Unigram. The accuracy and error rate achieved by implementing GloVe, TF-IDF Trigram, TFIDF Bigram, and TF-IDF Unigram are shown in figure 17. As depicted in figure 17, TF-IDF Trigram surpasses the performance of other extractors based on the accuracy and error rate because it attained an accuracy equal to 98.96% and an error rate of 1.04%.

D. FEATURE SELECTORS ANALYSIS
The feature selection process, as previously mentioned, comes after the feature extraction step. We use a variety of approaches in this phase, including analysis of variance, mutual information and Chi-square, and after that, we combine these 3 methods. Consequently, the goal of this experiment is to compare the accuracy and error rate of all available feature selectors in order to identify the optimal one. Figure 18 outlines the accuracy and error rate that were obtained using various feature selection strategies. As illustrated in figure 18, We see that, in terms of accuracy and error rate, the suggested hybridized selector exceeds the performances of other selectors, because it got an accuracy rate of 98.96% with a rate of error equal to 1.04%

E. THE OVERALL EFFECTS OF THE SUGGESTED MODEL
To underline the significance of the proposed model and make it clear how it affects the classification outcomes of tweets; We conducted this experiment to compute the accuracy rate, error rate, runtime, kappa-statistic, specificity, precision, false positive rate, F1-score, recall and false negative rate for every hybridization (Extractor+Selector+Classifier) and then examine it to see which model performs best.

F. COMPARATIVE STUDY BETWEEN OUR APPROACH AND BASELINE ALGORITHMS
Inside this subsection, we compare the outcomes of the suggested technique with those of Sentiment140, which is a Twitter project that automatically classifies sentiment. In this project, they classified tweets through the use of seven distinct ML techniques, including naive Bayes (NB), Support vector machines (SVM), Maximum Entropy (MaxEnt), Iterative Dichotomiser 3 (ID3), C4.5, random forest (RF), k-nearest neighbors (KNN) and our approach. In figure 19 We describe all the findings from this project and those of the suggested model according to the assessment metrics: Accuracy, Recall, F-score and Precision.  We were able to draw the conclusion from the comparative analysis that the outcomes produced by our approach signifi-cantly outperform those of the seven algorithms, This demonstrates the benefit of employing a deep learning technique for  tasks like these involving sentiment analysis and the strong results we obtain when dealing with large amounts of data as opposed to using conventional ML algorithms.

G. COMPARATIVE STUDY BETWEEN OUR APPROACH AND DEEP LEARNING ALGORITHMS
In this subsection we discuss the experimental findings obtained by the implementation of our fuzzy deep belief model and the deep learning algorithms such as Convolutional neural network (CNN), Feedforward neural network (FFNN), Recurrent Neural Network (RNN), Long short-term memory (LSTM). The obtained results in terms of recall, precision, F1-score, and classification rate for this experiment are illustrated in figure 20.
From the figure 20, we remark that our fuzzy deep belief network outperforms all other deep learning models in terms of four evaluation criteria. Since it achieves a recall equal to 99.75%, a precision equal to 97.59%, a F1-score equal to 98.65% and a classification rate equal to 98.96%. These obtained results shown that the fuzzy deep belief network has the ability to overcome the overfitting and underfitting issues.

VI. DISCUSSION
For further assessment of our newly suggested fuzzy deep belief classifier, we conducted one more experiment which aims to compare our classifier with the other chosen classifiers from literature which are Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65]. However, in this instance, the assessment measures employed will be false negative rate, recall, runtime, convergence, stability, F1 score, accuracy, error rate, kappa-Static, complexity, false positive rate, precision rate and specificity rate, as discussed in the subsection on evaluation criterias. This comparison is done utilizing the dataset Sentiment140. Its empirical findings are displayed in the figure 21.

A. COMPLEXITY, CONVERGENCE AND STABILITY
In this experience, we have evaluated the effectiveness of our developed hybrid model and the Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65] methods chosen from the literature to train the Sentiment140 dataset by measuring the space and time complexity. Table 10 reports the complexity's experimental results in terms of space after calculating the memory space seized by the allocation of the parameters and the implementation of the instructions of our developed hybrid model, Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65].
As depicted in Table 10, we observe that our suggested hybrid model has operated on numerous instructions that seize a memory space equal to 11.59 M when training the Sentiment140 corpus. Besides, the memory space allocated by our approach's parameters equal to 7.31 M when training the Sentiment140 corpus. As the empirical results reported, our innovative hybrid model occupied much lower memory space in comparison to Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65] approaches. Table 11 displays the obtained experimental results concerning the complexity in terms of time after calculating  the training and testing time spent by our developed hybrid approach and the other evaluated approaches.
As shown in Table 11, we notice that our developed hybrid pattern has consumed a training time equal to 11.92 s, in the case of the Sentiment140 corpus. In addition, our proposed hybrid model has expended a testing time of 3.56 s in the case of Sentiment140 corpus. As the acquired practical results reported, our offered hybrid model has a much lower time-complexity in comparison to Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65]. This accurate performance in terms of time-complexity attained by our hybrid pattern is a result of using the Hadoop cluster, which comprises of twelve computational nodes: eleven subordinate nodes and one supervisor node.
In the fifth experiment, we have assessed the efficiency of our proposed hybrid method and the Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65] methods chosen from the literature to train the Sentiment140 dataset by demonstrating the convergence of each evaluated approach using the equation 52 in order to determine the iteration number when the analyzed method verified the condition described below in the following equation 52.
Error ratep − Error ratec ≥ T value (52) where Error ratep represents the average error rate achieved by the evaluated approach at the previous iteration of the learning process, Error ratec measures the average error rate of the assessed method at the current iteration of the learning process, and T value represents the threshold rate that initiated the convergence rate. After we performed numerous analyzed experiences, The threshold value was set at 0.0001. Hence, the average error rate of every analyzed method is estimated by use the subsequent equation: where I signifies the total count of stored instances in the trained corpus, D represents the total count of decision feature labels in the used corpus, z is the expected and required decision feature label at the output of the classification process, and z label represents the obtained label of the decision attribute at the output of the classification process. Suppose the formula defined in the equation (52) is met. In that case, we say that the trained method is converged, and the algorithm is executed till the trained method's average error rate reaches the condition. Otherwise, we would say that the trained approach fails to converge. Figure 22 presents our proposed hybrid model's convergence rate when it was executed over the Sentiment140 dataset. As displayed in figure 22, we noticed that our developed hybrid model converged towards the threshold rate value of 0.0001 after our suggested hybrid model algorithm arrived at 254 iterations when it was practiced over the Sentiment140 corpus. Table 12 illustrates the convergence round of our suggested hybrid pattern, Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65].
As described in Table 12, we remark that our proposed hybrid pattern converges faster than others because it has a lower misclassification rate in comparison to other evaluated approaches.
In the final experiment, we measured the mean standard deviation (MSD) of each approach in comparison to the various 5 cross-validations of the given corpus to examine the effectiveness of our developed hybrid pattern, Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65]. The primary objective of this experiment is to find the more stable approaches amongst all the evaluated approaches.   [64], and Chen et al. [65] in comparison to the various 5 cross-validations. Table 13 depicts the obtained MSD and average accuracy (AVA) of the suggested hybrid model in comparison to Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65] over the various 5 cross-validations of used corpus in this proposal.
As reported in Table 13, we deduce that our suggested hybrid model is more stable compared to Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65] in comparison to the various 5 crossvalidations, because it reached a higher AVA rate equal to 92.27 % with a lower MSD equal to 0.18 % when it was practiced over the Sentiment140 datasets.

VII. CONCLUSION AND FUTURE SCOPE
In this proposal, we develop a novel hybrid paradigm to categorize tweets into three categories: positive, neutral, and negative. Six steps make up the suggested hybrid approach: Phase of data collecting, in which we have chosen the Sen-timent140 dataset to evaluate our contribution. Phase of data pre-treatment by performing all required pre-treatment operations. Semi-automatic tagging over the corpus using two methods vocabulary of Vader and the fuzzy system of Mamdani. Data representation step by utilizing four different extraction approaches, including: GloVe,TFIDF (Trigram), TFIDF (Bigram), and TFIDF (Unigram) in order to convert any twitter post into a digital vector. Data selection stage by utilizing 3 feature selection strategies, including: the ANOVA technique,the chi-square method and The mutual information approach. Data classification utilizing a deep belief network as a classification model to assign one of three labelsnegative, neutral, or positive-to each tweet that is input. To solve the problem of a long runtime for large data sets, our hybrid approach is finally executed through utilizing Hadoop platform in a parallel configuration.
Also from conducted simulations, we have deduced that our suggested hybrid model is more stable compared to Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65] in comparison to the various 5 cross-validations, because it reached a higher AVA rate equal to 92.27 % with a lower MSD equal to 0.18 % when it was practiced over the Sentiment140 datasets. In addition we have remarked that our proposed hybrid pattern converges faster than others because it has a lower misclassification rate in comparison to other evaluated approaches. we have also observed from the conducted experiments that our suggested hybrid model has operated on numerous instructions that seize a memory space equal to 11.59 M when training the Sentiment140 corpus. Besides, the memory space allocated by our approach's parameters equal to 7.31 M when training the Sentiment140 corpus. As the empirical results reported, our innovative hybrid model occupied much lower memory space in comparison to Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65] approaches. Finally, we have noticed that our developed hybrid pattern has consumed a training time equal to 11.92 s, in the case of the Sentiment140 corpus. In addition, our proposed hybrid model has expended a testing time of 3.56 s in the case of Sentiment140 corpus. As the acquired practical results reported, our offered hybrid model has a much lower time-complexity in comparison to Botchway et al. [7], Es-Sabery et al. [10], Hua et al. [63], Hassan et al. [64], and Chen et al. [65]. This accurate performance in terms of time-complexity attained by our hybrid pattern is a result of using the Hadoop cluster, which comprises of twelve computational nodes: eleven subordinate nodes and one supervisor node.
Typically, the goal of this study was to investigate the sentiments that people expressed on Twitter on a particular product, brand, or topic as collected by Sentiment140 corpus. The study could be limited by the fact that all of the tweets used in it were written in English. Additionally, although Vader lexicon, which was utilized for this study, assessed the tweets for various emotions, it does not count certain words that express emotions.
Our planned scientific endeavors are the utilisation of other lexicons for resolving the issue of Vader lexicon, the utilisation of the deep learning models as opposed to the traditional methods of feature extraction and selection for identifying the pertinent features, and searching for more classifiers to compare their effectiveness and our deep belief network that aims to classify the tweets in this contribution. Utilization of the fuzz rule-based model for handling uncertainty and vagueness data instead of the Mamdani fuzzy system used in this work.