TEmoX: Classification of Textual Emotion Using Ensemble of Transformers

Textual emotion classification (TxtEC) refers to the classification of emotion expressed by individuals in textual form. The widespread use of the Internet and numerous Web 2.0 applications has emerged in an expeditious growth of textual interactions. However, determining emotion from texts is challenging due to their unorganized, unstructured, and disordered forms. While research in textual emotion classification has made considerable breakthroughs for high-resource languages, it is yet challenging for low-resource languages like Bengali. This work presents a transformer-based ensemble approach (called TEmoX) to categorize Bengali textual data into six integral emotions: joy, anger, disgust, fear, sadness, and surprise. This research investigates 38 classifier models developed using four machine learning LR, RF, MNB, SVM, three deep-learning CNN, BiLSTM, CNN+BiLSTM, five transformer-based m-BERT, XLM-R, Bangla-BERT-1, Bangla-BERT-2, and Indic-DistilBERT techniques with two ensemble strategies and three embedding techniques. The developed models are trained, tuned, and tested on the three versions of the Bengali emotion text corpus BEmoC-v1, BEmoC-v2, BEmoC-v3. The experimental outcomes reveal that the weighted ensemble of four transformer models En-22: Bangla-BERT-2, XLM-R, Indic-DistilBERT, Bangla-BERT-1 outperforms the baseline models and existing methods by providing the maximum weighted $F1$ -score (80.24%) on BEmoC-v3. The dataset, models, and fractions of codes are available at https://github.com/avishek-018/TEmoX.


I. INTRODUCTION
Classifying textual emotion entails the automated process of attributing a text to an emotion category based on predetermined connotations.In recent years, the proliferation of the Internet and the rapid evolution of social media platforms have led to a significant surge in text-based content, greatly increasing its presence in everyday interactions.Online users communicate their concerns, opinions, or feelings via The associate editor coordinating the review of this manuscript and approving it for publication was Junhua Li .tweets, posts, and comments.Thus, much emotional text content is accessible on social media or online platforms.Researchers' attention has been attracted by the increasing volume of textual emotion content, as the categorization of emotions plays a vital role across numerous applications, including education, sports, e-commerce, healthcare, and amusement.With an ever-increasing number of people on virtual platforms and the rapidly producing online information, evaluating the emotions expressed in online content becomes crucial for different stakeholders, such as customers, enterprises, and online education.Textual emotion analysis of an enterprise's service or product can boost brand value, sales, and prestige [1].Automatic TxtEC helps to enhance the quality of a product or service, revise sales plans, and forecast forthcoming trends.Furthermore, it can shape brand reputation, follow client reactions, catch general emotions, and track conformity.
Although TxtEC has made significant advancements in well-resourced languages, its current stage of development remains rudimentary when it comes to low-resource languages like Bengali.Low-resource languages are defined as languages for which statistical methods cannot be directly applied due to data scarcity.These languages are crucial as they represent vast speakers, especially in regions like Asia and Africa.Investigating huge amounts of data to unveil the underlying sentiments or emotions (particularly in Bengali) is considered a critical research problem in low-resource languages.The textual data are voluminous and unstructured.Due to their chaotic forms, it is very arduous and time-intensive to organize, store, manipulate, and extract emotional content.The difficulty arises from several constraints, including sophisticated language structures, limited resources, and substantial verb inflections [2].Moreover, the scarcity of text processing tools and standard corpora makes textual emotion analysis more difficult in Bengali.Taking into consideration the current impediments of textual emotion classification in Bengali, this work introduces an intelligent technique called TEmoX which utilizes transformer-based learning to categorize Bengali texts into six primary emotions (e.g., joy, anger, disgust, fear, sadness, surprise).Transformer-based learning has recently demonstrated significant advancements in text classification [3], [4].Hence, this work motivates us to use transformerbased learning to classify textual emotions in Bengali.This research extends the previous work [3], which involved utilizing three transformer models: m-BERT, XLM-R, and Bangla-BERT-1.However, this work utilizes two more new models: Indic-DistilBERT [6] and Bangla Bert-2 [6].In addition, this study incorporates the extended version of the dataset (BEmoC-v3 [7]) and employs a stratified sampling technique [8] to address the imbalanced nature of the corpus.By exploiting the performance of 26 ensemble models, 3 deep learning models, and 4 machine learning-based models, this work proposes the En22 (XLM-R+Bangla-BERT-1+Bangla-BERT-2+Indic-DistilBER) model to perform textual emotion classification for improved results.The distinctive contributions of this work are outlined as follows: • Proposed a textual emotion classification technique called TEmoX to classify Bengali text into six categories: joy, anger, disgust, fear, sadness, and surprise.TEmoX uses weighted ensemble of four standard transformer models (XLM-R, Bangla-BERT-1, Bangla-BERT-2, and Indic-DistilBERT) with fine-tuned hyperparameters.
• Investigated 38 classification models, including 4 machine learning (ML), 3 deep learning (DL), and 5 transformer models with ensemble strategies to find a robust model for textual emotion classification tasks in Bengali.
• Analyzed the classification outcomes of 38 models with a detailed investigation of misclassification and error rate to find many exciting characteristics of the emotion classification task that might help future researchers.

II. RELATED WORK
Emotions can be perceived in various manners.Emotions are distinct states like fear, anger, and joy, each characterized by unique expressions, evaluations, preceding events, and bodily reactions [7].
Recent advancements in textual emotion classification tasks primarily concentrate on high-resourced languages owing to the obtainability of standard datasets and textprocessing tools.Unfortunately, no formal data repository exists in resource-constraint languages, including Bengali, like IMDB dataset. 1 There is a substantial advancement in the textual emotion classification in English, Arabic, Chinese, French, and other high-resourced languages [9].For example, the EmoTxt toolkit is created using ML algorithms for the English language [10].In another research, random forest (RF), decision tree (DT), and K-nearest neighbor (KNN) are used to detect multilabel multi-target emotion text in Arabic tweets where RF provides the highest F1-score of 82.6% [11].Ahmad et al. [12] suggested a DL model for categorizing English poetry text into 13 emotion classes.A recent study [13] actively explored seven ML techniques for classifying Tweets into happy or unhappy.The ensemble of logistic regression (LR) and stochastic gradient descent (LR-SGD) emerged with the highest accuracy of 79%.An automatic classification method was developed by Hasan et al. [14] for detecting emotion from tweets.They applied a supervised ML algorithm and obtained 90% accuracy by a decision tree in four emotion categories, such as happy-inactive, happyactive, unhappy-inactive, and unhappy-active.
Several DL approaches have been studied to classify textual emotion from short sentences.For the classification of emotions in Chinese microblogs, Lai et al. [15] presented a graph convolution network architecture, and their suggested method attained an F-score of 82.32%.Using Nested Long-Short Term Memory (LSTM), Haryadi et al. [16] successfully classified English Twitter data into seven emotion classes and yielded exceptional results, achieving the highest accuracy (99.167%).The SemEval-2019 task-3 [17] proposed a Bi-LSTM model for categorizing emotion into four classes and gained a maximum F1-score of 79.59%.Ameer et al. [18] presented a detailed analysis of classifying short text messages (i.e., SMS) using several ML, DL, and transfer learning-based techniques.Their models have been developed on a code-mixed (Urdu and English) dataset containing 12 emotion classes and achieved the maximum performance by ML model with uni-gram features.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Kumar et al. [19] used a dual-channel method for multiclass textual emotion detection.They employed CNN to extract textual features and the BiLSTM layer to order text and sequence information.Their work revealed that multiple layers could give more accurate results.However, their network becomes comparatively slower, and GloVe requires extra time than the BERT embeddings.
Although TxtEC in low-resource languages such as Bengali is still in its infancy, a few research activities have been embarked on utilizing ML and DL approaches.Among them, Tripto et al. [20] developed a method to detect multilabel sentiments and emotions from Romanized Bengali texts based on the YouTube comments dataset of 1006 data.Their model (LSTM) detected three-label sentiment (with 65.97% accuracy), five-label sentiment (54.24% accuracy), and emotion (59.23% accuracy).Rahib et al. [21] developed a DL-based method using CNN and LSTM to classify emotion from social media response to COVID-19 text and achieved an accuracy of 84.92%.Purba et al. [22] proposed an emotion detection system employing a Multinomial Naive Bayes classifier to identify emotions into three categories (angry, sad, and happy) with an accuracy of 68.27%.Mamun et al. [23] introduced a sentiment dataset comprising 8122 text expressions categorized into negative, positive, and neutral.They showed that the ensemble technique (LR+RF+SVM) surpassed the other approaches attaining the most increased accuracy of 82%.Rayhan et al. [24] developed an emotion dataset with six emotion classes by translating an existing English emotion dataset into Bengali.They applied CNN-BiLSTM and BiGRU on the dataset and attained the highest F1-score of 67.41% using CNN-BiLSTM.Azmin et al. [25] employed three emotive classes (happy, sad, and anger).They used a dataset developed by [26] and showed Multinomial Naïve Bayes (MNB) surpassed others with a precision of 78.6%.A corpus named Anubhuti [27] concentrated on Bengali short stories labeled in four classes (joy, anger, sorrow, and suspense) which obtained an accuracy of 73% by LR.Analyzing emotions expressed in Bengali blog writing, Das et al. [28] utilized conditional random field (CRF) for identifying emotional content from blogs, that achieved an accuracy of 56.45%.Rupesh et al. [29] classified six basic emotions on 1200 Bengali documents from different domains using SVM and obtained 73% accuracy.Rahman et al. [26] curated a Bengali emotion dataset focused on socio-political issues and employed ML techniques.Their work acquired the highest accuracy (52.98%) and F1-score (33.24%) by utilizing SVM with a non-linear RBF kernel.Parvin et al. [30] utilized the ensemble of CNN and RNN architectures on their developed emotion dataset (containing 9000 text data) and achieved an F1-score of 62.46%.Iqbal et al. [7] developed an emotion dataset called BEmoC-v3 containing 7000 textual data in Bengali.The previous version of BEmoC-v3 (i.e., BEmoC-v2) consisted of 6243 data utilized by Das et al. [3] for classifying six emotions in Bengali.They employed a pre-trained BERT variant XLM-Roberta and gained the maximum F1-score (69.73%).Table 1 summarizes the findings of a few recent studies on Bengali TxtEC in terms of the number of classes, corpus size, models used, performance, and critical weaknesses.
Most previous works in Bengali TxtEC methods used limited datasets to develop ML and DL approaches.In contrast to past studies, this research proposes an ensemble of transformer-based learning that can detect six emotions, outperforming previous methods of TxtEC in Bengali.The use of transformer models made Bengali text classification tasks more robust [32], [33].

III. BEmoC-v3: BENGALI EMOTION CORPUS
The development of an intelligent method for TxtEC in resource-constrained languages presents a significant challenge due to the lack of benchmark corpora.Thus, developing a reliable corpus is the prerequisite for any intelligent text classification model based on ML or DL techniques.The previous research [7] discussed various aspects of the development of the dataset (BEmoC-v3).This work focuses on the various analysis of the dataset.The Bengali Emotion Corpus ('BEmoC-v3'), is freely available at https://github.com/avishek-018/TEmoX.
Five human crawlers have manually accumulated Bengali text data from various online and offline sources.The primary sources include social media comments or posts (Facebook, YouTube), blog postings, textual conversations, narratives, storybooks, and news portals.A total of 7125 text documents were collected initially.Raw collected data requires several steps of pre-processing before labeling.Few pre-processing are done automatically, and the rest are performed manually: • Automatic: Removed punctuation, digits, non-Bengali words, emoticons, and duplicate data.A module named 'BanglaProcess'2 has been developed for automatic text pre-processing.
• Manual: The text underwent a process of spelling correction and exclusion of texts containing less than three words to ensure an unwavering emotional adherence.Following successful pre-processing, the corpus comprised 7000 texts that were subsequently forwarded to human annotators for manual labeling.The initial annotation task is assigned to five postgraduate students working in the Bengali language processing field with computer science and engineering backgrounds.The majority voting [7] process is employed to decide the primary label of the text.

B. VERIFICATION
The initial labeling of texts was examined by an expert with several years of experience conducting Bangla Language Processing (BLP) research.If any initial annotation was done incorrectly, the expert updated the labeling.Through conversations and extensive deliberations with the annotators, the NLP professional finalized the labels, ensuring a reduced likelihood of bias during the annotation process [34].Table 2 illustrates some discarded data samples and their causes.

C. QUALITY OF ANNOTATION
We used the Cohen's Kappa scores to determine interannotator congruence to ensure the quality of the labeling.The quality of the corpus is reflected by inter-coder reliability (93.1%) and Cohen's Kappa (0.91), which showed a perfect agreement among annotators [3].The Jaccard index between the classes has been calculated for quantitative analysis.Table 3 shows the similarity values where the 200 most frequent words are utilized from each category.Two emotion class pairs (joy-surprise and anger-disgust) showed the highest similarity index of 0.51 and 0.55, respectively.These results reveal that more than half of the frequently used terms are familiar in these two groups.Nevertheless, the pair (joy-fear) obtained the lowest similarity, indicating that the frequent words in this pair are more distinctive than those in other categories.Thus, it is of concern that the similarity can significantly impact the classification task.

D. STATISTICS OF BEmoC-v3
Following the pre-processing and annotation procedure, the BEmoC-v3 comprised 7000 text documents.To evaluate the models, the data is partitioned into three sets: training (5751 texts), validation (624 texts), and test sets (625 texts).As the data are imbalanced, it is preferable to distribute the corpus into training, validation, and test sets in such a fashion that  the proportions of data in each class remain the same as they were in the original corpus [35].Therefore, we performed a stratified sampling technique [8] while splitting the corpus.Table 4 shows statistics of data distribution in each category.
Fig. 2 shows the distribution of the number of texts versus the length of texts.According to the analysis, most of the data in this graph had a length between 15 to 35 words.Curiously, most of the texts in the disgust and sadness categories are between 20 to 30 words long.This depicts disgust that contents take more words to be expressed.The Joy and Sadness classes appear to have nearly identical numbers of textual data in the length distribution.
Fig. 3 represents the most frequent word distribution using Wordcloud.The words in the center are the most common, while those on the periphery are less common.

IV. METHODOLOGY
This work exploits several ML, DL, and transformer-based learning models with ensemble techniques for performing textual emotion classification in Bengali.Fig. 4 depicts a high-level overview of textual emotion classification.
The TF-IDF and Bag of Words feature extraction techniques are used for ML-based models (LR, RF, MNB, SVM), whereas Word2Vec, FastText, and pre-trained GloVe embeddings are used for DL-based models (CNN, BiLSTM, CNN+BiLSTM).Furthermore, we used transformer-based models (i.e., m-Bert, XLM-R, distil-BERT and two variants of Bangla-Bert: Bangla-BERT-1 and Bangla-BERT-2.This research also investigates the effect of transformer-based ensembling models for textual emotion classification.The same dataset (i,e.BEmoC-v3) will be used to train and tune all models.

A. FEATURE EXTRACTION
Several feature extraction techniques were utilized, including TF-IDF, Word2Vec, and FastText.These techniques transform the text data into a numerical representation of matrix or high dimensional vectors.These feature extractors are shown in Fig. 4.

1) TRADITIONAL FEATURE EXTRACTOR a: TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
The TF-IDF [36] determines the significance of a word in text content.We extracted a combination of uni-gram and bi-gram features from the most frequent 20000 words of BEmoC-v3.

b: BAG OF WORDS (BOW)
The BOW [37] describes the frequency of words in a dataset.Unlike TF-IDF it does not provide information for more or less important words according to other documents in the dataset.We used the same parameters used in TF-IDF to train the BOW model.

2) LOCAL CONTEXTUAL FEATURE EXTRACTOR a: WORD2VEC
The Word2Vec is a popular and widely used word embedding technique for detecting the semantic similarities between words in a dataset's context [38].The Word2Vec algorithms have two variants: skip-gram and continuous BOW.According to [39], skip-gram works effectively with a tiny training data set and accurately depicts even uncommon words or phrases.In this work, the Word2Vec is trained using skipgram with a window size of 7, embedding dimension of 100, and minimum word count of 4.

b: FASTTEXT
The Word2Vec algorithm can not handle out-of-vocabulary words; thus, any word not present in the test set can not be vectorized with a corresponding embedding value.The Fast-Text algorithm is used to tackle this problem [40].By leveraging sub-word information, this technique employs character n-grams to establish semantic relationships between words within a given context [41].In this approach, when a word is absent in the training vocabulary, it can be synthesized using its constituent n-grams.Like Word2Vec, the FastText algorithm is available in both Skip-Gram and Continuous-BOW variations.We trained the FastText algorithm using skip-gram with a window size of 5, a character n-gram of size 5, and an embedding dimension of 100.

c: GLOVE
It is a word vector technique for learning embeddings using word co-occurrences [42].The GloVe does not rely solely on words' local context information (like Word2Vec) to yield embeddings but instead utilizes global statistics on word cooccurrence.We used the pre-trained word vectors by [43] containing 39 M tokens, a vocab size of 0.18 M, and an embedding dimension of 100.

3) ATTENTION-BASED FEATURE EXTRACTOR
Bert-Based Tokenizer: Bert-based multilingual tokenizers leverage the power of BERT's contextual embeddings to encode words and sentences in different languages, capturing their semantic meaning and context.XLM-R, m-BERT are such kinds of multilingual tokenizers.Besides utilizing these multilingual tokenizers we also used Bangla-BERT-1, Bangla-BERT-2, Bn-Distilbert which are pretrined on the Bangla language only.

B. FEATURE REPRESENTATION
By utilizing the traditional feature extraction algorithms that are frequency-based algorithms, we get a feature matrix.From Fig. 4 we can see, UW, S and D denote Unique Words, Sentences and Documents.Moreover, M and N are total sentences/documents and total unique words, respectively.embedding dimensions, respectively.In BERT Tokenizer, TOK denotes Token.The [CLS] token is a special token added at the beginning of each input sequence in BERT, representing the entire sequence for sentence-level tasks.The [SEP] token is used to separate segments or sentences when working with pairs of sequences, indicating their boundaries.

C. CLASSIFIERS 1) ML-BASED APPROACH
This work explored the four most widely used ML models to build an emotion detection system, including SVM, LR, MNB, and RF, where the TF-IDF and BoW are used as text vectorizers.For the LR, we choose the 'lbfgs' solver with the 'l1' penalty and set the maximum iteration to 400 for the solver to converge.The C value is kept at 1 for both LR and SVM.The SVM utilizes the 'rbf' kernel with 'l2' penalizer.The RF is implemented with 100 estimator trees, and we keep the lowest number of instances required to divide an internal node at 2. For MNB, we set the Laplace smoothing parameter (alpha) to 1, enabling it to learn prior class probabilities.Table 5 shows a brief synopsis of the parameters employed for ML models.

2) DL-BASED APPROACH
Various DL models (CNN, BiLSTM, CNN+ BiLSTM) are applied to BEmoC-v3, and their performances are 109808 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.investigated.All DL models employ Word2Vec and fastText as feature embedding.The DL algorithms' performance depends heavily on the hyperparameters, which are tuned carefully to get an optimized network [44].In general, this task is carried out by humans, which likely leads to suboptimal results.With enough computing resources, one can apply a grid search that executes all possible combinations of hyperparameters.However, as the hyperparameters and parameter space increase, the computation becomes intractable.The past study reveals that a model developed with randomly selected hyperparameters' values could show better performance in lower computational than exhaustive grid search [45].We have empirically determined the hyperparameter values of the embedding models and the classifiers based on our developed corpus.The models utilized the 'ADAM' optimizer with a learning rate of 0.001 and were trained for 35 epochs per batch (with 16 samples).Keras' callbacks' were used to monitor the training process and save the model with the maximum validation accuracy in each epoch.The loss function chosen was 'sparse_categorical_crossentropy.' a: CNN For analyzing the performance of CNN [46], we passed BEmoC-v3 to our scratch CNN model.For all of the convolutional and dense layers, we employed rectified linear units to introduce non-linearity, while the softmax activation was used for the output layers.Only one convolution block had a 1D convolution layer containing 64 filters with a size of 7. The training weights of Word2Vec and FastText embeddings are passed to the embedding layer, which generates a sequence matrix.This matrix is then processed by the following layer, global max pooling, to extract the maximum value from each filter.This process produces a singledimensional vector of the same length as the number of filters used.Finally, an output layer with six nodes computes the probability distribution for each of the six emotion categories.

b: BiLSTM
Bidirectional Long-Short Term Memory (BiLSTM) is a kind of recurrent neural network (RNN) that can store information in both directions [47].Basic RNN only looks at recent information while iterating over data and fails where longterm dependency is needed.We may need to look further back to get the semantic meaning of a text in the emotion detection task.BiLSTM overcomes this problem and works tremendously well for long-term dependency problems.The BiLSTM network contains an Embedding layer initialized with the Word2Vec or FastText embedding weights.The model includes a BiLSTM layer with 32 hidden units and a fully-connected dense layer with 16 neurons and ReLU activation.The output layer utilizes a softmax activation function to produce probability distributions for six emotion classes.

c: CNN+BiLSTM
A hybrid architecture combining CNN and BiLSTM has been explored to leverage the advantages of both designs.Starting with an embedding layer initialized as in the previous procedure, a 1D convolutional layer with 64 filters (size 3) is added on top.Obeyed by this, a max-pooling layer downsampled the CNN features and transmitted them to two BiLSTM layers.The first layer comprises 64 LSTM units, while the second contains 32 LSTM units.Finally, the BiLSTM layer outputs are fed into a softmax-activated output layer that gives the probability distribution of six emotion classes.
Table 6 outlines the optimized hyperparameters of various DL models.The hyperparameter values are taken from the ranges mentioned in the 'Hyperparameter Space' field.

3) TRANSFORMER-BASED APPROACH
Transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers), can capture contextualized word representation from unlabeled texts [48].It makes use of the encoder representation technique of the transformer architecture first introduced in [49].There have been many pre-trained transformer models available in the Huggingface3 transformers library to be used in text processing tasks.Recently pre-trained transformers variants are being employed in different domains of Bengali text processing tasks, including sentiment analysis [50] and document categorization [51], [52].They outperformed the ML and DL models with higher accuracy.
This work implemented five transformer-based models: m-BERT, XLM-R, Indic-Transformers Bengali DistilBERT, and two variants of Bangla-BERT.All models are fine-tuned on the emotion corpora by employing Ktrain [53].We embed specific start and end sequence tokens (SOS and EOS) at the beginning and end of the transformer model for fine-tuning.When required, we applied padding at the end of the sequence or removed any additional tokens that exceeded the predetermined sequence length.The padded tokens are excluded during training to ensure they do not affect the training process.These tokens then go through multiple self-attention layers before being input into the transformer models.

a: m-BERT
We used the 'bert-base-multilingual-cased' model on BEmoC-v3 and fine-tuned it by modifying the batch size, learning rate, and epochs.The training process of m-BERT [54] involved using the most popular 104 languages with the most extensive Wikipedia data, including Bengali.The pretrained m-BERT contains about 110M parameters.
b: XLM-R XLM-R [55] trained with a multilingual masked language model.Providing various unique training procedures enhances the performance of BERT.These include (1) training the model for a more extended period with more data, ( 2) training with larger batch sizes and more extensive sequences, (2) dynamically constructing the masking pattern, and so on.The XLM-R model significantly surpasses other multilingual BERT models, especially in low-resource languages.The 'xlm-Roberta-base' technique is implemented on BEmoC-v3 using a batch size of 12.

c: BANGLA BERT
This work uses two variants of Bangla Bert that are dedicatedly pre-trained in the Bengali language only.The first one is 'sagorsarker/Bangla-bert-base'(hereafter called Bangla-BERT-1) [56] that is trained on Bengali corpus from OSCAR 4 and Bengali Wikipedia Dump Dataset. 5  Another one is 'csebuetnlp/banglabert'(hereafter called Bangla-BERT-2) [6].Both pre-trained models are based on mask language modeling described in the original BERT paper [48].

d: INDIC-DISTILBERT
We implemented 'indic-transfor mers-bn-distilbert' on BEmoC-v3 and fine-tuned it to acquire adequate performance.The Indic Distilbert [57] is pre-trained on three main Indian languages (Hindi, Bengali, and Telugu), on which the amount of Bengali data is around 6 GB.We fine-tuned all the models on BEmoC-v3 using the Ktrain' auto fit' technique.All models are trained for 20 epochs using a learning rate of 2e −5 with a batch size of 12. Model weights are saved at checkpoints, and the most acceptable model is chosen according to its performance on the validation set.The maximum sequence length for the texts settled at 50 words. 4https://oscar-corpus.com/ 5 https://dumps.wikimedia.org/bnwiki/latest/

4) ENSEMBLE-BASED APPROACH
After individually deploying the pre-trained transformer models, we approached the transformers' ensemble.In recent years, the ensemble of transformer models proved to be more efficient than the individual ones [58], [59], [60].We performed the ensembling to consider all possible combinations of classifier models using the weighted average and average ensemble techniques.The weighted average ensembles have specific effects on the ensembled outcome since the primary results of the base models can influence the ensemble outcomes.As a result, the best-performing model takes precedence over the others.On the other hand, the average ensemble takes the softmax probabilities of all the participating models and averages them.The output class in this averaging is the one with the highest probability.Prior base classifier results are not taken into account in this strategy [61], [62].
The framework of the ensemble of transformer-based Bengali TxtEC (i.e., TEmoX) is depicted in Fig. 5.The dataset is first sent to the BERT tokenizer, and the tokens are passed to the embedding layer (E) of each model.After passing the intermediate representation layers, a contextual representation (T) is achieved.Finally, a softmax probability distribution over the emotion classes is obtained.The probabilities are passed to a combination generator to generate the ensemble sets.Eq. 1 is used to determine the total number of ensemble sets generated from the combination generator.
here, C(m, r) returns the number of total combinations, m represents the number of transformer models and r is the number of choosing models for the ensemble.For our task m = 5 and r = 2, 3, 4, 5, as we will be generating combinations of 2, 3, 4, and 5 models respectively.The formula can be rewritten as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

. , prb dm
In the average ensemble technique the average of each softmax class value provided by 'm' models is calculated for each instance.Finally, the maximum of the probabilities is used to compute the output class using Eq. 2.
Here, the argmax function returns the class index of the maximum of the probabilities.The 'Average Ensemble' algorithm is briefly described in Algorithm 1.
The weighted average ensemble technique utilizes an extra weight with the softmax probabilities of the models.Given the prior weighted f1-scores of 'm' models, i.e., wf 1 , wf 2 , . . .wf m , the algorithm uses Eq. 3 to compute the outputs.
Here, wf j denotes the weighted F1 score of each model.The 'Weighted Average Ensemble' algorithm is briefly described in Algorithm 2.

V. EXPERIMENTS
The entire experiment is conducted in a multicore processor furnished with NVIDIA Geforce GTX 960M GPU with 4GB graphics memory, 8GB physical memory (RAM), and a 2.3GHz intel core i5 processor.Scikit-learn (0.24.2) package is used to develop the ML models.The Ktrain (0.26.3) package is employed to train the transformer models.Statistical measures such as precision (Pr), accuracy (Acc), recall (Re), and F1-score are utilized to assess and compare the models' performance.Keras (2.4.3) with Tensorflow (2.5.0) backend is utilized as the DL framework.Python version is kept at v3.6 for all the experiments.

A. RESULTS
The evaluation results of individual models on the test set are presented in Table 7, with the excellence of the models determined based on the weighted F1-score.The results of the ensemble-based approaches based on various combinations of the transformer models are reported in Table 8.

2) DL-BASED APPROACH
In relation to all of the evaluation parameters in DL models, CNN+BiLSTM with GloVe outperformed the remaining DL-based models by obtaining the highest F1-score of 63.39%, which is approximately 4.67% lower than the best ML approach (i.e., LR + TF-IDF).

3) TRANSFORMER-BASED APPROACH
There is a considerable proliferation in all scores regarding the transformer-based models.The m-BERT acquired the lowest F1-score among all transformer-based models, only 62.25%, which is even lower than the best-performing ML model.The Bangla-BERT-1, on the other hand, has a nearly 5.99% higher F1-score (e.g., 68.24 %) than m-BERT.Compared to Bangla-BERT-1 and m-BERT, the XLM-R model significantly improved and obtained a F1-score of 72.54%.The Bangla-BERT-2 model was solely developed for the Bengali language with a larger corpus size than Bangla-BERT-1.As expected, Bangla-BERT-2 has outperformed the Bangla-BERT-1 and the aforementioned transformer-based models with an F1-score of 74.77%.Among all models, the Indic-DistilBERT obtained the highest F1-score of 77.11%.This model beats the m-BERT, XLM-R, Bangla-BERT-1, and Bangla-BERT-2 by 14.68%, 4.57%, 8.87%, and 2.34% F1-score, respectively.The best-performing model card is available at the huggingface model hub. 6

4) ENSEMBLE-BASED APPROACHES
After analyzing the individual model's (i.e., base models) performance, we analyzed the ensemble of pre-trained transformers.various ensemble sets on the test data concerning the average and weighted ensembles.
Results of ensembling indicate that the ensemble set En-22 utilizing the weighted-average ensemble approach demonstrated the best performance in terms of the highest precision (80.45%), recall (80.16%), and F1-score (80.24%).Thus, the result confirms that among a total of 38 models, the weighted-average ensemble model (Bangla-BERT-2 + XLM-R + Indic-DistilBERT + Bangla-BERT-1) is the bestperforming model to classify textual emotion in Bengali.Therefore, it is to be called TEmoX.

B. ERROR ANALYSIS
Table 8 demonstrated that ensemble set En-22 (Bangla-BERT-2 + XLM-R + Indic-DistilBERT + Bangla-BERT-1) is the best-performing ensemble model for classifying textual emotion in Bengali, as evidenced by its high performance.A detailed error analysis is conducted to gain additional insights regarding the performance of the proposed method.
1) QUANTITATIVE ANALYSIS Fig. 6 depicts the class-wise fraction of predicted labels relating to the confusion matrix.
The confusion matrix revealed that a few data instances are not classified correctly.Among the 76 anger instances, 9 were predicted to be disgust.The same scenario can be noticed in the disgust class, where 11 instances are classified as anger.9 out of 87 data instances in the fear class were incorrectly classified as sadness whereas 7 out of 119 data points in the sadness class were incorrectly classified as fear.Furthermore, out of 72 data points in the surprise class, 11 are predicted to be joy.The misclassification ratio is the highest in this class.The reasons for these misclassifications can be explained by the Jaccard similarity of the corpus (Table 3).Overlapping of the most frequent words can hamper the classification task.Also, an anger emotion can sometimes be expressed as disgust and thus their sentence pattern can have similarities too.This phenomenon remains true for the other two class pairs too (joy-surprise and sadness-fear).The error analysis reveals that the disgust class achieved the highest rate of correct classification (82.05%), while the surprise class achieved the lowest rate of correct classification (61.84%).Table 9 presents the error rate for various approaches where the proposed ensemble of transformers achieved the lowest error rate of 19.84%.
2) QUALITATIVE ANALYSIS Table 10 presents the predicted labels for certain instances by the utilized transformer models in comparison to their actual labels.It's observed that while one model correctly predicts the label for a sample, the other does not.The ensemble method (En-22) addresses these inconsistencies by averaging the softmax probability distribution for each model and then making predictions based on the highest weighted-average probability.Nonetheless, classifying texts with similar words spread across different classes remains a hurdle, leading to higher misclassification rates in some models.A contextual review of such texts might pave the way for improved classification models.
A high degree of class imbalance in the used corpus might be a probable cause for inaccurate predictions.Also, some words often appear across various contexts and multiple classes.Interestingly, the elevated Jaccard index value (Table 3) reveals certain patterns.For instance, derogatory terms might reflect both anger and disgust emotions.The same scenario can be observed in the case of joy and surprise classes as a surprising incident might result in a positive outcome.Moreover, the classification of emotions is inherently subjective, and influenced by individual perspectives.A single statement could be interpreted in multiple ways based on personal inclinations.

C. COMPARISON WITH EXISTING TECHNIQUES
The analysis of results demonstrated that the ensemble method, En-22, emerged as the most effective model for categorizing textual emotions in Bengali.We further evaluated the effectiveness of the proposed model by comparing its performance to that of existing techniques.Some past techniques [3], [20], [25], [26], [27], [28], [29] were implemented and evaluated on the BEmoC-v3.Table 11 shows the results of the comparisons.We can see the best performing model (i.e.TEmoX) outperformed the previous models and achieved the highest f1-score of 80.24%.Moreover, to exhibit the generalizability of the proposed technique, we experimented with its performance on another Bengali emotion dataset [26] (Dataset 2).This dataset consists of 6314 Facebook comments annotated with six emotion classes.
The comparative analysis exhibits (in Table 11) that the suggested technique is more robust than the existing techniques for classifying textual emotion in Bengali.Although Dataset 2 showed a relatively poor performance than BEmoC-v3, it performed better in the proposed method than in the past techniques.Some possible reasons might impact the unsatisfactory performance on Dataset 2. For further investigations, we have investigated the Jaccard Index of Dataset 2 and the confusion matrix of the proposed model.Table 12 presents the Jaccard Index, which showcases the overlapping of the most frequent words among the classes.The analysis considers the top 200 most frequent words from each emotion class to determine the degree of overlap.Thus, it is evident that most class pairs have a similarity above 50%, which causes a high chance of misclassification.
In the confusion matrix in Fig. 7, it can be noticed that the number of misclassification is higher than the correct prediction in the disgust, fear, and surprise classes.The overlapping of the most frequent words among classes might be the reason for such performance.Moreover, there is no apparent justification for the construction of the dataset.Therefore, the quality indices of Dataset 2 might also influence the performance.

D. PERFORMANCE OF THE PROPOSED MODEL ON BEMOC DATSETS
The previous work [63] utilized a Bengali emotion dataset called BEmoC-v1 containing 5200 texts, whereas 109814 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Das et al. [3] used an extended version of BEmoC-v1 (renamed as BEmoC-v2 having 6243 data.Later the dataset is extended again, containing 7000 texts, and we called it BEmoC-v3.To investigate the performance, we evaluated the proposed model on the three versions of BEmoC (v1-v3).The datasets are partitioned into the train, test, and validation sets, where the test set is kept identical so that it can assess the influence of increasing train data relatively.
Table 13 illustrates the performance of the proposed model on the different versions of BEmoC.The analysis demonstrates that BEmoC-v1 (F1-score = 69.45%)performed relatively lower than BEmoC-v2 (F1-score = 76.96%) as it has fewer data.On the other hand, it clearly shows that BEmoC-v3 performed better by achieving the highest F1-score (80.24%) than BEmoC-v2 and BEmoC-v1 due to its more significant amount of text data.

VI. CONCLUSION
This research introduces a Transformer-based framework called TEmoX, designed to identify textual emotions within Bengali text across six distinct categories.The effectiveness of this model is evaluated using a newly constructed dataset known as BEmoC-v3.The results demonstrate that the proposed approach, which involves a weighted ensemble of XLM-R, Bangla-BERT-1, Bangla-BERT-2, and Indic-DistilBERT, achieved outstanding performance, boasting an impressive F1-score of 80.24%, outperforming all baseline models and existing techniques for classifying textual emotions in Bengali.Notably, the proposed approach exhibits a substantial improvement of 12.18%, 16.85%, and 3.13% compared to the best-performing machine learning (ML), deep neural network (DNN), and transformer-based baseline models.Looking ahead, this research endeavors to broaden its scope by identifying additional emotion categories, such as love, hate, and stress, while also incorporating more diverse data into the BEmoC-v3 corpus.Additionally, future investigations will explore the applicability of the proposed model in detecting emotions expressed through emoticons, code-mixing or code-switching data, and texts containing mixed emotions.We also intend to explore the model's performance in classifying multiclass or multilabel textual emotions in other low-resource languages, which holds promise for broadening its practical applications.

FIGURE 2 .
FIGURE 2. Corpus distribution concerning the number of texts vs length.

FIGURE 4 .
FIGURE 4. Abstract process of textual emotion classification in Bengali.Here, UW, S, D, and WF denote Unique Words, Sentences, Documents, and Word Features respectively.Moreover, M, N, and D are total sentences/documents, total unique words, and embedding dimensions, respectively.In BERT Tokenizer, TOK denotes Token.
The 26 different combinations of the transformer models are named from EN-1 to EN-26.The probabilities from each variety are now passed to the 'Average Ensemble Algorithm' to get the output class.Let us assume that we have 'd' test instances and the number of transformer models is 'm'.Each model classifies an instance d i into one of the pre-defined categories from n class .Thus for each instance d i , a model m j gives a softmax probability distribution vector(prb[]) of size n class .Thus, the output becomes:prb11 , prb 21 , prb 31 , prb 41 , . . . . . ., prb d1 prb 12 , prb 22 , prb 32 , prb 42 , . . . . . ., prb d2 . . .prb 1m , prb 21 , prb 3m , prb 4m , . . . . .

FIGURE 5 .TABLE 7 .
FIGURE 5. Framework for the ensemble of transformer-based Bengali TxtEC.E i denotes the input embedding for TOK i , T i represents the contextual representation for each TOK i .EN-1 to EN-26 are the 26 different ensemble combinations made from the 5 transformer models.TABLE 7. Performance of ML, DL and Transformer based models on the test set.

TABLE 8 .
Performance of Ensemble-based models.Here the combination of 2, 3, 4, and 5 models are shown separately with their average/weighted Pr, Re, and F1-score.

FIGURE 7 .
FIGURE 7. Confusion matrix of the proposed ensemble transformer models on Dataset-2.

TABLE 10 .
Data samples demonstrating the differing nature of transformer models.Here MB, XR, BB1, BB2 and IDB refers to m-BERT, XLM-R, Bangla-BERT-1, Bangla-BERT-2 and Indic-DistilBERT.A denotes the actual label and the wrong predictions marked in bold.

TABLE 1 .
A brief summary of previous works on Bengali textual emotion classification.

TABLE 2 .
Few samples of rejected sentences and modified labels after verification.

TABLE 5 .
Parameters used for ML models.
Table 8 illustrates the evaluation scores of 6 https://huggingface.co/avishek-018/bn-emotion-temox Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 9 .
Error rate of various approaches on the test set.

TABLE 11 .
Summary of the performance comparison.

TABLE 13 .
Results of the proposed method on different versions of BEmoC.