Multi-Label Emotion Classification on Code-Mixed Text: Data and Methods

The multi-label emotion classification task aims to identify all possible emotions in a written text that best represent the author’s mental state. In recent years, multi-label emotion classification attracted the attention of researchers due to its potential applications in e-learning, health care, marketing, etc. There is a need for standard benchmark corpora to develop and evaluate multi-label emotion classification methods. The majority of benchmark corpora were developed for the English language (monolingual corpora) using tweets. However, the multi-label emotion classification problem is not explored for code-mixed text, for example, English and Roman Urdu, although the code-mixed text is widely used in Facebook posts/comments, tweets, SMS messages, particularly by the South Asian community. For filling this gap, this study presents a large benchmark corpus for the multi-label emotion classification task, which comprises 11,914 code-mixed (English and Roman Urdu) SMS messages. Each code-mixed (English and Roman Urdu) SMS message manually annotated using a set of 12 emotions, including anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral (no emotion). As a secondary contribution, we applied and compared state-of-the-art classical machine learning (content-based methods – three word n-gram features and eight character n-gram features), deep learning (CNN, RNN, Bi-RNN, GRU, Bi-GRU, LSTM, and Bi-LSTM), and transfer learning-based methods (BERT and XLNet) on our proposed corpus. After our extensive experimentation, the best results were obtained using state-of-the-art classical machine learning methods on word uni-gram (Micro Precision = 0.67, Micro Recall = 0.54, Micro F1 = 0.67) with a combination of OVR multi-label and SVC single-label machine learning algorithms. Our proposed corpus is free and publicly available for research purposes to foster research in an under-resourced language (Roman Urdu).


I. INTRODUCTION
A single piece of text may contain one or more emotions. A single-label emotion classification task aims to predict only one emotion of a text. The main drawback of single-label emotion classification is that it only captures one emotion in a given text, making it difficult to completely understand the author's emotional state. Multi-label emotion classification overcomes this limitation by capturing all possible emotions The associate editor coordinating the review of this manuscript and approving it for publication was Khin Wee Lai . in a given text, see examples in Table 1. Consequently, we can make a more accurate judgment about the emotional state of an author. Multi-label emotion classification has potential applications in various domains. For example, in E-learning, multi-label emotion classification can adjust the learning techniques in conformity with the learner. Multi-label emotion classification can be helpful in health care to determine the feelings and comfort level of the patient towards the treatment. Multi-label emotion classification can be used, for example, in stock market monitoring or prioritizing calls in a call center.  In general, code-mixing can be characterized as the use of two or more languages at the same time. According to [1], more than 50% of Europeans use other language besides their mother language. The internet is the most prominent source in promoting global, linguistic code-mixed culture. In South Asian community and particularly in Pakistan, codemixed (English and Roman Urdu) text became a preferable script for Facebook comments/posts [2], tweets [3]- [5], and daily communication using SMS message [6]. It can be noted from these studies that the use of code-mixed digital text is increasing. Thus, there is a need to develop standard evaluation resources and methods for code-mixed texts for various applications, such as author profiling, sentiment analysis, emotion analysis, etc.
Standard evaluation resources are needed to develop, evaluate, and compare multi-label emotion classification methods. Previous studies developed few corpora for multi-label emotion classification using English tweets (monolingual) [7]- [10]. However, the problem of multi-label emotion classification is not explored for code-mixed (say, English and Roman Urdu) texts. To fulfill this research gap, the present study aims to develop a large benchmark code-mixed (English and Roman Urdu) corpus for the multi-label emotion classification task and evaluate it.
The two main objectives of this study are: (1) to develop a large benchmark code-mixed (English and Roman Urdu) SMS messages corpus for multi-label emotion classification task and (2) to apply, evaluate, and compare state-of-theart classical machine learning, deep learning, and transfer learning methods on the proposed corpus to investigate most suitable methods for multi-label emotion classification on a code-mixed corpus. For the first objective, we developed a large benchmark multi-label emotion classification corpus, which contains 11,914 code-mixed (English and Roman Urdu) multi-label SMS messages, hereafter called CM-MEC-21 corpus. In the CM-MEC-21 corpus, each code-mixed SMS message is manually annotated from a predefined set of 12 emotions, which are: anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral (no emotion). For the second objective, we developed and applied state-of-the-art classical machine learning, deep learning, and transfer learning methods on our proposed CM-MEC-21 corpus.
We believe that our proposed CM-MEC-21 corpus will be helpful for: (1) promotion of research in an under-resourced language, i.e., Roman Urdu, (2) development of bi-lingual dictionaries for English and Roman Urdu languages, (2) carrying out a detailed comparison of existing methods for the multi-label emotion classification task, and (4) development and evaluation of the new methods for multi-label emotion classification on code-mixed text (in our case, English and Roman Urdu).
The rest of this paper is organized as follows: Section II describes the existing multi-label emotion classification corpora and methods. Section III presents the corpus compilation process used to create the proposed corpus. Section IV describes methods for multi-label emotion classification task. Section V presents the experimental setup (dataset, techniques, evaluation methodology, and evaluation measures). Results and their analysis are presented in Section VI. Finally, Section VII concludes the paper and discusses potential avenues for future work.

II. RELATED WORK
In literature, efforts were made to develop benchmark corpora for the emotion classification task. However, the majority of these efforts focused on the single-label emotion classification task. One of the most prominent efforts of the single-label emotion classification task is a series of international competitions organized by SemEval [7]. The main outcome of these competitions is a collection of benchmark corpora, which can be used for the development, comparison, and evaluation of single-label classification methods. Other researchers also made efforts to develop corpora for single-label emotion classification task [11]- [19].
Regarding the multi-label emotion classification task, we found that only three benchmark corpora were developed. [7] developed a large benchmark monolingual (only one language) corpora of English, Spanish, and Arabic for SemEval-2018 multi-label emotion classification competition. The English corpus consists of 10,983 tweets (training instances = 6,838, validation instances = 886, and testing instances = 3,259) manually annotated by seven annotators with the presence/absence of 12 emotions: anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and neutral (no emotion). The best system [20] applied a multi-layer self-attention with the bidirectional long short-term memory architecture approach and achieved an accuracy of 58.80%.
In another effort towards the development of the benchmark corpus for the multi-label emotion classification task, [8] created the CBET corpus, which contains 81,162 English tweets manually annotated with nine emotions, including anger, fear, disgust, joy, love, sadness, surprise, thankfulness, and guilt. The authors applied lexical and learning-based methods for multi-label emotion classification on the CBET corpus, and they achieved the best results on word uni-grams (Precision = 47.07).
A similar corpus is BMET corpus [9], which consists of 96,323 English tweets manually annotated for the multi-label emotion classification task using six emotions: anger, fear, joy, sadness, surprise, and thankfulness. The authors applied the Latent Variable Chain (LVC) Transformation model and achieved the best micro average F 1 score of 67.19%.
As can be observed from the multi-label emotion classification corpora mentioned above, all existing multi-label emotion classification corpora are developed for monolingual setup using tweets. Although SMS messages and tweets look similar, there are significant differences between them [6]. The first main difference between SMS messages and tweets is that usage of SMS messages is considerably wider spread than the usage of tweets. Out of the 8.97B worldwide connections of mobile, 5.67B connections are basic regular phones (non-smartphone) [21]. So, while contacting through texting, pure SMS messaging is the only possible option for individuals away from the Internet. Recently, organizations like WhatsApp, Facebook, Banks, etc., are using SMS message services to verify user profiles, security, and confirmation [6]. The second main difference between SMS messages and tweets is that most tweets are public, and SMS messages are private. As compared to the SMS communication medium [6], when a user broadcasts a tweet message on Twitter, Twitter provides open access to tweet settings, and anyone with an Internet connection can access and interact with them, or there are also followers-specific settings.
On the other hand, SMS messages are focused on private messaging such as a small audience, a one-to-one transmission, or group communication. Therefore, SMS messages' protocol can be called a personal, conversational medium, while tweets are more social. Researchers tend to pursue variations in texting patterns for public and personal circles. Considering the significant differences between SMS messages and tweets, we can conclude that tweets cannot be treated as an alternative to the SMS corpus, even though the tweets corpus is easier to compile.
To summarize, existing multi-label emotion classification corpora are mainly developed for the monolingual data (English) for tweets. However, there is no benchmark multi-label emotion classification corpus for code-mixed text, for example, English and Roman Urdu. This study presents a large code-mixed (English and Roman Urdu) corpus for the multi-label emotion classification task comprising 11,914 code-mixed SMS messages, manually annotated with 12 emotion categories. To the best of our knowledge, no such corpus was developed in the past.

III. CORPUS COMPILATION PROCESS
The primary objective of this research is to develop a large standard benchmark code-mixed (English and Roman Urdu) corpus for the multi-label emotion classification task. This section describes the source data, annotation process, characteristics, and standardization of our proposed corpus.

A. SOURCE DATA
To develop a large benchmark code-mixed corpus for the multi-label emotion classification task, we manually selected the data from an existing benchmark SMS-AP-18 corpus [6]. The SMS-AP-18 corpus was developed for code-mixed (English and Roman Urdu) author profiling based on SMS messages. Each document belongs to a profile of an author containing his collection of SMS messages. The SMS-AP-18 corpus contains 810 author profiles (a total of 84,694 codemixed SMS messages). On average, there are 104.56 codemixed SMS messages per profile. The SMS-AP-18 corpus is annotated with seven different author traits, including gender, age group, native city, native language, personality type, education level, and profession. The reason for selecting the SMS-AP-18 corpus for this study is that this is the only benchmark and publicly available corpus containing codemixed (English and Roman Urdu) SMS messages written by individuals.
To develop our proposed CM-MEC-21 corpus, we manually selected a subset of 12,000 SMS messages from the first 168 author profiles of the SMS-AP-18 corpus. The reason behind selecting a subset of 12,000 code-mixed SMS messages is that it will be very challenging and time-consuming to annotate the entire SMS-AP-18 corpus for multi-label emotion classification on code-mixed SMS messages (English and Roman Urdu). We only selected those messages whose length is greater than or equal to five words. We did not perform any pre-processing on 12,000 code-mixed SMS messages because the SMS-AP-18 corpus was already pre-processed.

B. ANNOTATION PROCESS
This section presents the annotation process used to develop our proposed CM-MEC-21 corpus, including preparing annotation guidelines, annotations and calculating the Inter-Annotator Agreement (IAA).

1) ANNOTATION GUIDELINES
The proposed CM-MEC-21 corpus was manually annotated by three annotators (A, B, and C). All the annotators were graduates, experienced in-text annotations, and native Urdu speakers with high English language proficiency. To facilitate the annotation process of our proposed CM-MEC-21 corpus, we prepared a set of annotation guidelines. These annotation guidelines were used to manually annotate the code-mixed (English and Roman Urdu) SMS messages for multi-label emotion classification. Since this research focuses on multi-label emotion classification, annotators were asked to assign multiple labels (or emotions) to a code-mixed SMS message. However, the first label should be the most dominating among all the labels assigned to a code-mixed SMS message. If a code-mixed SMS message does not contain any emotion, only one label will be assigned, i.e., neutral. Following the set of emotion categories used in the SemEvel-2018 international competition on multi-label emotion classification on English tweets, annotators assigned one or more emotions to a code-mixed SMS message from the following twelve categories. The annotation process was performed using Excel files and completed in three months.
Definitions of the emotions are as follows: • Anger: The emotion anger, also known as annoyance or rage, is an extreme emotional condition. It includes an awkward and bitter response to an anticipated incitement, danger, or hurt. [22] • Anticipation: Anticipation is an emotion, including delight, excitement, or nervousness/anxiety because of or expecting an event [23], which also includes interest, hope, and prospect.
• Disgust: Disgust is a reaction to denial or refusal to something conceivably contagious [24], which also includes disinterest, loathing, and dislike.
• Fear: Fear is a feeling persuaded by an anticipated troublesome situation, or danger [25], which also includes anxiety, panic, and horror.
• Joy: Joy is a sensation of great pleasure, and happiness [26], which also includes ecstasy, pride, and delight.
• Love: Love encloses a range of positive and strong emotional states, from the magnificent virtue or good habit, the sound interpersonal affection, and the simplest pleasure [22], also includes adoration and affection.
• Optimism: Optimism is a mental attitude contemplating a trust or hope that the reaction of some particular aim, or conclusion in general, will be positive, supportive, and desirable [27], also includes confidence, certainty, and hopefulness.
• Pessimism: In general, pessimists likely to focus on the negatives of life, a depressed or negative mindset, also includes distrust, cynicism, and no confidence [28].
• Surprise: Surprise is a mental state that a person might feel if something unanticipated occurs [30], which also includes amazement and distraction.
• Trust: Usually refers to a circumstance defined by the following aspects: One group (trustor) is ready to depend on the activities of another group (trustee); the situation is directed to the future [31], also includes confidence, belief, and faith.
• Neutral: There is no emotion(s) in a sentence.

2) ANNOTATIONS
Annotations were performed by three annotators (A, B, and C). All the annotators were native speakers of Urdu and had a high level of proficiency in the English language. Annotations were carried out in two steps. In the first step, a subset of 200 code-mixed (Roman Urdu and English) SMS messages was annotated by annotators A, B, and C using the annotation guidelines. Annotators discussed the annotations of 200 code-mixed SMS messages and revised the annotation guidelines to improve the quality of annotations further. In the second step, revised annotation guidelines were used to annotate the remaining 11,800 code-mixed SMS messages.
To select the set of gold standard labels (emotions) for each code-mixed SMS message, we used the following guidelines. A code-mixed SMS message was annotated with only those labels, which were assigned by at least two annotators. If an SMS message did not have a single label that at least two annotators assigned, it was discarded. Out of 12,000 code-mixed SMS messages, 86 were discarded, and the final gold standard corpus contained 11,914 code-mixed SMS messages.

3) INTER-ANNOTATOR AGREEMENT
After annotations, Cohen's Kappa Coefficient and inter-rater agreement were computed. We computed inter-rater agreement as to the percentage of times each pair of annotators agree. We achieved a moderate level of Cohen's Kappa Coefficient score of 0.620 and the inter-rater agreement of 0.618. It can be noted that inter-rater agreement scores are lower than SemEval-2018 (inter-rater agreement = 83.38) shared task on English tweets (mono-lingual) [32]. According to previous studies, human annotators were agreed only approximately 70-80% of the time for binary or ternary classification schemes, and the more classes there are, the more challenging it is for annotators to agree [33]- [36]. Our CM-EMC-21 corpus consists of 12 classes, and this highlights that codemixed multi-label emotion annotation is a complex task for humans to agree.

C. CORPUS CHARACTERISTICS AND STANDARDIZATION
The proposed CM-MEC-21 corpus contains 11,914 codemixed SMS messages. The minimum and maximum length of code-mixed SMS messages in the CM-MEC-211 corpus is 5 and 99 words, respectively. The average length of a code-mixed SMS message is 12 words. The proposed corpus contains a total of 141,997 words and 15,660 word types (unique words). Table 2 shows the percentage of code-mixed (English and Roman Urdu) SMS messages annotated with a given type of emotion. The statistics in these rows sum up to more than 100% because a single code-mixed (English and Roman Urdu) SMS message may be annotated with more than one label (emotion). It can be noted that neutral, anticipation, joy, trust, and optimism labels got a higher percentage. Pessimism, anger, and surprise are rare emotions. Table 3 indicates the number of labels assigned to code-mixed (English and Roman Urdu) SMS messages. We standardized the proposed CM-MEC-21 corpus in CSV format and made it publicly available for research purposes.

IV. METHODS FOR MULTI-LABEL EMOTION CLASSIFICATION
To demonstrate how our proposed CM-MEC-21 corpus can be used to develop, evaluate and compare methods for the multi-label emotion classification task, we applied and  compared three main types of popular and widely used supervised machine learning methods: (1) state-of-the-art classical machine learning methods (content-based methods), (2) state-of-the-art deep learning methods, and (3) state-of-theart transfer learning methods. As far as we know, no previous study made such a detailed and thorough comparison of stateof-the-art methods for the multi-label emotion classification task on the code-mixed SMS messages. Below we describe these methods in detail.

A. CONTENT-BASED METHODS
Words and characters help create contextual content. Their sequence and structure can give important insights to classify texts. In earlier studies, [37] used the content-based approach for emotion classification on the responses of psychologists and non-psychologist students. Wang [38] and Ameer et al. [39] applied content-based methods for automatic emotion identification in texts.

1) N-GRAM FEATURES
The content-based methods for emotion classification tasks are based on n-grams taken from the code-mixed (English and Roman Urdu) SMS messages. The term n-gram (of characters or words) refers to a series of sequential tokens in a sentence, paragraph, and document. A group of n-grams can be generated by considering a series of tokens moving over a string, considering one token at a time. The series can be of length 1 (uni-grams), length 2 (bi-grams), length 3 (tri-grams), etc. N-grams are very widely used in natural language processing.
As several examples, let us mention the following research works. Mohammad [18] used word uni-grams and word bigrams for emotion classification on a corpus of newspaper headlines. Mohammad et al. [40] applied the word uni-grams and bi-grams along with punctuation marks, elongated words, emotion lexicons, and negation features to detect the emotional state and the stimulus of the authors of tweets on a corpus of 2012 US presidential elections. Mohammad et al. [13] used word n-grams, elongated words, and features associated with emotions for the emotion detection task. Content-based methods are also used in other tasks, for example, the author profiling task [6], [41]- [44].
Our study applied character and word n-grams for the multi-label emotion classification task on code-mixed SMS messages (English and Roman Urdu). We used TF-IDF values for n-grams, which is the most common solution (Scikit-learn was used for the implementation of the models). The maximum number of features for each experiment was 1,000, i.e., we used n-grams with the highest TF-IDF values. The length of n-grams was from 1 to 3 for word n-grams and from 3 to 10 for character n-grams, which are commonly used values.

B. DEEP LEARNING-BASED METHODS
The second type of methods is a deep learning-based methods, in which different state-of-the-art neural network models were applied [45], [46]. The series of SemEval [7], [20] emotion classification competitions played an important role in the development of emotion classification. We noticed that the most widely used models were Convolutional Neural Network, Recurrent Neural Networks, Long short-term memory, and Bidirectional Long short-term memory.
Rana [47] explored Gated Neural Networks for emotion classification from Noisy Speech and achieved promising results. Kim [48] trained a CNN model for text emotion classification and obtained a good classification effect. We applied seven state-of-the-art deep learning models for multi-label emotion classification on code-mixed (English and Roman Urdu) SMS messages, including Long shortterm memory, Bidirectional Long short-term memory, Gated Neural Networks, Bidirectional Gated Neural Networks, Recurrent Neural Networks, Bidirectional Recurrent Neural Networks, and Convolutional Neural Network for multi-label emotion classification.
We used Scikit-learn implementation of deep learning models considering the following parameters, which are usually the default: hidden layers = 3, hidden units = 64, no. of epochs = 10, batch size = 64, and dropout = 0.001. The parameters of CNN model are as follows: activation function = Rectified Linear Units (ReLU), optimizer = adam, hidden layers = 3, loss function = sigmoid, no. of epochs = 10, batch size = 64, dropout = 0.001. [49] is one of the most popular advanced techniques for NLP problems. The BERT model provided state-of-the-art performance across various NLP tasks without any significant task-specific architecture alterations. BERT was primarily employed in aspect-based sentiment analysis, such as in [50]- [52]. Several other studies focused on emotion analysis using BERT. For example, in [53], the authors conducted a comparative analysis of multiple pre-trained transformer models for the text emotion recognition problem, including BERT. However, our study differentiates from the earlier one as we assess the emotion classification model's performance on multi-label code-mixed SMS messages, which is significantly more challenging. VOLUME 10, 2022 In our study, the pre-trained uncased version of the BERT base model was applied, i.e., before the word tokenization step, the text is transformed to lowercase. The BERT Base model comprises 12 encoders, each with eight layers: four multi-head self-attention and four feed-forward layers.

Bidirectional Encoder Representations from Transformers (BERT)
The pre-trained XLNet was also applied in this work. The architecture of the XLNet base model consists of 12 transformer layers with 768 hidden layers, and 12 attention head layers were used. The XLNet tokenizer was used to split the sequences into tokens. After that, the tokens were padded, and classification was performed.
For the multi-label emotion classification task on our proposed CM-MEC-21 corpus, we added a fully connected layer and a sigmoid layer to these models. The batch size and learning rate were set to 32 and 2e-5, respectively. The models were optimized using the Adam optimizer, and the loss parameter was set to BCEWithLogitsLoss. The models were trained for ten epochs.

V. EXPERIMENTAL SETUP
This section describes how our proposed CM-MEC-21 corpus can be used to develop and evaluate emotion classification methods. The following sections present the dataset, techniques, evaluation methods, and evaluation measures in detail.

B. TECHNIQUES
We applied classical machine learning, deep learning, and transfer (see Section IV) techniques on three sub-corpora. Below we describe the four sets of experiments (Exp1, Exp2, Exp3, and Exp4) that were designed and carried out with different combinations of training data and testing data present in the three sub-corpora. Exp1. For this experiment, we used all three sub-corpora. The reason for designing Exp1, Exp2, Exp3, and Exp4 was to investigate the performance of different techniques on different types of training and testing datasets.

C. EVALUATION METHODOLOGY
The multi-label emotion classification problem on codemixed SMS messages (English and Roman Urdu) is treated as a supervised multi-label text classification problem. We applied two different machine learning classification methods, including One vs. Rest and One vs. One, along with base classifiers including Random Forest, Logistic Regression, Naïve Bayes, Support Vector Machine, Bagging, and AdaBoost. The features extracted using content-based methods (see Section IV-A) are used as input in training and testing phases. Micro recall is defined as recall of the aggregated contributions of all classes as (2), shown at the bottom of the page.
Micro F 1 is the harmonic mean of Micro Precision (Mi P ) and Micro Recall (Mi R ): Tables 4, 5 In Exp1 (see Table 4), overall, best results are obtained using word 2-grams (Micro Precision = 0.64, Micro Recall = 0.50, Micro F 1 = 0.64). The results show that word 2-grams are the most suitable features when we combine English (monolingual) and Roman Urdu (monolingual) SMS messages for training (Train-E-CM-MEC-21 + Train-RU-CM-MEC-21) and test them on the code-mixed SMS messages data (Test-CM-CM-MEC-21). It can be noted that the Micro F 1 score of 0.64 is not very high, highlighting the fact that multi-label emotion classification on code-mixed text is a challenging task. The best results obtained using deep learning (Micro F 1 = 0.30) and transfer learning methods (Micro F 1 = 0.26) are low. The possible reason for obtaining low results with deep learning and transfer learning is that the amount of data used for training is very small. Normally, deep learning and transfer learning methods require a huge amount of data for good training.

VI. RESULTS AND ANALYSIS
In Exp2 (see Table 5), overall, best results are obtained using character 3-grams (Micro Precision = 0.66, Micro Recall = 0.55, Micro F 1 = 0.66). This shows that character 3-grams are the most appropriate features when we train on code-mixed (English + Roman Urdu) SMS messages dataset (Train-CM-CM-MEC-21) and test them on code-mixed SMS messages dataset (Test-CM-CM-MEC-21). It can be observed that the Micro F 1 score of Exp2 (F 1 = 0.66) is slightly better than Exp1 (F 1 = 0.64). This highlights that the training model by combining data in two languages (Train-E-CM-MEC-21 + Train-RM-CM-MEC-21) and testing on code-mixed (English + Roman Urdu) data (Test-CM-MEC-21) did not have a significant effect on the performance. However, the best results were obtained with different features (word-2-grams in Exp1 and character-3-grams in Exp2). Similar to Exp1, results obtained with deep learning and transfer learning are low.
In Exp3 (see Table 6), overall, best results are obtained using In Exp4 (see Table 7), overall, best results are obtained using word 1-grams (Micro Precision = 0.60, Micro Recall = 0.57, Micro F 1 = 0.60). This shows that word 1-grams are the most suitable features when we combine English (monolingual), Roman Urdu (monolingual), and code-mixed (English + Roman Urdu) SMS messages (Train-E Mi R = e∈E no. of messages correctly assigned to emotion class e e∈E no. of messages in emotion class e (2) VOLUME 10, 2022  that training models on code-mixed (Train-CM-MEC-21) data is more efficient compared to a combination of English (monolingual), Roman Urdu (monolingual), and code-mixed (English + Roman Urdu) SMS message on our proposed CM-MEC-21 corpus. Deep learning and transfer learning fail to produce promising results in the experiments. Table 8 shows the best results obtained in Exp1, Exp2, Exp3, and Exp4. Overall, in all four experiments, contentbased methods outperform the deep learning and transfer learning methods. This indicates that content-based methods were more efficient for multi-label emotion classification tasks when training data was small. It can be observed that results are not up to the mark for deep learning and transfer learning methods. This performance highlights that multi-label emotion classification on code-mixed (English and Roman Urdu) SMS messages is challenging for deep learning and transfer learning methods. Moreover, this performance indicates that complex deep learning and transfer learning models can find multi-label emotion classification tasks difficult when training data is small.
Regarding the length of n in content-based methods, it can be observed from summary Table 8 that the best performance   achieved on a short length of n for both words and character n-grams. A possible reason for lower performance on longer length is that when we generate n-gram features for longer lengths of n, it is likely to capture unwanted, noisy, and irrelevant information to predict multiple emotions. Consequently, the performance of the machine learning algorithms decreases.
Regarding the machine learning algorithms, a combination of OVR multi-label and NB single-label machine learning algorithms performed best (Micro Precision = 0.67, Micro Recall = 0.54, Micro F 1 = 0.67) as compared to other algorithms. One possible reason behind this is that OVR and NB are simple, efficient, and able to handle noisy data. Moreover, OVR and NB do not require a huge dataset to work well.
Regarding the combinations of experiments, in Exp1 and Exp2, there is not much variation in results. A possible reason is that the training model has mixed dataset features, characteristics, and knowledge of all three source languages. Considering Exp3 and Exp4, there is variation in results. This indicates that code-mixed (Train-CM-MEC-21) data is more suitable in model training compared to a combination of English (monolingual), Roman Urdu (monolingual), and code-mixed (English + Roman Urdu) SMS message on our proposed CM-MEC-21 corpus.
The main findings of these four experiments are: (1) classical machine learning outperforms deep learning and transfer learning methods in all four experiments, (2) overall highest Micro F 1 of 0.67 is obtained (Exp3), indicating that multi-label emotion classification on code-mixed data is a complex task, (3) deep learning and transfer learning methods fail to give promising results when we have small training data, and (4) change in training data also changes the best-performing features in the majority of experiments, i.e., Exp1, Exp2, and Exp3.
To conclude, overall best results (see Table 8) are obtained using state-of-the-art classical machine learning methods with word uni-gram (Micro F 1 = 0.67) and OVR multi-label machine learning algorithm when training on code-mixed (Train-CM-CM-MEC-21) and testing by combining codemixed, Roman Urdu, and English (Test-CM-CM-MEC-21 + Test-E-CM-MEC-21 + Test-RU-CM-MEC-21) SMS messages.

VII. CONCLUSION
Code-mixed (English and Roman Urdu) text is widely used, especially in the South Asian community. However, it is not explored for the multi-label emotion classification problem. As described in this paper, our novel contribution is a newly developed and publicly available benchmark code-mixed and multi-label SMS messages-based corpus for the multi-label emotion classification task. The corpus consists of 11,914 code-mixed multi-label SMS messages manually annotated for the presence/absence of the following 12 emotions: anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral (no emotion). In addition to dataset creation, we applied state-of-the-art machine learning (content-based -3-word n-gram features and eight character n-gram features), deep learning, and transfer learning-based methods for multi-label emotion classification task on our proposed CM-MEC-21 corpus. The best results (see Table 8) obtained using state-of-the-art machine learning methods with word uni-gram (Micro F 1 = 0.67) and OVR multi-label machine learning algorithm, when training on code-mixed (Train-CM-CM-MEC-21) and testing by combining code-mixed, Roman Urdu, and English (Test-CM-CM-MEC-21 + Test-E-CM-MEC-21 + Test-RU-CM-MEC-21) multi-label SMS messages.
In the future, we plan to apply other transfer learning-based models such as RoBERTa, DistilBERT, etc., to our proposed corpus. An ensemble of the model would be considered to increase classification performance.