Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances

Emotion is intrinsic to humans and consequently emotion understanding is a key part of human-like artificial intelligence (AI). Emotion recognition in conversation (ERC) is becoming increasingly popular as a new research frontier in natural language processing (NLP) due to its ability to mine opinions from the plethora of publicly available conversational data in platforms such as Facebook, Youtube, Reddit, Twitter, and others. Moreover, it has potential applications in health-care systems (as a tool for psychological analysis), education (understanding student frustration) and more. Additionally, ERC is also extremely important for generating emotion-aware dialogues that require an understanding of the user's emotions. Catering to these needs calls for effective and scalable conversational emotion-recognition algorithms. However, it is a strenuous problem to solve because of several research challenges. In this paper, we discuss these challenges and shed light on the recent research in this field. We also describe the drawbacks of these approaches and discuss the reasons why they fail to successfully overcome the research challenges in ERC.


I. INTRODUCTION
Emotion is often defined as an individual's mental state associated with thoughts, feelings and behaviour. Stoics like Cicero organized emotions into four categories -metus (fear), aegritudo (pain), libido (lust) and laetitia (pleasure). Later, evolutionary theory of emotion were initiated in the late 19th century by Charles Darwin [1]. He hypothesized that emotions evolved through natural selection and, hence, have crossculturally universal counterparts. In recent times, Plutchik [2] categorized emotion into eight primary types, visualized by the wheel of emotions (Fig. 4). Further, Ekman [3] argued the correlation between emotion and facial expression.
Natural language is often indicative of one's emotion. Hence, emotion recognition has been enjoying popularity in the field of NLP [4,5], due to its widespread applications in opinion mining, recommender systems, health-care, and so on. Strapparava and Mihalcea [6] addressed the task of emotion detection on news headlines. A number of emotion lexicons [7,8] have been developed to tackle the textual emotion recognition problem.
Only in the past few years has emotion recognition in conversation (ERC) gained attention from the NLP community [9,10,11,12] due to the increase of public availability of conversational data. ERC can be used to analyze conversations that take place on social media. It can also aid in analyzing conversations in real times, which can be instrumental in legal trials, interviews, e-health services and more.
Unlike vanilla emotion recognition of sentences/utterances, ERC ideally requires context modeling of the individual utterances. This context can be attributed to the preceding utterances, and relies on the temporal sequence of utterances. Compared to the recently published works on ERC [10,11,12], both lexicon-based [13,8,14] and modern deep learning-based [4,5] vanilla emotion recognition approaches fail to work well on ERC datasets as these works ignore the conversation specific factors such as the presence of contextual cues, the temporality in speakers' turns, or speaker-specific information. Fig. 5a and Fig. 5b show an example where the same utterance changes its meaning depending on its preceding utterance.
Task definition -Given the transcript of a conversation along with speaker information of each constituent utterance, the ERC task aims to identify the emotion of each utterance from several pre-defined emotions. Fig. 2 illustrates one such conversation between two people, where each utterance is labeled by the underlying emotion. Formally, given the input sequence of N number of utterances [(u 1 , p 1 ), (u 2 , p 2 ), . . . , (u N , p N )], where each utterance u i = [u i,1 , u i,2 , . . . , u i,T ] consists of T words u i,j and spoken by party p i , the task is to predict the emotion label e i of each utterance u i .
Controlling variables in conversations -Conversations are broadly categorized into two categories: task oriented and chit-chat (also called as non-task oriented). Both kinds of conversation are governed by different factors or pragmatics [15], such as topic, interlocutors' personality, argumentation logic, viewpoint, intent [16], and so on. Fig. 1 depicts how these factors play out in a dyadic conversation. Firstly, topic (T opic) and interlocutor personality (P * ) always influence the conversation, irrespective of the time. A speaker makes up his/her mind (S t * ) about the reply (U t * ) based on the contextual preceding utterances (U <t * ) from both speaker and listener, the previous utterance being the most important one since it usually makes the largest change in the joint task model S t+1 Fig. 1: Interaction among different variables during a dyadic conversation between persons A and B. Grey and white circles represent hidden and observed variables, respectively. P represents personality, U represents utterance, S represents interlocutor state, I represents interlocutor intent, E represents emotion and Topic represents topic of the conversation. This can easily be extended to multi-party conversations.
(for task-oriented conversations) or the speaker's emotional state (for chit-chat). Delving deeper, the pragmatic features, as explained by Hovy [15], like argumentation logic, interlocutor viewpoint, inter-personal relationship and dependency, situational awareness are encoded in speaker state (S t * ). Intent (I t * ) of the speaker is decided based on previous intent I t−2 * and speaker state S t * , as the interlocutor may change his/her intent based on the opponent's utterance and current situation. Then, the speaker formulates appropriate emotion E t * for the response based on the state S t * and intent I t * . Finally, the response U t * is produced based on the speaker state S t * , intent I t * and emotion E t * . We surmise that considering these factors would help representing the argument and discourse structure of the conversation, which leads to improved conversation understanding, including emotion recognition.
Early computational work on dialogue focused mostly on task-oriented cases, in which the overall conversational intent and step-by-step sub-goals played a large part [17,18]. Cohen and Levesque [19] developed a model and logic to represent intentions and their connections to utterances, whose operators explicate the treatment of beliefs about the interlocutor's beliefs and vice versa, recursively. Emotion however played no role in this line of research. In more recent work, chatbots and chit-chat dialogue have become more prominent, in part due to the use of distributed (such as embedding) representations that do not readily support logical inference.
On conversational setting, K. D'Mello et al. [20] and Yang et al. [21] worked with small datasets with three and four emotion labels, respectively. This was followed by Phan et al. [22], where emotion detection on conversation transcript was attempted. Recently, several works [23,24] have devised deep learning-based techniques for ERC. These works are crucial as we surmise an instrumental role of ERC in emotion-aware a.k.a. affective dialogue generation which has fallen within the topic of "text generation under pragmatics constriants" as proposed by Hovy [15]. Fig. 3 illustrates one such conversation between a human (user) and a medical chatbot (healthassistant). The assistant responds with emotion based on the user's input. Depending on whether the user suffered an injury earlier or not, the health-assistant responds with excitement (evoking urgency) or happiness (evoking relief).
As ERC is a new research field, outlining research challenges, available datasets, and benchmarks can potentially aid future research on ERC. In this paper, we aim to serve this purpose by discussing various factors that contribute to the emotion dynamics in a conversation. We surmise that this paper will not only help the researchers to better understand the challenges and recent works on ERC but also show possible future research directions. The rest of the paper is organized as follows: Section II presents the key research challenges; Section III and IV cover the datasets and recent progress in this field; finally Section V concludes the paper.

II. RESEARCH CHALLENGES
Recent works on ERC, e.g., DialogueRNN [11] or ICON [23], strive to address several key research challenges that make the task of ERC difficult to solve: a) Categorization of emotions: Emotion is defined using two type of models -categorical and dimensional. Categorical model classifies emotion into a fixed number of discrete categories. In contrast, dimensional model describes emotion as a point in a continuous multi-dimensional space.
Most dimensional categorization models [25,26] adopt two dimensions -valence and arousal. Valence represents the degree of emotional positivity, and arousal represents the intensity of the emotion. In contrast with the categorical models, dimensional models map emotion into a continuous spectrum rather than hard categories. This enables easy and intuitive comparison of two emotional states using vector operations, whereas comparison is non-trivial for categorical models. As there are multiple categorization and dimensional taxonomies available, it is challenging to select one particular model for annotation. Choosing a simple categorization model e.g., Ekman's model has a major drawback as these models are unable to ground complex emotions. On the other hand, complex emotion models such as Plutchik's model make it very difficult for the annotators to discriminate between the related emotions, e.g., discerning anger from rage. Complex emotion models also increase the risk of obtaining a lower inter-annotator agreement.
The popular ERC dataset IEMOCAP [27] adopted both categorical and dimensional models. However, newer ERC datasets like DailyDialogue [28] have employed only categorical model due to its more intuitive nature. Most of the available datasets for emotion recognition in conversation adopted simple taxonomies, which are slight variants of Ekman's model. Each emotional utterance in the EmoContext dataset is labeled with one of the following emotions: happiness, sadness and anger. The majority of the utterances in EmoContext do not elicit any of these three emotions and are annotated with an extra label: others. Naturally, the inter-annotator agreement for the EmoContext dataset is higher due to its simplistic emotion taxonomy. However, the short context length and simple emotion taxonomy make ERC on this dataset less challenging. b) Basis of emotion annotation: Annotation with emotion labels is challenging as the label depends on the annotators perspective. Self-assessment by the interlocutors in a conversation is arguably the best way to annotate utterances. However, in practice it is unfeasible as real-time tagging of unscripted conversations will impact the conversation flow. Post-conversation self-annotation could be an option, but it has not been done yet.
As such, many ERC datasets [27] are scripted and annotated by a group of people uninvolved with the script and conversation. The annotators are given the context of the utterances as prior knowledge for accurate annotation. Often pre-existing transcripts are annotated for quick turn-around, as in EmotionLines [10].
The annotators also need to be aware of the interlocutors perspective for situation-aware annotation. For example, the emotion behind the utterance "Lehman Brothers' stock is plummeting!!" depends on whether the speaker benefits from the crash. The annotators should be aware of the nature of association between the speaker and Lehman Brothers for accurate labeling.
c) Conversational context modeling: Context is at the core of the NLP research. According to several recent studies [29,30], contextual sentence and word embeddings can improve the performance of the state-of-the-art NLP systems by a significant margin.
The notion of context can vary from problem to problem. For example, while calculating word representations, the surrounding words carry contextual information. Likewise, to classify a sentence in a document, other neighboring sentences are considered as its context. In Poria et al. [31], surrounding utterances are treated as context and they experimentally show that contextual evidence indeed aids in classification.
Similarly in conversational emotion-detection, to determine the emotion of an utterance at time t, the preceding utterances at time < t can be considered as its context. However, computing this context representation often exhibits major difficulties due to emotional dynamics. Emotional dynamics of conversations consists of two important aspects: self and inter-personal dependencies [32]. selfdependency, also known as emotional inertia, deals with the aspect of emotional influence that speakers have on themselves during conversations [33]. On the other hand, inter-personal dependencies relate to the emotional influences that the counterparts induce into a speaker. Conversely, during the course of a dialogue, speakers also tend to mirror their counterparts to build rapport [34]. This phenomenon is illustrated in Fig. 2. Here, P a is frustrated over her long term unemployment and seeks encouragement (u 1 , u 3 ). P b , however, is pre-occupied and replies sarcastically (u 4 ). This enrages P a to appropriate an angry response (u 6 ). In this dialogue, emotional inertia is evident in P b who does not deviate from his nonchalant behavior. P a , however, gets emotionally influenced by P b . Modeling self and inter-personal relationship and dependencies may also depend on the topic of the conversation as well as various other factors like argument structure, interlocutors' personality, intents, viewpoints on the conversation, attitude towards each other etc.. Hence, analyzing all these factors are key for a true self and inter-personal dependency modeling that can lead to enriched context understanding.
The contextual information can come from both local and distant conversational history. While the importance of local context is more obvious, as stated in recent works, distant context often plays a less important role in ERC. Distant contextual information is useful mostly in the scenarios when a speaker refers to earlier utterances spoken by any of the speakers in the conversational history.
The usefulness of context is more prevalent in classifying short utterances, like "yeah", "okay", "no", that can express different emotions depending on the context and discourse of the dialogue. The examples in Fig. 5a and Fig. 5b explain this phenomenon. The emotions expressed by the same utterance "Yeah" in both these examples differ from each other and can only be inferred from the context.
Finding contextualized conversational utterance representations is an active area of research. Leveraging such contextual clues is a difficult task. Memory networks, RNNs, and attention mechanisms have been used in previous works e.g., HRLCE or DialogueRNN, to grasp information from the context. d) Speaker specific modeling: Individuals have their own subtle way of expressing emotions. For instance, some individuals are more sarcastic than others. For such cases, the usage of certain words would vary depending on if they are being sarcastic. Let's consider this example, P a ∶ "The order has been cancelled.", P b ∶ "This is great!". If P b is a sarcastic person, then his response would express negative emotion to the order being canceled through the word great. On the other hand, P b 's response, great, could be taken literally if the canceled order is beneficial to P b (perhaps P b cannot afford the product he ordered). Since, necessary background information is often missing from the conversations, speaker profiling based on preceding utterances often yields improved results. e) Listener specific modeling: During a conversation, the listeners make up their mind about the speaker's utterance as it's spoken. However, there is no textual data on the listener's reaction to the speaker. A model must resort to visual modality to model the listener's facial expression to capture the listener's reaction. However, according to DialogueRNN, capturing listener reaction does not yield any improvement as the listener's subsequent utterance carries their reaction. Moreover, of the listener never speaks in a conversation, his/her reaction remains irrelevant. Nonetheless, listener modeling can be useful in the scenarios where continuous emotion recognition of every moment of the conversation is necessary, like audience reaction during a political speech, as opposed to emotion recognition of each utterance.
f) Presence of emotion shift: Due to emotional inertia, participants in a conversation tend to stick a particular emotional state, unless some external stimuli, usually the other participants, invoke a change. This is illustrated in Fig. 6, where Joey changes his emotion from neutral to anger due to the last utterance of Chandler, which was unexpected and rather shocking to Joey. This is a hard problem to solve, as the state-of-the-art ERC model, DialogueRNN is more accurate in emotion detection for the utterances without emotional shift or when the shift is to a similar emotion (e.g., from fear to sad). It is equally interesting and challenging to track the development, shift and dynamics of the participants' opinions on the topic of discussion in the conversation. g) Fine-grained emotion recognition: Fine-grained emotion recognition aims at recognizing emotion expressed on the explicit and implicit topics. It involves a deeper understanding of the topic of the conversation, interlocutor opinion and stand. For example, in Fig. 7, while both persons take a supportive stand for the government's bill, they use completely opposite emotions to express it. It is not possible for a vanilla emotion recognizer to understand the positive emotion of both the interlocutors on the aspect of government's bill. Only by interpreting Person 2's frustration about the opposition's protest against the bill can a classifier infer Person 2's support for the bill. On the other hand, even though Person 1 does not explicitly express his/her opinion on the opposition, from the discourse of the conversation, it can be inferred that Person 1 holds a negative opinion on the opposition. h) Multiparty conversation: In a multiparty conversation, more than two participants are involved. Naturally, emotion recognition in such conversations is more challenging in comparison with dyadic conversations due to the difficulty in tracking individual speaker states and handling co-references.
i) Presence of sarcasm: Sarcasm is a linguistic tool that uses irony to express contempt. An ERC system incapable of detecting sarcasm mostly fails to predict emotion of the sarcastic utterances correctly. Sarcasm detection in a conversation largely depends on the context and discourse of the conversation. For example, the utterance "The part where Obama signed it" can only be detected as sarcastic if we look at the previous utterance "What part of this would be   unconstitutional?". Sarcastic nature is also person dependent, which again warrants speaker profiling in the conversation. j) Emotion reasoning: The ability to reason is necessary for any explainable AI system. In the context of ERC, it is often desired to understand the cause of an expressed emotion by a speaker. As an example, we can refer to Fig. 2. An ideal ERC system, with the ability of emotion reasoning, should perceive the reason for P erson A 's anger, expressed in u 6 of Fig. 2. It is evident upon observation that this anger is caused by the persistent nonchalant behavior of P erson B . Readers should not conflate emotion reasoning with context modeling, which we discuss earlier in this section. Unlike context modeling, emotion reasoning does not only find the contextual utterances in conversational history that triggers the emotion of an utterance, but also determines the function of those contextual utterances on the target utterance. In   Fig. 2, it is the indifference of P erson B , reflected by u 4 and u 5 , that makes P erson A angry. Similarly, in Fig. 6, Joey expresses anger once he ascertains Chandler's deception in the previous utterance. It is hard to define a taxonomy or tagset for emotion reasoning. At present, there is no available dataset which contains such rich annotations. Building such dataset would enable future dialogue systems to frame meaningful argumentation logic and discourse structure, taking one step closer to human-like conversation.

III. DATASETS
In the last few years, emotion recognition in conversation has gained major research interest, mainly because of its potential application in dialogue systems to generate emotionaware and empathetic dialogues [12]. The primary goal of ERC task is to label each utterance in the conversation with an emotion label. In this section, we discuss the publicly available ERC datasets as well as the shortcomings of these datasets.
There are a few publicly available datasets for ERC -IEMO-CAP [27], SEMAINE [35], Emotionlines [10], MELD [36], DailyDialog [28] and EmoContext [37]. A detailed comparison of these datasets is drawn in Table II. Out of these five datasets, IEMOCAP, SEMAINE and MELD are multimodal (containing acoustic, visual and textual information) and the remaining two are textual. Apart from SEMAINE dataset, rest of the  12885  2308  1710  648  4669  Surprise  1823  1636  1658  --Sadness  1150  1002  498  1084  5838  Anger  1022  1607  772  1103  5954  Disgust  353  361  338  --Fear  74  358 255  1]) and power ([0, ∞)). We also show the emotion label distribution of these datasets in Table I. In EmoContext dataset, an emotion label is assigned to only the last utterance of each dialogue. None of these datasets can be used for emotion reasoning as they lack necessary annotation details required for the reasoning task. Readers should also note that, all these datasets do not contain fine-grained and topic level emotion annotation.

IV. RECENT ADVANCES
In this section we give a brief introduction to the recent work on this topic. We also compare the approaches and report their drawbacks.
As depicted in Fig. 1, recognizing emotion of an utterance in a conversation primarily depends on these following three factors: 1) the utterance itself and its context defined by the interlocutors' preceding utterances in the conversation, as well as intent and the topic of the conversation, 2) the speaker's state comprising variables like personality and argumentation logic and, 3) emotions expressed in the preceding utterances.
Although, IEMOCAP and SEMAINE have been developed almost a decade ago, most of the works that used these two datasets did not consider the aforementioned factors.
a) Benchmarks and their drawbacks: Based on these factors, a number of approaches to address the ERC problem have been proposed recently. Conversational memory network (CMN), proposed by Hazarika et al. [38] for dyadic dialogues, is one of the first ERC approaches that utilizes distinct memories for each speaker for speaker-specific context modeling. Later, Hazarika et al. [23] improved upon this approach with interactive conversational memory network (ICON), which interconnects these memories to model self and inter-speaker emotional influence. None of these two methods actually exploit the speaker information of the target utterance for classification. This makes the model blind to speaker-specific nuances. Recently, Yeh et al. [9] proposed an ERC method called Interaction-aware Attention Network (IANN) by leveraging inter-speaker relation modeling. Similar to ICON and CMN, IANN utilises distinct memories for each speaker. DialogueRNN [11] aims to solve this issue by considering the speaker information of the target utterance and, further, modeling self and inter-speaker emotional influence with a hierarchical multi-stage RNN with attention mechanism. On both IEMOCAP and SEMAINE datasets, DialogueRNN outperformed (Table III and Table IV) the other two approaches.
All of these models affirm that contextual history, modeling self and inter-speaker influence are beneficial to ERC (shown in Fig. 8 and Fig. 9). Further, DialogueRNN shows that the nearby utterances are generally more context rich and ERC performance improves when the future utterances, at time > t, are available. This is indicated by Fig. 9, where DialogueRNN uses both past and future utterances as context with roughly the same frequency. Also, the distant utterances are used less frequently than the nearby utterances. On the other hand, CMN and ICON do not use future utterances as context at all. However, for real-time applications, systems cannot rely on future utterances. In such cases, CMN, ICON and DialogueRNN with fixed context window would be befitting.
All these networks, namely CMN, ICON, IANN and Di-alogueRNN, perform poorly on the utterances with emotion shift. In particular, the cases where the emotion of the target utterance differs from the previous utterance, DialogueRNN could only correctly predict 47.5% instances. This stands less as compared to the 69.2% success-rate that it achieves at the regions of no emotional-shift.
Among these three approaches, only DialogueRNN is capable of handling multiparty conversations on large scale. However, on the multiparty conversational dataset MELD, only a little performance improvement (shown in Table V) is observed by DialogueRNN compared to bc-LSTM which depicts a future research direction on multiparty ERC. ICON and CMN are designed to detect emotions in dyadic dialogues. Adapting ICON and CMN to apply on multiparty conversational dataset MELD can cause scalability issue in situations when number speakers participating in a conversation in the test data is more than the training data.
Due to the sequential nature of the utterances in conversations, RNNs are used for context generation in the aforemen- utterances  train  val  test  train  val  test  IEMOCAP  120  31  5810  1623  SEMAINE  63  32  4368  1430  EmotionLines  720  80  200  10561 1178  2764  MELD  1039  114  280  9989  1109  2610  DailyDialog  11,118 1000 [23] as in our experiment, we disregard their contextual feature extraction and pre-processing part. and More details can be found in Majumder et al. [11]. tioned models. However, there is ample room for improvement, as the RNN-based context representation methods perform poorly in grasping long distant contextual information.

IEMOCAP
Recently, two shared tasks -EmotionX (co-located with SocialNLP workshop) and EmoContext 1 (co-located with Semeval 2019) have been organized to address the ERC problem. EmoContext shared task has garnered more than 500 participants, affirming the growing popularity of this research 1 https://www.humanizing-ai.com/emocontext.html field. Compared to other datasets, EmoContext dataset [37] has very short conversations consisting only three utterances where the goal is to label the 3rd utterance as shown in Fig. 10. Emotion labels of the previous utterances are not present in the EmoContext dataset. The key works [24,39,37] on this dataset have mainly leveraged on context modeling using bc-LSTM architecture [31] that encapsulates the temporal order of the utterances using an LSTM. A common trend can be noticed in these works, where traditional word embeddings, such as Glove [40], are combined with contextualized word     embeddings, such as ELMo [29] to improve the performance. In Fig. 11, we depict the HRLCE framework, proposed by Huang et al. [39], that comprises of an utterance encoder and a context encoder that takes input from the utterance encoder. To represent each utterance, HRLCE utilizes ELMo [29], Glove [40] and Deepmoji [41]. The context encoder in HRLCE  adapts the bc-LSTM framework followed by a multi-head attention layer. Huang et al. [39] applied HRLCE framework only on the EmoContext dataset. However, HRLCE can be easily adapted to apply on other ERC datasets. It should be noted that none of the works on the EmoContext dataset utilize speaker information. In fact, in our experiments, we found that DialogueRNN, which makes use of the speaker information, performs similar (Table VI) to Bae et al. [24], Huang et al. [39] and Chatterjee et al. [37] on EmoContext dataset. One possible reason for this could be the presence of very short context history in the dataset that renders speaker information inconsequential.

V. CONCLUSION
Emotion recognition in conversation has been gaining popularity among NLP researchers. In this work, we summarize the recent advances in ERC and highlight several key research challenges associated with this research area. Further, we point out how current work has partly addressed these challenges, while also presenting some shortcomings. Overall, we surmise that an effective emotion-shift recognition model and context encoder can yield significant performance improvement over chit-chat dialogue, and even improve some aspects of task-oriented dialogue. Moreover, challenges like topic-level speaker-specific emotion recognition, ERC on multiparty conversations, and conversational sarcasm detection can form new research directions. Additionally, fine-grained speaker-specific continuous emotion recognition may become of interest for the purpose of tracking emotions during long monologue. We believe that addressing each of the challenges outlined in this paper will not only enhance AI-enabled conversation understanding, but also improve the performance of dialogue systems by catering to affective information.