Different Contextual Window Sizes Based RNNs for Multimodal Emotion Detection in Interactive Conversations

Multimodal emotion detection (MED) in interactive conversations is extremely important for improving the overall human-computer interaction experience. Present research methods in this domain do not explicitly distinguish the contexts of a test utterance in a meaningful way while classifying emotions in conversations. In this paper, we propose a model, named different contextual window sizes based recurrent neural networks (DCWS-RNNs), to differentiate the contexts. The model has four recurrent neural networks (RNNs) that use different contextual window sizes. These window sizes can represent the implicit weights of different aspects of contexts. Further, four RNNs are independently to model the different aspects of contexts into memories. Such memories with the test utterance are then merged using attention-based multiple hops. Experiments show DCWS-RNNs outperforms the compared methods on both the IEMOCAP and AVEC datasets. Case studies on the IEMOCAP dataset also demonstrate that our model has excellent performance to capture the emotional dependent utterance that is most relevant to the test utterance and assigned to the highest attention score.


I. INTRODUCTION
Emotions are part of human nature and play an important role in our daily life. The development of machines that can detect and comprehend emotions has been a long-standing goal of artificial intelligence (AI). With the growth of deep learning [1] and neural network [2], many researchers have been paying increasing attention to multimodal emotion detection (MED). This can help in numerous fields like counseling, public opinion mining, financial forecasting, intelligent systems, dialogue systems, human-computer interaction, and so on. In all these applications, creating empathetic agents is an urgent need as it can improve human-computer interaction experience in conversations. Although previous significant research work has been carried out on MED using audio, visual, and text modalities [3], [4], The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney.
there is little work in interactive conversations. According to [5], [6], modeling contexts and analyzing emotional dynamics [7] in conversations pose complex challenges. This is because of the presence of intricate dependencies between the affective states of speakers participating in the conversation [8]. In this paper, we propose a method, named different contextual window sizes based recurrent neural networks (DCWS-RNNs), to address the issues of MED in utterance level (generally, a unit of speech bounded by breathes or pauses of the speaker).
In interactive conversations, self and inter-speaker emotional influence are two vital properties exited in emotional dynamics [9], [10]. Self-emotional influence also termed as emotional inertia, relates to the degree to which a person's feelings carry over from one moment to another [11], or in other words, copes with the aspect of emotional influence that speakers have on themselves during the conversation [12]. On the other hand, inter-speaker emotional influence is another property that the counterparts induce into a speaker. That means speakers may follow the emotions of their counterparts during the course of a conversation [13]. Figure 1 demonstrates a short conversation extracted from the IEMOCAP dataset, which is involving these two properties of both self and inter-speaker emotional influence.
Many studies in MED have been done by exploiting contextual information present in the conversation history [8], [13], [14]. However, they do not propose a meaningful way to distinguish the contexts of a test utterance in interactive conversations. In other words, they do not pay attention to the implicit weights of different aspects of contexts, just only using simple or equal treatments and leading to capturing emotional dynamics less effectively. In contrast, we develop a model that is to distinguish the contexts. Further, our model represents the implicit weights of different aspects of contexts using different contextual window sizes. The model is based on such an assumption: the contexts of a test utterance in the interactive conversations should divide into four aspects, and the implicit weights of these different aspects can be represented using different contextual window sizes. Each aspect is part of contexts of the test utterance. Although these four different aspects of contexts are not necessarily independent, their separate modeling exhibit superior performance in comparison with the state-of-the-art in the experiments, which is the evidence to prove that our hypothesis is effective.
The DCWS-RNNs first represents all the utterances in a video with multimodal (text, audio, and visual) features. Next, for classifying a test utterance (i.e. u t ), four different aspects of the contexts of u t are respectively generated by collecting corresponding utterances. The number of utterances of each aspect is its contextual window size. To capture the most emotional dependent utterance of u t , DCWS-RNNs utilizes four recurrent neural networks (RNNs) to independently process these four different aspects.
These RNNs are based on gated recurrent units (GRUs). One RNN is assigned to one aspect. The number of GRUs of each RNN is equal to the number of utterances of its assigned aspect. Each RNN models its corresponding aspect into memory cells using GRUs. These memory cells are then performed an attention mechanism, producing a weighted representation. This representation denotes the summary of the memories. All weighted representations of four aspects are then merged with u t using an addition operation. This is done to improve the representation of the test utterance through modeling the emotional dynamics from different aspects of contexts. The whole operation process is repeated based on a multiple hop scheme. Finally, the emotion representation of u t is obtained after H hops, which is used to classify its emotion category.
The contributions of this paper can be summarized as follows: • We propose a DCWS-RNNs model to distinguish the contexts of the test utterance. To our knowledge, it is the first time to represent the implicit weights of these different aspects of contexts using different contextual window sizes.
• We adopt a multimodal approach to represent the utterances. This can provide comprehensive features from modalities such as text, audio, and visual and help in enhancing our model robust performance.
• For MED of utterance level in interactive conversations, we conduct the experiments on both IEMOCAP and AVEC datasets. The results show our model DCWS-RNNs obtains competitive performance in comparison with the state of the art [14] (seen in Table 4 and Table 5). The remainder of the paper is organized as follows: Section II briefly discusses important previous work; Section III provides the problem definition; Section IV describes our method in detail; Section V provides experimental setup and reports the results; discussion and analysis are presented in Section VI; finally, Section VII concludes the paper.

II. RELATED WORK
Emotion detection [15] involves the intersection of signal processing [16], in-depth learning, edge computing, psychology [17], natural language processing [18], cognitive science [19], and so on [20]. Ekman (1993) [21] launched research on the correlation between emotion and facial cues. Rothkrantz (2008, 2011) [22], [23] employed both acoustic information and visual cues for emotion detection. Alm et al. (2005) [24] researched the emotion detection by adding the text-based information, improved in the later work of Strapparava and Mihalcea (2010) [25]. In recent years, MED has become an active research area due to obtaining impetus from researchers. MED means to perform emotion detection from a multimodal learning perspective [3], [26]. A large amount of current research, such as [27]- [31], have been dependent on multimodal techniques for emotion detection. These exhibit good performances in emotion detection systems [32]- [34], thus promoting the use of multimodality. Our proposed DCWS-RNNs, in the same way, adopts multimodal (text, audio and visual modalities) fusion approach to represent all the utterances in conversations for performing emotion detection.
Replicating human interaction needs to profoundly understand the conversations. The significant premise is to capture the emotional dynamics. Some important work has pointed out that the emotional dynamics have been looked upon as an interactive phenomenon, rather than being within-person and one-directional [35], [36]. These emotional dynamics can be modeled by observing transition properties, e.g., through investigating the patterns of emotional transition properties [37] or using finite state machines to model transitions [38]. Since the conversation is a temporal event and has a natural temporal nature, contexts of the test utterance play a critical role in emotional detection. Poria et al. [39] utilized contextual information from neighboring utterances of the same speaker to predict emotions. Hazarika et al. [13] calculated the preceding contexts of both parties by employing the memory networks with the same length. Different from these works, we not only use the contexts but divide these contexts into four different aspects.
Memory networks as a storage mechanism, whose memory cells are continuous vectors, are efficient in capturing long-term dependencies [40]. The use of memory networks has been beneficial in multiple research areas, such as speech detection [41], [42], question-answering [43], [44], machine translation [45], commonsense reasoning [46], and others. Zadeh et al. [47] proposed a framework of memory-based sequential learning for multi-view signals. Hazarika et al. [8], [13] both employed memory networks to enable inter-speaker interaction to detect emotions in interactive conversations, with the former combining memories to encode whole utterances and the later using two distinct memory networks. Besides, the attention mechanism and the multiply hop scheme are generally embedded into memory networks. In this paper, these mechanisms are adopted to model the different aspects of contexts of the test utterance.
RNN as a kind of neural network used to process sequence data has also been instrumental in the progress of the MED problem. For example, Poria et al. [39] successfully developed RNN-based deep networks for MED, which was followed by other works [48]- [50]. Majumder et al. [14] proposed an attentive RNN for emotion detection, which is the state of the art in MED. Our work is a follow-up to previous research [13]. Although our proposed model also employs RNN, our method is different for it has four RNNs with different contextual window sizes (corresponding to different time steps) as calculation modules of the contexts.

III. PROBLEM DEFINITION
Our goal is to accurately detect the emotion of utterances present in interactive conversations. Let us define a conversation U to be an asynchronous exchange of utterances between two persons P a and P b . With T utterances, U = {u 1 , u 2 , . . . , u T } denotes a totally ordered set based on temporal occurrence, where u t is the t th utterance at time step t ∈ [1, T ]. Here, u t is spoken by either P a or P b . The contexts of u t are those utterances in U except u t . We divide the contexts of u t into four aspects: own before aspect, the other before aspect, own later aspect and the other later aspect (respectively marked as A, B, C, and D). Each aspect is part of the contexts of u t . The corresponding utterances in these four aspects are separately collected and respectively marked as four different sets (U A , U B , U C , and U D ). There is no intersection of utterances between every aspect.
Further, considering different contextual window sizes (W A , W B , W C and W D ), these four different sets (U A , U B , U C and U D ) within corresponding window sizes can be respectively constrained to Concretely, for aspect A of contexts, U A contains those utterances that occur before u t and are uttered by the same person who utters u t ; U W A A contains those utterances that are in U A and with the farthest distance of W A from u t . In the same way, U B , U W B B , U C , U W C C , U D , and U W D D can be respectively obtained referring to aspect A.
It should be noted that, at the beginning or the ending of the conversation, the number of the utterances in these constrained sets would have lesser than their corresponding window sizes. So we have the inequality as, where, λ ∈ {A, B, C, D}. For brevity, we explain our model using a subscript λ which can instantiate to A, B, C or D in the remaining sections.  Table 1 provides a sample conversation with different contextual window sizes, W A = 9, W B = 8, W C = 6 and W D = 4. u i represents the i th utterance in U .

IV. DCWS-RNNs MODEL
For improving classification performance, we propose a hypothesis: the contexts of a test utterance u t should be differentiated. Further, we utilize different contextual window sizes to represent the implicit weights of different aspects of contexts. Based on this assumption, we develop a model named DCWS-RNNs. The overall working process of our model is as follows.
Firstly, four different sets (U W A A , U W B B , U W C C , and U W D D ) relevant to the test utterance u t are separately collected and respectively sent to four RNNs to generate four initial memories, with each RNN accepting a corresponding set. These RNNs have different contextual window sizes. One RNN is assigned to one aspect and each RNN works independently. The RNN can model its corresponding aspect into memory cells using GRUs. Then, these memories are conveyed to the attention block to perform an attention mechanism. After attention operation, all the memories of four aspects are accumulated and merged with u t . This produces an updated representation of the test utterance. Furthermore, a multiple hop scheme is adopted to conduct the iterations. At each hop, the output of the RNN module is used as its input in the next hop. The ultimate emotion representation of u t is obtained after H hops, which is used for the emotion classification. Figure 2 illustrates the overall model.
In the following subsections, we will detail the DCWS-RNNs model.

A. SEPARATE GRUs MODULES
For the computations of different aspects of contexts, we employ four independent RNNs that are based on GRUs. Similar to an LSTM [51] in RNN, the GRU is a simpler gating mechanism with similar computation performance and introduced by Cho et al. [52]. At any time step t of a temporal sequence, the GRU controls the combination criteria with the t th input u t and previous state s t−1 by computing two gates, r t (reset gate) and z t (update gate). The computations are: Here, V , W and b are trainable parameters and ⊗ represents element-wise multiplication. Specially, we adopt the rules: for before utterances of u t , the direction is starting from the oldest one and; for later utterances of u t , the direction is starting from the farthest one.
The initial memories of four different aspects of contexts are computed using four RNNs which are respectively based on separate GRU Init A , GRU Init B , GRU Init C and GRU Init D (Init means initialization).
For the memory representation of A aspect, marked as M (1) A , the utterances in set U W A A are framed as a sequence based on temporal occurrence (starting from the oldest one) and fed to the GRU Init A -based RNN to produce M

B. INDEPENDENT ATTENTION MECHANISM
In this module, the test utterance u t is first embedded into a vector q (1) of dimension R d using a projection matrix P. Next, q (1) and M (1) λ are read by independent attention module. This module performs an independent attention mechanism on them, resulting in an attention vector p (1) λ corresponding to λ. The dimension of p (1) λ is R W λ . The computations are: where, softmax(x j ) = e x j / k e x k , and the attention vector p (1) λ is a probability distribution over M (1) λ . The j th normalized score of p (1) λ represents the relevance of j th memory cell with respect to the test utterance. The p (1) λ and M (1) λ are then used to find a weighted representation. We do this by taking an inner product as follows: This representation attM (1) λ contains a weighted contextual summary accumulated from the memories of the aspect that is corresponding to λ. As λ can instantiate to A, B, C or D, the representation of the test utterance can be updated by adding all weighted representations as:

C. MULTIPLE HOPS
For better performance, we perform H hops on all initial memories (including M C and M D ). Adopting a multiple hop scheme is inspired by several recent works [40], [44], which suggest the importance of multiple input/output iterations for performing transitive inference. The multiple hop scheme over the inputs is crucial to achieving high performance in our experiments. The iterative nature of this scheme can enable the model to focus on relevant parts of the inputs. It also allows for a type of transitive inference. Each iteration provides our model with newly relevant information about the inputs. In other words, some information that may be ignored previous iterations can be retrieved during the iterations. Multiple hops also are instrumental in improving the focus of attention heads which might miss essential contexts in a single hop. This scheme is done as follows: • Each memory representation M (h) λ (h is an integer greater than zero) is updated by using dynamic RNN whose cell is GRU λ . The output at the h th hop is used as the input of next-hop (h + 1) th . This constraint of sharing parameters adjacently between hops is added for the reduction in total parameters and ease of training. The update operation is as: • At every hop, the representation of the test utterance is updated by the equation as: • After H hops, the representation q (H +1) is sent to the final prediction.

D. FINAL PREDICTION
Through an affine transformation, the final representation vector can be obtained using the equation as: For classification training, categorical cross-entropy loss along with L2 regularization is used as the cost measure: Here, N denotes total utterances across all videos and C is the number of emotion categories. y i,j is the ground truth value of i th utterance of belonging to class j from the training set.ŷ i,j is its predicted probability of emotion labels of belonging to class j. α is the L2-regularizer weight, and θ is the set of trainable parameters.
For regression,ŷ is a scalar (without softmax normalization) whose scores are used to calculate the mean squared error (MSE) cost metric. Figure 3 presents a flowchart of DCWS-RNNs.

V. EXPERIMENTS A. DATASET DETAILS
We perform experiments on two emotion detection datasets: the interactive emotional dyadic motion capture database (IEMOCAP 1 ) [53] and the 2012 Audio-Visual Emotion Challenge and Workshop dataset (AVEC 2012) [54]. IEMOCAP provides rich video and audio samples for all the utterances along with transcriptions and, generally, is regarded as a benchmark of emotion detection. It contains 12 hours of dyadic conversational videos. These videos are grouped into 1 https://sail.usc.edu/iemocap/ five sessions and split into 5 minutes of interaction between pairs of 10 unique speakers (5 male and 5 female). Each pair is assigned to diverse multiple conversation scenarios of dialogues. All the conversations are segmented into spoken utterances. Each utterance is tagged by at least 3 annotators with one of the labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and others. In our experiments, however, we consider the first six categories with majority agreement (i.e., at least two out of three annotators labeled the same emotion), to compare with the state-of-the-art [14] and other works. Table 2 presents the class distribution in the IEMOCAP dataset.  ([0, ∞)). The annotations are available for every 0.2 seconds in each video. However, in order to adapt our problem definition, the attributes over the span of an utterance are averaged as the utterance-level annotation.

B. MULTIMODAL FEATURE DATA
For a fair comparison with the state-of-the-art method [14], we employ the identical feature data that is downloaded from their public website. 2 The extraction procedures of feature data are briefly described below.
The textual features are extracted from the transcript of each utterance through a simple CNN with a single convolutional layer followed by max-pooling [56]. The convolutional layer consists of three distinct convolution filters, followed by max-pooling and rectified linear unit (ReLU) activation. These activations are concatenated and then fed to a fully connected layer to obtain the textual utterance representation t u .
The audio features are obtained by using the open-sourced software open SMILE. Specifically, the IS13_ComParE 3 extractor can provide 6373 features for each utterance. By using a fully-connected layer, the dimension is reduced and the audio feature vector a u is provided.
The visual features are captured by using 3D-CNN [57]. Outputs of this 3D convolution are then subjected to max-pooling followed by a fully connected layer. Finally, the activations of this layer form the video representation v u .
By concatenating the above three multimodal features, the final representation of an utterance is u = [t u , a u , v u ]. This multimodal representation is adopted for all utterances in a conversation.

C. TRAINING DETAILS
A large section of research works in MED is performed on datasets where training and test splits may share certain speakers. However, as each individual has a unique way of expressing emotions in the real-world, it is crucial to develop a model that can find generic and person-independent features. Hence, we perform person independent experiments to test our method. We partition the dataset into training and test sets with roughly 80/20 ratio such that the partitions do not share any speaker (i.g. on the IEMOCAP dataset, only the first eight speakers from session 1 to 4 belong to the training set, while the last two speakers from session 5 belong to the test set). Furthermore, 20% of the training set is used as a validation set for hyper-parameters tuning. Table 3 summaries the splits of both datasets.
For optimization of the experimental parameters on the IEMOCAP dataset, we use the Adam optimizer [58], which 2 https://github.com/senticnet 3 http://audeering.com/technology/opensmile starts with an initial learning rate (LR) of 0.01, to train our network. Termination of the training phase is decided by early-stopping with the patience of 10 epochs through monitoring the validation loss. The network is subjected to regularization in the form of Dropout [59] and Gradient-clipping for a norm of 40. The final hyper-parameters are decided using a random search [60]. Based on validation performance, contextual window sizes of W A , W B , W C and W D are respectively set to be 31, 22, 24 and 17. The number of hops H is fixed at 3. The embedding dimension size R d is set as 50.

D. BASELINES
• TFN [61] captures inter-modality and intra-modality dynamic interactions by using the tensor outer product.
In their model, inter-modality dynamics are modeled with a new multimodal fusion approach, named Tensor Fusion, which explicitly aggregates unimodal, bimodal and trimodal interactions. The intra-modality dynamics are modeled through three Modality Embedding Subnetworks, for language, visual and acoustic modalities, respectively.
• CATF-LSTM [62] proposes attention-based networks for improving both context learning and dynamic feature fusion. It utilizes an attention-based fusion mechanism to amplify the higher quality and informative modalities during fusion in multimodal classification, followed by a contextual attention-based LSTM network that models the contextual relationship among utterances and prioritizes the important contextual information for classification.
• CMN [13] models separate contexts from conversational histories using two distinct GRUs for two speakers. These contexts are stored as memories and an attention mechanism is performed on them. The weighted memories are then merged with the test utterance using an addition operation. Furthermore, this merged representation is used to classify its emotion category after the multiple hop scheme that is done by the repeated input and output cycle.
• ICON [8] considers the preceding utterances of both speakers falling within a context-window and models their self-emotional influences using local GRUs. Furthermore, to incorporate inter-speaker influences, a global representation is generated using a global GRU that is considered as a memory to track the overall conversational flow.
• DialogueRNN [14] is a state-of-the-art method for MED. The method is based on the assumption that there are three major aspects relevant to the emotion in a conversation: the speaker, the context from the preceding utterances, and the emotion of the preceding utterances. It employs a recurrent network based on three GRUs to track these aspects. The global GRU and party GRU are respectively used to update the global context and the individual speaker states during the conversation. Further, the emotion GRU is employed to track emotional state through the conversation. However, the contexts are not considered in the method TFN; the speakers in the conversations are not differentiated in the method CATF-LSTM; in the models, CMN, ICON, and DialogueRNN, the different aspects of contexts are not distinguished and their weights are not represented.
For the performance comparison, we reimplement the above methods in our experiments.

E. RESULTS
On the IEMOCAP dataset, we mainly conduct the static analysis of the performance of our proposed model DCWS-RNNs. Thus, for detecting the emotion of a test utterance u t , we can use those utterances that occur before or later u t to improve the emotion representation. We use Accuracy and F1-score to evaluate classification performance. We experiment on the IEMOCAP dataset and compare DCWS-RNNs with the above baseline methods. The results are summarized in Table 4. As shown in Table 4, our model DCWS-RNNs outperforms the compared models with classification performance increase in Accuracy ranging from 2.40% to 6.41% and in F1 score ranging from 0.81% to 4.89%. Especially, our method DCWS-RNNs succeeds over the state of the art DialogueRNN by 2.40% in Accuracy. This suggests that the proposed DCWS-RNNs model is more capable of capturing contextual information.
Furthermore, for a probability to apply in the real-time dialogue system, we perform the experiments on the AVEC dataset just only using the history information (both A aspect of contexts and B aspect of contexts). We mark this variant model as DCWS-RNNs*. The results on the AVEC dataset are presented in Table 5. Our DCWS-RNNs* obtains competitive performance in comparison with the state of the art DialogueRNN.

A. HYPER-PARAMETERS
In consideration of the combination explosion, we use the random search to set the values of W A , W B , W C and W D . In the   Figure 4, the performance initially improves, which is showing the importance of the multiple hop scheme. However, with a further increase, the performance decrease. Finally, the best performance is obtained at H = 3 in our experiments.

B. MULTIMODALITY
Here we give a short multimodal analysis. Table 6 summarizes the performance for different combinations of modes used by DCWS-RNNs on IEMOCAP and DCWS-RNNs* on AVEC. As seen in the table, we observe the trend in VOLUME 8, 2020 our experiments where the trimodal network outperforms a unimodal network. Besides, text modality performs best out of the three, which is reaffirming its significance in multimodal systems. Although other modalities contribute to improving the performance, that contribution is little compared to the textual modality, which is aligned with our expectations.

C. CASE STUDIES
In order to understand how DCWS-RNNs pays attention to different aspects of contexts of a test utterance after 3-hops, we manually explore utterances in the test set of IEMOCAP and perform visualization of the p λ attention vector (Eq. (7)) over them. Figure 5 presents four cases which are corresponding to four different aspects of contexts and, whose utterances are respectively assigned to the highest attention score. Each case has its own range of attention scores, from the lowest 0 value to the highest value. The symbol u t denotes the test utterance and u d denotes the dependent utterance that is most relevant to the test utterance and assigned to the highest attention score.
As shown in Figure 5 (a), it displays the important role of self-emotional influences and how DCWS-RNNs pays the highest attention to the utterance that is in A aspect of contexts. In this case, the model DCWS-RNNs anticipates the test utterance (turn 6) correctly by attending to the utterance (turn 3) that is in A aspect of contexts and assigned to the highest attention score. This result indicates the presence of long-term emotional dependencies and the need to consider the A aspect of contexts. In Figure 5 (b), it visualizes the important role of inter-speaker influences and how DCWS-RNNs processes dependency on the utterances that are in B aspect of contexts. In this case, the model DCWS-RNNs predicts the test utterance (turn 2) correctly by attending to the utterance (turn 1) that is in B aspect of contexts and assigned to the highest attention score. This result indicates the need to consider the B aspect of contexts. Figure 5 (c) demonstrates the ability of DCWS-RNNs to model self-emotional influences and how it pays the highest attention to the utterance that is in C aspect of contexts. In this case, while classifying the test utterance (turn 63), DCWS-RNNs finds contextual information from C aspect of contexts and focuses more on the utterance (turn 65). This utterance is most relevant to the test utterance and assigned to the highest attention score. This result indicates the benefit of considering the C aspect of contexts to capture emotional correlations across time. Figure 5 (d) shows the important role of inter-speaker influences and how DCWS-RNNs differentiates the utterances that are in D aspect of contexts. In this case, while classifying the test utterance (turn 51), DCWS-RNNs finds contextual information from D aspect of contexts and focuses more on the utterance (turn 65) that is most relevant to the test utterance and assigned to the highest attention score. This result indicates the benefit of considering the D aspect of contexts to capture emotional correlations across time.
From the above cases, we can obtain evidence that the weights of different aspects of contexts can be represented with different contextual window sizes. In other words, all the different aspects of contexts unequally contribute to the output representation. This helps to capture the more relevant information of the test utterance from the perspective of emotional contexts, making our model performs better than the state of the art DialogueRNN.

D. ABLATION STUDY
The main novelty of our DCWS-RNNs model is to divide the contexts of u t into four different aspects: A, B, C, and D. These aspects have been described in section III. To check the impact of the different aspects of contexts present in DCWS-RNNs, we conduct an ablation study by removing one of the aspects at a time and evaluate the performance on the IEMOCAP dataset. Table 7 demonstrates the results of this study. As expected, A aspect of contexts stands most important, as without its presence the performance falls by 6.53%. We believe that A aspect of contexts contains more correlation information that is relevant to the test utterance in comparison with B, C, and D aspect. The B aspect of contexts is also important, but less than A aspect as its absence causes performance to fall by 3.70%. We speculate the reason to be the self-emotional influence is more important than inter-speaker emotional influence.

VII. CONCLUSION
In this paper, we have presented DCWS-RNNs, a multimodal model for emotion detection in interactive conversations. In contrast to the state-of-the-art method, DialogueRNN, our method considers the contexts of a test utterance should be divided into four aspects. Further, the implicit weights of these different aspects of contexts can be represented with different contextual window sizes, which can help to improve accuracy performance. Our experimental results confirm the performance of DCWS-RNNs with such a modeling scheme. In the future, we plan to investigate the time complexity of our model and newer structures of RNNs.