Can we exploit all datasets? Multimodal Emotion Recognition using Cross-Modal Translation

The use of sufficiently large datasets is important for most deep learning tasks, and emotion recognition tasks are no exception. Multimodal emotion recognition is the task of considering multiple types of modalities simultaneously to improve accuracy and robustness, typically utilizing three modalities: visual, audio, and text. Similar to other deep learning tasks, large datasets are required. Various heterogeneous datasets exist, including unimodal datasets constructed for traditional unimodal recognition and bimodal or trimodal datasets for multi-modal emotion recognition. A trimodal emotion recognition model shows high performance and robustness by comprehensively considering multiple modalities. However, the use of unimodal or bimodal datasets in this case is problematic. In this study, we propose a novel method to improve the performance of emotion recognition based on a cross-modal translator that can translate between the three modalities. The proposed method can train a multimodal model based on three modalities with different types of heterogeneous datasets, and the dataset does not require alignment between modalities: visual, audio, and text. We achieved a high performance exceeding the baseline in CMU-MOSEI and IEMOCAP, which are representative multimodal datasets, by adding unimodal and bimodal datasets to the trimodal dataset.


I. INTRODUCTION
The perception of human emotions is becoming an essential part of various human-computer interaction systems, as recognizing human emotions affect plays a crucial role in our daily lives. People respond to and act according to their perceptions of emotions in response to external stimuli. Intelligent systems, such as surveillance, robotics, and medical systems, benefit from the ability of understanding human emotions and behaviors.
One of the most important tasks in recognizing emotions is to assemble various types of information that express human emotions. The expression of human emotions is intrinsically multi-modal. Voice pitch, speed, facial expression, words used, and gestures are among several means of expressing emotions. Therefore, intuitively, using various modalities can achieve higher performance and reliability compared to a limiting use of one modality. One of the main challenges in multi-modal emotion recognition is the difficulty in obtaining labeled data because it takes a long time for humans to identify categories of emotions in video, audio, or text. Owing to the efforts of several researchers to recognize emotions, labeled image-based facial expression recognition datasets [1,2,3] or text-based emotion recognition datasets [4,5,6] became publicly available. However, the unified labeling set for video, audio, and text modalities is much smaller than that for single or bimodal datasets. Building large-scale multimodal datasets for video, audio, and text is expensive and time-consuming. The main motivation of this study is to investigate the effective utilization of datasets with different modal information by using a learning strategy to train trimodal emotion recognition with cross-modal translators.
To utilize datasets with different modalities, many researchers have proposed cross-modal transferring methods; target modal data are augmented through cross-modal translation with source modal data and used to train a target-modal-based recognition model. For example, to transfer visual information to audio, He et al. [7] used VAEGAN [8] as a visual-to-audio translator. The conditional generative adversarial network(GAN) [9] and cycle GAN [10] have also been used [7,11,12] to translate visual to audio information. For audio-text transfer, a consistent prediction method for real speech and synthetic speech has been proposed [13] to improve the speech recognition performance. Yoon et al. [14] translated birds and plants images into text to accurately classify birds and plants.
In this study, we propose a multi-modal emotion recognition model that takes the three modalities of video, audio, and text as input and learns from a multimodal dataset containing all three modalities as well as single or bimodal datasets. To this end, we propose a feature-level cross-modal transfer model for translation between the three modalities. Data, including video, audio, and text used for emotion recognition, were expressed in a time series. Therefore, to transfer a word into an audio signal, word-level multimodal alignment is required. However, aligning different modalities generally requires human labor. To address this problem, we propose a novel cross-modal translation model using a sequence-level discriminator for unaligned multimodal datasets. Using additional heterogeneous singleor bimodal datasets, we prove that the proposed method is effective in improving performance.
We trained a cross-modal translator and a multimodal emotion recognizer with an end-to-end architecture that simultaneously learns both models. We tested the performance of the proposed end-to-end cross-modal translation and emotion recognition model by applying it to two benchmark datasets, CMU-MOSEI and IEMOCAP. The contributions of this study are as follows.
1) We proposed a strategy for training a multimodal emotion recognition model using multiple heterogeneous datasets with different modalities. We used cross-modal translators and an end-to-end learning strategy to achieve this goal. Cross-modal translators can be used to leverage single or bimodal datasets and improve the performance of emotion recognizers based on data augmentation effects.
2) We propose a novel cross-modal translation model between trimodal unaligned multimodal datasets. By adding a sequence-level discriminator, we can train a cross-modal translator without a human-labored word or phoneme-level alignment job. To the best of our knowledge, this is the first attempt at recognizing emotion by augmenting three modals: visual, audio, and text.
3) Our approach is evaluated on the representative multimodal emotion recognition benchmark datasets CMU-MOSEI and IEMOCAP; it exceeds the baseline approach by 13.4% on CMU-MOSEI and 10.4% on IEMOCAP.

A. MULTIMODAL EMOTION RECOGNITION
Many prior studies have been conducted on multi-modal emotion recognition. In recent years, considerable progress has been made in this area by using modality fusion methods. A dynamic fusion graph-based network [15], which fuses modalities dynamically in a hierarchical manner, a tensor fusion network that combines data representation from each modality to an embedding [16], a capsule GCN considering information redundancy and complementarity [17], and late or early fusion networks [15,18,19,20], which emphasizes a relative place of network have been proposed and showed better performance than single modality emotion recognition systems. M3ER [21] also uses multiplicative fusion to determine a more important modality on a per-sample basis. Previous fusion methods generally do not require alignment between modalities, and they are difficult to interact in an intermodal sequential manner. To overcome this limitation, the attention-mechanism-based fusion method [22,23,24] or transformer algorithm [25,26,27,28] are widely used. The transformer [29] algorithm uses self-attention to analyze the correlation between items constituting a sequence. Tsai et al. [25] introduced a cross-modal transformer. They proposed a multimodal transformer that provides latent cross-modal adaptation, which fuses multimodal information by directly attending to low-level features in other modalities. To improve the performance with additional datasets, a self-supervised training strategy that uses a pre-training method with a large-scale unlabeled dataset is used for emotion recognition tasks [26,27,28]. Rahman et al. [26] deployed BERT [30] and XLNet [31] with multi-modal adaptation gates. Khare et al. [28] trained a transformer on a masked language-modeling task for trimodal emotion recognition. They used a cross-modal-based transformer model to analyze the input modalities in an intermodal sequential manner.
In this study, we also used a cross-modal based multimodal transformer. However, to compensate for insufficient data, we deploy a data augmentation method instead of fine-tuning pre-trained transformers that require large-scale computing resources.

B. CROSS-MODAL TRANSLATION
To address the problem of data shortage and imbalance between modalities, recent studies on data augmentation through cross-modality translation have been conducted. Projecting different modalities onto a shared semantic space is a commonly used method for representing and manipulating multiple modalities. Harwath et al. [32] proposed a method for projecting audio and images onto a shared embedding space and clustering embedding to This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3183587

FIGURE 1 Architecture of proposed Multimodal Emotion Classifier with Cross-modal Translator
translate them into related text words. This method allows reading the text searches using only the image and audio information. Qi et al. [33] also used a shared semantic space for image-text translation. They treated images and texts in two different languages, trained a cross-modal translation model using reinforcement learning, and then applied the training results to a cross-modal retrieval task. To analyze sentiments, Yang et al. [34] proposed a method for visual-totext and audio-to-text translation with a pre-trained BERT using a shared semantic space.
While a simple linear mapping function for cross-modal translation is used for the fusion based multi-modal emotion recognition method [16,21], standalone generative models such as GANs or autoencoders can be deployed for crossmodal translation [7,11,12,33,34,35]. Tsai et al. [35] introduced an autoencoder-based modality reconstruction method for the missing modalities. To augment audio datasets for audio emotion recognition, He et al. [7] introduced a visual-to-audio translator based on VAEGAN with cycle reconstruction loss.
Although prior studies have made progress in this task, few attempts have been made to translate all three modalities and use them to train an integrated multimodal classifier. Additionally, little effort has been made into using multiple heterogeneous datasets together, and there has been no attempt to train an emotion recognizer and cross-modal translator simultaneously in an end-to-end manner, other than by augmenting and feeding the data in a pipeline manner. In this study, we propose a method to address data shortage and imbalance problems by simultaneously training a generative model and a classification model for visual, audio, and text trimodal.

III. METHODOLOGY
We denote the audio, visual, and text modalities as ∈ , , . We embed the modalities into a shared latent semantic space and denote them as , , . To feed missing modalities into the trimodal emotion recognition model, we translate the given modality inputs into missing modalities with the translators 1→ 2 where 1 2 ∈ { , , }.

A. End2End cross-modal translation and emotion recognition architecture
Fig . 1 shows the architecture of the proposed multimodal emotion recognition system using the cross-modal translation method. For the trimodal data, we use the feature extractor FEm to extract the feature from each modal and feed it to the corresponding transformer module. The final output is the result of the softmax and feedforward layer with the Multimodal Fusion module, which takes the weighted sum of each modality module output as input. Equations (1-3) shows how the final output can be derived from the modality modules and fusion module . We set a fixed weight of 0.33 for , and for the experiments.
One crucial factor that should be considered in sequential data is the focus on important clues. For example, if one focuses on a moment with a strong facial expression, a clear emotional vocabulary, or a strong tone of voice, emotions can be more easily recognized. A transformer is a wellknown neural network model that best reflects this characteristic. Positional embedding [29] were added to the input features to account for the order of sequence components. We also used a transformer for the multimodal fusion module to learn how to combine the results of each modality. When a unimodal or bimodal sequence is provided, the missing modality is augmented using a cross-modal translator and fed as an input to the corresponding transformer module. In Fig. 1, the visual feature is translated into the audio feature ′ and the text feature ′ , via a visual-to-audio and visual-to-text translator; then, they are fed into the corresponding transformer module. Equations (4)(5)(6)(7)(8) shows how the final output can be derived from the cross-modal translator → , → , modality modules and fusion module .

B. Cross-Modal Translator for Sequential Input
He et al. [7] proposed the VAEGAN-based visual-to-audio modal translator. They augmented audio emotion data with a translator. However, the previous work managed only a single image and audio spectrum. In this study, we propose a sequential VAEGAN for trimodalities, visual, audio, and text. As we use three modalities and pair each modality, our proposed translation unit has six VAEGANS and each VAEGAN has a feature extractor FE, an encoder Enc, a Decoder Dec, and a discriminator Dis for a single image slide, a word, and an audio piece. The translator also includes the sequence discriminator SeqDis for the entire input sequence. Fig. 2 shows the visual-to-text translator. The visual sequences fed into the visual feature extractor and fake text feature sequence can be generated along with the visual encoder and text decoder with the discriminators. Each Encm, Decm and Dism processes one slide at a time, not whole sequences at once, whereas SeqDism manages sequential input to classify the input feature more accurately. SeqDism uses a (K + 1)-class objective; K classes are used for groundtruth samples, and the (K + 1)-th class is used for fake samples. We used a transformer-based sequential classifier for the discriminator. The training objective includes four components: the VAE, GAN, sequential discriminator, and cycle losses.
indicates i-th constituent of sequence x, Dg indicates the generated sequence, and Dgt indicates the ground-truth examples. To calculate → , aligned multi-modal ground truth data were required. If the given data have no paired aligned ground truth data, we update the modules, , Lcycle, and except .

C. Training Strategy
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3183587 The proposed model uses an end2end strategy that simultaneously learns the cross-modal translator and the multimodal emotion classifier. To improve the learning performance with cross-modal data augmentation, possible fake features were generated from real examples. Algorithm 1 describes the learning strategy in detail.

A. Datasets
To evaluate the performance of the emotion recognition task, we applied the proposed method to the CMU Multimodal Opinion sentiment and emotion intensity (CMU-MOSEI) dataset [38] and the interactive emotional dyadic motion capture database (IEMOCAP) dataset [39]. The CMU-MOSEI dataset is currently the largest publicly available multi-modal dataset for emotion recognition. It comprises 23,453 single-speaker video segments. 1,000 distinct speakers and 250 topics were acquired from YouTube. The dataset consists of six emotions: happiness, sadness, anger, surprise, fear, and disgust. In addition to the visual and audio data, human-labeled transcriptions were included for linguistic emotion analysis. Detailed statistics are presented in Table 1.
The IEMOCAP dataset was built for multimodal human emotion analysis. It was recorded from ten actors in dyadic sessions with markers on the face that provided detailed information about their facial expressions during scripted and spontaneous spoken communication scenarios. It contains four labelled emotion annotations: angry, happy, neutral, and sad. Detailed statistics are presented in Table 2. To gain accuracy from a cross-modal translator, we utilized single-or bimodality emotion recognition datasets: AFEW [40] (video and audio), CK+ (video) [3], RAVDESS (video and audio) [41], and SemEval 2018 E-c (text) [4]. The AFEW contains videos from different movies and TV series with spontaneous expressions. The training, validation, and test sets contained 773, 383, and 653 video files, respectively. CK+ consisted of 529 videos from 123 subjects, ranging from 18 to 50 years old of age, with a variety of genders. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files with 24 distinct professional actors (12 females and 12 males). SemEval 2018 E-c (SemEval), a multilabeled text emotion dataset, comprises 10,983 tweets and 11 labels for the presence or absence of emotions.
For these datasets, we only used data that shared emotions with CMU-MOSEI, and these datasets were merged into a larger dataset.

B. Training Details
The model was trained using the Adam optimizer with a learning rate 0.01 and 64 batch. The detailed settings for the feature extractor, transformer-based emotion recognizer, and cross-modal translator are as follows.

1) Feature extraction
To extract features from the visual, audio, and text data, we deployed different feature extractors for each modality. For the visual modality, we deployed the video-based facial expression recognizer, FAN [42]. This method uses frame attention to automatically highlight discriminative   frames from input video. For audio feature extraction, we follow the method and setting of the audio classification algorithm, Panns [43], with the learnable audio frontend (LEAF) [44] instead of Mel-filterbanks. For the visual and audio feature extractor, we added a fully connected layer with 300 hidden units to feed the output to the input of the cross-modal translator and multi-model emotion recognizer. For text modality, glove word embedding [45] was used to extract word vectors from the transcripts. We deployed the setting of the CMU-MOSEI SDK [46] to learn the embeddings.

2) Multi-modal emotion recognition
Each transformer for the audio, video, text, and multimodal fusion modules was configured with the same model architecture. The model had a feed-forward layer of dimension 128 and four attention heads. The number of hidden nodes of attention was 128.

3) cross-modal translation module
Based on the work in [47], we added the SeqDis discriminator for sequential features. SeqDis consists of one transformer encoder with a feed-forward layer of dimension 128 and four attention heads. The number of hidden nodes of attention was 128.
We shared the weights of the last layer of the encoder and decoders for each modality translator to embed features on the same latent semantic spaces.

C. Results
We used the weighted accuracy (WA) [48] and F1-score for each emotion owing to natural imbalances across various  Table 3 lists the performance of the proposed models on the CMU-MOSEI dataset. For comparison, we included the graph memory fusion network (GraphMFN) [38], which was published along with the CMU-MOSEI dataset, and Khare's cross-modal transformer-based multimodal emotion recognition method [28]. In Table 3, the Multimodal transformer indicates the proposed transformerbased multimodal emotion recognition model without a cross-modal translation module. +Cross-modal translation shows the performance when applying the data augmentation strategy in Algorithm 1 using the proposed cross-modal translation model and a multimodal transformer. Through data augmentation, the performance improved by an average of 1.6% in terms of WA and 2.7% on average in terms of F1 measure. When the proposed SeqDist was used, the performance was improved by 2.2% based on WA and 3.0% based on F1. +The auxiliary dataset indicates the model performance when unimodal and bimodal datasets are added, in addition to the CMU-MOSEI datasets. A significant improvement of 5.4% in terms of WA and 8.5% in terms of the F1 measure was observed. A similar trend was observed in the additionally tested IEMOCAP dataset. When the proposed method and additional datasets were used, the performance improved by 10.4% in the WA standards and 14.8% in the F1 score, demonstrating the effectiveness of the proposed method.

Confusion Matrix
We show the per-class performance of the proposed model with an auxiliary dataset on CMU-MOSEI and IEMOCAP using the confusion matrix in Fig. 3. Using the auxiliary dataset, the proposed algorithm achieved an accuracy of over 70% for each class. Owing to data imbalance, data samples tend to be misclassified into classes with more samples. For example, the most common type of error is the misclassification of samples into happy classes in CMU-MOSEI and neutral classes in IEMOCAP.   Tables 5 and 6 show the results of analyzing the data augmentation effect. When data are augmented through cross-modal translation, even when using only 90% of the total data, the performance is better compared to learning with a 100% dataset without data augmentation in F1 measurements. In addition, using data augmentation improves the existing performance regardless of the data size. However, the larger the data size, the better the growth effect. It appears that the augmentation effect increases as the number of training data increases because a sufficient amount of training data is required for cross-modal translation learning.

Ablation Study
To determine how much individual modality affects the model, we conducted an ablation experiment. Tables 7 and 8 show the performance changes when the data were augmented for each modality. T, V, and A represent the text, video, and audio data, respectively. In both datasets, the greatest performance improvement was achieved when augmentation was applied to all the modalities. In particular, the IEMOCAP dataset exhibited the highest performance for all the classes. Table 9 compares the performances of the pipeline and end2end strategies. In the pipeline strategy, we train the cross-modal translator first and then train the multimodal transformer-based emotion. In the end2end strategy, we trained the cross-modal translator and multimodal emotion classifier simultaneously; we set the ratio of n emotion classifier iterations per translator update and compare the performances. The results in Table 9 show that, on average, the performance of the end2end strategy with a 1:5 and 1:7 balance is better than that of the pipeline strategy. In particular, end2end with a 1:7 balance showed the best overall performance. Whenever a fake feature is generated using the translator and classification is performed using the fake features, both the translator and classifier are updated according to the backpropagation algorithm. Experimental results confirmed that the performance was improved if only the translator was occasionally updated. However, when the rate of learning only the translator was relatively high (1:3), the performance was poor compared to the pipelined method. In fact, it seems that if the translator is frequently trained alone, the features are translated independently of the classifier.

V. CONCLUSION AND FUTURE WORKS
We presented the multimodal emotion recognition model that used cross-modal translators. Using the proposed method, we can further exploit the heterogeneous types of datasets with different modalities. For inter-modal translation, we proposed novel cross-modal translators that uses a sequential discriminator to cover unaligned multimodal sequence data. The proposed model learns cross-modal translators and multimodal emotion recognizers simultaneously, and this strategy further improves the performance. The empirical results demonstrate that the proposed method is efficient in handling multiple datasets with different modalities. With our method, the cost of constructing, aligning, or reorganizing the dataset can be significantly decreased. In a future work, we shall apply our method to self-supervised learning using multimodal datasets. It is well known that self-supervised learning strategies can help improve robustness and performance. To train a multimodal model, we expect that unlabeled heterogeneous datasets could be helpful, and that cross-modal translators would become more robust in the self-supervised learning process.