Abstract:
In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training ...Show MoreMetadata
Abstract:
In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model (FM) representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Emotion Recognition ,
- Optimal Transport ,
- Foundation Model ,
- Nonverbal Emotion Recognition ,
- Multiple Modalities ,
- Speech Recognition ,
- Benchmark Datasets ,
- Fusion Techniques ,
- Subtle Cues ,
- Disgust ,
- Attention Mechanism ,
- Original Features ,
- Understanding Of Context ,
- Convolutional Block ,
- Self-supervised Learning ,
- Complementary Behavior
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Emotion Recognition ,
- Optimal Transport ,
- Foundation Model ,
- Nonverbal Emotion Recognition ,
- Multiple Modalities ,
- Speech Recognition ,
- Benchmark Datasets ,
- Fusion Techniques ,
- Subtle Cues ,
- Disgust ,
- Attention Mechanism ,
- Original Features ,
- Understanding Of Context ,
- Convolutional Block ,
- Self-supervised Learning ,
- Complementary Behavior
- Author Keywords