Abstract:
Understanding multimodal humor and sarcasm detection remains a key challenge in artificial intelligence. Despite recent advances, inconsistencies in feature extraction, e...Show MoreMetadata
Abstract:
Understanding multimodal humor and sarcasm detection remains a key challenge in artificial intelligence. Despite recent advances, inconsistencies in feature extraction, evaluation methods, and experimental setups have hindered fair comparisons across different approaches. To address this issue, we propose the Multimodal Humor and Sarcasm Detection Benchmark (MHSDB), the first unified evaluation platform specifically designed for these tasks. MHSDB combines four datasets in English and Hindi and standardizes feature extraction and evaluation processes to facilitate consistent comparisons. We systematically evaluate mainstream foundation models across audio, video, and text modalities. Unimodal representations are assessed using self-attention mechanisms, while multimodal representations are evaluated through mainstream fusion strategies, including utterance-level and sequence-level approaches. Our experimental results reveal that multimodal approaches outperform unimodal ones in capturing complex contexts and multi-layered semantics. Additionally, specific fusion strategies excel at integrating cross-modal information, achieving state-of-the-art performance, and paving the way for future research on optimizing feature representation and multimodal fusion.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: