Loading [MathJax]/extensions/MathMenu.js
MHSDB: A Comprehensive Benchmark for Multimodal Humor and Sarcasm Detection Leveraging Foundation Models | IEEE Conference Publication | IEEE Xplore

MHSDB: A Comprehensive Benchmark for Multimodal Humor and Sarcasm Detection Leveraging Foundation Models


Abstract:

Understanding multimodal humor and sarcasm detection remains a key challenge in artificial intelligence. Despite recent advances, inconsistencies in feature extraction, e...Show More

Abstract:

Understanding multimodal humor and sarcasm detection remains a key challenge in artificial intelligence. Despite recent advances, inconsistencies in feature extraction, evaluation methods, and experimental setups have hindered fair comparisons across different approaches. To address this issue, we propose the Multimodal Humor and Sarcasm Detection Benchmark (MHSDB), the first unified evaluation platform specifically designed for these tasks. MHSDB combines four datasets in English and Hindi and standardizes feature extraction and evaluation processes to facilitate consistent comparisons. We systematically evaluate mainstream foundation models across audio, video, and text modalities. Unimodal representations are assessed using self-attention mechanisms, while multimodal representations are evaluated through mainstream fusion strategies, including utterance-level and sequence-level approaches. Our experimental results reveal that multimodal approaches outperform unimodal ones in capturing complex contexts and multi-layered semantics. Additionally, specific fusion strategies excel at integrating cross-modal information, achieving state-of-the-art performance, and paving the way for future research on optimizing feature representation and multimodal fusion.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

I. Introduction

As communication methods diversify, effective emotion recognition has become increasingly critical. Among various emotional expressions, humor and sarcasm stand out as particularly complex and prevalent, attracting considerable research interest [1]. Humor often involves irony or exaggeration, while sarcasm typically relies on a delicate interplay of vocabulary, gestures, and tone. Detecting humor and sarcasm through text or speech alone is more challenging compared to classic emotion recognition tasks [2] –[4], as these phenomena require deeper semantic understanding. Thus, integrating multimodal signals, such as visual cues and speech patterns, becomes vital for capturing the subtleties of these complex emotions.

Contact IEEE to Subscribe

References

References is not available for this document.