I. Introduction
As communication methods diversify, effective emotion recognition has become increasingly critical. Among various emotional expressions, humor and sarcasm stand out as particularly complex and prevalent, attracting considerable research interest [1]. Humor often involves irony or exaggeration, while sarcasm typically relies on a delicate interplay of vocabulary, gestures, and tone. Detecting humor and sarcasm through text or speech alone is more challenging compared to classic emotion recognition tasks [2] –[4], as these phenomena require deeper semantic understanding. Thus, integrating multimodal signals, such as visual cues and speech patterns, becomes vital for capturing the subtleties of these complex emotions.