XDGesture: An xLSTM-based Diffusion Model for Co-speech Gesture Generation | IEEE Conference Publication | IEEE Xplore

XDGesture: An xLSTM-based Diffusion Model for Co-speech Gesture Generation


Abstract:

In multimodal human-computer interaction, generating co-speech gestures is crucial for enhancing interaction naturalness and user experience. However, achieving synchroni...Show More

Abstract:

In multimodal human-computer interaction, generating co-speech gestures is crucial for enhancing interaction naturalness and user experience. However, achieving synchronized and natural gesture sequences remains a significant challenge due to the complexity of modeling temporal dependencies across different modalities. Existing methods often rely on simple concatenation techniques, which are limited in effectively handling multimodal information. To address this issue, we propose XDGesture, a diffusion-based framework that integrates a Cross-Modal Fusion module and xLSTM. The Cross-Modal Fusion module efficiently merges information from different modalities, providing the model with rich contextual conditions. Meanwhile, xLSTM, with its enhanced memory structure and exponential gating mechanism, processes the fused multimodal data, capturing long-range dependencies between speech and gestures. This enables the generation of high-quality gesture sequences that are naturally synchronized with speech. Experimental results demonstrate that XDGesture remarkably outperforms existing baselines on multiple datasets, particularly in terms of gesture quality, naturalness, and synchronization with speech.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

References

References is not available for this document.