Abstract:
In multimodal human-computer interaction, generating co-speech gestures is crucial for enhancing interaction naturalness and user experience. However, achieving synchroni...Show MoreMetadata
Abstract:
In multimodal human-computer interaction, generating co-speech gestures is crucial for enhancing interaction naturalness and user experience. However, achieving synchronized and natural gesture sequences remains a significant challenge due to the complexity of modeling temporal dependencies across different modalities. Existing methods often rely on simple concatenation techniques, which are limited in effectively handling multimodal information. To address this issue, we propose XDGesture, a diffusion-based framework that integrates a Cross-Modal Fusion module and xLSTM. The Cross-Modal Fusion module efficiently merges information from different modalities, providing the model with rich contextual conditions. Meanwhile, xLSTM, with its enhanced memory structure and exponential gating mechanism, processes the fused multimodal data, capturing long-range dependencies between speech and gestures. This enables the generation of high-quality gesture sequences that are naturally synchronized with speech. Experimental results demonstrate that XDGesture remarkably outperforms existing baselines on multiple datasets, particularly in terms of gesture quality, naturalness, and synchronization with speech.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: