Abstract:
Co-speech gestures are a principal component in conveying messages and enhancing interaction experiences between humans and critical ingredients in human-agent interactio...Show MoreMetadata
Abstract:
Co-speech gestures are a principal component in conveying messages and enhancing interaction experiences between humans and critical ingredients in human-agent interaction, including virtual agents and robots. Existing machine learning approaches have yielded only marginal success in learning speech-to-motion at the frame level. Current methods generate repetitive gesture sequences that lack appropriateness with respect to the speech context. To tackle this challenge, we take inspiration from successes in natural language processing on context and long-term dependencies, and propose a new framework that views text-to-gesture as machine translation, where gestures are words in another (non-verbal) language. We propose a vector-quantized variational autoencoder structure as well as training techniques to learn a rigorous representation of gesture sequences. We then translate input text into a discrete sequence of associated gesture chunks in the learned gesture space. Ultimately, we use translated gesture tokens from the input text as an input to the autoencoder's decoder to produce gesture sequences. Subjective and objective evaluations confirm the success of our approach in terms of appropriateness, human-likeness, and diversity. We also introduce new objective metrics using the quantized gesture representation.
Date of Conference: 23-27 October 2022
Date Added to IEEE Xplore: 26 December 2022
ISBN Information: