Skip to Main Content
This work presents an integral system capable of generating animations with realistic dynamics, including the individualized nuances, of three-dimensional (3-D) human faces driven by speech acoustics. The system is capable of capturing short phenomena in the orofacial dynamics of a given speaker by tracking the 3-D location of various MPEG-4 facial points through stereovision. A perceptual transformation of the speech spectral envelope and prosodic cues are combined into an acoustic feature vector to predict 3-D orofacial dynamics by means of a nearest-neighbor algorithm. The Karhunen-Loe´ve transformation is used to identify the principal components of orofacial motion, decoupling perceptually natural components from experimental noise. We also present a highly optimized MPEG-4 compliant player capable of generating audio-synchronized animations at 60 frames/s. The player is based on a pseudo-muscle model augmented with a nonpenetrable ellipsoidal structure to approximate the skull and the jaw. This structure adds a sense of volume that provides more realistic dynamics than existing simplified pseudo-muscle-based approaches, yet it is simple enough to work at the desired frame rate. Experimental results on an audiovisual database of compact TIMIT sentences are presented to illustrate the performance of the complete system.