Skip to Main Content
The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data.