Skip to Main Content
We present a system capable of producing video-realistic videos of a speaker given audio only. The audio input signal requires no phonetic labelling and is speaker independent. The system requires only a small training set of video to achieve convincing realistic facial synthesis. The system learns the natural mouth and face dynamics of a speaker to allow new facial poses, unseen in the training video, to be synthesised. To achieve this we have developed a novel approach which utilises a hierarchical and nonlinear PCA model which couples speech and appearance. We show that the model is capable of synthesising videos of a speaker using new audio segments from both previously heard and unheard speakers. The model is highly compact making it suitable for a wide range of real-time applications in multimedia and telecommunications using standard hardware.