1. Introduction
3D facial animation has been an active research topic for decades, as attributed to its broad applications in virtual reality, film production, and games. The high correlation between speech and facial gestures (especially lip movements) makes it possible to drive the facial animation with a speech signal. Early attempts are mainly made to build the complex mapping rules between phonemes and their visual counterpart, which usually have limited performance [53], [63]. With the advances in deep learning, recent speech-driven facial animation techniques push forward the state-of-the-art significantly. However, it still remains challenging to generate human-like motions.