Skip to Main Content
We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to improve automatic speech recognition (ASR) significantly. We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream hidden Markov models (HMM) are used to integrate audio and visual information. ASR experiments are performed under both clean and noisy audio conditions using a relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relative to audio-only ASR WERs, at various SNRs with additive white Gaussian noise, and by 19% relative to the audio-only ASR WER under clean audio conditions.