1. INTRODUCTION
Automatic Speech Recognition (ASR) can be a difficult task when the speech signal gets distorted by noises. As in the conversation between humans, visual modality sometimes becomes necessary to understand speeches accurately since it’s invariant to the presence of noise. Audio-Visual Speech Recognition (AVSR) is the task of generating text transcriptions from both the audio and the auxiliary visual evidence. Although many kinds of visual sources [1] can be useful for speech recognition, lip motion [2] [3] is considered the most related and beneficial evidence in the literature.