Skip to Main Content
We propose a real time system to detect the speaker's frontal face for multimodal speech recognition. It is widely acknowledged that automatic speech recognizers, as well as humans, can improve recognition performance by adding visual modality, i.e., the speaker's facial image to audio modality. Visual modality also provides inaudible information, such as the speaker's facial orientation, and the location of the mouth. To acquire this information, we have to localize the speaker's face in real time. Our system is a combination of skin color detection and spatial feature detection. The color-based detection is fast but depends on the skin and the background color, while the special feature detection requires more computation. We applied color-based pruning to reduce the search space for the spatial feature detection. By detecting the facial orientation, the proposed method functions as a "face to talk" switch in place of the "push to talk" switch. In our experiment, pruning based on color reduced 53-97% of the search space, and 98.9% of the frontal face was detected correctly by the subsequent spatial detector.