By Topic

Audio-Visual ASR from Multiple Views inside Smart Rooms

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Potamianos, G. ; Dept. of Human Language Technol., IBM Thomas J. Watson Res. Center, Yorktown Heights, NY ; Lucey, P.

Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face, which is not always the case in realistic human-computer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras that closely track the subject's head. Since however these cameras are fixed in space, they cannot necessarily obtain frontal views of the speaker. Clearly, AVASR from non-frontal views is required, as well as fusion of multiple camera views, if available. In this paper, we report our very preliminary work on this subject. In particular, we concentrate on two topics: first, the design of an AVASR system that operates on profile face views and its comparison with a traditional frontal-view AVASR system, and second, the fusion of the two systems into a multi-view frontal/profile system. We in particular describe our visual front end approach for the profile view system, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video, recorded inside the IBM smart room as part of the CHIL project. Our experiments demonstrate that AVASR is possible from profile views, however the visual modality benefit is decreased compared to frontal video data

Published in:

Multisensor Fusion and Integration for Intelligent Systems, 2006 IEEE International Conference on

Date of Conference:

Sept. 2006