By Topic

Audio-Visual Fusion and Tracking With Multilevel Iterative Decoding: Framework and Experimental Evaluation

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Shankar T. Shivappa ; Department of Electrical and Computer Engineering, University of California, San Diego, USA ; Bhaskar D. Rao ; Mohan Manubhai Trivedi

Speech is a natural interface for human communication. However building human-computer interfaces in unconstrained intelligent spaces still remains a challenging task. Incorporating video information is shown to improve the performance of many audio applications. Similarly, information from the microphones is useful in computer vision tasks. One of the first steps in enabling natural human-computer interaction is person tracking. In this paper, we present a new approach to person tracking using both audio and visual information. We develop a multilevel framework to combine the audio and visual cues to track multiple persons in a meeting room equipped with cameras and microphone arrays. We discuss in detail the multilevel iterative decoding based audio-visual person tracker (MID-AVT). Extensive experimental evaluation of the MID-AVT and comparison to other audio-visual tracking techniques is also presented. The dataset consists of real meeting recordings with sensor configurations similar to those used in the CLEAR 2006 and CLEAR 2007 evaluation workshops. The overall accuracy of the tracker was 75%. The MID-AVT framework performed slightly better than particle filter-based tracker when accurate camera and microphone calibration was available. However, the MID-AVT is also shown to be robust to sensor calibration errors while the particle filtering framework fails. In addition to the audio-visual person tracking results, we also track the active speaker at every instance of time and the results are presented.

Published in:

IEEE Journal of Selected Topics in Signal Processing  (Volume:4 ,  Issue: 5 )