By Topic

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Daniel Gatica-Perez ; IDIAP Res. Inst., Ecole Polytechnique Federale de Lausanne, Martigny ; Guillaume Lathoud ; Jean-Marc Odobez ; Iain McCowan

Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a novel probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audiovisual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results-based on an objective evaluation procedure-that show that our framework 1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy, 2) can deal with cases of visual clutter and occlusion, and 3) significantly outperforms a traditional sampling-based approach

Published in:

IEEE Transactions on Audio, Speech, and Language Processing  (Volume:15 ,  Issue: 2 )