Close category search window
 

Probabilistic Speaker Diarization With Bag-of-Words Representations of Speaker Angle Information

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

5 Author(s)
Ishiguro, K. ; NTT Commun. Sci. Labs., NTT Corp., Kyoto, Japan ; Yamada, T. ; Araki, S. ; Nakatani, T.
more authors

Speaker diarization determines “who spoke when” from the recorded conversations of an unknown number of people. In general, we have no a priori information about the number, the locations, or even the characteristics of the speakers. Additionally, speakers' speech utterances vary dynamically because of turn-taking during the conversations. These conditions make the speaker-clustering task extremely difficult. The problem becomes even harder if online (incremental) processing is required. In this paper, we formulate the speaker-clustering problem as the clustering of the sequential audio features generated by an unknown number of latent mixture components (speakers). We employ a probabilistic model that assumes time-sensitive speaker mixtures at every time frame, which, surprisingly, suits the diarization scenario. We combine the time-varying probabilistic model with direction of arrival (DOA) information calculated from a microphone array in a bag-of-words (BoW)-style feature representation. The proposed system effectively estimates the number and locations of the speakers in an online manner based on the standard Bayes inference scheme. Experiments confirm that the proposed model can successfully infer the number and features of speakers and yield better or comparable speaker diarization results compared with conventional methods in several datasets.

Published in:
Audio, Speech, and Language Processing, IEEE Transactions on  (Volume:20 ,  Issue: 2 )

Date of Publication: Feb. 2012

Need Help?


IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2013 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.