By Topic

A Two Microphone-Based Approach for Source Localization of Multiple Speech Sources

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Wenyi Zhang ; Dept. of Electr. & Comput. Eng., Univ. of California at San Diego, La Jolla, CA, USA ; Rao, B.D.

This paper proposes a two microphone-based source localization technique for multiple speech sources utilizing speech specific properties and novel clustering algorithms. Voiced speech is sparse in the frequency domain and can be represented by sinusoidal tracks via sinusoidal modeling which provides high local signal-to-noise ratio (SNR). By utilizing the inter-channel phase differences (IPDs) between the dual channels on the sinusoidal tracks, the source localization of the mixed multiple speech sources is turned into a clustering problem on the IPD versus frequency plot. The generalized mixture decomposition algorithm (GMDA) is used to cluster the groups of points corresponding to multiple sources and thus estimate the direction of arrival (DOA) of the sources. Experiments illustrate the proposed GMDA algorithm with the Laplacian noise model can estimate the number of sources accurately and exhibits smaller DOA estimation error than the baseline histogram based DOA estimation algorithm in various scenarios including reverberant and additive white noise environments. Experiments suggest that appropriate power thresholding can be a simple and good approximation to the sinusoidal modeling, for the purpose of selecting time-frequency points with high local SNR, with slight loss in performance.

Published in:

Audio, Speech, and Language Processing, IEEE Transactions on  (Volume:18 ,  Issue: 8 )