By Topic

Phoneme Selective Speech Enhancement Using Parametric Estimators and the Mixture Maximum Model: A Unifying Approach

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Das, A. ; Indian Institute of Technology, Madras, India ; Hansen, J.H.L.

This study presents a ROVER speech enhancement algorithm that employs a series of prior enhanced utterances, each customized for a specific broad level phoneme class, to generate a single composite utterance which provides overall improved objective quality across all classes. The noisy utterance is first partitioned into speech and non-speech regions using a voice activity detector, followed by a mixture maximum (MIXMAX) model which is used to make probabilistic decisions in the speech regions to determine phoneme class weights. The prior enhanced utterances are weighted by these decisions and combined to form the final composite utterance. The enhancement system that generates the prior enhanced utterances comprises of a family of parametric gain functions whose parameters are flexible and can be varied to achieve high enhancement levels per phoneme class. These parametric gain functions are derived using 1) a weighted Euclidean distortion cost function, and 2) by modeling clean speech spectral magnitudes or discrete Fourier transform coefficients by Chi or two-sided Gamma priors, respectively. The special case estimators of these gain functions are the generalized spectral subtraction (GSS), minimum mean square error (MMSE), two-sided Gamma or joint maximum a posteriori (MAP) estimators. Performance evaluations performed over two noise types and signal-to-noise ratios (SNRs) ranging from ${-}$ 5 dB to 10 dB suggest that the proposed ROVER algorithm not only outperforms the special case estimators but also the family of parametric estimators when all phoneme classes are jointly considered.

Published in:

Audio, Speech, and Language Processing, IEEE Transactions on  (Volume:20 ,  Issue: 8 )