By Topic

Speech recognition with auxiliary information

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Stephenson, T.A. ; Dalle Molle Inst. for Perceptual Artificial Intelligence, Martigny, Switzerland ; Doss, M.M. ; Bourlard, H.

State-of-the-art automatic speech recognition (ASR) systems are usually based on hidden Markov models (HMMs) that emit cepstral-based features which are assumed to be piecewise stationary. While not really robust to noise, these features are also known to be very sensitive to "auxiliary" information, such as pitch, energy, rate-of-speech (ROS), etc. Attempts so far to include such auxiliary information in state-of-the-art ASR systems have often been based on simply appending these auxiliary features to the standard acoustic feature vectors. In the present paper, we investigate different approaches to incorporating this auxiliary information using dynamic Bayesian networks (DBNs) or hybrid HMM/ANNs (HMMs with artificial neural networks). These approaches are motivated by the fact that the auxiliary information is not necessarily (directly) emitted by the HMM states but, rather, carries higher-level information (e.g., speaker characteristics) that is correlated with the standard features. As implicitly done for gender modeling elsewhere, this auxiliary information then appears as a conditional variable in the emission distributions and can be hidden (except in the case of some HMM/ANNs) as its estimates become too noisy. Based on recognition experiments carried out on the OGI Numbers database (free format numbers spoken over the telephone), we show that auxiliary information that conditions the distribution of the standard features can, in certain conditions, provide more robust recognition than using auxiliary information that is appended to the standard features; this is most evident in the case of energy as an auxiliary variable in noisy speech.

Published in:

Speech and Audio Processing, IEEE Transactions on  (Volume:12 ,  Issue: 3 )