Skip to Main Content
State-of-the-art automatic speech recognition (ASR) systems are usually based on hidden Markov models (HMMs) that emit cepstral-based features which are assumed to be piecewise stationary. While not really robust to noise, these features are also known to be very sensitive to "auxiliary" information, such as pitch, energy, rate-of-speech (ROS), etc. Attempts so far to include such auxiliary information in state-of-the-art ASR systems have often been based on simply appending these auxiliary features to the standard acoustic feature vectors. In the present paper, we investigate different approaches to incorporating this auxiliary information using dynamic Bayesian networks (DBNs) or hybrid HMM/ANNs (HMMs with artificial neural networks). These approaches are motivated by the fact that the auxiliary information is not necessarily (directly) emitted by the HMM states but, rather, carries higher-level information (e.g., speaker characteristics) that is correlated with the standard features. As implicitly done for gender modeling elsewhere, this auxiliary information then appears as a conditional variable in the emission distributions and can be hidden (except in the case of some HMM/ANNs) as its estimates become too noisy. Based on recognition experiments carried out on the OGI Numbers database (free format numbers spoken over the telephone), we show that auxiliary information that conditions the distribution of the standard features can, in certain conditions, provide more robust recognition than using auxiliary information that is appended to the standard features; this is most evident in the case of energy as an auxiliary variable in noisy speech.