By Topic

Analysis of MLP-Based Hierarchical Phoneme Posterior Probability Estimator

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

5 Author(s)
Pinto, J. ; Idiap Res. Inst., Martigny, Switzerland ; Garimella, S. ; Magimai-Doss ; Hermansky, H.
more authors

We analyze a simple hierarchical architecture consisting of two multilayer perceptron (MLP) classifiers in tandem to estimate the phonetic class conditional probabilities. In this hierarchical setup, the first MLP classifier is trained using standard acoustic features. The second MLP is trained using the posterior probabilities of phonemes estimated by the first, but with a long temporal context of around 150-230 ms. Through extensive phoneme recognition experiments, and the analysis of the trained second MLP using Volterra series, we show that 1) the hierarchical system yields higher phoneme recognition accuracies-an absolute improvement of 3.5% and 9.3% on TIMIT and CTS respectively-over the conventional single MLP-based system, 2) there exists useful information in the temporal trajectories of the posterior feature space, spanning around 230 ms of context, 3) the second MLP learns the phonetic temporal patterns in the posterior features, which include the phonetic confusions at the output of the first MLP as well as the phonotactics of the language as observed in the training data, and 4) the second MLP classifier requires fewer number of parameters and can be trained using lesser amount of training data.

Published in:

Audio, Speech, and Language Processing, IEEE Transactions on  (Volume:19 ,  Issue: 2 )