Skip to Main Content
In most hidden Markov model-based automatic speech recognition systems, one of the fundamental questions is to determine the intrinsic speech feature dimensionality and the number of clusters used on the Gaussian mixture model. We analyzed mel-frequency band energies using a variational Bayesian principal component analysis method to estimate the feature dimensionality as well as the number of Gaussian mixtures by learning a maximum lower bound of the evidence instead of maximizing the likelihood function as used in conventional speech recognition systems. In analyzing the Texas Instruments/Massachusetts Institute of Technology (TIMIT) speech database, our method revealed the intrinsic structures of vowels and consonants. The usefulness of this method is demonstrated in the superior classification performance for the most difficult phonemes /b/, /d/, and /g/.