Skip to Main Content
We propose a novel approach of using Cross Validation (CV) and Speaker Clustering (SC) based data samplings to construct an ensemble of acoustic models for speech recognition. We also investigate the effects of the existing techniques of Cross Validation Expectation Maximization (CVEM), Discriminative Training (DT), and Multiple Layer Perceptron (MLP) features on the quality of the proposed ensemble acoustic models (EAMs). We have evaluated the proposed methods on TIMIT phoneme recognition task as well as on a telemedicine automatic captioning task. The proposed methods have led to significant improvements in recognition accuracy over conventional Hidden Markov Model (HMM) baseline systems, and the integration of EAMs with CVEM, DT, and MLP has also significantly improved the accuracy performances of the single model systems based on CVEM, DT, and MLP, where the increased inter-model diversity is shown to have played an important role in the performance gain.