Skip to Main Content
In this paper, we propose a novel idea for using two different feature streams in a continuous speech recognition system. Conventionally multiple feature streams are concatenated and HMMs trained to build triphone/syllable models. In this paper, instead of concatenation, we build separate subword HMMs for each of the feature streams during training. Also during training, the relevance of a feature stream to a particular sound is evaluated. During testing, hypotheses are generated by the language model. A greedy Kullback Leibler distance measure is used to determine the best feature at a particular instant, for the given hypotheses. There are two important aspects of this approach, namely, a) use of a feature that is relevant for recognizing a specific sound and b) the dimension of the feature stream does not increase with the number of different feature streams. To enable feature switching during recognition, a syllable-based automatically annotated recognition framework is used. In this framework, the test speech signal is first segmented into syllables, and, syllable boundaries are incorporated in the language model. Experiments are performed on three databases (a) Tamil DDNews database (b) TIMIT database (c) NTIMIT database, using, two features: MFCC (derived from the power spectrum of the speech signal) and MODGDF (derived from the phase spectrum of the speech signal). The results show that word error rate (WER) is lower than that of the use of joint features by almost 1.5% for the TIMIT database, by almost 3.4% for the NTIMIT database, by about 3.8% for the Tamil DDNew database.