By Topic

KL divergence based feature switching in the linguistic search space for automatic speech recognition

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Kumar, J.C. ; Dept. of Comput. Sci. & Eng., Indian Inst. of Technol. Madras, Chennai, India ; Janakiraman, R. ; Murthy, H.A.

In this paper, we propose a novel idea for using two different feature streams in a continuous speech recognition system. Conventionally multiple feature streams are concatenated and HMMs trained to build triphone/syllable models. In this paper, instead of concatenation, we build separate subword HMMs for each of the feature streams during training. Also during training, the relevance of a feature stream to a particular sound is evaluated. During testing, hypotheses are generated by the language model. A greedy Kullback Leibler distance measure is used to determine the best feature at a particular instant, for the given hypotheses. There are two important aspects of this approach, namely, a) use of a feature that is relevant for recognizing a specific sound and b) the dimension of the feature stream does not increase with the number of different feature streams. To enable feature switching during recognition, a syllable-based automatically annotated recognition framework is used. In this framework, the test speech signal is first segmented into syllables, and, syllable boundaries are incorporated in the language model. Experiments are performed on three databases (a) Tamil DDNews database (b) TIMIT database (c) NTIMIT database, using, two features: MFCC (derived from the power spectrum of the speech signal) and MODGDF (derived from the phase spectrum of the speech signal). The results show that word error rate (WER) is lower than that of the use of joint features by almost 1.5% for the TIMIT database, by almost 3.4% for the NTIMIT database, by about 3.8% for the Tamil DDNew database.

Published in:

Communications (NCC), 2010 National Conference on

Date of Conference:

29-31 Jan. 2010