By Topic

A hierarchical decision approach to large-vocabulary discrete utterance recognition

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
T. Kaneko ; IBM Japan Science Institute, Tokyo, Japan ; N. Dixon

Very short response time is a critical requirement for automatic discrete utterance recognition. The real-time vocabulary size of most of today's commercially available recognizers is limited to several hundreds of utterances, primarily due to the fact that detailed acoustic matching involves considerable computation. The method presented here offers an economical solution to the real-time large-vocabulary recognition problem by carrying out recognition in two stages. In the initial stage, the incoming utterance is linearly matched against the entire vocabulary using only two features-utterance duration and either two or three average spectra for each utterance. While the number of prototypes matched is large, the time required per match is substantially reduced. During this initial stage, a preset number of best-match prototypes is determined for each unknown input. In the second stage, matching is performed for the best-match list based upon more detailed features (e.g., 10-ms log-power spectra), using more elaborate matching methodology, e.g., dynamic programming. Evaluation experiments were conducted using the 2000 most frequent words in an office-correspondence corpus and three normal adult-male talkers. It was observed that first-stage best-match lists of 30-50 items included the "correct" words between 99.0 and 99.5 percent of the time. Using DP on 10-ms spectral samples for the second stage, recognition accuracy ranged from 86.5 to 94.5 percent. A match-limiter, when used with a 50-64-word, commercially available recognizer for the second stage, makes near-real-time large-vocabulary recognition feasible.

Published in:

IEEE Transactions on Acoustics, Speech, and Signal Processing  (Volume:31 ,  Issue: 5 )