Skip to Main Content
A statistical decision approach to the recognition of connected digits is described in this paper. The method can be either speaker dependent (i.e., each new speaker must first train the system on representative digit strings before he can successfully use the system) or speaker independent. Multiple repetitions of each digit (spoken in connected strings) are used in the training sequence. Repetitions of the same digit are combined by linearly warping the individual reference patterns to the speakers' average length for the digit. Statistics of the mean and covariance of the recognition parameters between repetitions of the same digit are computed and are used in the recognition phase of the system. Once a spoken digit string has been segmented, the recognition of each digit within the string is achieved using a distance measure based on an expanded form of the principle of minimum residual error. In cases where a great deal of coarticulation can be anticipated between adjacent digits (i.e., between digits bounded by voiced regions) a second distance metric is employed. This metric includes both the effects of the analysis estimation error and the effects of coarticulation. The analysis parameters used in this system are the linear prediction coefficients (LPC's) of a 10-pole LPC analysis. For stability purposes, the linear predictive coding (LPC) coefficients are converted to parcor or reflection coefficients prior to the linear warping, and then the warped parcor coefficients are converted back to LPC coefficients for recognition purposes. The recognition system was tested on six speakers in the speaker-dependent mode with recognition accuracies of from 97 to 100 percent. It was also tested with 10 new speakers in the speaker-independent mode, with a digit recognition accuracy of 95 percent.
Acoustics, Speech and Signal Processing, IEEE Transactions on (Volume:24 , Issue: 6 )
Date of Publication: Dec 1976