By Topic

Comparison of speaker recognition methods using statistical features and dynamic features

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
S. Furui ; Nippon Telegraph and Telephone Public Corporation, Tokyo, Japan

This paper describes results of speaker recognition experiments using statistical features and dynamic features of speech spectra extracted from fixed Japanese word utterances. The speech wave is transformed into a set of time functions of log area ratios and a fundamental frequency. In the case of statistical features, a mean value and a standard deviation for each time function and a correlation matrix between these functions are calculated in the voiced portion of each word, and after a feature selection procedure, they are compared with reference features. In the case of dynamic features, the time functions are brought into time registration with reference functions. The results of the experiments show that there is only a slight difference between the recognition accuracies for statistical features and dynamic features over the long term. Since the amount of calculation necessary for recognition using statistical features is only about one-tenth of that for recognition using dynamic features, it is more efficient to use statistical features than dynamic features. When training utterances are recorded over ten months for each customer and spectral equalization is applied, 99.5 percent and 96.3 percent verification accuracies can be obtained for input utterances ten months and five years later, respectively, using statistical features extracted from two words. Combination of dynamic features with statistical features can reduce the error rate to half that obtained with either one alone.

Published in:

IEEE Transactions on Acoustics, Speech, and Signal Processing  (Volume:29 ,  Issue: 3 )