Skip to Main Content
This paper describes results of speaker recognition experiments using statistical features and dynamic features of speech spectra extracted from fixed Japanese word utterances. The speech wave is transformed into a set of time functions of log area ratios and a fundamental frequency. In the case of statistical features, a mean value and a standard deviation for each time function and a correlation matrix between these functions are calculated in the voiced portion of each word, and after a feature selection procedure, they are compared with reference features. In the case of dynamic features, the time functions are brought into time registration with reference functions. The results of the experiments show that there is only a slight difference between the recognition accuracies for statistical features and dynamic features over the long term. Since the amount of calculation necessary for recognition using statistical features is only about one-tenth of that for recognition using dynamic features, it is more efficient to use statistical features than dynamic features. When training utterances are recorded over ten months for each customer and spectral equalization is applied, 99.5 percent and 96.3 percent verification accuracies can be obtained for input utterances ten months and five years later, respectively, using statistical features extracted from two words. Combination of dynamic features with statistical features can reduce the error rate to half that obtained with either one alone.