• Abstract

# Non-Linear Filtering in Reproducing Kernel Hilbert Spaces for Noise-Robust Speaker Verification

In this paper, we present a non-linear filtering approach for extracting noise-robust speech features that can be used in a speaker verification task. At the core of the proposed approach is a time-series regression using Reproducing Kernel Hilbert Space (RKHS) based methods that extracts discriminatory non-linear signatures while filtering out the non-informative noise components. A linear projection is then used to map the characteristics of the RKHS regression function into a linear-predictive vector which is then presented as an input to a back-end speaker verification engine. Experiments using the YOHO speaker verification corpus demonstrate that a recognition system trained using the proposed features demonstrate consistent improvements over an equivalent Mel-frequency cepstral coefficients (MFCCs) based verification system for signal-to-noise levels ranging from 0–30 dB.

SECTION I

## INTRODUCTION

Speaker verification/identification is a popular biometric identification technique used for authenticating and monitoring human subjects using their speech signal. The method is attractive since it does not require direct contact with the individual, thus avoiding the hurdle of perceived invasiveness inherent in most biometric systems. Unfortunately, even though most existing speech based recognition systems deliver acceptable accuracy under controlled conditions, their performance degrades significantly when subjected to noise present in practical environments [1], [2]. For instance, in a comparative study [2], it was demonstrated that additive broadband noise severely degrades the performance of a conventional recognition system. This performance degradation is primarily attributed to an unavoidable mismatch between training and recognition conditions, especially when the characteristics of all possible noise sources are not known in advance [3], [4]. Therefore, in literature several strategies have been presented for robust speech processing techniques to mitigate this mismatch which can be broadly categorized into three groups. The first strategy is to estimate and filter the noise prior to feature extractor which can be seen as a separate speech enhancement block [5], [6], [7]. The second strategy is to design a robust front-end feature extractor [12], [13] including features based on human auditory modeling [8], [9], [19]. The third one aims at making the classifier more robust by on-line model adaptation [10]. Over four decades, the speech features used in most state-of-the-art recognition systems have relied on spectral-based techniques which inherently assume the linearity of the speech production mechanism. Some examples of spectral based features include Mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP), cochlear features, linear predictive coefficients (LPCs) [11], [12], [13]. MFCC features, for example, are calculated using a discrete cosine transform (DCT) on the smoothed power spectrum, and PLP, similar to MFCCs, are based on human auditory models. While these feature sets accurately extract linear information of speech signals, they do not capture information about nonlinear or higher-order statistical characteristics of the signals, which have been shown to be not insignificant [14], [15]. The failure of spectral based features to achieve robust performance in ambient noise motivates investigation into alternative computationally efficient formulations that utilize non-linear and higher-order statistical information embedded in speech signal. Previous studies in this direction have approximated auditory time-series by a low-dimensional non-linear dynamical model. In another study [15], it was demonstrated that sustained vowels from different speakers exhibit a nonlinear, non-chaotic behavior that can be embedded in a low dimension manifold of order less than four.

This paper proposes a novel speech feature extraction method called kernel predictive coefficients (KPCs) by exploiting non-linear filtering properties of a functional regression procedure in a reproducing kernel Hilbert space (RKHS). RKHS regression have been extensively studied in the machine learning community especially in the field of regularization theory [20], support vector machines (SVMs) [21] and speech recognition [22]. In this paper, RKHS regression is combined with a linear projection step which endows the KPC features with the following robustness properties:

• The algorithm doesn't make any prior assumption on noise statistics.

• The algorithm uses kernel methods to extract features that are nonlinear and robust to corruption by noise.

• Robust parameter estimation is ensured by imposing smoothness constraints based on regularization principles.

The paper is organized as follows. Section 2 introduces notations of kernel based predictive coding. Section 3 presents results from verification experiments performed with the resulting features, and Section 4 provides concluding remarks, discussions and future work.

SECTION II

## KERNEL PREDICTIVE COEFFICIENTS FEATURE EXTRACTION

This section presents some of the fundamentals of the KPC feature extraction algorithm based on the formulation introduced in [22]. The system level architecture of a KPC feature extractor is shown in Fig. 1 which consists of a kernel regression module that interfaces with a linear projection module.

Fig. 1. System level architecture of the KPC feature extractor.

### A. Kernel Regression

Given a stationary discrete time speech signal represented by with n = 1, …, N denoting time indices, the aim of nonlinear prediction is to estimate a function , that can predict based on previous P samples The prediction function f is an element of a Hilbert space where defines an inner-product between two functional elements. Prediction function f is obtained by minimizing the cost function TeX Source $${\min_{f \in {\cal H}}}\ C(f)=\lambda \Vert f\Vert ^{2}_{\cal H}+\sum^{N}_{n=P+1}(x[n]-f({\bar{\bf x}}[n]))^{2}\eqno{\hbox{(1)}}$$The regularizer will penalize large signal excursions by constraining the functional norm, and thus will avoid over-fitting to the time-series data. λ in (1) is a parameter that determines the trade-off between reconstruction error and the smoothness of the prediction function f. Using the property of Hilbert spaces the prediction function f can be decomposed into a weighted sum of countable orthonormal basis functions as TeX Source $$f({\bar{\bf x}}[n])=\sum^{\infty}_{i=1}b_{i}\phi_{i}({\bar{\bf x}}[n])\eqno{\hbox{(2)}}$$with . The orthonormal property of the basis functions can be expressed as and equals zero otherwise. Using this property, the regularizer can be reformulated as TeX Source \eqalignno{\Vert f\Vert ^{2}_{\cal H} & = \langle f,f \rangle_{\cal H} \cr& = \sum^{\infty}_{i,j=1}b_{i}b_{j}\langle \phi_{i}(.), \phi_{j}(.)\rangle_{\cal H}\cr& = \sum^{\infty}_{i=1}{b^{2}_{i} \over \Lambda_{i}}& {\hbox{(3)}}}The optimization problem given by (1) can now be written as TeX Source $$\displaylines{\min_{b_{i}}\acute{C}(b_{i}) = \lambda \sum^{\infty}_{i=1}{b^{2}_{i} \over \Lambda_{i}}\hfill \cr\hfill \quad +\sum^{N}_{n=P+1} \left(x[n]-\sum^{\infty}_{i=1}b_{i}\phi_{i}(\bar{\bf x}[n])\right)^{2}\quad{\hbox{(4)}}}$$The solution for the optimization problem is obtained by equating the derivative of (4) with respect to parameter bi to zero which leads to TeX Source $$b_{i}={\Lambda_{i}\over \lambda}\sum^{N}_{n=P+1}\alpha[n]\phi_{i}(\bar{\bf x}[n])\eqno{\hbox{(5)}}$$where α [n] is the reconstruction error for speech sample at time instant n = P + 1,…,N, given by TeX Source \eqalignno{\alpha[n] & = x[n]-\sum^{\infty}_{i=1}b_{i}\phi_{i}(\bar{\bf x}[n])\cr& = x[n]-f(\bar{\bf x}[n])& {\hbox{(6)}}}By substituting (5) in (2), prediction function f can be written in terms of α as TeX Source \eqalignno{f(\bar{\bf x}[n]) & = \sum^{N}_{m=P+1}{\alpha[m]\over \lambda} \sum_{i=1}^{\infity} \Lambda_i \phi_i (\bar{\bf x}[m])\phi_{i}(\bar{\bf x}[n])\cr& = \sum^{N}_{m=P+1}{\alpha[m]\over \lambda}K(\bar{\bf x}[m], \bar{\bf x}[n])& {\hbox{(7)}}}where is a reproducing kernel [21] over space . Using function f in (7), the reconstruction error α [n] can be written as TeX Source $$\alpha[n] = x[n] - \sum^{N}_{m=P+1}{\alpha[m]\over \lambda}K(\bar{\bf x}[m], \bar{\bf x}[n])\eqno {\hbox{(8)}}$$with optimum solution in a matrix form as TeX Source $${\mbi \alpha}^{\ast}=\lambda(\lambda{\bf I}+{\bf K})^{-1}{\bf x}\eqno{\hbox{(9)}}$$with α* = {α[n]},x = {x[n]} and K representing the kernel matrix with elements .

### B. Linear Projection

The regression parameters α [n] completely determines the RKHS function f(.) which captures only the information bearing component of the speech signal. However, the function f(.) spans an infinite dimension space and therefore can not be directly used as features. The next step in KPC feature extraction is to project the information present in f into a lower dimensional space. To achieve this, we will directly use the sampled values of f evaluated at each of the training points . Based on this sampled values, the objective of the projection operation is determined by a linear regression parameters ai, i = 1,…,P such that TeX Source $$f(\bar{\bf x}[n]) \approx \sum^{P}_{i=1}a_{i}x[n-i]; n \geq P+1\eqno {\hbox{(10)}}$$

Since the system of equations (10) is overdetermined, we resort to a linear regression procedure where the vector of KPC parameters a can be written as TeX Source $${\bf a}=({\bf X}^{T}{\bf X})^{-1}{\bf Xf}\eqno{\hbox{(11)}}$$where f denotes a vector of sampled values and X represents the time-series matrix with row vector of XT given by .

SECTION III

## EXPERIMENTS AND RESULTS

Fig. 2(a) and (b) show KP coefficients with parameter λ = 0.0001 computed over consecutive speech frames for clean and noisy utterances of march. Fig. 2(c) and (d) show the same scenario but for a higher value of λ = 0.01. The parameter λ is a smoothing parameter (also known as bias-variance tradeoff) and plays an important role in noise robustness. The figure shows that lower values of λ makes the coefficients more sensitive to noise, as evident from Fig. 2(b) and (d).

Fig. 2. KPC features for utterance march under different SNR conditions (clean and 10 dB) using different values of parameter λ (a)–(b) λ = 0.0001, (c)–(d) λ = 0.01.

We have used a SVM based speaker verification system for all experiments and the results presented in this paper are based on the YOHO database. The database contains a large set of high-quality speeches with sampling rate of 8 Khz for speaker recognition research purposes. We have used eight speakers from enrollment sessions. There are four sessions per speaker and 24 utterances per session. For the training we have used 25 percent of utterances from each speaker and another 25 percent of utterances have been used for cross-validation. The testing part contains the rest of utterances which is 50 percent of utterances from each speaker. We have used the EER metric which is widely used for quantifying performance of a biometric system. EER is defined as the error rate at which total false positive rate is equal to false rejection rate. Thus, the lower the EER, the better is the performance of a biometric system. Here EERs corresponding to each speaker were averaged to obtain an overall EER for the speaker verification system. As a baseline, the same verification system was developed using MFCC features. MFCC features were obtained from DCT of log-energy over 26 Mel-scale filter banks. To extract both KPC and MFCC features, a window size of 512 samples with an overlap of 256 samples was chosen. The noise clippings from the NOISEX database [24] were added to clean speech obtained from YOHO database to generate test data. For these experiments, it was assumed that the channel characteristics remained fixed between training and testing conditions. We considered three types of noise: white noise, speech babble noise (cafeteria noise), car interior noise (Volvo 340 at 75 mi/h under rainy conditions).

Table I summarizes the EER obtained based on the MFCC and KPC features under different SNR levels under white noise condition. The table clearly shows the effectiveness of the smoothing parameter λ. When a very small value of λ is used the, performance under clean condition improves showing that regression procedure is able to extract speaker discriminatory information. For a higher value of λ, the verification performance under clean condition deteriorates but the system remains robust to corruption by noise (increase in EER is more evident for smaller value of λ). This indicates that by appropriately choosing λ or the kernel filtering parameters the training can focus on robust features that remain intact in noise as opposed to acute or precise features which can be easily corrupted by noise.

TABLE I Comparison of EER(%) for an SVM-Based Speaker Verification System Using KPC Features With an Identical System Using MFCC, for Additive White Noise, at Various SNR Levels

We have also evaluated the EER of the KPC verification system for the “babble” noise and the “car” noise conditions. Figs. 3 and 4 compares the EER obtained using an MFCC and a KPC based verification system under both conditions of “noise” and for different levels of signal-to-noise ratio. For all the experiments the parameter λ has been chosen to be 0.05. Based on the experimental results the following can be inferred:

• For clean speech, the performance of both MFCC and KPC systems are comparable and show EER rates close to 0.30% which comparable with the reported state-of-the-art using YOHO corpus.

• The KPC features demonstrate a more robust performance in the presence of white, babble, and car noise where the degradation in performance (increase in EER) is lower compared to the MFCC based system.

Fig. 3. Comparison of EER for systems with MFCC and KPC features (λ = 0.05) for speech utterances corrupted with babble noise.
Fig. 4. Comparison of EER for systems with MFCC and KPC features (λ = 0.05) for speech utterances corrupted with car noise.
SECTION IV

## CONCLUSION AND EXTENSIONS

In this paper, we have proposed a novel speech feature extraction method robust to noise with different statistics for speaker verification systems. The approach is primarily data driven and effectively extracts nonlinear features of speech that are invariant to noise with different statistics. Using different experiments we have shown that KPC features demonstrates consistent improvements in performance over a MFCC benchmark features when applied for speaker verification system. Experimental results demonstrate the effectiveness of the proposed features for speaker verification systems in noisy environment. An extension to this work is to use the KPC features to extract the cepstral coefficients that are robust to channel conditions. It can be shown that kernel regression in cepstral domain is equivalent to a filtering operation, where the shape of the filter is determined by a choice of the kernel function. Therefore, specific kernel functions can be used for constructing regression function which acts as a matched filter for reducing channel effects. To make the KPC features even more robust to noise, other type of projections can be used instead of linear projection. The theory of KPC can be extended by linking the projection block on regression functions with game theoretic principles in machine learning. The regression function estimation and projection block can be viewed as balancing criteria on optimizing the dual objective function [11]. In this paper, a single iteration between the criterion was used to identify the robust features. In principle, multiple iterations between the balancing criteria could be used to identify the weak features and could be used to improve the performance of the system.

## Footnotes

Amin Fazel and Shantanu Chakrabartty are with the Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824-1226 Email: fazel@egr.msu.edu, shantanu@egr.msu.edu

## References

1. Automated speaker recognition: Current trends and future direction

D. A. Reynolds

Biometrics Colloquium, Vol. 17, 2005

2. Speech recognition in noisy environments: A survey

Y. Gong

Speech Communication, vol. 16, p. 261–291, 1995

3. Speech variability in automatic speaker recognition systems for commercial and forensic purposes

J. Ortega-Garcia, J. Gonzalez-Rodriguez, S. Cruz-Llanas

IEEE Trans. Aerospace and Electronics Systems, vol. 15, issue (11), p. 27–32, 2000

4. A tutorial on text-independent speaker verification

F. Bimbot, et al.

EURASIP J. on Applied Signal Process., Vol. 4, pp. 430–451, 2004

5. Suppression of acoustic noise in speech using spectral subtraction

S. F. Boll

IEEE Trans. Acoust. Speech Signal Process., vol. 27, issue (2), p. 113–120, 1979

6. A signal subspace approach for speech enhancement

Y. Ephraim, H. L. Van Trees

IEEE Trans. on Speech Audio Process., vol. 3, issue (4), p. 251–266, 1995

7. Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoust. Environment and sequential estimation of the corrupting noise

L. Deng, J. Droppo, A. Acero

IEEE Trans. Speech Audio Process., vol. 12, issue (2), p. 133–143, 2004

8. Auditory nerve representation as a basis for speech processing

O. Ghitza

S., Furui, M. M., Sondhi, Advances in Speech Signal Process., pp. 453–485, 1992

9. Perceptual linear predictive (PLP) analysis for speech

H. Hermansky

J. Acoust. Soc. Am., vol. 87, p. 1738–1752, 1990

10. Speech recognition using noise-adaptive prototypes

A. Nadas, D. Nahamoo, M. A. Picheny

IEEE Trans. Acoust. Speech Signal Process., vol. 37, issue (10), p. 1495–1503, 1989

11. Fundamentals of Speech Recognition

L. R. Rabiner, B. H. Juang

Englewood Cliffs, NJ
Prentice Hall, 1993

12. New LP-derived features for speaker identification

K. T. Assaleh, R. J. Mammone

IEEE Trans. on Speech Audio Process., vol. 2, issue (4), p. 630–638, 1994

13. Cepstral analysis techniques for automatic speaker verification

S. Furui

IEEE Trans. Acoust. Speech Signal Process., vol. 29, p. 254–272, 1981

14. Speech characterization and synthesis by nonlinear methods

M. Banbrook, S. McLaughlin, I. Mann

IEEE Trans. Speech Audio Process., vol. 7, p. 1–17, 1999

15. Evidence for nonlinear sound production mechanisms in the vocal tract

H. M. Teager, S. M. Teager

Proc. NATO ASI on Speech Production Speech Modeling, vol. 55, p. 241–261, 1990

16. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

B. S. Atal

J. Acoust. Soc. Am., vol. 55, issue (6), p. 1304–1312, 1974

17. A new cepstrum-based channel compensation method for speaker verification

T. F. Lo, M. W. Mak, K. K. Yiu

Proc. Eurospeech, vol. 2, p. 775–778, 1999

18. Channel-robust speaker identification using modified-mean cepstral mean normalization with frequency warping

A. A. Garcia, R. J. Mammone

Proc. ICASSP, 1999, 325–328

19. Rasta processing of speech

H. Hermansky, N. Morgan

IEEE Trans. Speech Audio Process., vol. 2, issue (4), p. 578–589, 1994

20. Regularization theory and neural networks architectures

F. Girosi, M. Jones, T. Poggio

Neural Computation, vol. 7, p. 219–269, 1995

21. The Nature of Statistical Learning Theory

V. Vapnik

New York
The Nature of Statistical Learning Theory, Springer-Verlag, 1995

22. Robust speech feature extraction by growth transformation in reproducing kernel Hilbert space

S. Chakrabartty, Y. Deng, G. Cauwenberghs

IEEE Tran. Audio Speech Lang. Process., vol. 15, issue (6), p. 1842–1849, 2007

23. Functional Analysis

W. Rudin

New York
Functional Analysis, McGraw-Hill, 1973

24. The Noisex-92 Database

http://www.speech.cs.cmu.edu, Available online

## Cited By

No Citations Available

## Keywords

### INSPEC: Non-Controlled Indexing

No Keywords Available

### Authors Keywords

No Keywords Available

### More Keywords

No Keywords Available

No Corrections

## Media

No Content Available
This paper appears in:
International Symposium on Circuits and Systems
Issue Date:
2009
On page(s):
113 - 116
ISBN:
N/A
Print ISBN:
978-1-4244-3827-3
INSPEC Accession Number:
10760359
Digital Object Identifier:
10.1109/ISCAS.2009.5117698
Date of Current Version:
26 Jun, 2009