Speaker verification/identification is a popular biometric identification technique used for authenticating and monitoring human subjects using their speech signal. The method is attractive since it does not require direct contact with the individual, thus avoiding the hurdle of perceived invasiveness inherent in most biometric systems. Unfortunately, even though most existing speech based recognition systems deliver acceptable accuracy under controlled conditions, their performance degrades significantly when subjected to noise present in practical environments [1], [2]. For instance, in a comparative study [2], it was demonstrated that additive broadband noise severely degrades the performance of a conventional recognition system. This performance degradation is primarily attributed to an unavoidable mismatch between training and recognition conditions, especially when the characteristics of all possible noise sources are not known in advance [3], [4]. Therefore, in literature several strategies have been presented for robust speech processing techniques to mitigate this mismatch which can be broadly categorized into three groups. The first strategy is to estimate and filter the noise prior to feature extractor which can be seen as a separate speech enhancement block [5], [6], [7]. The second strategy is to design a robust front-end feature extractor [12], [13] including features based on human auditory modeling [8], [9], [19]. The third one aims at making the classifier more robust by on-line model adaptation [10]. Over four decades, the speech features used in most state-of-the-art recognition systems have relied on spectral-based techniques which inherently assume the linearity of the speech production mechanism. Some examples of spectral based features include Mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP), cochlear features, linear predictive coefficients (LPCs) [11], [12], [13]. MFCC features, for example, are calculated using a discrete cosine transform (DCT) on the smoothed power spectrum, and PLP, similar to MFCCs, are based on human auditory models. While these feature sets accurately extract linear information of speech signals, they do not capture information about nonlinear or higher-order statistical characteristics of the signals, which have been shown to be not insignificant [14], [15]. The failure of spectral based features to achieve robust performance in ambient noise motivates investigation into alternative computationally efficient formulations that utilize non-linear and higher-order statistical information embedded in speech signal. Previous studies in this direction have approximated auditory time-series by a low-dimensional non-linear dynamical model. In another study [15], it was demonstrated that sustained vowels from different speakers exhibit a nonlinear, non-chaotic behavior that can be embedded in a low dimension manifold of order less than four.

This paper proposes a novel speech feature extraction method called kernel predictive coefficients (KPCs) by exploiting non-linear filtering properties of a functional regression procedure in a reproducing kernel Hilbert space (RKHS). RKHS regression have been extensively studied in the machine learning community especially in the field of regularization theory [20], support vector machines (SVMs) [21] and speech recognition [22]. In this paper, RKHS regression is combined with a linear projection step which endows the KPC features with the following robustness properties:

The algorithm doesn't make any prior assumption on noise statistics.

The algorithm uses kernel methods to extract features that are nonlinear and robust to corruption by noise.

Robust parameter estimation is ensured by imposing smoothness constraints based on regularization principles.

The paper is organized as follows. Section 2 introduces notations of kernel based predictive coding. Section 3 presents results from verification experiments performed with the resulting features, and Section 4 provides concluding remarks, discussions and future work.

SECTION II

## KERNEL PREDICTIVE COEFFICIENTS FEATURE EXTRACTION

This section presents some of the fundamentals of the KPC feature extraction algorithm based on the formulation introduced in [22]. The system level architecture of a KPC feature extractor is shown in Fig. 1 which consists of a kernel regression module that interfaces with a linear projection module.

### A. Kernel Regression

Given a stationary discrete time speech signal represented by with *n* = 1, …, *N* denoting time indices, the aim of nonlinear prediction is to estimate a function , that can predict based on previous *P* samples The prediction function *f* is an element of a Hilbert space where defines an inner-product between two functional elements. Prediction function *f* is obtained by minimizing the cost function
TeX Source
$${\min_{f \in {\cal H}}}\ C(f)=\lambda \Vert f\Vert ^{2}_{\cal H}+\sum^{N}_{n=P+1}(x[n]-f({\bar{\bf x}}[n]))^{2}\eqno{\hbox{(1)}}$$The regularizer will penalize large signal excursions by constraining the functional norm, and thus will avoid over-fitting to the time-series data. λ in (1) is a parameter that determines the trade-off between reconstruction error and the smoothness of the prediction function *f*. Using the property of Hilbert spaces the prediction function *f* can be decomposed into a weighted sum of countable orthonormal basis functions as
TeX Source
$$f({\bar{\bf x}}[n])=\sum^{\infty}_{i=1}b_{i}\phi_{i}({\bar{\bf x}}[n])\eqno{\hbox{(2)}}$$with . The orthonormal property of the basis functions can be expressed as and equals zero otherwise. Using this property, the regularizer can be reformulated as
TeX Source
$$\eqalignno{\Vert f\Vert ^{2}_{\cal H} & = \langle f,f \rangle_{\cal H} \cr& = \sum^{\infty}_{i,j=1}b_{i}b_{j}\langle \phi_{i}(.), \phi_{j}(.)\rangle_{\cal H}\cr& = \sum^{\infty}_{i=1}{b^{2}_{i} \over \Lambda_{i}}& {\hbox{(3)}}}$$The optimization problem given by (1) can now be written as
TeX Source
$$\displaylines{\min_{b_{i}}\acute{C}(b_{i}) = \lambda \sum^{\infty}_{i=1}{b^{2}_{i} \over \Lambda_{i}}\hfill \cr\hfill \quad +\sum^{N}_{n=P+1} \left(x[n]-\sum^{\infty}_{i=1}b_{i}\phi_{i}(\bar{\bf x}[n])\right)^{2}\quad{\hbox{(4)}}}$$The solution for the optimization problem is obtained by equating the derivative of (4) with respect to parameter *b*_{i} to zero which leads to
TeX Source
$$b_{i}={\Lambda_{i}\over \lambda}\sum^{N}_{n=P+1}\alpha[n]\phi_{i}(\bar{\bf x}[n])\eqno{\hbox{(5)}}$$where α [*n*] is the reconstruction error for speech sample at time instant *n* = *P* + 1,…,*N*, given by
TeX Source
$$\eqalignno{\alpha[n] & = x[n]-\sum^{\infty}_{i=1}b_{i}\phi_{i}(\bar{\bf x}[n])\cr& = x[n]-f(\bar{\bf x}[n])& {\hbox{(6)}}}$$By substituting (5) in (2), prediction function *f* can be written in terms of α as
TeX Source
$$\eqalignno{f(\bar{\bf x}[n]) & = \sum^{N}_{m=P+1}{\alpha[m]\over \lambda} \sum_{i=1}^{\infity} \Lambda_i \phi_i (\bar{\bf x}[m])\phi_{i}(\bar{\bf x}[n])\cr& = \sum^{N}_{m=P+1}{\alpha[m]\over \lambda}K(\bar{\bf x}[m], \bar{\bf x}[n])& {\hbox{(7)}}}$$where is a reproducing kernel [21] over space . Using function *f* in (7), the reconstruction error α [*n*] can be written as
TeX Source
$$\alpha[n] = x[n] - \sum^{N}_{m=P+1}{\alpha[m]\over \lambda}K(\bar{\bf x}[m], \bar{\bf x}[n])\eqno {\hbox{(8)}}$$with optimum solution in a matrix form as
TeX Source
$${\mbi \alpha}^{\ast}=\lambda(\lambda{\bf I}+{\bf K})^{-1}{\bf x}\eqno{\hbox{(9)}}$$with *α*^{*} = {α[*n*]},**x** = {*x*[*n*]} and **K** representing the kernel matrix with elements .

SECTION III

## EXPERIMENTS AND RESULTS

Fig. 2(a) and (b) show KP coefficients with parameter λ = 0.0001 computed over consecutive speech frames for clean and noisy utterances of *march*. Fig. 2(c) and (d) show the same scenario but for a higher value of λ = 0.01. The parameter λ is a smoothing parameter (also known as bias-variance tradeoff) and plays an important role in noise robustness. The figure shows that lower values of λ makes the coefficients more sensitive to noise, as evident from Fig. 2(b) and (d).

We have used a SVM based speaker verification system for all experiments and the results presented in this paper are based on the YOHO database. The database contains a large set of high-quality speeches with sampling rate of 8 Khz for speaker recognition research purposes. We have used eight speakers from enrollment sessions. There are four sessions per speaker and 24 utterances per session. For the training we have used 25 percent of utterances from each speaker and another 25 percent of utterances have been used for cross-validation. The testing part contains the rest of utterances which is 50 percent of utterances from each speaker. We have used the EER metric which is widely used for quantifying performance of a biometric system. EER is defined as the error rate at which total false positive rate is equal to false rejection rate. Thus, the lower the EER, the better is the performance of a biometric system. Here EERs corresponding to each speaker were averaged to obtain an overall EER for the speaker verification system. As a baseline, the same verification system was developed using MFCC features. MFCC features were obtained from DCT of log-energy over 26 Mel-scale filter banks. To extract both KPC and MFCC features, a window size of 512 samples with an overlap of 256 samples was chosen. The noise clippings from the NOISEX database [24] were added to clean speech obtained from YOHO database to generate test data. For these experiments, it was assumed that the channel characteristics remained fixed between training and testing conditions. We considered three types of noise: white noise, speech babble noise (cafeteria noise), car interior noise (Volvo 340 at 75 mi/h under rainy conditions).

Table I summarizes the EER obtained based on the MFCC and KPC features under different SNR levels under white noise condition. The table clearly shows the effectiveness of the smoothing parameter λ. When a very small value of λ is used the, performance under clean condition improves showing that regression procedure is able to extract speaker discriminatory information. For a higher value of λ, the verification performance under clean condition deteriorates but the system remains robust to corruption by noise (increase in EER is more evident for smaller value of λ). This indicates that by appropriately choosing λ or the kernel filtering parameters the training can focus on robust features that remain intact in noise as opposed to acute or precise features which can be easily corrupted by noise.

We have also evaluated the EER of the KPC verification system for the “babble” noise and the “car” noise conditions. Figs. 3 and 4 compares the EER obtained using an MFCC and a KPC based verification system under both conditions of “noise” and for different levels of signal-to-noise ratio. For all the experiments the parameter λ has been chosen to be 0.05. Based on the experimental results the following can be inferred:

For clean speech, the performance of both MFCC and KPC systems are comparable and show EER rates close to 0.30% which comparable with the reported state-of-the-art using YOHO corpus.

The KPC features demonstrate a more robust performance in the presence of white, babble, and car noise where the degradation in performance (increase in EER) is lower compared to the MFCC based system.

SECTION IV

## CONCLUSION AND EXTENSIONS

In this paper, we have proposed a novel speech feature extraction method robust to noise with different statistics for speaker verification systems. The approach is primarily data driven and effectively extracts nonlinear features of speech that are invariant to noise with different statistics. Using different experiments we have shown that KPC features demonstrates consistent improvements in performance over a MFCC benchmark features when applied for speaker verification system. Experimental results demonstrate the effectiveness of the proposed features for speaker verification systems in noisy environment. An extension to this work is to use the KPC features to extract the cepstral coefficients that are robust to channel conditions. It can be shown that kernel regression in cepstral domain is equivalent to a filtering operation, where the shape of the filter is determined by a choice of the kernel function. Therefore, specific kernel functions can be used for constructing regression function which acts as a matched filter for reducing channel effects. To make the KPC features even more robust to noise, other type of projections can be used instead of linear projection. The theory of KPC can be extended by linking the projection block on regression functions with game theoretic principles in machine learning. The regression function estimation and projection block can be viewed as balancing criteria on optimizing the dual objective function [11]. In this paper, a single iteration between the criterion was used to identify the robust features. In principle, multiple iterations between the balancing criteria could be used to identify the weak features and could be used to improve the performance of the system.