By Topic

IEEE Quick Preview
  • Abstract

SECTION I

INTRODUCTION

MISMATCH between training and test conditions represent one of the most challenging problems facing speaker recognition researchers today. There can be considerable sources of mismatch present including: transmission channel differences [1], [2], handset variability [3], background noise [4], session variability due to physical stress [5], vocal effort such as whisper [11], [12], Lombard effect [13], non-stationarity environment [10], spontaneity of speech, but to name a few. Various compensation strategies have been proposed in the past to reduce unwanted variability between training and test utterances, while retaining the speaker identity information. The current trend in state-of-the-art speaker recognition systems is to model the acoustic features with a GMM-UBM, use utterance dependent adapted GMM [7] mean super-vectors [14] as the features representing the speech segments, and model the super-vectors using various latent factor analysis techniques [1], [6], [15]. In [16], the aim was to identify the lower dimensional speaker and channel dependent subspaces, termed Eigenvoice [15], [17] and Eigenchannel [1], in the super-vector domain. In [1], an alternative was considered where speaker and channel variabilities were jointly modeled. The recently proposed i-vector [6] scheme utilizes a factor analysis framework [15], [18] to perform dimensionality reduction on the super-vectors while retaining important speaker discriminant information. This lower dimensional i-vector representation enables the development of full Bayesian techniques [19], [20], using a single model to represent the speaker and channel variability.

One limitation of the conventional GMM super-vector domain representation and subsequent factor analysis modeling is that, it does not take into account the fact that the original acoustic features contain redundancy. In general, the speech short-time spectrum is known to be representable in a lower dimensional subspace, which motivates a separate class of speech enhancement methods known as signal subspace approaches [21], [22]. Linear correlation among the speech spectral components are quite high, which justifies the success of these methods. This phenomenon is also valid for popular acoustic features, such as Mel-frequency Cepstral Coefficients (MFCC) [23], [24], even though these features are processed through Discrete Cosine Transform (DCT) for de-correlation before use in training or test.

A. Motivation

To motivate the proposed work, we first demonstrate that the conventional acoustic features can be constrained to reside in a lower dimensional subspace. For this purpose, we train a 1024 mixture full covariance GMM UBM using 60 dimensional MFCC features on a large background speech data set.1 For a typical mixture of this UBM, the covariance matrix and distribution of its eigenvalues is shown in Fig. 1. From Fig. 1(a) it is clear that the full covariance matrix, which shows strong diagonal terms, has significant non-zero off-diagonal elements, indicating that the feature coefficients are not fully uncorrelated. Fig. 1(b) shows the sorted eigenvalues of the same covariance matrix revealing that most of it's energy is accounted for by the first few dimensions only. This shows that the acoustic feature space is actually lower dimensional and features can thus be further compacted or enhanced by using a factor analysis model. Also, it is known that the first few directions obtained by the Eigen-decomposition of acoustic feature covariance matrices are mostly speaker dependent (e.g. see Zhou and Hansen [25] for a quantitative analysis), while other directions are more phoneme dependent. In this study, considering these noted observations on the acoustic features, we aim at investigating a factor analysis scheme on acoustic features for speaker recognition. We would like to name this method acoustic factor analysis.

Figure 1
Fig. 1. Analysis of full covariance matrices of a UBM trained using 60-dimensional MFCC feature Formula $(20\ {\rm static}+\Delta+\Delta\Delta)$. (a) A 3-D surface plot of the covariance matrix showing high values in the diagonal and significant off-diagonal values indicating correlation among different feature coefficients. (b) Sorted eigenvalues of the same covariance matrix demonstrating that most of the energy is accounted for by in the first few dimensions.

B. Limitations of Conventional Factor Analysis

Before proceeding with the formulation of the factor analysis scheme in the front-end features, we first defend the argument that the traditional factor analysis schemes do not take full advantage of the acoustic feature covariances. In a standard i-vector system, the GMM super-vectors are dimensionality reduced by a total factor analysis model, which is based on the idea that utterance super-vectors lie in a lower dimensional subspace. Let Formula ${\bf m}_{s}$ denote a GMM super-vector extracted from an utterance Formula $s$, and Formula ${\bf x}_{n}$ would denote the acoustic features. For a randomly chosen utterance Formula $s$, it is generally assumed that Formula ${\bf m}_{s}$ is normally distributed with mean Formula ${\bf m}_{0}$ and covariance matrix Formula ${\bf B}$ [15]. Here, Formula ${\bf m}_{0}$ denotes the speaker independent mean vector obtained by concatenating the UBM mean vectors Formula ${\bf m}_{0[g]}$. Let the UBM covariance matrices be Formula ${\mmb\Sigma}_{g}$, where Formula $g$ denotes the mixture number. The main motivation of both Eigenvoice and total variability modeling, is that the super-covariance matrix Formula ${\bf B}$ contains zero eigenvalues and thus some dimensions of Formula ${\bf m}_{s}$ can be disregarded. For the Formula $g$-th Gaussian mixture, the utterance dependent mean vector Formula ${\bf m}_{s[g]}$ is estimated from the posterior mean of the acoustic features that belong to Formula $s$, that is Formula ${\bf x}_{n}\in s$. This is a deterministic parameter. However, for a randomly selected utterance Formula $s$, the sub-vectors Formula ${\bf m}_{s[g]}$ are normally distributed random vectors having covariance matrix Formula ${\bf B}_{[g]}$, which is the Formula $g$-th sub-matrix of the super-covariance matrix Formula ${\bf B}$. Clearly, the matrices Formula ${\bf B}_{[g]}$ are not related to the feature covariance matrices Formula ${\mmb\Sigma}_{g}$, since the former represents the covariance of the mean sub-vectors Formula ${\bf m}_{s[g]}$ obtained from different utterances, while the latter represents the covariance of the acoustic features Formula ${\bf x}_{n}$ which is independent of the utterance.2 Thus, assuming that the matrix Formula ${\bf B}$ contains zero eigenvalues is not equivalent to assuming the same for the Formula ${\mmb\Sigma}_{g}$ matrices. Though this reasoning is based on full covariance UBM models, similar arguments can be made for a diagonal covariance based system.

C. Feature Dimensionality Reduction

Given that the conventional acoustic features reside in a lower dimensional subspace, it is important now to ask the question how we can use this knowledge to effectively extract utterance level features. Since speaker dependent information is contained in the leading eigen-directions of the acoustic features [25], using all the feature coefficients for modeling channel degraded data will result in retaining some nuisance components along with speaker dependent information in the GMM super-vectors and i-vectors. Therefore, we propose a dimensionality reduction transformation of the acoustic features for each GMM mixture that emphasizes the speaker dependent information in the leading eigenvectors of the corresponding mixture covariance matrix, while suppressing some unwanted channel components. In this manner, the GMM super-vectors will be “enhanced” in the sense that they will be more speaker discriminative, while the subsequently extracted i-vectors will also inherit this quality.

Dimensionality reduction of the acoustic features for de-correlation/enhancement is not a new concept. There are many techniques found in the literature that perform this task, including DCT, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Heteroscedastic LDA (HLDA), to name but a few [16], [26], [27]. The main goal for this process has been to be able to model the features using diagonal covariance matrix GMM/HMMs for speech/speaker recognition. These techniques can be classified in mainly two groups by their mode of operation including: 1) the signal processing domain, and 2) the model domain. In the first scenario, some transformation (supervised/unsupervised) is used at the signal/acoustic feature level in order to achieve improved energy compaction. The most common technique is the application of the DCT for the log-filterbank energies [23] popularized by the MFCC representation. PCA can also be used [26] by learning the principal directions from the Eigen-decomposition of the covariance matrix trained on the utterance data itself. In general, this class of processing only depends on the speech data under consideration and does not use any outside knowledge. In the second scenario, raw acoustic features (e.g., filter-bank energies) are initially used to train a large model, which is then used to derive the feature transformations. One such technique used in speaker recognition is HLDA [16], where first a GMM-UBM is trained on the raw acoustic features. Each mixture is then assumed to represent a separate class, and HLDA transformation is trained so that discrimination between these classes is maximized. In a similar fashion, PCA projections can also be used in each GMM mixture as a transformation [28]. In these methods, after the initial training phase, the acoustic features are aligned to the mixture component providing the highest posterior probability and the corresponding transformation is used for dimensionality reduction.

Both the signal processing domain and model domain feature dimensionality reduction techniques previously used in essence have one common property: they re-generate the acoustic features after a dimensionality reduction. This means, the sub-sequent procedures for the speaker recognition system require that we begin training from these newly extracted features. Model domain dimensionality reduction has an extra inconvenience of mixture-alignment. Speech features are known to be highly intertwined and overlapped in the vector space for different acoustic conditions and generally do not form meaningful clusters [29]. Thus, using the top posterior probability for aligning a feature vector to a single mixture may not be appropriate. To demonstrate this, we select MFCC feature vectors Formula ${\bf x}_{n}$ from 10 development utterances that were used in the UBM training, and for each feature vector, we find the highest posterior probability among the 1024 mixtures of the UBM, Formula $\max_{g}p(g\vert{\bf x}_{n})$. A histogram of these top mixture probabilities is shown in Fig. 2, which clearly demonstrate that only a few frames are unquestionably aligned to a specific Gaussian mixture (indicated by the high peak near Formula $\max_{g}p(g\vert{\bf x}_{n})=1$). In actuality, a majority of the feature vectors are aligned with more than one mixture, resulting in a top mixture probability in the region of 0.3 Formula $\sim$ 0.8. Thus, using the top scoring mixture for hard alignment of feature vectors to a specific mixture can introduce inaccuracies and should be avoided if possible.

D. Further Implications of the Proposed Method

Figure 2
Fig. 2. Distribution of top posterior probabilities Formula $p(g\vert{\bf x}_{n})$ obtained from a subset of development data.

Historically, feature extraction, dimensionality reduction, enhancement and normalization has always been thought of as a separate process from acoustic modeling. In this study, we propose a new modeling scheme of the acoustic features that bridges the gap between these two processing domains through integrated feature dimensionality reduction and modeling. We demonstrate that the proposed method not only performs dimensionality reduction, it also removes the need for hard feature clustering to a specific mixture, and does not require retraining of the UBM from the new features, thereby incorporating a built-in feature normalization and enhancement scheme. All this is achieved using a single linear transformation derived from a pre-trained full covariance matrix UBM and applying this in a probabilistic fashion to the mixture dependent Baum-Welch statistics.

E. Outline

This paper is organized as follows. In Section II, we formulate the proposed Acoustic Factor Analysis (AFA) scheme and derive the mixture-dependent transformation matrices. Section III describes the various properties of the AFA transformation, including normalization and enhancement. In Section IV, we describe how the proposed scheme can be integrated within an i-vector system followed by our system description in Section V. Experimental results are presented in Section VI, and finally, Section VII concludes the study.

SECTION II

ACOUSTIC FACTOR ANALYSIS

In this section, we describe the proposed factor analysis model of acoustic features, discuss its formulation and mixture-wise application for dimensionality reduction.

A. Formulation

Let Formula ${\cal X}=\{{\bf x}_{n}\vert n=1\cdots N\}$ be the collection of all acoustic feature vectors from the development set obtained from a large corpus of many speakers' recordings in diverse environment/channel conditions. Using a factor analysis model, the Formula $d\times 1$ dimensional feature vector Formula ${\bf x}$ can be represented by, Formula TeX Source $${\bf x}={\bf Wy}+\mu+\epsilon.\eqno{\hbox{(1)}}$$ Here, Formula ${\bf W}$ is a Formula $d\times q$ low rank factor loading matrix that represents Formula $q<d$ bases spanning the subspace with important variability in the feature space, and Formula ${\mmb\mu}$ is the Formula $d\times 1$ mean vector of Formula ${\bf x}$. We denote the latent variable vector or latent factors Formula ${\bf y}\sim{\cal N}({\bf 0},{\bf I})$, as acoustic factors, which is of dimension Formula $q\times 1$. We assume that the remaining noise component Formula $\epsilon\sim{\cal N}({\bf 0},{\mmb\sigma}^{\bf 2}{\bf I})$ is isotropic, and therefore the model is equivalent to PPCA [18]. In this model, the feature vectors are also normally distributed such that, Formula ${\bf x}\sim{\cal N}(\mu,\sigma^{2}{\bf I}+{\bf WW}^{T})$.

The advantage of this model is that the acoustic factors Formula ${\bf y}$, defining the weights of the factor loadings, explains the correlation between the feature coefficients Formula ${\bf x}$, which we believe are more speaker dependent [25], while the noise component Formula $\epsilon$ incorporates the residual variance of the data. It should be emphasized that even though we denote the term Formula $\epsilon$ as “noise”, when used with cepstral features this term actually represents convolutional channel distortion [30]. A mixture of these models [18] can be used to incorporate the variations caused by different phonemes uttered by multiple speakers in distinct noisy/channel degraded conditions, given by, Formula TeX Source $$p({\bf x})=\sum_{g}w_{g}p({\bf x}\vert g)\eqno{\hbox{(2)}}$$ where for the Formula $g$-th mixture, Formula TeX Source $$p({\bf x}\vert g)={\cal N}\left({\mmb\mu}_{g},\sigma_{g}^{2}{\bf I}+{\bf W}_{g}{\bf W}_{g}^{T}\right).\eqno{\hbox{(3)}}$$ Here, Formula $\mu_{g}$, Formula $w_{g}$, Formula ${\bf W}_{g}$ and Formula $\sigma_{g}^{2}$ represent the mean vector, mixture weight, factor loading matrix, and noise variance for the Formula $g$-th AFA model, respectively.

B. Mixture Dependent Transformation

One advantage of using the mixture of PPCA for acoustic factor analysis is that, its parameters can be conveniently extracted from a GMM trained using the Expectation-Maximization (EM) algorithm [18]. Thus, we utilize a full covariance UBM to derive the AFA model parameters. The proposed feature transformation and dimensionality reduction procedure is presented below:

1) Universal Background Model

A full covariance UBM model Formula $\Lambda_{0}$, is trained on the development dataset Formula ${\cal X}=\{{\bf x}_{n}\vert n=1\cdots N\}$, given by, Formula TeX Source $$p({\bf x}\vert\Lambda_{0})=\sum_{g=1}^{M}w_{g}{\cal N}(\mu_{g},{\mmb\Sigma}_{g})\eqno{\hbox{(4)}}$$ where Formula $w_{g}$ represents the mixture weights, Formula $M$ is the total number of mixtures, Formula $\mu_{g}$ are the mean vectors and Formula ${\mmb\Sigma}_{g}$ are the full covariance matrices. The mean and weight parameters of the UBM will be identical to the mixture model of (2).

2) Noise Subspace Selection

We require to set the value of Formula $q$, which defines the number of principal axes we would like to select. In other words, we assume the lower Formula $d-q$ dimensions of the features will actually represent the noise subspace [21]. Using this value of Formula $q$, we find the noise variance for the Formula $g$-th mixture as, Formula TeX Source $$\sigma_{g}^{2}={1\over d-q}\sum_{i=q+1}^{d}\lambda_{g,i}\eqno{\hbox{(5)}}$$ where Formula $\lambda_{g,q+1}\cdots\lambda_{g,d}$ are the smallest eigenvalues of the covariance matrix Formula ${\mmb\Sigma}_{g}$. Thus, Formula $\sigma_{g}^{2}$ is essentially the average variance lost per discarded dimension. It may be noted that the model allows the use of different values of Formula $q$ for each mixture. This has been investigated in [9] and also, we elaborate this issue in greater detail in Section IV.

3) Compute the Factor Loading Matrix

The maximum likelihood estimation of the factor loading matrix Formula ${\bf W}_{g}$ of the Formula $g$-th mixture of the AFA model in (2) is given by, Formula TeX Source $${\bf W}_{g}={\bf U}_{{\bf q}_{g}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{1\over 2}{\bf R}_{g}\eqno{\hbox{(6)}}$$ where Formula ${\bf U}_{{\bf q}_{g}}$ is a Formula $d\times q$ matrix whose columns are the Formula $q$ leading eigenvectors of Formula ${\mmb\Sigma}_{g}$, Formula ${\mmb\Lambda}_{{\bf q}_{g}}$ is a diagonal matrix containing the corresponding Formula $q$ eigenvalues, and Formula ${\bf R}_{g}$ is a Formula $q\times q$ arbitrary orthogonal rotation matrix. In this work, we set Formula ${\bf R}_{g}={\bf I}$.

4) Feature Transformation

The posterior mean of the acoustic factors Formula ${\bf y}_{n}$ can be used as the transformed and dimensionality reduced version of Formula ${\bf x}_{n}$ for the Formula $g$-th component of the AFA model. This can be shown to be Formula TeX Source $$E\{{\bf y}_{n}\vert{\bf x}_{n},g\}=\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle={\bf A}_{g}^{T}({\bf x}_{n}-{\mmb\mu}_{g})\buildrel{\Delta}\over{=}{\bf z}_{n,g}\eqno{\hbox{(7)}}$$ where Formula TeX Source $$\eqalignno{{\bf A}_{g}=&\,{\bf W}_{g}{\bf M}_{g}^{-T}\ {\rm and}&\hbox{(8)}\cr{\bf M}_{g}=&\,\sigma_{g}^{2}{\bf I}+{\bf W}_{g}^{T}{\bf W}_{g}.&\hbox{(9)}}$$ We term the matrix Formula ${\bf A}_{g}$ as the Formula $g$-th AFA transform. In this operation, we are essentially replacing the original feature vectors Formula ${\bf x}_{n}$ by the mixture dependent transformed acoustic feature Formula ${\bf z}_{n,g}$. Each feature vector Formula ${\bf x}_{n}$ can be transformed by Formula ${\bf A}_{g}$, corresponding to the mixture component it is aligned with and a new set of features can then be obtained. However, as noted earlier, we will not regenerate the acoustic features and instead use a probabilistic soft-alignment in our system. This is described in Section V where we discuss the integration of AFA within an i-vector system.

SECTION III

PROPERTIES OF THE AFA TRANSFORM

In this section, we discuss the general properties and advantages of the proposed acoustic feature model, the resulting transformation and the transformed features.

A. Probability Distribution of the Transformed Features

Here, we derive the probability distribution of the transformed acoustic features and show how AFA performs feature de-correlation. Let Formula ${\bf z}_{n,g}=\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle$ indicate the AFA transformed feature vector for the Formula $g$-th mixture. We have the following mean vector of Formula ${\bf z}_{n,g}$, Formula TeX Source $$\eqalignno{\mu_{{\bf z}_{g}}=&\,E\left\{\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle\right\}\cr=&\,E\left\{{\bf A}_{g}^{T}({\bf x}_{n}-\mu_{g})\right\}={\bf 0}&\hbox{(10)}}$$ and its corresponding covariance matrix, Formula TeX Source $$\eqalignno{{\mmb\Sigma}_{{\bf z}_{\bf g}}=&\,E\left\{{\bf z}_{n,g}{\bf z}_{n,g}^{T}\right\}-\mu_{{\bf z}_{g}}\mu_{{\bf z}_{g}}^{T}\cr=&\,{\bf A}_{g}^{T}E\left\{({\bf x}_{n}-\mu_{g})({\bf x}_{n}-\mu_{g})^{T}\right\}{\bf A}_{g}\cr=&\,{\bf A}_{g}^{T}{\mmb\Sigma}_{g}{\bf A}_{g}.&\hbox{(11)}}$$ For further simplification, we first substitute the value of Formula ${\bf W}_{g}$ from (6) into (9) and use Formula ${\bf R}_{g}={\bf I}$ to obtain, Formula TeX Source $$\eqalignno{{\bf M}_{g}=&\,\sigma_{g}^{2}{\bf I}+{\bf W}_{g}^{T}{\bf W}_{g}\cr=&\,\sigma_{g}^{2}{\bf I}+\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\bf U}_{{\bf q}_{g}}^{T}{\bf U}_{{\bf q}_{g}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{1\over 2}\cr=&\,{\mmb\Lambda}_{{\bf q}_{g}}.&\hbox{(12)}}$$ Next, substituting the values of Formula ${\bf W}_{g}$ and Formula ${\bf M}_{g}$ from (6) and (12) into (8) we have, Formula TeX Source $${\bf A}_{g}^{T}={\mmb\Lambda}_{{\bf q}_{g}}^{-1}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\bf U}_{{\bf q}_{g}}^{T}.\eqno{\hbox{(13)}}$$ Using this expression of Formula ${\bf A}_{g}^{T}$ in (11) we obtain, Formula TeX Source $$\eqalignno{{\mmb\Sigma}_{{\bf z}_{\bf g}}=&\,{\mmb\Lambda}_{{\bf q}_{g}}^{-1}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\mmb\Lambda}_{{\bf q}_{g}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{1\over 2}{\mmb\Lambda}_{{\bf q}_{g}}^{-T}\cr=&\,\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right){\mmb\Lambda}_{{\bf q}_{g}}^{-T}\cr=&\,{\bf I}-\sigma_{g}^{2}{\mmb\Lambda}_{{\bf q}_{g}}^{-1}.&\hbox{(14)}}$$ Here, we utilize the expression Formula ${\bf U}_{{\bf q}_{g}}^{T}{\mmb\Sigma}_{g}{\bf U}_{{\bf q}_{g}}={\mmb\Lambda}_{{\bf q}_{g}}$ and take advantage of the diagonal system. Thus, we show that for a given mixture alignment Formula $g$, the posterior mean of the acoustic factors, or the transformed feature vectors Formula ${\bf z}_{n,g}$ follow a Gaussian distribution with zero mean and a diagonal covariance matrix given by Formula ${\bf I}-\sigma_{g}^{2}{\mmb\Lambda}_{{\bf q}_{g}}^{-1}$. Thus, the AFA transformation de-correlates the mean normalized acoustic features in each mixture.

B. Acoustic Feature Enhancement

In the Formula $g$-th mixture, the AFA transformation matrix Formula ${\bf A}_{g}^{T}$ expression given in (13) can be expressed as: Formula TeX Source $$\eqalignno{{\bf A}_{g}^{T}=&\,{\mmb\Lambda}_{{\bf q}_{g}}^{-1}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\bf U}_{{\bf q}_{g}}^{T}\cr=&\,{\mmb\Lambda}_{{\bf q}_{g}}^{-{1\over 2}}{\bf G}_{g}{\bf U}_{{\bf q}_{g}}^{T}&\hbox{(15)}}$$ where we introduced a diagonal gain matrix given by: Formula TeX Source $${\bf G}_{g}={\mmb\Lambda}_{{\bf q}_{g}}^{-{1\over 2}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}.\eqno{\hbox{(16)}}$$ the Formula $i$-th diagonal entry of Formula ${\bf G}_{g}$ is given by, Formula TeX Source $$G_{g}(i)=\sqrt{\left(\lambda_{g,i}-\sigma_{g}^{2}\right)\over\lambda_{g,i}}.\eqno{\hbox{(17)}}$$ Keeping aside the term Formula ${\mmb\Lambda}_{{\bf q}_{g}}^{-(1/2)}$ in (15), we observe that the transformation operation performed by Formula ${\bf A}_{g}^{T}$ in (7) first computes the inner product of the mean normalized acoustic feature with the Formula $q$ principal eigenvectors of Formula ${\mmb\Sigma}_{\bf g}$, then for each Formula $i$-th eigenvector direction applies the gain function defined by Formula $G_{g}(i)$. The second term in (17) can be identified as a square-root Wiener gain function [31]. This becomes clearer if we define the classic speech enhancement terminology a priori SNR Formula $\xi$ as [21], [32], Formula TeX Source $$\xi={\lambda_{g,i}-\sigma_{g}^{2}\over\sigma_{g}^{2}}\eqno{\hbox{(18)}}$$ and use this to express the gain equations. The Wiener gain Formula $G_{\rm w}$ and the square-root Wiener gain Formula $G_{\sqrt{\rm w}}$ are given by: Formula TeX Source $$G_{\rm W}={\xi\over\xi+1}\ {\rm and}\ G_{\sqrt{\rm W}}=\left({\xi\over\xi+1}\right)^{1\over 2}.\eqno{\hbox{(19)}}$$ Wiener and square-root Wiener gain functions are plotted against Formula $\xi$ in Fig. 3. As discussed in [31] page 179, Sec 6.6.3), in case of additive noise, the square-root Wiener filter is applied, when instead of the magnitude spectrum, the power spectrum of the filtered signal and the clean signal are desired to be equal. The operation performed by the AFA transformation in (15) can be interpreted as a gain function operating on a transformed space defined by the Formula $i$-th eigenvector to obtain a clean eigenvalue Formula $\lambda_{g,i}-\sigma_{g}^{2}$ from the noisy eigenvalue Formula $\lambda_{g,i}$ [9]. Since the eigenvalues can be interpreted as a power spectrum obtained from the principal components [33], it is understandable why Formula $G_{\sqrt{\rm w}}$ arises in this scenario instead of Formula $G_{\rm w}$. Due to this square-root operation on the gain function, the square-root Wiener obviously shows lower attenuation characteristics compared to the standard Wiener filter, as depicted in Fig. 3. It may be noted that conventional factor analysis techniques in the super-vector space can also be interpreted using similar Wiener like gain functions as discussed in [34].

Figure 3
Fig. 3. Input SNR [dB] Formula $(\xi)$ vs. Wiener gains. Wiener gain and square-root Wiener gain are shown with a solid Formula $(\matrix{\hbox{\rm\vrule height 0.04em width 0.5em}\cr\noalign{\vskip 5pt}})$ and dashed Formula $(\matrix{\hbox{\rm\vrule height 0.04em width 0.5em}\cr\noalign{\vskip 5pt}}\ \matrix{\hbox{\rm\vrule height 0.04em width 0.5em}\cr\noalign{\vskip 5pt}})$ line, respectively.

In the signal subspace speech enhancement method [21], a similar gain function is obtained by starting from the same model in (1), except for the standard normal assumption on the latent factors Formula ${\bf y}$. In that work, the term Formula ${\bf Wy}+\mu\buildrel{\Delta}\over{=}{\bf a}$ in (1) was interpreted as the “clean signal”, Formula ${\bf x}$ as the noisy signal and Formula $\epsilon$ as the additive noise. The goal was to find an estimate of the clean signal Formula $\mathhat{\bf a}$ by finding the posterior mean of Formula ${\bf a}$ given the noisy signal Formula ${\bf x}$ and noise variance. However, in the AFA scheme, the goal is to estimate the posterior mean of the latent factors Formula ${\bf y}$ for an “enhanced” and more compact version of the “noisy” (channel degraded) acoustic features Formula ${\bf x}$ [18]. This difference between the two approaches yield two different optimization criteria and their resulting gain functions.

Another contrast between the speech enhancement schemes and AFA transformation is the interpretation of noise. In conventional speech enhancement methods the noise statistics are estimated from silence regions between speech segments [35], and thus for the signal subspace based method, noise variance Formula $\sigma_{g}^{2}$ is assumed to be known in the model (1). In our case, the noise we are attempting to remove or compensate for is actually an additive distortion in the cepstral domain, which will not exist in the silence regions. In addition, even if the silence segments were modeled in the UBM, it is very unlikely that the mixture components modeling the silences would be useful in determining the noise level in other components. Thus, even though the AFA dimension Formula $q$ is related to the noise variance, we resort to set the value of Formula $q$ arbitrarily and compute the corresponding noise variance for each mixture using (5).

C. Acoustic Feature Variance Normalization

Going back to (15), the term Formula ${\mmb\Lambda}_{{\bf q}_{g}}^{-(1/2)}$ normalizes the variance of the acoustic feature stream in the Formula $i$-th eigen-direction, since Formula $\lambda_{g,i}$ is the expected feature variance along this direction [36]. This means, the AFA transformation assumes that the features that are closely aligned with the Formula $g$-th mixture, originates from the same random process, and performs this normalization in addition to the enhancement mentioned in the previous section. This process is interestingly similar to the cepstral variance normalization frequently performed in the front-end. However, feature domain processing considers the temporal movement of the features in performing these normalizations assuming that the feature streams are independent, while AFA groups the features together in a mixture irrespective of their time location and performs the normalization in an orthogonal axis derived from the corresponding mixture covariance matrix. It would be interesting to see how AFA systems perform if the feature domain normalizations are removed from the front-end. Recent studies [37] show that in the full-covariance UBM based i-vector scheme, a very basic scale normalization technique outperforms Cepstral Mean and Variance Normalization (CMVN) and feature Gaussianization [38]. This may be due to the uncorrelated assumption among feature coefficients inherently assumed while applying these normalization schemes. We have yet to perform experiments on comparative feature normalization schemes using AFA and suggest this as a future work.

SECTION IV

AFA INTEGRATED I-VECTOR SYSTEM

In this section, we describe how the proposed method can be incorporated into a conventional i-vector system [6].

A. UBM and AFA Model Training

First, a full covariance UBM model, Formula $\Lambda_{0}$ given by (4), is trained on the development data vectors. Next, the AFA dimension Formula $q$ is set, which defines the number of principal axes to retain from each mixture component. Using the value of Formula $q$, we find the noise variance for the Formula $g$-th mixture using (5). The factor loading matrix Formula ${\bf W}_{g}$ and transformation matrix Formula ${\bf A}_{g}$ are then calculated using (6) and (8), respectively. After applying the transformation as in (7), the posterior means of the acoustic factors Formula ${\bf z}_{n,g}=\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle$ are used as mixture dependent transformed acoustic features.

B. UBM Transformation

Following the discussion from Section III-A, and using (10) and (14), the AFA transformation would require a new transformed UBM Formula $\mathhat{\Lambda}_{0}$ that models Formula ${\bf z}_{n,g}$ instead of Formula ${\bf x}_{n}$, such that, Formula TeX Source $$p({\bf z}\vert\mathhat{\Lambda}_{0})=\sum_{i=1}^{M}w_{g}{\cal N}({\bf 0},\mathhat{\mmb\Sigma}_{g})\eqno{\hbox{(20)}}$$ where Formula $\mathhat{\mmb\Sigma}_{g}={\bf I}-\sigma_{g}^{2}{\mmb\Lambda}_{{\bf q}_{g}}^{-1}={\mmb\Sigma}_{{\bf z}_{\bf g}}$. This UBM is not an actual acoustic model used to calculate the posterior probabilities or other statistics.(20) simply indicates how the UBM parameters should be modified/replaced compared to the original UBM Formula $\Lambda_{0}$ given in (4). This transformation only affects the hyper-parameter estimation.

C. Baum-Welch Statistics Estimation

In this step, the zero and first order Baum-Welch statistics are extracted from each feature vector with respect to the UBM. Using the AFA transformed features, extraction of the statistics can be accomplished as follows. The probabilistic alignment of feature Formula ${\bf x}_{n}$ with the Formula $g$-th mixture is given by: Formula TeX Source $$\gamma_{g}(n)=p(g\vert{\bf x}_{n})={p({\bf x}_{n}\vert g)w_{g}\over p({\bf x}_{n})}.\eqno{\hbox{(21)}}$$ For an utterance Formula $s$, the zero order statistics is extracted as: Formula TeX Source $$N_{s}(g)=\sum_{n\in s}\gamma_{g}(n),\eqno{\hbox{(22)}}$$ which follows the standard procedure [6], [15]. Conventionally, the first order statistics are extracted as: Formula TeX Source $${\bf F}_{s}(g)=\sum_{n\in s}\gamma_{g}(n){\bf x}_{n}.$$ However, with the present AFA transform, the first order statistics Formula $\mathhat{\bf F}_{s}(g)$ is extracted using the transformed features in the corresponding mixtures instead of the original features. Formula TeX Source $$\eqalign{\mathhat{\bf F}_{s}(g)=&\,\sum_{n\in s}\gamma_{g}(n){\bf z}_{n,g}=\sum_{n\in s}\gamma_{g}(n){\bf A}_{g}^{T}({\bf x}_{n}-\mu_{g})\cr=&\,{\bf A}_{g}^{T}\left[{\bf F}_{s}(g)-N_{s}(g)\mu_{g}\right]={\bf A}_{g}^{T}\bar{\bf F}_{s}(g)}$$ where Formula $\bar{\bf F}_{s}(g)$ is the centralized first order statistics [20]. This transformation of statistics is somewhat similar to the approach in [39], where it was done to normalize the UBM parameters to zero means and identity covariance matrices. However, in [39] the goal was to simplify the i-vector system algorithm, theoretically preserving the procedure results with added computational benefits; whereas in this work, we are performing feature transformation and dimensionality reduction for possible improvement of the i-vector system performance.

D. Hyper-Parameter Estimation

Training of the Total Variability (TV) matrix Formula ${\bf T}$ for the i-vector system follows a very similar procedure as discussed in [6]. In this system, an utterance dependent super-vector Formula $s$ is expressed as, Formula TeX Source $${\bf m}_{s}={\bf m}_{0}+{\bf Tw}_{s}\eqno{\hbox{(23)}}$$ where the Formula $Md$ dimensional vector Formula ${\bf m}_{0}$ denotes the speaker independent mean super-vector (i.e., concatenation of the UBM means Formula $\mu_{g}={\bf m}_{0[g]}$), Formula ${\bf T}$ is an Formula $Md\times R$ low rank matrix Formula $(R<Md)$ whose columns span the total variability space, and Formula ${\bf w}_{s}$ is a normal distributed random vector of size Formula $R$, known as the total factors. The posterior mean vector of Formula ${\bf w}_{s}$ given an utterance data is know as an i-vector.

1) Initialization

Depending on the AFA parameter Formula $q$, the size of the matrix Formula ${\bf T}$ needs to be defined. In the AFA based i-vector system, the super-vector dimension becomes Formula $K=Mq$ instead of Formula $Md$. Thus, the Formula ${\bf T}$ matrix size needs to be set to Formula $K\times R$, and randomly initialized. We define a parameter, super-vector compression (SVC) ratio Formula $\alpha=K/Md=q/d$, measuring compaction obtained through AFA transformation.

2) EM Iterations

For each utterance Formula $s\in{\cal S}$, Formula $R\times R$ precision matrix Formula ${\bf L}_{s}$ and Formula $R\times 1$ vector Formula ${\bf B}_{s}$ are estimated as [40]: Formula TeX Source $$\eqalignno{{\bf L}_{s}=&\,{\bf I}+\sum_{g=1}^{M}N_{s}(g){\bf T}_{[g]}^{T}\mathhat{\mmb\Sigma}_{g}^{-1}{\bf T}_{[g]}\ {\rm and}&\hbox{(24)}\cr{\bf B}_{s}=&\,\sum_{g=1}^{M}N_{s}(g){\bf T}_{[g]}^{T}\mathhat{\mmb\Sigma}_{g}^{-1}\mathhat{\bf F}_{s}(g)&\hbox{(25)}}$$ respectively, where Formula ${\bf T}_{[g]}$ is the Formula $g$-th sub-matrix of Formula ${\bf T}$ of dimension Formula $q\times R$, Formula $\mathhat{\mmb\Sigma}_{g}$ is the Formula $q\times q$ AFA transformed UBM covariance matrix. The total factors for the utterance Formula $s$ are estimated as: Formula TeX Source $${\bf w}_{s}={\bf L}_{s}^{-1}{\bf B}_{s}.\eqno{\hbox{(26)}}$$ In each iteration, the Formula $g$-th block of the Formula ${\bf T}$ matrix is updated using the following equation: Formula TeX Source $${\bf T}_{[g]}=\sum_{s\in{\cal S}}\mathhat{\bf F}_{s}(g){\bf w}_{s}^{T}\left[\sum_{s\in{\cal S}}\left({\bf L}_{s}^{-1}+{\bf w}_{s}{\bf w}_{s}^{T}\right)N_{s}(g)\right]^{-1}\eqno{\hbox{(27)}}$$ which follows the same procedure as a conventional i-vector system [6], [40].

SECTION V

SYSTEM DESCRIPTION

We perform our experiments on the male trials of the NIST SRE 2010 telephone and microphone conditions (core conditions 1–5, extended trials). A standard i-vector system [6] with a Gaussian Probabilistic Linear Discriminant Analysis (PLDA) [41] back-end is used for evaluation. Specific blocks of the baseline system implementation and details of the proposed scheme are described below. An overall block diagram of the proposed system is included in Fig. 4.

Figure 4
Fig. 4. A block diagram of the proposed AFA integrated i-vector system. The system is shown in two phases: (a) development and (b) evaluation. In the evaluation phase, only i-vector extraction procedure is depicted assuming an arbitrary classifier. For details on the PLDA classifier used, refer to Section V-D.

A. Feature Extraction

In order to remove the silence frames, an independent Hungarian phoneme recognizer [42] combined with an energy based voice activity detection (VAD) scheme is used. A 60-dimensional feature vector Formula $(19\ {\rm MFCC}+{\rm Energy}+\Delta+\Delta\Delta)$ is extracted using a 25 ms analysis window with subsequent 10 ms shifts, and then Gaussianized utilizing a 3-s sliding window [38].

B. UBM Training

Gender dependent UBMs having full and diagonal-covariance matrices with 1024 mixtures are trained on telephone utterances selected from the Switchboard II Phase 2 and 3, Switchboard Cellular Part 1 and 2, and the NIST 2004, 2005, 2006 SRE enrollment data. We use the HTK toolkit for training with 15 iterations per mixture split. The UBM full covariance values were floored to Formula $10^{-5}$ using the Formula $-v$ option in HTK HERest toolkit [43].

C. Total Variability Modeling

For the TV matrix training, the UBM training dataset is utilized. Five iterations are used for the EM training. We use 400 total factors (i.e., our i-vector size was 400). All i-vectors are first whitened and then length normalized using radial Gaussianization [41].

D. Session Variability Compensation and Scoring

A Gaussian probabilistic linear discriminant analysis (PLDA) model with a full-covariance noise process is used for session variability compensation and scoring [41]. In this generative model, an Formula $R$ dimensional i-vector Formula ${\bf w}_{s}$ extracted from a speech utterance Formula $s$ is expressed as: Formula TeX Source $${\bf w}_{s}={\bf w}_{0}+{\mmb{\Phi\beta}}+{\bf n}\eqno{\hbox{(28)}}$$ where Formula ${\bf w}_{0}$ is an Formula $R\times 1$ speaker independent mean vector, Formula ${\mmb\Phi}$ is the Formula $R\times N_{EV}$ rectangular matrix representing a basis for the speaker-specific subspace/eigenvoices, Formula $\beta$ is an Formula $N_{EV}\times 1$ latent vector having a standard normal distribution, and Formula ${\bf n}$ is the Formula $R\times 1$ random vector representing the full covariance residual noise. The only model parameter here is the number of eigenvoices Formula $N_{EV}$, that is the number of columns in the matrix Formula ${\mmb\Phi}$. I-vectors extracted from the UBM training dataset and additional microphone data selected from SRE 2004 and 2005, are utilized to train this PLDA model.

SECTION VI

EVALUATION RESULTS

A. Performance Evaluation of AFA Systems

In this experiment, in four different runs we retain Formula $q=$36, 42 and 48 coefficients from the Formula $d=60$ dimensional features using the proposed AFA method. We vary the number of eigenvoices Formula $N_{EV}$ in the PLDA model from 50 to 400 in 50 step increments. The performance metrics used are %Equal Error Rate (EER) and minimum Detection Cost Functions (DCF) defined in NIST SRE 2008 [44] Formula $({\rm DCF}_{old})$ and NIST SRE 2010 [45] Formula $({\rm DCF}_{new})$. The results are summarized in the plot shown in Fig. 5 and a subset of these results, organized by performance metrics, is also shown in Table I. The proposed systems are compared against our baseline full-covariance and diagonal covariance UBM based i-vector systems, referred to as “Baseline full-cov” and “Baseline diag-cov”, respectively.

Figure 5
Fig. 5. Performance comparison between proposed AFA and baseline i-vector system with respect to (a) %EER, (b) Formula ${\rm DCF}_{old}$ and (c) Formula ${\rm DCF}_{new}$ for different eigenvoice size Formula $N_{EV}$ of the PLDA model. Evaluation is performed on NIST SRE 2010 core condition-5 using the extended trials.
Table 1
TABLE I PERFORMANCE COMPARISON BETWEEN BASELINE I-VECTOR AND PROPOSED AFA SYSTEMS FOR DIFFERENT VALUES OF Formula $N_{EV}$ AND Formula $q$. EVALUATION PERFORMED ON NIST SRE 2010 CORE CONDITION-5 EXTENDED TRIALS

From Fig. 5(a)(c), we observe that for Formula $q=42$ and for almost all values of Formula $N_{EV}$, the proposed AFA system performs better than both baseline systems with respect to all three performance metrics. For Formula $q=48$, the AFA system is superior to the baselines in Formula ${\rm DCF}_{new}$, but very close with respect to the other performance measures. For Formula $q=42$ and Formula $N_{EV}=200$, we achieve the best EER performance of 1.73% which is 11.28% lower relative to the corresponding Baseline full-cov system EER. The results in Fig. 5 and Table I indicate that the proposed AFA transformation of the acoustic features are successfully able to reduce nuisance directions in the feature space, producing i-vectors with better speaker discriminating ability. We also note that our full-covariance baseline system and AFA based systems perform significantly better than the diagonal-covariance system.

B. Effect of Different AFA Dimension

In Fig. 6, AFA system performance is compared with the Baseline full-cov system for different values of Formula $q$, keeping the parameter Formula $N_{EV}$ fixed at 150. Here we use Formula $q=24$, 30, 36, 42, 48 and 54, yielding super-vector compression (SVC) ratios of Formula $\alpha=$0.4, 0.5, 0.6, 0.7, 0.8 and 0.9, respectively. From this figure, we observe that the system performance is quite sensitive to the Formula $q$ parameter of the proposed AFA method, though performance improvement is achieved compared to the baseline system in almost all cases. If the value of Formula $q$ is too low, some speaker dependent information is removed by the AFA transform and system performance degrades. Values of Formula $q$ close to feature dimension Formula $d$ yields performances similar to the baseline system. We observe consistent improvements in the system performance by setting Formula $q$ close to 42 Formula $\sim$ 48 for the AFA systems. In this region, relative improvement values of all three performance metrics are in the range of 4 Formula $\sim$ 12%. We believe the fluctuation of performance is due to the fact that a different value of Formula $q$ is suitable for each mixture component. Thus, methods of selecting the optimal AFA dimension can be a viable future work, especially since the model allows different values of Formula $q$ for each mixture.

Figure 6
Fig. 6. Performance comparison of AFA system for different values of Formula $q$ with respect to % Relative Improvements (RI) in %EER, Formula ${\rm DCF}_{old}$ and Formula ${\rm DCF}_{new}$ compared to the corresponding baseline system performance metric. Evaluation is performed on NIST SRE 2010 core condition-5 using the extended trials. The figure clearly reveals that the system performance drastically degrades as the value of Formula $q$ is reduced.

C. Effect of UBM Variance Flooring

It is known that full covariance UBM based speaker recognition systems can be very sensitive to small values in the UBM covariance matrices [20]. In [20], a variance flooring algorithm [46] was used to tackle with this issue. As mentioned in Section V-B, we performed UBM variance flooring by limiting the minimum value of a covariance matrix component to Formula $10^{-5}$ using HTK. We refer to this flooring method as “vFloor-1”. To observe the effect of an alternate variance flooring on the AFA systems, we trained the UBM as described in [20]. In each EM iteration, the full covariance matrices were processed using the flooring function described in Table II [20], [46]. We used the floor matrix Formula ${\bf F}=f\bar{\mmb\Sigma}$, where Formula TeX Source $$\bar{\mmb\Sigma}={1\over M}\sum_{g=1}^{M}{\mmb\Sigma}_{g}\eqno{\hbox{(29)}}$$ is the average covariance matrix, and Formula $f=0.1$ is set as in [20]. We refer to this flooring method as “vFloor-2”. Baseline and AFA system results using these two different UBM flooring methods are summarized in Table III. In this experiment, PLDA size Formula $N_{EV}$ was set to 150.

Table 2
TABLE II UBM COVARIANCE MATRIX FLOORING FUNCTION (VFloor-2) [20]
Table 3
TABLE III PERFORMANCE COMPARISON BETWEEN BASELINE I-VECTOR AND DIFFERENT AFA SYSTEMS USING ALTERNATE UBM FLOORING. EVALUATIONS PERFORMED ON NIST SRE 2010 CORE CONDITION-5 EXTENDED TRIALS

From the results, we observe that the variance flooring vFloor-2 [20] provides slightly improved baseline system performance compared to vFloor-1, with respect to %EER and Formula ${\rm DCF}_{old}$ but degrades in Formula ${\rm DCF}_{new}$ measure. The proposed AFA transformation achieves much better performance over the baseline system when using vFloor-1. AFA provides improvement over the baseline system using vFloor-2 only for Formula $q=54$, whereas performance improvement is observed for Formula $q=42$, 48 and 54 when vFloor-1 is used. This deterioration of AFA system performance can be expected, since the vFloor-2 algorithm modifies the eigenvalues of the covariance matrices on which the AFA approach directly relies on. Noting that AFA with vFloor-1 provides the best overall performance and vFloor-2 does not provide sufficient advantage over vFloor-1, we use vFloor-1 method in all of our subsequent experiments.

D. Performance in Microphone Conditions

In this section we present evaluation results of the proposed systems on the NIST SRE 2010 core conditions 1–4 using the extended trials. In these experiments, additional microphone data from SRE 2005 and 2006 corpora was included for UBM and TV matrix training. The PLDA model was trained using both telephone and microphone data as before. The results are given in Tables IVVII. We compare the following systems: Baseline full-cov, and AFA with Formula $q=36$, 42, 48 and 54. The PLDA parameter Formula $N_{EV}$ was set to 150. We did not evaluate the diagonal UBM system in these conditions.

Table 4
TABLE IV PERFORMANCE COMPARISON BETWEEN BASELINE I-VECTOR AND DIFFERENT AFA SYSTEMS. EVALUATION PERFORMED IN NIST SRE 2010 CORE CONDITION-1 EXTENDED TRIALS
Table 5
TABLE V PERFORMANCE COMPARISON BETWEEN BASELINE I-VECTOR AND DIFFERENT AFA SYSTEMS. EVALUATION PERFORMED IN NIST SRE 2010 CORE CONDITION-2 EXTENDED TRIALS
Table 6
TABLE VI PERFORMANCE COMPARISON BETWEEN BASELINE I-VECTOR AND DIFFERENT AFA SYSTEMS. EVALUATION PERFORMED IN NIST SRE 2010 CORE CONDITION-3 EXTENDED TRIALS
Table 7
TABLE VII PERFORMANCE COMPARISON BETWEEN BASELINE I-VECTOR AND DIFFERENT AFA SYSTEMS. EVALUATION PERFORMED IN NIST SRE 2010 CORE CONDITION-4 EXTENDED TRIALS

From the results, again we observe that the proposed AFA systems consistently outperform the baseline system, especially for conditions 1–3. However, it seems a single parameter setting of Formula $q$ does not always provide the best performance across all the performance metrics. Considering the best %EER values, the proposed systems achieved 8.14%, 6.43%, 8.67% and 12.33% relative improvements in conditions 1, 2, 3 and 4, respectively. These results demonstrate the effectiveness of the proposed scheme in the microphone mismatched conditions as well.

E. Fusion of Multiple Systems

We select three of our systems for fusion: (i) Baseline full-cov, (ii) AFA Formula $(q=42)$ and (iii) AFA Formula $(q=48)$. The PLDA Formula $N_{EV}$ parameter was set to 150 for all systems. Simple equal-weight linear fusion was used with mean and variance normalization of individual system scores to (0, 1) for calibration. Results are shown for NIST SRE 2010 core condition 5 and pooled condition (combining all trials from condition 1–5) in Tables VIII and IX, respectively.

Table 8
TABLE VIII LINEAR EQUAL-WEIGHT SCORE FUSION PERFORMANCE OF BASELINE I-VECTOR AND PROPOSED SYSTEMS FOR NIST SRE 2010 CORE CONDITION-5
Table 9
TABLE IX LINEAR EQUAL-WEIGHT SCORE FUSION PERFORMANCE OF BASELINE I-VECTOR AND PROPOSED SYSTEMS FOR NIST SRE 2010 CORE CONDITIONS 1–5 POOLED
Figure 7
Fig. 7. Performance comparison of baseline, AFA and fusion systems using DET curves. Evaluation is performed by pooling results of the core conditions 1–5 of NIST SRE 2010 extended trials. (i) Baseline i-vector system using Full Covariance UBM (Baseline full-cov), (ii) AFA i-vector system Formula $(q=42)$, and (iii) Equal-weight linear fusion of systems (i) and (ii).

From the results, fusion performance of systems (i) and (ii) clearly reveal that AFA and baseline system have complementary information, since %EER and the DCF values improve. This is observed for both telephone and pooled condition. The best result is achieved by fusing systems (i)–(iii), to obtain 16.52%, 14.47% and 14.09% relative improvement in %EER, Formula ${\rm DCF}_{old}$ and Formula ${\rm DCF}_{new}$, respectively, compared to the baseline system in condition-5. In the pooled condition, this fusion provides 13.75%, 14.0% and 11.80% relative improvement in %EER, Formula ${\rm DCF}_{old}$ and Formula ${\rm DCF}_{new}$, respectively. Performance comparison of the systems (i), (ii) and their fusion for the pooled condition is shown in Fig. 7 using Detection Error Trade-off (DET) curves. Here, again we observe the superiority of the proposed AFA system over the baseline system while the fusion of these systems consistently provide further improvement in the full DET range.

F. Computational Advantages

In our experiments, we observe that the TV matrix training process using the AFA transform is computationally less expensive compared to the conventional process. This is expected since the computational complexity of an i-vector system is proportional to the super-vector size Formula $Md$ [39], which is reduced to Formula $Mq$ for an AFA based system. Thus, the computational complexity of the proposed system is theoretically reduced by a factor of Formula $1/\alpha$ Formula $(0<\alpha<1)$ compared to the baseline system.

SECTION VII

CONCLUSIONS

In this study, we have proposed an alternate modeling technique to address and compensate for transmission channel mismatch in speaker recognition. Motivated by the covariance structure of conventional acoustic features, we developed a factor analysis technique which operates within the acoustic feature domain utilizing a well trained UBM with full covariance matrices. We advocated that conventional super-vector domain factor analysis methods fail to take advantage of the observation that speech features reside in a lower dimensional manifold in the acoustic space. The proposed acoustic factor analysis scheme was utilized to develop a mixture-dependent feature transformation that performs dimensionality reduction, de-correlation, normalization and enhancement at the same time. Finally, the transformation was effectively integrated within a standard i-vector-PLDA based speaker recognition system using a probabilistic feature alignment technique. The superiority of the proposed method was demonstrated by experiments performed using the NIST SRE 2010 extended trials of five core conditions. Measurable improvements over two baseline systems were shown in terms of EER, min DCFs and DET curves.

Footnotes

This work was supported by AFRL under Contract FA8750-12-1-0188(approved for public release, distribution unlimited), and in part by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J. H. L. Hansen. The authors note that preliminary investigations on Acoustic Factor Analysis (AFA) was presented in [8] and [9]. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Rodrigo Guido.

The authors are with the Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, TX 75252 USA (e-mail: taufiq.hasan@utdallas.edu; john.hansen@utdallas.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

1More details on feature extraction and development data are given in Sections V-A and V-B, respectively

2Utterance dependent covariance matrices can also be extracted through MAP adaptation. However, we assume that each utterance GMM shares the common UBM covariance and weights.

References

No Data Available

Authors

Taufiq Hasan

Taufiq Hasan

Taufiq Hasan received his B.Sc. and M.Sc. degrees in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2006 and 2008, respectively. He was a Lecturer in the Electrical and Electronic Engineering Department at United International University, Dhaka, Bangladesh from December 2006 to June 2008. Currently has been pursuing his Ph.D. degree as a Research Assistant in the Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas (UTD), Richardson, U.S.A. since August 2008. He is a member of the Center for Robust Speech Systems (CRSS) at UTD. His research interests are robust speaker recognition in noise and channel mismatch, speech enhancement and automatic video summarization.

John H. L. Hansen

John H. L. Hansen

John H. L. Hansen (S'81–M'82–SM'93–F'07) received the Ph.D. and M.S. degrees in Electrical Engineering from Georgia Institute of Technology, Atlanta, Georgia, in 1988 and 1983, and B.S.E.E. degree from Rutgers University, College of Engineering, New Brunswick, N.J. in 1982.

He joined University of Texas at Dallas (UTD), Erik Jonsson School of Engineering and Computer Science in the fall of 2005, where he is presently serving as Jonsson School Associate Dean for Research, as well as Professor of Electrical Engineering and also holds the Distinguished University Chair in Telecommunications Engineering. He previously served as Department Head of Electrical Engineering from 2005–12, overseeing a 5x increase in research expenditures with a 20% increase in enrollment and the addition of 18 T/TT faculty, growing UTDallas to be the 8th largest EE program from ASEE rankings. He also holds a joint appointment He also holds a joint appointment as Professor in the School of Behavioral and Brain Sciences (Speech & Hearing). At UTD, he established the Center for Robust Speech Systems (CRSS) which is part of the Human Language Technology Research Institute. Previously, he served as Department Chairman and Professor in the Dept. of Speech, Language and Hearing Sciences (SLHS), and Professor in the Dept. of Electrical & Computer Engineering, at Univ. of Colorado Boulder (1998–2005), where he co-founded the Center for Spoken Language Research. In 1988, he established the Robust Speech Processing Laboratory (RSPL) and continues to direct research activities in CRSS at UTD. In 2007, he was named IEEE Fellow for contributions in “Robust Speech Recognition in Stress and Noise,” and is currently serving as Member of the IEEE Signal Processing Society Speech Technical Committee (2005–08; 2010–13; elected Chair-elect in 2010), and Educational Technical Committee (2005–08; 2008–10). Previously, he has served as Technical Advisor to U.S. Delegate for NATO (IST/TG-01), IEEE Signal Processing Society Distinguished Lecturer (2005/06), Associate Editor for IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING (1992–99), Associate Editor for IEEE SIGNAL PROCESSING LETTERS (1998–2000), Editorial Board Member for the IEEE Signal Processing Magazine (2001–03). He has also served as guest editor of the Oct. 1994 special issue on Robust Speech Recognition for IEEE TRANSACTIONS ON SPEECH & AUDIO PROCESSING. He has served on the Speech Communications Technical Committee for the Acoustical Society of America (2000–03), and is serving as a member of the ISCA (Inter. Speech Communications Association) Advisory Council. In 2010, he was recognized as ISCA Fellow, for contributions on “research for speech signals under adverse conditions.” His research interests span the areas of digital speech processing, analysis and modeling of speech and speaker traits, speech enhancement, feature estimation in noise, robust speech recognition with emphasis on spoken document retrieval, and in-vehicle interactive systems for hands-free human-computer interaction. He has supervised 59 (27 PhD, 32 MS/MA) thesis candidates, was recipient of The 2005 University of Colorado Teacher Recognition Award as voted on by the student body, author/co-author of 433 journal and conference papers and 8 textbooks in the field of speech processing and language technology, coauthor of the textbook Discrete-Time Processing of Speech Signals, (IEEE Press, 2000), co-editor of DSP for In-Vehicle and Mobile Systems (Springer, 2004), Advances for In-Vehicle and Mobile Systems: Challenges for International Standards (Springer, 2006), In-Vehicle Corpus and Signal Processing for Driver Behavior (Springer, 2008), and lead author of the report “The Impact of Speech Under ‘Stress’ on Military Speech Technology,” (NATO RTO-TR-10, 2000). He also organized and served as General Chair for ICSLP/Interspeech-2002: International Conference on Spoken Language Processing, Sept. 16–20, 2002, and served as Co-Organizer and Technical Program Chair for IEEE ICASSP-2010, Dallas, TX.

Cited By

No Data Available

Keywords

Corrections

None

Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Text Size