By Topic

- Aerospace
- Bioengineering
- Communication, Networking & Broadcasting
- Components, Circuits, Devices & Systems
- Computing & Processing (Hardware/Software)
- Engineered Materials, Dielectrics & Plasmas

SECTION I

MISMATCH between training and test conditions represent one of the most challenging problems facing speaker recognition researchers today. There can be considerable sources of mismatch present including: transmission channel differences [1], [2], handset variability [3], background noise [4], session variability due to physical stress [5], vocal effort such as whisper [11], [12], Lombard effect [13], non-stationarity environment [10], spontaneity of speech, but to name a few. Various compensation strategies have been proposed in the past to reduce unwanted variability between training and test utterances, while retaining the speaker identity information. The current trend in state-of-the-art speaker recognition systems is to model the acoustic features with a GMM-UBM, use utterance dependent adapted GMM [7] mean super-vectors [14] as the features representing the speech segments, and model the super-vectors using various *latent factor analysis* techniques [1], [6], [15]. In [16], the aim was to identify the lower dimensional speaker and channel dependent subspaces, termed Eigenvoice [15], [17] and Eigenchannel [1], in the super-vector domain. In [1], an alternative was considered where speaker and channel variabilities were jointly modeled. The recently proposed i-vector [6] scheme utilizes a factor analysis framework [15], [18] to perform dimensionality reduction on the super-vectors while retaining important speaker discriminant information. This lower dimensional i-vector representation enables the development of full Bayesian techniques [19], [20], using a single model to represent the speaker and channel variability.

One limitation of the conventional GMM super-vector domain representation and subsequent factor analysis modeling is that, it does not take into account the fact that the original acoustic features contain redundancy. In general, the speech short-time spectrum is known to be representable in a lower dimensional subspace, which motivates a separate class of speech enhancement methods known as *signal subspace* approaches [21], [22]. Linear correlation among the speech spectral components are quite high, which justifies the success of these methods. This phenomenon is also valid for popular acoustic features, such as Mel-frequency Cepstral Coefficients (MFCC) [23], [24], even though these features are processed through Discrete Cosine Transform (DCT) for de-correlation before use in training or test.

To motivate the proposed work, we first demonstrate that the conventional acoustic features can be constrained to reside in a lower dimensional subspace. For this purpose, we train a 1024 mixture full covariance GMM UBM using 60 dimensional MFCC features on a large background speech data set.^{1} For a typical mixture of this UBM, the covariance matrix and distribution of its eigenvalues is shown in Fig. 1. From Fig. 1(a) it is clear that the full covariance matrix, which shows strong diagonal terms, has significant non-zero off-diagonal elements, indicating that the feature coefficients are not fully uncorrelated. Fig. 1(b) shows the sorted eigenvalues of the same covariance matrix revealing that most of it's energy is accounted for by the first few dimensions only. This shows that the acoustic feature space is actually lower dimensional and features can thus be further compacted or enhanced by using a *factor analysis* model. Also, it is known that the first few directions obtained by the Eigen-decomposition of acoustic feature covariance matrices are mostly speaker dependent (e.g. see Zhou and Hansen [25] for a quantitative analysis), while other directions are more phoneme dependent. In this study, considering these noted observations on the acoustic features, we aim at investigating a factor analysis scheme on acoustic features for speaker recognition. We would like to name this method *acoustic factor analysis*.

Before proceeding with the formulation of the factor analysis scheme in the front-end features, we first defend the argument that the traditional factor analysis schemes do not take full advantage of the acoustic feature covariances. In a standard i-vector system, the GMM super-vectors are dimensionality reduced by a total factor analysis model, which is based on the idea that utterance super-vectors lie in a lower dimensional subspace. Let
${\bf m}_{s}$ denote a GMM super-vector extracted from an utterance
$s$, and
${\bf x}_{n}$ would denote the acoustic features. For a randomly chosen utterance
$s$, it is generally assumed that
${\bf m}_{s}$ is normally distributed with mean
${\bf m}_{0}$ and covariance matrix
${\bf B}$ [15]. Here,
${\bf m}_{0}$ denotes the speaker independent mean vector obtained by concatenating the UBM mean vectors
${\bf m}_{0[g]}$. Let the UBM covariance matrices be
${\mmb\Sigma}_{g}$, where
$g$ denotes the mixture number. The main motivation of both Eigenvoice and total variability modeling, is that the super-covariance matrix
${\bf B}$ contains zero eigenvalues and thus some dimensions of
${\bf m}_{s}$ can be disregarded. For the
$g$-th Gaussian mixture, the utterance dependent mean vector
${\bf m}_{s[g]}$ is estimated from the posterior mean of the acoustic features that belong to
$s$, that is
${\bf x}_{n}\in s$. This is a deterministic parameter. However, for a randomly selected utterance
$s$, the sub-vectors
${\bf m}_{s[g]}$ are normally distributed random vectors having covariance matrix
${\bf B}_{[g]}$, which is the
$g$-th sub-matrix of the super-covariance matrix
${\bf B}$. Clearly, the matrices
${\bf B}_{[g]}$ are not related to the feature covariance matrices
${\mmb\Sigma}_{g}$, since the former represents the covariance of the mean sub-vectors
${\bf m}_{s[g]}$ obtained from different utterances, while the latter represents the covariance of the acoustic features
${\bf x}_{n}$ which is independent of the utterance.^{2} Thus, assuming that the matrix
${\bf B}$ contains zero eigenvalues is not equivalent to assuming the same for the
${\mmb\Sigma}_{g}$ matrices. Though this reasoning is based on full covariance UBM models, similar arguments can be made for a diagonal covariance based system.

Given that the conventional acoustic features reside in a lower dimensional subspace, it is important now to ask the question how we can use this knowledge to effectively extract utterance level features. Since speaker dependent information is contained in the leading eigen-directions of the acoustic features [25], using all the feature coefficients for modeling channel degraded data will result in retaining some nuisance components along with speaker dependent information in the GMM super-vectors and i-vectors. Therefore, we propose a dimensionality reduction transformation of the acoustic features for each GMM mixture that emphasizes the speaker dependent information in the leading eigenvectors of the corresponding mixture covariance matrix, while suppressing some unwanted channel components. In this manner, the GMM super-vectors will be “enhanced” in the sense that they will be more speaker discriminative, while the subsequently extracted i-vectors will also inherit this quality.

Dimensionality reduction of the acoustic features for de-correlation/enhancement is not a new concept. There are many techniques found in the literature that perform this task, including DCT, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Heteroscedastic LDA (HLDA), to name but a few [16], [26], [27]. The main goal for this process has been to be able to model the features using diagonal covariance matrix GMM/HMMs for speech/speaker recognition. These techniques can be classified in mainly two groups by their mode of operation including: 1) the signal processing domain, and 2) the model domain. In the first scenario, some transformation (supervised/unsupervised) is used at the signal/acoustic feature level in order to achieve improved energy compaction. The most common technique is the application of the DCT for the log-filterbank energies [23] popularized by the MFCC representation. PCA can also be used [26] by learning the principal directions from the Eigen-decomposition of the covariance matrix trained on the utterance data itself. In general, this class of processing only depends on the speech data under consideration and does not use any outside knowledge. In the second scenario, raw acoustic features (e.g., filter-bank energies) are initially used to train a large model, which is then used to derive the feature transformations. One such technique used in speaker recognition is HLDA [16], where first a GMM-UBM is trained on the raw acoustic features. Each mixture is then assumed to represent a separate class, and HLDA transformation is trained so that discrimination between these classes is maximized. In a similar fashion, PCA projections can also be used in each GMM mixture as a transformation [28]. In these methods, after the initial training phase, the acoustic features are aligned to the mixture component providing the highest posterior probability and the corresponding transformation is used for dimensionality reduction.

Both the signal processing domain and model domain feature dimensionality reduction techniques previously used in essence have one common property: they re-generate the acoustic features after a dimensionality reduction. This means, the sub-sequent procedures for the speaker recognition system require that we begin training from these newly extracted features. Model domain dimensionality reduction has an extra inconvenience of mixture-alignment. Speech features are known to be highly intertwined and overlapped in the vector space for different acoustic conditions and generally do not form meaningful clusters [29]. Thus, using the top posterior probability for aligning a feature vector to a single mixture may not be appropriate. To demonstrate this, we select MFCC feature vectors ${\bf x}_{n}$ from 10 development utterances that were used in the UBM training, and for each feature vector, we find the highest posterior probability among the 1024 mixtures of the UBM, $\max_{g}p(g\vert{\bf x}_{n})$. A histogram of these top mixture probabilities is shown in Fig. 2, which clearly demonstrate that only a few frames are unquestionably aligned to a specific Gaussian mixture (indicated by the high peak near $\max_{g}p(g\vert{\bf x}_{n})=1$). In actuality, a majority of the feature vectors are aligned with more than one mixture, resulting in a top mixture probability in the region of 0.3 $\sim$ 0.8. Thus, using the top scoring mixture for hard alignment of feature vectors to a specific mixture can introduce inaccuracies and should be avoided if possible.

Historically, feature extraction, dimensionality reduction, enhancement and normalization has always been thought of as a separate process from acoustic modeling. In this study, we propose a new modeling scheme of the acoustic features that bridges the gap between these two processing domains through integrated feature dimensionality reduction and modeling. We demonstrate that the proposed method not only performs dimensionality reduction, it also removes the need for hard feature clustering to a specific mixture, and does not require retraining of the UBM from the new features, thereby incorporating a built-in feature normalization and enhancement scheme. All this is achieved using a single linear transformation derived from a pre-trained full covariance matrix UBM and applying this in a probabilistic fashion to the mixture dependent Baum-Welch statistics.

This paper is organized as follows. In Section II, we formulate the proposed Acoustic Factor Analysis (AFA) scheme and derive the mixture-dependent transformation matrices. Section III describes the various properties of the AFA transformation, including normalization and enhancement. In Section IV, we describe how the proposed scheme can be integrated within an i-vector system followed by our system description in Section V. Experimental results are presented in Section VI, and finally, Section VII concludes the study.

SECTION II

In this section, we describe the proposed factor analysis model of acoustic features, discuss its formulation and mixture-wise application for dimensionality reduction.

Let
${\cal X}=\{{\bf x}_{n}\vert n=1\cdots N\}$ be the collection of all acoustic feature vectors from the development set obtained from a large corpus of many speakers' recordings in diverse environment/channel conditions. Using a factor analysis model, the
$d\times 1$ dimensional feature vector
${\bf x}$ can be represented by,
TeX Source
$${\bf x}={\bf Wy}+\mu+\epsilon.\eqno{\hbox{(1)}}$$ Here,
${\bf W}$ is a
$d\times q$ low rank factor loading matrix that represents
$q<d$ bases spanning the subspace with important variability in the feature space, and
${\mmb\mu}$ is the
$d\times 1$ mean vector of
${\bf x}$. We denote the latent variable vector or latent factors
${\bf y}\sim{\cal N}({\bf 0},{\bf I})$, as *acoustic factors*, which is of dimension
$q\times 1$. We assume that the remaining noise component
$\epsilon\sim{\cal N}({\bf 0},{\mmb\sigma}^{\bf 2}{\bf I})$ is isotropic, and therefore the model is equivalent to PPCA [18]. In this model, the feature vectors are also normally distributed such that,
${\bf x}\sim{\cal N}(\mu,\sigma^{2}{\bf I}+{\bf WW}^{T})$.

The advantage of this model is that the *acoustic factors*
${\bf y}$, defining the weights of the factor loadings, explains the correlation between the feature coefficients
${\bf x}$, which we believe are more speaker dependent [25], while the noise component
$\epsilon$ incorporates the residual variance of the data. It should be emphasized that even though we denote the term
$\epsilon$ as “noise”, when used with cepstral features this term actually represents convolutional channel distortion [30]. A mixture of these models [18] can be used to incorporate the variations caused by different phonemes uttered by multiple speakers in distinct noisy/channel degraded conditions, given by,
TeX Source
$$p({\bf x})=\sum_{g}w_{g}p({\bf x}\vert g)\eqno{\hbox{(2)}}$$ where for the
$g$-th mixture,
TeX Source
$$p({\bf x}\vert g)={\cal N}\left({\mmb\mu}_{g},\sigma_{g}^{2}{\bf I}+{\bf W}_{g}{\bf W}_{g}^{T}\right).\eqno{\hbox{(3)}}$$ Here,
$\mu_{g}$,
$w_{g}$,
${\bf W}_{g}$ and
$\sigma_{g}^{2}$ represent the mean vector, mixture weight, factor loading matrix, and noise variance for the
$g$-th AFA model, respectively.

One advantage of using the mixture of PPCA for acoustic factor analysis is that, its parameters can be conveniently extracted from a GMM trained using the Expectation-Maximization (EM) algorithm [18]. Thus, we utilize a full covariance UBM to derive the AFA model parameters. The proposed feature transformation and dimensionality reduction procedure is presented below:

A full covariance UBM model $\Lambda_{0}$, is trained on the development dataset ${\cal X}=\{{\bf x}_{n}\vert n=1\cdots N\}$, given by, TeX Source $$p({\bf x}\vert\Lambda_{0})=\sum_{g=1}^{M}w_{g}{\cal N}(\mu_{g},{\mmb\Sigma}_{g})\eqno{\hbox{(4)}}$$ where $w_{g}$ represents the mixture weights, $M$ is the total number of mixtures, $\mu_{g}$ are the mean vectors and ${\mmb\Sigma}_{g}$ are the full covariance matrices. The mean and weight parameters of the UBM will be identical to the mixture model of (2).

We require to set the value of $q$, which defines the number of principal axes we would like to select. In other words, we assume the lower $d-q$ dimensions of the features will actually represent the noise subspace [21]. Using this value of $q$, we find the noise variance for the $g$-th mixture as, TeX Source $$\sigma_{g}^{2}={1\over d-q}\sum_{i=q+1}^{d}\lambda_{g,i}\eqno{\hbox{(5)}}$$ where $\lambda_{g,q+1}\cdots\lambda_{g,d}$ are the smallest eigenvalues of the covariance matrix ${\mmb\Sigma}_{g}$. Thus, $\sigma_{g}^{2}$ is essentially the average variance lost per discarded dimension. It may be noted that the model allows the use of different values of $q$ for each mixture. This has been investigated in [9] and also, we elaborate this issue in greater detail in Section IV.

The maximum likelihood estimation of the factor loading matrix ${\bf W}_{g}$ of the $g$-th mixture of the AFA model in (2) is given by, TeX Source $${\bf W}_{g}={\bf U}_{{\bf q}_{g}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{1\over 2}{\bf R}_{g}\eqno{\hbox{(6)}}$$ where ${\bf U}_{{\bf q}_{g}}$ is a $d\times q$ matrix whose columns are the $q$ leading eigenvectors of ${\mmb\Sigma}_{g}$, ${\mmb\Lambda}_{{\bf q}_{g}}$ is a diagonal matrix containing the corresponding $q$ eigenvalues, and ${\bf R}_{g}$ is a $q\times q$ arbitrary orthogonal rotation matrix. In this work, we set ${\bf R}_{g}={\bf I}$.

The posterior mean of the *acoustic factors*
${\bf y}_{n}$ can be used as the transformed and dimensionality reduced version of
${\bf x}_{n}$ for the
$g$-th component of the AFA model. This can be shown to be
TeX Source
$$E\{{\bf y}_{n}\vert{\bf x}_{n},g\}=\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle={\bf A}_{g}^{T}({\bf x}_{n}-{\mmb\mu}_{g})\buildrel{\Delta}\over{=}{\bf z}_{n,g}\eqno{\hbox{(7)}}$$ where
TeX Source
$$\eqalignno{{\bf A}_{g}=&\,{\bf W}_{g}{\bf M}_{g}^{-T}\ {\rm and}&\hbox{(8)}\cr{\bf M}_{g}=&\,\sigma_{g}^{2}{\bf I}+{\bf W}_{g}^{T}{\bf W}_{g}.&\hbox{(9)}}$$ We term the matrix
${\bf A}_{g}$ as the
$g$-th *AFA transform*. In this operation, we are essentially replacing the original feature vectors
${\bf x}_{n}$ by the mixture dependent transformed acoustic feature
${\bf z}_{n,g}$. Each feature vector
${\bf x}_{n}$ can be transformed by
${\bf A}_{g}$, corresponding to the mixture component it is aligned with and a new set of features can then be obtained. However, as noted earlier, we will not regenerate the acoustic features and instead use a probabilistic soft-alignment in our system. This is described in Section V where we discuss the integration of AFA within an i-vector system.

SECTION III

In this section, we discuss the general properties and advantages of the proposed acoustic feature model, the resulting transformation and the transformed features.

Here, we derive the probability distribution of the transformed acoustic features and show how AFA performs feature de-correlation. Let
${\bf z}_{n,g}=\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle$ indicate the AFA transformed feature vector for the
$g$-th mixture. We have the following mean vector of
${\bf z}_{n,g}$,
TeX Source
$$\eqalignno{\mu_{{\bf z}_{g}}=&\,E\left\{\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle\right\}\cr=&\,E\left\{{\bf A}_{g}^{T}({\bf x}_{n}-\mu_{g})\right\}={\bf 0}&\hbox{(10)}}$$ and its corresponding covariance matrix,
TeX Source
$$\eqalignno{{\mmb\Sigma}_{{\bf z}_{\bf g}}=&\,E\left\{{\bf z}_{n,g}{\bf z}_{n,g}^{T}\right\}-\mu_{{\bf z}_{g}}\mu_{{\bf z}_{g}}^{T}\cr=&\,{\bf A}_{g}^{T}E\left\{({\bf x}_{n}-\mu_{g})({\bf x}_{n}-\mu_{g})^{T}\right\}{\bf A}_{g}\cr=&\,{\bf A}_{g}^{T}{\mmb\Sigma}_{g}{\bf A}_{g}.&\hbox{(11)}}$$ For further simplification, we first substitute the value of
${\bf W}_{g}$ from (6) into (9) and use
${\bf R}_{g}={\bf I}$ to obtain,
TeX Source
$$\eqalignno{{\bf M}_{g}=&\,\sigma_{g}^{2}{\bf I}+{\bf W}_{g}^{T}{\bf W}_{g}\cr=&\,\sigma_{g}^{2}{\bf I}+\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\bf U}_{{\bf q}_{g}}^{T}{\bf U}_{{\bf q}_{g}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{1\over 2}\cr=&\,{\mmb\Lambda}_{{\bf q}_{g}}.&\hbox{(12)}}$$ Next, substituting the values of
${\bf W}_{g}$ and
${\bf M}_{g}$ from (6) and (12) into (8) we have,
TeX Source
$${\bf A}_{g}^{T}={\mmb\Lambda}_{{\bf q}_{g}}^{-1}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\bf U}_{{\bf q}_{g}}^{T}.\eqno{\hbox{(13)}}$$ Using this expression of
${\bf A}_{g}^{T}$ in (11) we obtain,
TeX Source
$$\eqalignno{{\mmb\Sigma}_{{\bf z}_{\bf g}}=&\,{\mmb\Lambda}_{{\bf q}_{g}}^{-1}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\mmb\Lambda}_{{\bf q}_{g}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{1\over 2}{\mmb\Lambda}_{{\bf q}_{g}}^{-T}\cr=&\,\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right){\mmb\Lambda}_{{\bf q}_{g}}^{-T}\cr=&\,{\bf I}-\sigma_{g}^{2}{\mmb\Lambda}_{{\bf q}_{g}}^{-1}.&\hbox{(14)}}$$ Here, we utilize the expression
${\bf U}_{{\bf q}_{g}}^{T}{\mmb\Sigma}_{g}{\bf U}_{{\bf q}_{g}}={\mmb\Lambda}_{{\bf q}_{g}}$ and take advantage of the diagonal system. Thus, we show that for a given mixture alignment
$g$, the posterior mean of the *acoustic factors*, or the transformed feature vectors
${\bf z}_{n,g}$ follow a Gaussian distribution with zero mean and a diagonal covariance matrix given by
${\bf I}-\sigma_{g}^{2}{\mmb\Lambda}_{{\bf q}_{g}}^{-1}$. Thus, the AFA transformation de-correlates the mean normalized acoustic features in each mixture.

In the
$g$-th mixture, the AFA transformation matrix
${\bf A}_{g}^{T}$ expression given in (13) can be expressed as:
TeX Source
$$\eqalignno{{\bf A}_{g}^{T}=&\,{\mmb\Lambda}_{{\bf q}_{g}}^{-1}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}{\bf U}_{{\bf q}_{g}}^{T}\cr=&\,{\mmb\Lambda}_{{\bf q}_{g}}^{-{1\over 2}}{\bf G}_{g}{\bf U}_{{\bf q}_{g}}^{T}&\hbox{(15)}}$$ where we introduced a diagonal gain matrix given by:
TeX Source
$${\bf G}_{g}={\mmb\Lambda}_{{\bf q}_{g}}^{-{1\over 2}}\left({\mmb\Lambda}_{{\bf q}_{g}}-\sigma_{g}^{2}{\bf I}\right)^{T\over 2}.\eqno{\hbox{(16)}}$$ the
$i$-th diagonal entry of
${\bf G}_{g}$ is given by,
TeX Source
$$G_{g}(i)=\sqrt{\left(\lambda_{g,i}-\sigma_{g}^{2}\right)\over\lambda_{g,i}}.\eqno{\hbox{(17)}}$$ Keeping aside the term
${\mmb\Lambda}_{{\bf q}_{g}}^{-(1/2)}$ in (15), we observe that the transformation operation performed by
${\bf A}_{g}^{T}$ in (7) first computes the inner product of the mean normalized acoustic feature with the
$q$ principal eigenvectors of
${\mmb\Sigma}_{\bf g}$, then for each
$i$-th eigenvector direction applies the gain function defined by
$G_{g}(i)$. The second term in (17) can be identified as a square-root Wiener gain function [31]. This becomes clearer if we define the classic speech enhancement terminology *a priori* SNR
$\xi$ as [21], [32],
TeX Source
$$\xi={\lambda_{g,i}-\sigma_{g}^{2}\over\sigma_{g}^{2}}\eqno{\hbox{(18)}}$$ and use this to express the gain equations. The Wiener gain
$G_{\rm w}$ and the square-root Wiener gain
$G_{\sqrt{\rm w}}$ are given by:
TeX Source
$$G_{\rm W}={\xi\over\xi+1}\ {\rm and}\ G_{\sqrt{\rm W}}=\left({\xi\over\xi+1}\right)^{1\over 2}.\eqno{\hbox{(19)}}$$ Wiener and square-root Wiener gain functions are plotted against
$\xi$ in Fig. 3. As discussed in [31] page 179, Sec 6.6.3), in case of additive noise, the square-root Wiener filter is applied, when instead of the magnitude spectrum, the power spectrum of the filtered signal and the clean signal are desired to be equal. The operation performed by the AFA transformation in (15) can be interpreted as a gain function operating on a transformed space defined by the
$i$-th eigenvector to obtain a clean eigenvalue
$\lambda_{g,i}-\sigma_{g}^{2}$ from the noisy eigenvalue
$\lambda_{g,i}$ [9]. Since the eigenvalues can be interpreted as a power spectrum obtained from the principal components [33], it is understandable why
$G_{\sqrt{\rm w}}$ arises in this scenario instead of
$G_{\rm w}$. Due to this square-root operation on the gain function, the square-root Wiener obviously shows lower attenuation characteristics compared to the standard Wiener filter, as depicted in Fig. 3. It may be noted that conventional factor analysis techniques in the super-vector space can also be interpreted using similar Wiener like gain functions as discussed in [34].

In the signal subspace speech enhancement method [21], a similar gain function is obtained by starting from the same model in (1), except for the standard normal assumption on the latent factors ${\bf y}$. In that work, the term ${\bf Wy}+\mu\buildrel{\Delta}\over{=}{\bf a}$ in (1) was interpreted as the “clean signal”, ${\bf x}$ as the noisy signal and $\epsilon$ as the additive noise. The goal was to find an estimate of the clean signal $\mathhat{\bf a}$ by finding the posterior mean of ${\bf a}$ given the noisy signal ${\bf x}$ and noise variance. However, in the AFA scheme, the goal is to estimate the posterior mean of the latent factors ${\bf y}$ for an “enhanced” and more compact version of the “noisy” (channel degraded) acoustic features ${\bf x}$ [18]. This difference between the two approaches yield two different optimization criteria and their resulting gain functions.

Another contrast between the speech enhancement schemes and AFA transformation is the interpretation of noise. In conventional speech enhancement methods the noise statistics are estimated from silence regions between speech segments [35], and thus for the signal subspace based method, noise variance $\sigma_{g}^{2}$ is assumed to be known in the model (1). In our case, the noise we are attempting to remove or compensate for is actually an additive distortion in the cepstral domain, which will not exist in the silence regions. In addition, even if the silence segments were modeled in the UBM, it is very unlikely that the mixture components modeling the silences would be useful in determining the noise level in other components. Thus, even though the AFA dimension $q$ is related to the noise variance, we resort to set the value of $q$ arbitrarily and compute the corresponding noise variance for each mixture using (5).

Going back to (15), the term ${\mmb\Lambda}_{{\bf q}_{g}}^{-(1/2)}$ normalizes the variance of the acoustic feature stream in the $i$-th eigen-direction, since $\lambda_{g,i}$ is the expected feature variance along this direction [36]. This means, the AFA transformation assumes that the features that are closely aligned with the $g$-th mixture, originates from the same random process, and performs this normalization in addition to the enhancement mentioned in the previous section. This process is interestingly similar to the cepstral variance normalization frequently performed in the front-end. However, feature domain processing considers the temporal movement of the features in performing these normalizations assuming that the feature streams are independent, while AFA groups the features together in a mixture irrespective of their time location and performs the normalization in an orthogonal axis derived from the corresponding mixture covariance matrix. It would be interesting to see how AFA systems perform if the feature domain normalizations are removed from the front-end. Recent studies [37] show that in the full-covariance UBM based i-vector scheme, a very basic scale normalization technique outperforms Cepstral Mean and Variance Normalization (CMVN) and feature Gaussianization [38]. This may be due to the uncorrelated assumption among feature coefficients inherently assumed while applying these normalization schemes. We have yet to perform experiments on comparative feature normalization schemes using AFA and suggest this as a future work.

SECTION IV

In this section, we describe how the proposed method can be incorporated into a conventional i-vector system [6].

First, a full covariance UBM model,
$\Lambda_{0}$ given by (4), is trained on the development data vectors. Next, the AFA dimension
$q$ is set, which defines the number of principal axes to retain from each mixture component. Using the value of
$q$, we find the noise variance for the
$g$-th mixture using (5). The factor loading matrix
${\bf W}_{g}$ and transformation matrix
${\bf A}_{g}$ are then calculated using (6) and (8), respectively. After applying the transformation as in (7), the posterior means of the *acoustic factors*
${\bf z}_{n,g}=\langle{\bf y}_{n}\vert{\bf x}_{n},g\rangle$ are used as mixture dependent transformed acoustic features.

Following the discussion from Section III-A, and using (10) and (14), the AFA transformation would require a new transformed UBM $\mathhat{\Lambda}_{0}$ that models ${\bf z}_{n,g}$ instead of ${\bf x}_{n}$, such that, TeX Source $$p({\bf z}\vert\mathhat{\Lambda}_{0})=\sum_{i=1}^{M}w_{g}{\cal N}({\bf 0},\mathhat{\mmb\Sigma}_{g})\eqno{\hbox{(20)}}$$ where $\mathhat{\mmb\Sigma}_{g}={\bf I}-\sigma_{g}^{2}{\mmb\Lambda}_{{\bf q}_{g}}^{-1}={\mmb\Sigma}_{{\bf z}_{\bf g}}$. This UBM is not an actual acoustic model used to calculate the posterior probabilities or other statistics.(20) simply indicates how the UBM parameters should be modified/replaced compared to the original UBM $\Lambda_{0}$ given in (4). This transformation only affects the hyper-parameter estimation.

In this step, the zero and first order Baum-Welch statistics are extracted from each feature vector with respect to the UBM. Using the AFA transformed features, extraction of the statistics can be accomplished as follows. The probabilistic alignment of feature ${\bf x}_{n}$ with the $g$-th mixture is given by: TeX Source $$\gamma_{g}(n)=p(g\vert{\bf x}_{n})={p({\bf x}_{n}\vert g)w_{g}\over p({\bf x}_{n})}.\eqno{\hbox{(21)}}$$ For an utterance $s$, the zero order statistics is extracted as: TeX Source $$N_{s}(g)=\sum_{n\in s}\gamma_{g}(n),\eqno{\hbox{(22)}}$$ which follows the standard procedure [6], [15]. Conventionally, the first order statistics are extracted as: TeX Source $${\bf F}_{s}(g)=\sum_{n\in s}\gamma_{g}(n){\bf x}_{n}.$$ However, with the present AFA transform, the first order statistics $\mathhat{\bf F}_{s}(g)$ is extracted using the transformed features in the corresponding mixtures instead of the original features. TeX Source $$\eqalign{\mathhat{\bf F}_{s}(g)=&\,\sum_{n\in s}\gamma_{g}(n){\bf z}_{n,g}=\sum_{n\in s}\gamma_{g}(n){\bf A}_{g}^{T}({\bf x}_{n}-\mu_{g})\cr=&\,{\bf A}_{g}^{T}\left[{\bf F}_{s}(g)-N_{s}(g)\mu_{g}\right]={\bf A}_{g}^{T}\bar{\bf F}_{s}(g)}$$ where $\bar{\bf F}_{s}(g)$ is the centralized first order statistics [20]. This transformation of statistics is somewhat similar to the approach in [39], where it was done to normalize the UBM parameters to zero means and identity covariance matrices. However, in [39] the goal was to simplify the i-vector system algorithm, theoretically preserving the procedure results with added computational benefits; whereas in this work, we are performing feature transformation and dimensionality reduction for possible improvement of the i-vector system performance.

Training of the Total Variability (TV) matrix
${\bf T}$ for the i-vector system follows a very similar procedure as discussed in [6]. In this system, an utterance dependent super-vector
$s$ is expressed as,
TeX Source
$${\bf m}_{s}={\bf m}_{0}+{\bf Tw}_{s}\eqno{\hbox{(23)}}$$ where the
$Md$ dimensional vector
${\bf m}_{0}$ denotes the speaker independent mean super-vector (i.e., concatenation of the UBM means
$\mu_{g}={\bf m}_{0[g]}$),
${\bf T}$ is an
$Md\times R$ low rank matrix
$(R<Md)$ whose columns span the total variability space, and
${\bf w}_{s}$ is a normal distributed random vector of size
$R$, known as the *total factors*. The posterior mean vector of
${\bf w}_{s}$ given an utterance data is know as an i-vector.

Depending on the AFA parameter $q$, the size of the matrix ${\bf T}$ needs to be defined. In the AFA based i-vector system, the super-vector dimension becomes $K=Mq$ instead of $Md$. Thus, the ${\bf T}$ matrix size needs to be set to $K\times R$, and randomly initialized. We define a parameter, super-vector compression (SVC) ratio $\alpha=K/Md=q/d$, measuring compaction obtained through AFA transformation.

For each utterance $s\in{\cal S}$, $R\times R$ precision matrix ${\bf L}_{s}$ and $R\times 1$ vector ${\bf B}_{s}$ are estimated as [40]: TeX Source $$\eqalignno{{\bf L}_{s}=&\,{\bf I}+\sum_{g=1}^{M}N_{s}(g){\bf T}_{[g]}^{T}\mathhat{\mmb\Sigma}_{g}^{-1}{\bf T}_{[g]}\ {\rm and}&\hbox{(24)}\cr{\bf B}_{s}=&\,\sum_{g=1}^{M}N_{s}(g){\bf T}_{[g]}^{T}\mathhat{\mmb\Sigma}_{g}^{-1}\mathhat{\bf F}_{s}(g)&\hbox{(25)}}$$ respectively, where ${\bf T}_{[g]}$ is the $g$-th sub-matrix of ${\bf T}$ of dimension $q\times R$, $\mathhat{\mmb\Sigma}_{g}$ is the $q\times q$ AFA transformed UBM covariance matrix. The total factors for the utterance $s$ are estimated as: TeX Source $${\bf w}_{s}={\bf L}_{s}^{-1}{\bf B}_{s}.\eqno{\hbox{(26)}}$$ In each iteration, the $g$-th block of the ${\bf T}$ matrix is updated using the following equation: TeX Source $${\bf T}_{[g]}=\sum_{s\in{\cal S}}\mathhat{\bf F}_{s}(g){\bf w}_{s}^{T}\left[\sum_{s\in{\cal S}}\left({\bf L}_{s}^{-1}+{\bf w}_{s}{\bf w}_{s}^{T}\right)N_{s}(g)\right]^{-1}\eqno{\hbox{(27)}}$$ which follows the same procedure as a conventional i-vector system [6], [40].

SECTION V

We perform our experiments on the male trials of the NIST SRE 2010 telephone and microphone conditions (core conditions 1–5, extended trials). A standard i-vector system [6] with a Gaussian Probabilistic Linear Discriminant Analysis (PLDA) [41] back-end is used for evaluation. Specific blocks of the baseline system implementation and details of the proposed scheme are described below. An overall block diagram of the proposed system is included in Fig. 4.

In order to remove the silence frames, an independent Hungarian phoneme recognizer [42] combined with an energy based voice activity detection (VAD) scheme is used. A 60-dimensional feature vector $(19\ {\rm MFCC}+{\rm Energy}+\Delta+\Delta\Delta)$ is extracted using a 25 ms analysis window with subsequent 10 ms shifts, and then Gaussianized utilizing a 3-s sliding window [38].

Gender dependent UBMs having full and diagonal-covariance matrices with 1024 mixtures are trained on telephone utterances selected from the Switchboard II Phase 2 and 3, Switchboard Cellular Part 1 and 2, and the NIST 2004, 2005, 2006 SRE enrollment data. We use the HTK toolkit for training with 15 iterations per mixture split. The UBM full covariance values were floored to $10^{-5}$ using the $-v$ option in HTK HERest toolkit [43].

For the TV matrix training, the UBM training dataset is utilized. Five iterations are used for the EM training. We use 400 total factors (i.e., our i-vector size was 400). All i-vectors are first whitened and then length normalized using radial Gaussianization [41].

A Gaussian probabilistic linear discriminant analysis (PLDA) model with a full-covariance noise process is used for session variability compensation and scoring [41]. In this generative model, an $R$ dimensional i-vector ${\bf w}_{s}$ extracted from a speech utterance $s$ is expressed as: TeX Source $${\bf w}_{s}={\bf w}_{0}+{\mmb{\Phi\beta}}+{\bf n}\eqno{\hbox{(28)}}$$ where ${\bf w}_{0}$ is an $R\times 1$ speaker independent mean vector, ${\mmb\Phi}$ is the $R\times N_{EV}$ rectangular matrix representing a basis for the speaker-specific subspace/eigenvoices, $\beta$ is an $N_{EV}\times 1$ latent vector having a standard normal distribution, and ${\bf n}$ is the $R\times 1$ random vector representing the full covariance residual noise. The only model parameter here is the number of eigenvoices $N_{EV}$, that is the number of columns in the matrix ${\mmb\Phi}$. I-vectors extracted from the UBM training dataset and additional microphone data selected from SRE 2004 and 2005, are utilized to train this PLDA model.

SECTION VI

In this experiment, in four different runs we retain $q=$36, 42 and 48 coefficients from the $d=60$ dimensional features using the proposed AFA method. We vary the number of eigenvoices $N_{EV}$ in the PLDA model from 50 to 400 in 50 step increments. The performance metrics used are %Equal Error Rate (EER) and minimum Detection Cost Functions (DCF) defined in NIST SRE 2008 [44] $({\rm DCF}_{old})$ and NIST SRE 2010 [45] $({\rm DCF}_{new})$. The results are summarized in the plot shown in Fig. 5 and a subset of these results, organized by performance metrics, is also shown in Table I. The proposed systems are compared against our baseline full-covariance and diagonal covariance UBM based i-vector systems, referred to as “Baseline full-cov” and “Baseline diag-cov”, respectively.

From Fig. 5(a)–(c), we observe that for $q=42$ and for almost all values of $N_{EV}$, the proposed AFA system performs better than both baseline systems with respect to all three performance metrics. For $q=48$, the AFA system is superior to the baselines in ${\rm DCF}_{new}$, but very close with respect to the other performance measures. For $q=42$ and $N_{EV}=200$, we achieve the best EER performance of 1.73% which is 11.28% lower relative to the corresponding Baseline full-cov system EER. The results in Fig. 5 and Table I indicate that the proposed AFA transformation of the acoustic features are successfully able to reduce nuisance directions in the feature space, producing i-vectors with better speaker discriminating ability. We also note that our full-covariance baseline system and AFA based systems perform significantly better than the diagonal-covariance system.

In Fig. 6, AFA system performance is compared with the Baseline full-cov system for different values of $q$, keeping the parameter $N_{EV}$ fixed at 150. Here we use $q=24$, 30, 36, 42, 48 and 54, yielding super-vector compression (SVC) ratios of $\alpha=$0.4, 0.5, 0.6, 0.7, 0.8 and 0.9, respectively. From this figure, we observe that the system performance is quite sensitive to the $q$ parameter of the proposed AFA method, though performance improvement is achieved compared to the baseline system in almost all cases. If the value of $q$ is too low, some speaker dependent information is removed by the AFA transform and system performance degrades. Values of $q$ close to feature dimension $d$ yields performances similar to the baseline system. We observe consistent improvements in the system performance by setting $q$ close to 42 $\sim$ 48 for the AFA systems. In this region, relative improvement values of all three performance metrics are in the range of 4 $\sim$ 12%. We believe the fluctuation of performance is due to the fact that a different value of $q$ is suitable for each mixture component. Thus, methods of selecting the optimal AFA dimension can be a viable future work, especially since the model allows different values of $q$ for each mixture.

It is known that full covariance UBM based speaker recognition systems can be very sensitive to small values in the UBM covariance matrices [20]. In [20], a variance flooring algorithm [46] was used to tackle with this issue. As mentioned in Section V-B, we performed UBM variance flooring by limiting the minimum value of a covariance matrix component to $10^{-5}$ using HTK. We refer to this flooring method as “vFloor-1”. To observe the effect of an alternate variance flooring on the AFA systems, we trained the UBM as described in [20]. In each EM iteration, the full covariance matrices were processed using the flooring function described in Table II [20], [46]. We used the floor matrix ${\bf F}=f\bar{\mmb\Sigma}$, where TeX Source $$\bar{\mmb\Sigma}={1\over M}\sum_{g=1}^{M}{\mmb\Sigma}_{g}\eqno{\hbox{(29)}}$$ is the average covariance matrix, and $f=0.1$ is set as in [20]. We refer to this flooring method as “vFloor-2”. Baseline and AFA system results using these two different UBM flooring methods are summarized in Table III. In this experiment, PLDA size $N_{EV}$ was set to 150.

From the results, we observe that the variance flooring vFloor-2 [20] provides slightly improved baseline system performance compared to vFloor-1, with respect to %EER and ${\rm DCF}_{old}$ but degrades in ${\rm DCF}_{new}$ measure. The proposed AFA transformation achieves much better performance over the baseline system when using vFloor-1. AFA provides improvement over the baseline system using vFloor-2 only for $q=54$, whereas performance improvement is observed for $q=42$, 48 and 54 when vFloor-1 is used. This deterioration of AFA system performance can be expected, since the vFloor-2 algorithm modifies the eigenvalues of the covariance matrices on which the AFA approach directly relies on. Noting that AFA with vFloor-1 provides the best overall performance and vFloor-2 does not provide sufficient advantage over vFloor-1, we use vFloor-1 method in all of our subsequent experiments.

In this section we present evaluation results of the proposed systems on the NIST SRE 2010 core conditions 1–4 using the extended trials. In these experiments, additional microphone data from SRE 2005 and 2006 corpora was included for UBM and TV matrix training. The PLDA model was trained using both telephone and microphone data as before. The results are given in Tables IV–VII. We compare the following systems: Baseline full-cov, and AFA with $q=36$, 42, 48 and 54. The PLDA parameter $N_{EV}$ was set to 150. We did not evaluate the diagonal UBM system in these conditions.

From the results, again we observe that the proposed AFA systems consistently outperform the baseline system, especially for conditions 1–3. However, it seems a single parameter setting of $q$ does not always provide the best performance across all the performance metrics. Considering the best %EER values, the proposed systems achieved 8.14%, 6.43%, 8.67% and 12.33% relative improvements in conditions 1, 2, 3 and 4, respectively. These results demonstrate the effectiveness of the proposed scheme in the microphone mismatched conditions as well.

We select three of our systems for fusion: (i) Baseline full-cov, (ii) AFA $(q=42)$ and (iii) AFA $(q=48)$. The PLDA $N_{EV}$ parameter was set to 150 for all systems. Simple equal-weight linear fusion was used with mean and variance normalization of individual system scores to (0, 1) for calibration. Results are shown for NIST SRE 2010 core condition 5 and pooled condition (combining all trials from condition 1–5) in Tables VIII and IX, respectively.

From the results, fusion performance of systems (i) and (ii) clearly reveal that AFA and baseline system have complementary information, since %EER and the DCF values improve. This is observed for both telephone and pooled condition. The best result is achieved by fusing systems (i)–(iii), to obtain 16.52%, 14.47% and 14.09% relative improvement in %EER, ${\rm DCF}_{old}$ and ${\rm DCF}_{new}$, respectively, compared to the baseline system in condition-5. In the pooled condition, this fusion provides 13.75%, 14.0% and 11.80% relative improvement in %EER, ${\rm DCF}_{old}$ and ${\rm DCF}_{new}$, respectively. Performance comparison of the systems (i), (ii) and their fusion for the pooled condition is shown in Fig. 7 using Detection Error Trade-off (DET) curves. Here, again we observe the superiority of the proposed AFA system over the baseline system while the fusion of these systems consistently provide further improvement in the full DET range.

In our experiments, we observe that the TV matrix training process using the AFA transform is computationally less expensive compared to the conventional process. This is expected since the computational complexity of an i-vector system is proportional to the super-vector size $Md$ [39], which is reduced to $Mq$ for an AFA based system. Thus, the computational complexity of the proposed system is theoretically reduced by a factor of $1/\alpha$ $(0<\alpha<1)$ compared to the baseline system.

SECTION VII

In this study, we have proposed an alternate modeling technique to address and compensate for transmission channel mismatch in speaker recognition. Motivated by the covariance structure of conventional acoustic features, we developed a factor analysis technique which operates within the acoustic feature domain utilizing a well trained UBM with full covariance matrices. We advocated that conventional super-vector domain factor analysis methods fail to take advantage of the observation that speech features reside in a lower dimensional manifold in the acoustic space. The proposed acoustic factor analysis scheme was utilized to develop a mixture-dependent feature transformation that performs dimensionality reduction, de-correlation, normalization and enhancement at the same time. Finally, the transformation was effectively integrated within a standard i-vector-PLDA based speaker recognition system using a probabilistic feature alignment technique. The superiority of the proposed method was demonstrated by experiments performed using the NIST SRE 2010 extended trials of five core conditions. Measurable improvements over two baseline systems were shown in terms of EER, min DCFs and DET curves.

This work was supported by AFRL under Contract FA8750-12-1-0188(approved for public release, distribution unlimited), and in part by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J. H. L. Hansen. The authors note that preliminary investigations on Acoustic Factor Analysis (AFA) was presented in [8] and [9]. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Rodrigo Guido.

The authors are with the Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, TX 75252 USA (e-mail: taufiq.hasan@utdallas.edu; john.hansen@utdallas.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

^{1}More details on feature extraction and development data are given in Sections V-A and V-B, respectively

^{2}Utterance dependent covariance matrices can also be extracted through MAP adaptation. However, we assume that each utterance GMM shares the common UBM covariance and weights.

No Data Available

No Data Available

None

No Data Available

- This paper appears in:
- No Data Available
- Issue Date:
- No Data Available
- On page(s):
- No Data Available
- ISSN:
- None
- INSPEC Accession Number:
- None
- Digital Object Identifier:
- None
- Date of Current Version:
- No Data Available
- Date of Original Publication:
- No Data Available

Normal | Large

- Bookmark This Article
- Email to a Colleague
- Share
- Download Citation
- Download References
- Rights and Permissions