EM-based TDOA Estimation of a Speech Source via Gaussian Mixture Models in Noisy and Anechoic Environments

The propagation delay difference of a speech signal transmitted from the source to microphones, also known as time difference of arrival (TDOA), embodies the information of speech source position. The TDOA estimation plays a vital role in diverse systems such as teleconferencing and far-field speech recognition since the TDOA is a key parameter impacting quality of restored speech signals. This paper is devoted to estimating the TDOA of one speech source on a frame by frame basis in noisy and anechoic environments. First, we propose two variants of Gaussian mixture model to represent the speech signal received by a microphone pair, assuming Gaussianity of the signal and modeling speech sparsity by the speech presence probability (SPP). Second, after estimating the noise parameter in advance and formulating the speech parameters using the maximum likelihood principle, the proposed Gaussian mixture models are reduced to being dependent only on two unknowns, i.e. TDOA and SPP. Third, following these two models, we present two distinct estimators to estimate the TDOA and the SPP iteratively based on the expectation maximization algorithm. The proposed two estimators are free from the ad hoc parameter selection which is required in many classical approaches. Simulation results show that the TDOA estimated by them could be more accurate than that of the state-of-the-art GCC variants in a wide range of frames with specific SPP values. More importantly, the automatically estimated SPP which can be served as voice activity detection in a soft manner encodes the information of the TDOA estimation accuracy. In a speech frame, the estimated SPP with a large value indicates the estimated TDOA with small error. For example, when the SPP is larger than 0.76 and 0.87 in the two proposed estimators, respectively, the TDOA estimation error could be at most 19% of that in the worst case.


I. INTRODUCTION
Speech source localization, i.e., determining the spatial position of a speech source, is a fundamental issue in adhoc acoustic sensor networks composed of distributed microphones [1]- [3], and it finds a growing interest in many applications such as teleconferencing [4], far-field speech recognition [5], surveillance [6], and so on. The mainstream localization applications focus either on time difference of arrival (TDOA) estimation [7] [8] or direction of arrival (DOA) estimation [9]. Many source localization approaches have been developed in recent decades, which can be classified into two groups, i.e., spatial spectrum and time-frequency (TF) processing, see Table 1.
The spatial spectrum approaches construct a spectrum function of the spatial parameters (i.e., TDOA or DOA). The locations of highest peaks of the spectrum function indicate the TDOA (or DOA) candidates. The spectrum function can be constructed by methods like, e.g., the generalized crosscorrelation (GCC) algorithm [10] and subspace-based methods. The GCC function is expressed by inserting a weight

Approach
Reference Frame size Speech characteristics Source number Spatial spectrum GCC variants [10]- [12] Single No 1 Subspace-based [13]- [16] Multiple No ≥ 1 T-F processing Histogram [17] [18] Multiple Yes ≥ 1 Clustering [19]- [24] Multiple Yes ≥ 1 Deep learning [25]- [28] Multiple Yes ≥ 1 Hyprid Proposed Single Yes 1 into the cross-correlation function between the outputs of a microphone pair. The TDOA is estimated as the lag time that maximizes the GCC function. The most famous GCC function is the GCC with phase transform (GCC-PHAT).
In [11], a non-linear function of the GCC is proposed in order to emphasize the large GCC values, which improves accuracy of the TDOA estimates. In [12], an empirical weight is proposed for the GCC function based on tuning two intrinsic parameters. The subspace-based methods rely on the eigen-structure of the array covariance matrix. The noise subspace or the signal subspace is used to construct the spectrum function which is likely to exhibit a large value at the true DOA. Two representatives are MUSIC [13] and ESPRIT [14]. Their variants suitable for wideband signals are proposed in [15] and [16], respectively. Note that the DOA estimate can be transferred to the TDOA estimate based on the microphone array geometry and planar wave assumption of speech signal. As shown in Table 1, the spatial spectrum approaches mentioned above do not make use of speech characteristics. The GCC variants can estimate the TDOA of a source using a single frame while the subspace-based approaches can deal with the case of multiple sources using a number of frames. The TF processing approaches associate the TF bins to each source by means of a histogram [17] [18] or clustering [19] [20], and then use the associated TF bins to estimate the DOA (or TDOA) of each source, based on the assumption that at most one source is dominant in individual TF bins. The TF processing approaches have been investigated intensively in recent years due to its ability of jointly counting and localizing multiple sound sources even in the under-determined conditions 1 . Different types of speech features are used to associate the TF bins to the unknown sources. Nevertheless, the TF processing approaches simultaneously analyze a large amount of measurements or snapshots (in other words, a large number of successive speech frames) and then the sources are localized in an offline manner with notable latency.
The maximum likelihood (ML) algorithm is a well-known tool for parameter estimation. Based on the assumption that the microphone outputs are Gaussian distributed, the ML algorithm is applied to estimate the TDOA [10], assuming that the parameters of the speech and noise are known a priori. The resulting function based on which the TDOA is estimated resembles the GCC function. Therefore, the ML-based TDOA estimator is simply treated as a member algorithm of the GCC family in the literature. In [21]- [24], the ML is used to iteratively estimate the DOA (or TDOA) as well as the parameters of the speech and noise based on a large number of successive speech frames. The estimation procedures follow the principle of the TF processing approaches (more precisely, clustering), see Table 1.
Recently, deep learning is applied to estimate the TDOA, which is shown to be more robust in adverse environments. In [25], deep neural networks (DNNs) are applied to predict a time-frequency mask which is combined with the GCC-PHAT to estimate the TDOA. In [26], convolutional neural networks (CNNs) are devised to jointly fulfill the tasks of denoising and dereverberation, before the TDOA is estimated by the GCC. In [27], an end-to-end DNN based acoustic source localization approach is proposed. However, when a neural network that is trained by one data set is applied to another data set from a different distribution, its performance is not guaranteed [28].
As shown in Table 1, none of them can incorporate speech characteristics while estimating the TDOA in real time (using a single frame). We attempt to fill the gap in this paper. In order to fulfill the practical requirement of real-time update of TDOA estimates, we estimate the TDOA based on one frame of speech signals, and update the TDOA estimates frame by frame. To incorporate speech characteristics, we use the speech presence probability (SPP) [29] [30] to model speech sparsity and study the role of the SPP in the TDOA estimation.
Three assumptions are made for speech signals received by a microphone pair. First, the noise is a Gaussian random variable. Second, the speech is sparse, in other words, the speech could be absent at some times and frequencies. Third, when the speech is present, it could be treated as either a Gaussian random variable or a deterministic value. It results in two models for the noisy speech signals, which contain a number of unknown parameters of the speech and noise. The noise parameter (i.e., variance) is assumed to be known. Otherwise, it is estimated in advance. Underlying the principle of the ML algorithm, the speech parameters can be formulated in terms of the received signals and the mixing vector of the microphone pair as well as the noise variance. If these formulations are plugged back into the two models of noisy speech signals, the resulting two Gaussian mixture models (GMMs) will only depend on the TDOA (contained in the mixing vector) and the SPP. Then, the TDOA and the SPP are estimated iteratively using the expectation maximization (EM) algorithm [31]. Performance improvement of the proposed two approaches over some existing GCC approaches is observed in our simulations. Moreover, in contrast to the state of the art approaches, the proposed approaches automatically estimate the SPP for the audio recordings, avoiding the usage of voice activity detection in case that the recordings should be further processed. The estimated SPP is able to tell how accurate the estimated TDOA is. Therefore, the proposed approaches complete two tasks simultaneously: estimate the TDOA and also give the estimation accuracy.
The proposed approaches are categorized into hybrid approach in Table 1 since the proposed approaches can be elaborated from two different viewpoints. First, the TDOA is estimated using the spatial spectrum that is constructed based on the likelihood function. Second, the EM-based estimation procedure using a single frame includes a clustering step in the frequency domain. More precisely, in E-step, the frequency bins are divided softly into two clusters, i.e., noisy speech and pure noise, before the TDOA is estimated as an unknown parameter of the cluster of noisy speech in M-step. In contrast, the clustering using multiple frames in [21]- [24] takes place in time-frequency domain, where all clusters correspond to speech sources and no cluster deals with pure noise.
It is noteworthy that when super-Gaussian distributions are used to model speech signals instead of Gaussian distributions, the speech enhancement performance is improved [32]. Due to the fact that super-Gaussian distributions are more appropriate to represent speech signals, it is reasonable to believe that the TDOA estimation accuracy could be improved if super-Gaussian distributions are used in our proposed approaches. An extension of our proposed approaches to super-Gaussian cases will be done in the future work.
The remainder of the paper is organized as follows. In Section II, the problem of TDOA estimation is formulated briefly. In Section III, two GMMs for the noisy speech signals are introduced. In Section IV, the EM-based TDOA estimation procedure is presented and discussed. In Section V, simulation results and performance analysis are given. Finally, conclusions are drawn in Section VI.

II. PROBLEM FORMULATION
Consider the case where a single speech signal emanating from a source in the far-field is recorded by two spatially separated microphones in an anechoic and noisy environment. Let x 1 (t) and x 2 (t) denote the signals received at the microphone pair. Since there is only the line-of-sight (LOS) component due to the anechoic assumption, without loss of generality, the signals x 1 (t) and x 2 (t) can be modeled as where t is discrete time, s(t) is the source speech signal, n 1 (t) and n 2 (t) are the noise. In x 2 (t), τ denotes the relative time delay, which is known as the TDOA of the speech signal transmitted from the source to the microphone pair. The speech signals in (1) are segmented into frames. Transforming the signals of the lth frame using the short-time Fourier transform (STFT), we derive where f is the frequency bin. In the sequel, the frame index l will be omitted for brevity when there is no confusion. We can rewrite (2) in a vector form as where with [·] T standing for matrix transpose and A(f ) denoting the mixing vector.
Our goal is to find τ given only the noisy measurements (i.e., the output of the microphone pair). In Section III, we present how the noisy speech signals are modeled by GMM, while, in Section IV, an EM-based procedure is proposed to solve the TDOA estimation problem.

III. MODEL OF THE NOISE AND SPEECH
We assume that n 1 (t) and n 2 (t) in (1) are Gaussian distributed and uncorrelated with each other. Under mild conditions, the Fourier coefficients of the noise, i.e., N 1 (f ) and N 2 (f ) in (2), can be treated as statistically independent complex Gaussian random processes. Then, we have the probability density function (pdf) of N(f ) given by with the covariance matrix where σ 2 n1 (f ) and σ 2 n2 (f ) are the variance of N 1 (f ) and N 2 (f ), respectively. Note that they are estimated prior to the procedure of TDOA estimation. Herein, det[·] stands for the determinant of a matrix and [·] * stands for the complex conjugate transpose.
When the speech is absent, i.e., S(f ) = 0, we have X(f ) = N(f ) according to (3). Then, by replacing N(f ) with X(f ) in (5), we derive the pdf of X(f ) as follows, which is a two-dimensional complex Gaussian pdf with zeromean and the covariance Σ(f ) given in (6). When the speech is present, namely S(f ) is non-zero, the pdf of X(f ) is then denoted by g 1 (X(f )). To take two cases that the speech is absent and present into account, we formulate the pdf of X(f ) in (3) using the Gaussian mixture model (GMM): where p 0 and p 1 denote the a priori speech absence probability (SAP) and speech presence probability (SPP) in a single frame, respectively. It is clear that When the speech is present, S(f ) can be modeled as either a Gaussian random process or a deterministic arbitrary value, according to array signal processing [33]. Accordingly, we can model the pdf of X(f ) using the stochastic GMM (GMM-s) or the deterministic GMM (GMM-s) as shown in Subsections III-A and III-B, respectively.

A. STOCHASTIC GAUSSIAN MIXTURE MODEL (GMM-S)
In (3), the noise N(f ) is assumed to be a zero-mean Gaussian random process, see (5). Now, the speech signal S(f ) is also treated as a Gaussian random process with zero-mean. According to the model in (3), X(f ) is a Gaussian random variable with the pdf which is a two-dimensional complex Gaussian pdf with zeromean and the covariance matrix where A(f ) (see (4)) is the mixing vector, and Σ(f ) (see (6)) is the covariance matrix of the noise N(f ), and σ 2 s (f ) denotes the variance of the speech S(f ). σ 2 s (f ) which is the only unknown parameter of the speech in this case can be estimated by the ML algorithm as follows [33] [34]: with Here, B(f ) is involved to represent a matrix consisting of A(f ) and Σ(f ), for ease of notation. Denoteĝ 1s (X(f )) by the pdf in (9) after σ 2 s (f ) is substituted by the estimateσ 2 s (f ) in (11). Since Σ(f ) is estimated in advance, the pdfĝ 1s (X(f )) possesses only one unknown parameter, that is, the TDOA (in the mixing vector A(f )).
Substitutingĝ 1s (X(f )) for g 1 (X(f )) in (8), we obtain the new pdf as follows: which is referred to as the GMM-s in this paper.

B. DETERMINISTIC GAUSSIAN MIXTURE MODEL (GMM-D)
If S(f ) in (3) is treated as a deterministic arbitrary value, we obtain the pdf g 1d (X(f )) as follows: which is a two-dimensional complex Gaussian pdf with the mean A(f )S(f ) and the covariance matrix Σ(f ). In (14), A(f ) and Σ(f ) are given in (4) and (6), respectively. As the unknown parameter of the speech, S(f ) can be estimated by the ML algorithm as follows [33] [35]: where B(f ) is a matrix consisting of A(f ) and Σ(f ), given in (12). Denote byĝ 1d (X(f )) the pdf in (14) after S(f ) being substituted by the estimateŜ(f ) in (15). The unknown parameter ofĝ 1d (X(f )) consists of only the TDOA (in A(f )), the same as that ofĝ 1s (X(f )).
Substitutingĝ 1d (X(f )) for g 1 (X(f )) in (8), we obtain the new pdf as follows: which is referred to as the GMM-d in this paper.

A. IMPLEMENTATION OF THE ALGORITHM
Both the GMM-s in (13) and the GMM-d in (16) possess two unknown parameters, namely the a priori SPP p 1 (p 0 = p 1 − 1) and the TDOA τ , based on the fact that the noise variance is estimated in advance. We can estimate these unknown parameters using the EM algorithm. In the treatment of EM algorithm, p 1 (equivalently p 0 ) is considered as a hidden variable, while τ is considered as an unknown variable of the model. Based on the general form of p(X(f )) in (8), the EM estimation using a speech frame contains two iterative steps: • E-step: • M-step: p where i is the iteration index, F denotes the number of frequency bins, p k denotes the estimate of the a priori SAP (or SPP) in the ith iteration, and τ (i) denotes the TDOA estimate in the ith iteration. Note that the EM iteration starts with initial estimates (i.e., p (0) k , τ (0) ) and stops when the estimation is converged. The EM-based estimation algorithm in each frame is summarized in Table 2. More implementation details will be given in Section V-A.  (6)) using the traditional algorithms (e.g., [36]). 2: Set i = 1, initialize p (0) k and τ (0) . while not converged and i ≤ no. of iters do 3: Update g0 X(f ); τ i−1 using (7) for all frequency bins. 4: Update g1 X(f ); τ i−1 using (9)  If we replace the pdf in (8) with the GMM-s in (13) or the GMM-d in (16) in the EM iteration, more precisely, substitutingĝ 1s (X(f )) orĝ 1d (X(f )) for g 1 (X(f )) in (17) and (19), we can derive two types of TDOA estimates, i.e., τ s and τ d , respectively, after the EM iteration stops. Also, we have two different a priori SPPs, that is, p 1s and p 1d .

B. DISCUSSION ABOUT THE ALGORITHM
The EM-based estimation procedure of the GMMs can be elaborated in the viewpoint of soft clustering in the frequency domain. In E-step (see (17)), the a posteriori SAP (or SPP) p (i) k|X(f ) with k = 0, 1 is calculated for each frequency bin of a frame, which accounts for the probability of the frequency bin to be in two clusters, i.e., pure noise and noisy speech. It means that all frequency bins are divided softly into two clusters. In M-step (see (18) and (19)), the unknown parameters of the clusters are estimated, including the a priori SAP (or SPP) p (i) k and the TDOA. If hard clustering is applied in our problem, each frequency bin either belongs to a cluster completely or not. Since the noise pdf in (7) does not depend on the TDOA, the TDOA should be estimated using only the frequency bins belonging to the cluster of noisy speech. Similarly, for the TDOA estimation in soft clustering, more emphasis is placed on the frequency bins with a high probability to be in the cluster of noisy speech (i.e., large p (i) 1|X(f ) ). Details are going to be explained in the following.
In (19), the term to be maximized on the right side of the equation can be considered as the spatial spectrum (see Section I), which is equivalent to a weighted sum of the loglikelihood functions of all frequency bins, where the weight is the a posteriori SPP p (i) 1|X(f ) and the log-likelihood function is derived from the pdf of noisy speech (see (9) or (14)). Note that the a posteriori SPP p (i) 1|X(f ) accounts for the probability of a frequency bin to be in the cluster of noisy speech. The frequency bins with a high probability to be in the cluster of noisy speech will be emphasized in the spatial spectrum. Meanwhile, the impact of remaining frequency bins on the final spatial spectrum is marginal since their a posteriori SPPs are low.
In (18), the a priori SPP p (i) 1 is computed as the average of the a posteriori SPPs p (i) 1|X(f ) . Therefore, in a frame with a relatively large p (i) 1 , the majority of frequency bins are expected to have high probabilities to be in the cluster of noisy speech. Simply speaking, p (i) 1 can be served as soft voice activity detection in a frame. Its value indicates frame quality, namely, the percentage of frequency bins that is much more likely to be in the cluster of noisy speech. In a frame of good quality, the TDOA estimated by the proposed approaches is relatively accurate.
As aforementioned, g 0 (X(f )) does not contain any unknown parameters whereas bothĝ 1s (X(f )) andĝ 1d (X(f )) possess the TDOA as the unknown parameter. When the TDOA is determined, the variance ofĝ 1s (X(f )) (with zero mean) and the mean ofĝ 1d (X(f )) (with the same variance as g 0 (X(f ))) increases as ||X(f )|| increases. Here, ||X(f )|| is the norm of X(f ). At a high SNR level, as ||X(f )|| increases (i.e., the spectral amplitude of the speech increases), the values ofĝ 1s (X(f )) andĝ 1d (X(f )) gradually overwhelm that of g 0 (X(f )). Therefore, the a posteriori SPP in (17) is apt to be overestimated. As a result, the a priori SPP in (18) which is the mean of the a posteriori SPP is also overestimated. It will be observed in the simulation results.
The pdfĝ 1s (X(f )) can be formulated as the Gaussian distribution: where β denotes the term independent of X(f ). The pdf g 1d (X(f )) can be formulated as the Gaussian distribution: By comparing the above two Gaussian distributions, we can derive thatĝ 1s (X(f )) <ĝ 1d (X(f )) in case that ||X(f )|| is large enough. Therefore, we have p 1s < p 1d , according to (17) and (18).

A. EXPERIMENTAL SETUP
The ISM MATLAB code [37] is used to generate anechoic acoustic response impulses. In a room with the size of 3×4×2.5 meter, two microphones are placed with 4 possible distances between each other, that is, 5 cm, 10 cm, 15 cm and 20 cm. For each location of the microphone pair, the speech source is placed at different positions. It results in four TDOA values, that is, 7.0691×10 −5 s, −1.4126×10 −4 s, 0 s and 6.7436 × 10 −4 s. We randomly select 8 sentences from the TIMIT database for the source speech signal. From the Noisex-92 database, 10 samples of uncorrelated two-channel noise are constructed, including white noise, babble noise and factory noise. The noise variance is estimated in advance using well-known noise estimation algorithms, e.g., Martin's VOLUME 4, 2016 minimum tracking algorithm [36]. The sampling rate is set to 16 kHz. The speech signal is segmented into frames of length 512, and successive frames are overlapped by 50 %. For the STFT, a Hanning window of length 512 is used. The TDOA is estimated for each frame.
We select three candidate estimators: GCC-PHAT [10], GCC-NONLIN [11] and ρ-PHAT-C [12], which are suitable for real-time estimation using a single frame. For the ρ-PHAT-C, two tuning parameters are empirically recommended in [12], that is, ρ = 1 and b = 0.75. EM-GMM-s and EM-GMM-d are the abbreviations for the two proposed estimators based on the GMM-s and GMM-d, respectively. The EM iteration is initialized as follows: p (0) 1 = 0.5 and τ (0) equals the TDOA estimate of the GCC-NONLIN. We set the maximal iteration number as 2 due to the fact that it yields satisfactory estimation results in our simulations. In addition, maximization in (19) is done by grid search on the TDOA candidates with the resolution of d 180c [21]. Here, d and c denote the distance between the microphone pair and the propagation velocity of the speech signal, respectively.

B. WHITE NOISE AND ANECHOIC ENVIRONMENTS
In this subsection, the white noise and anechoic environments are considered. Based on the settings above, we have 320 samples of noisy two-channel speech signals at each of three SNR levels (i.e., 5 dB, 10 dB and 20 dB). Table 3 shows the root mean squared error (RMSE) and the mean absolute error (MAE) of the TDOAs estimated by five estimators for all the 60680 speech frames segmented from 320 speech samples at each SNR level. We can find that the MAE coincides with the RMSE in terms of performance comparison.The EM-GMMs and EM-GMM-d outperform the GCC-PHAT and GCC-NONLIN. However, them are slightly defeated by the ρ-PHAT-C. It should be noted that the ρ-PHAT-C suffers from a lack of theoretical justification since its two parameters are selected empirically. In contrast, the proposed two estimators are free from any ad hoc parameter selection. In addition to the TDOA, the EM-GMM-s and EM-GMMd also estimate the a priori SPP in (18) (i.e., p 1s and p 1d , respectively) in each frame, which contains valuable information. In what follows, we investigate the impact of the a priori SPPs on the estimated TDOAs in detail. Note that the SPP means the a priori SPP except where noted.
Figs. 1-2 show the relationship between the TDOAs and the SPPs estimated by the proposed two estimators for the frames of a noisy speech sample. In Fig. 1, the solid line represents the true TDOA value and the mark '+' represents a pair of TDOA and SPP (p 1s ) estimated by the EM-GMM-s in a frame. In the frames with p 1s larger than a certain value, e.g., 0.75, the marks '+' are located much closer to the solid line, which means that the estimation error of the TDOA is significantly reduced, as compared with that in the frames with p 1s < 0.75. In Fig. 2, similar results can be observed when p 1d is larger than 0.87. In the frames where the speech is present with a relatively high probability (i.e., large p 1s or p 1d ), the frequency bins are more likely to contain noisy speech components, which indicates relatively good frame quality and consequently leads to the TDOA estimates with small error. This has been explained in Section IV-B. We rank the SPPs calculated in all the 60680 frames in a descending order and partition the SPPs into different levels. As a result, the frames with different levels of quality are selected according the SPP levels. For example, the frames with top 10% SPPs are exactly the top 10% frames of best quality. Figs. 3 -4 show the thresholds of the SPPs when different percentage (from 99% to 10%) of the largest SPP values is selected. It is found that the threshold of p 1s ranges from 0.618 to 0.891 whereas that of p 1d ranges from 0.698 to 0.952, considering all the three SNR levels. Additionally, Figs. 3 -4 tell us how the estimated p 1s and p 1d are roughly distributed in different conditions.
The lower limit of p 1s and p 1d is overestimated since the theoretical value should be close to 0.5. 2 Also, the estimated p 1s is smaller than the counterpart p 1d . This is due to the formulation difference betweenĝ 1s (X(f )) andĝ 1d (X(f )), which has been elaborated at the end of Section IV-B. Figs. 5-7 show the RMSE of the TDOA estimates in the frames at different SPP levels. p 1s is used to select the frames for the EM-GMM-s, the GCC-NONLIN and the ρ-PHAT-C, whereas p 1d is used for the EM-GMM-d. The RMSE of the TDOA values estimated by both the EM-GMM-s and EM-GMM-d decreases in the frames with increased SPPs. In other words, the proposed two estimators tend to estimate the TDOA more accurately in a frame with a larger SPP.
The RMSE curves of the EM-GMM-s in Figs. 5-7 descend faster than the ones of the EM-GMM-d. The EM-GMMs gives smaller RMSE values than the EM-GMM-d in the frames with large SPPs. Hence, p 1s is more competent than p 1d in indicating the TDOA estimation accuracy. It means that the EM-GMM-s is more precise than the EM-GMM-d in terms of modeling noisy speech signals in the frames of relatively good quality (i.e., relatively large SPPs). Addition-  ally, the EM-GMM-s and the ρ-PHAT-C perform similarly in the frames with relatively large SPPs. However, the former is slightly worse than the latter in the frames with small SPPs. This might be caused by the fact that the EM performance is degraded when the EM iteration converges to the local optima in the frame with bad quality (i.e., small SPPs). Percentage of the largest SPPs (%)

C. NON-WHITE NOISE AND REVERBERANT ENVIRONMENTS
Although the proposed estimators are originally designed for white noise and anechoic environments, their performance in non-white noise and reverberant environments is also studied. Figs. 8 and 9 show the RMSE of the TDOA estimates at different SPP levels in babble noise and anechoic environments at 5 dB and 20 dB SNR, respectively. Figs. 10 and 11 show the RMSE of the TDOA estimates in factory noise and anechoic environments at 5 dB and 20 dB SNR, respectively. In both babble and factory noise, the relationship between the RMSE of TDOA estimates and the SPPs remains the same as that in white noise, i.e., the RMSE decreases as the SPP increases. The SPP estimated in the EM-GMM-s outperforms the counterpart in the EM-GMM-d in terms of indicating the TDOA estimation accuracy, since the RMSE curves of the EM-GMM-s in Figs. 8-11 descend faster than the ones of the EM-GMM-d. The performance of the ρ-PHAT-C is degraded severely in non-white noise due to the fact that its performance is highly dependent of the two tuning parameters which are selected without theoretical justification. Figs. 12 and 13 show the RMSE of the TDOA estimates in white noise (10 dB SNR) and reverberant environments with T60 = 0.1 s and 0.2 s 3 , respectively. The RMSE curves of the two proposed estimators in Fig. 12 descend as the SPP rises except for the 10% of the largest SPPs. The SPP still can roughly indicate the TDOA estimation accuracy when T60 = 0.1 s. However, in Fig. 13, the RMSE of TDOA estimates is not inversely proportional to the SPP. The SPP is not able to indicate the TDOA estimation accuracy when T60 = 0.2 s. This can be explained in the following. The proposed GMMs group the speech signals into two clusters, namely noisy speech and pure noise. The reverberant speech will destroy the validity of the GMMs and prevent the EM algorithm from accurately identifying two clusters. The strong part of the reverberant speech is apt to be identified to be the cluster of noisy speech. For example, a frequency bin of a frame does not contain speech signals from direct path but contain the reverberant speech of previous frames. This frequency bin could be identified to be the cluster of noisy speech. Therefore, the estimated SPP becomes inaccurate, especially when the reverberant condition becomes severe. In order to cope with the reverberant condition, the proposed GMMs and EM estimation procedure should take the dereverberation step into account. This is expected to be studied in the future.

VI. CONCLUSIONS
In this paper, we propose an EM-based TDOA estimation procedure via stochastic and deterministic GMM for single speech in noisy and anechoic environments. By carefully modeling the speech and the noise, the two GMMs are independent of the speech parameters, and the only parameter to be pre-estimated is the noise variance. After that, the TDOA and SPP values are estimated iteratively via the EM algorithm. Simulations show that the proposed estimators outperform or are comparable to the selected state-of-theart estimators in terms of TDOA estimation. Moreover, the a priori SPPs calculated in the proposed estimators which can be served as soft voice activity detection, are utilized as the indicators for accuracy of TDOA estimates. Therefore, the proposed estimators complete two tasks simultaneously: estimate the TDOA and also give the estimation accuracy.