Fast Nonstationary Noise Tracking Based on Log-Spectral Power MMSE Estimator and Temporal Recursive Averaging

Estimation of the noise power spectral density (PSD) plays a critical role in most existing single-channel speech enhancement algorithms. In this paper, we present a novel noise PSD tracking algorithm, which employs a log-spectral power minimum mean square error (MMSE) estimator. This method updates the noise PSD estimate by performing a temporal recursive averaging of log-spectral MMSE estimate of the current noise power to reduce the risk of speech leakage into noise estimate. A smoothing parameter used in the recursive operation is adjusted by speech presence probability (SPP). In this method, a spectral nonlinear weighting function is derived to estimate the noise spectral power which depends on the a priori and the a posteriori signal-to-noise ratio (SNR). An extensive performance comparison has been carried out with several state-of-the-art noise tracking algorithms, i.e., Minimum Statistics (MS), modified minima controlled recursive averaging algorithm (MCRA-2), MMSE-based method, and SPP-based method. It is clear from experimental results that the proposed algorithm exhibits more excellent noise tracking capability under various nonstationary noise environments and SNR levels. When employed in a speech enhancement framework, improved speech enhancement performance in terms of the segmental SNR (segSNR) improvements and three objective composite metrics is observed.


I. INTRODUCTION
Speech is one of the most important forms of human communication, which plays an important role in many applications such as mobile communications, digital hearing aids and human-computer interactions. However, in practical scenarios, clean speech signals will always, to some extent, be degraded by surrounding interference noises. In most situations, the interfering noise is usually nonstationary. The nonstationary interference noise will bring great challenges to speech signal processing applications. In humancomputer interaction (e.g., automatic speech recognition), for instance, the degraded speech leads to a significant decrease of recognition accuracy. As a consequence, noise suppression technology [1]- [7] is of great importance, the aim of The associate editor coordinating the review of this manuscript and approving it for publication was Bora Onat. which is to suppress the disturbing noise component in noisy speech while preserving the original quality and intelligibility of clean speech. Single-channel noise suppression approaches [8]- [10] based on short-time Fourier transform (STFT, a sequence of Fourier transforms of a windowed signal) are often used to achieve this.
Noise power spectral density (PSD) is defined as the noise power per unit bandwidth. Noise PSD estimation is a crucial component in designing single-channel speech enhancement algorithms [11]- [17]. An underestimation of the noise PSD leads to an unnecessary amount of residual noise in the enhanced signal, while an overestimation introduces speech distortions, which may result in a loss of speech intelligibility. A conventional noise PSD method is to exploit a voice activity detector (VAD) [18]- [21] to identify speech pause periods, and then the noise PSD estimate is updated during speech absence. Although this is effective for highly VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ stationary noise, it often fails at low SNR (where SNR is the ratio of signal power to the noise power) scenarios, especially when the noise is nonstationary. In past decades, a significant amount of work has been done to solve this problem. In general, most state-of-the-art methods for noise PSD estimation can be divided into four main groups [22], i.e., Minimum Statistics (MS) methods [23], [24], time-recursive averaging methods [25]- [27], subspace decomposition algorithms [28], [29], and other techniques based on Bayesian estimation principle [30]- [32].
In the first group of algorithms, the noise PSD is tracked via Minimum Statistics (MS) algorithms [23], [24], which rely on two assumptions that the noise and the speech are statistically independent, and that the power of the noisy speech signal frequently decays to the power level of the noise signal (e.g., in speech pauses). The noise PSD is estimated as the tracked minimum of the smoothed noisy spectrum within a finite time window. The expectation of the minima is smaller than the mean value of the spectral power, thus a bias compensation factor is derived to correct the bias [24]. Since MS method will result in speech leakage into noise PSD estimate when the time window is short, a sufficiently long time window is required to reduce the amount of speech leakage. Unfortunately, if the time window is chosen too long, fast noise level changes will be tracked with a rather large delay. Thus a trade-off is necessary, a typical size of window is in the order of 1 s. As the minimum value in a window is used, the noise PSD will always be underestimated or tracked with a large delay in case of increasing noise power level.
In the second category of algorithms, the noise PSD estimate is updated by recursively averaging the previous estimated noise PSD and the current noisy speech power spectrum, in which the smoothing factors are controlled by the speech presence probability (SPP). The representative methods of this class include minima controlled recursive averaging (MCRA) method [25] and its two modifications, i.e., improved MCRA (IMCRA) [26] and MCRA-2 [27]. The main distinction between MCRA, IMCRA and MCRA-2 is reflected in the way the SPP is calculated. In MCRA, the SPP is determined by the ratio of the smoothed noisy speech power spectrum to its local minimum obtained by minimum statistics technique [24], and for that reason this method is referred as the minima controlled recursive averaging (MCRA) algorithm. The presence of speech is detected when the ratio is above a certain fixed threshold. MCRA-2 employs the continuous spectral minimum-tracking algorithm [33] to obtain the minimum and is not constrained within a search window. Moreover, unlike the fixed threshold in MCRA, frequencydependent thresholds are used in MCRA-2 to calculate the SPP. In IMCRA method, the SPP estimation is based on a Gaussian statistical model and obtained from the ratio of the likelihood functions of speech presence and speech absence. The derivation of IMCRA method involves two iterations of smoothing and minimum tracking. The first iteration provides a simple speech-presence detector for each frequency bin, while the second iteration of smoothing excludes high-energy speech components, thus allowing for smaller windows in minima tracking. However, since these approaches are proposed on the basis of the MS principle [24], they still show a considerable tracking delay in case of the increasing noise power level.
In the third family of methods, the decomposed noise-only subspace is used to update the noise PSD estimation. A famous subspace decomposition based approach, called subspace noise tracking (SNT) algorithm was proposed in [28]. The SNT is based on eigenvalue decompositions of correlation matrices that are constructed using time series of noisy discrete Fourier transform (DFT) coefficients. An improvement of this method, called minimum subspace noise tracking (MSNT) algorithm [29], exploits the limitedrank structure of the clean speech signal. MSNT combines the subspace structure and the minimum statistics tracking to estimate noise PSD. In comparison to, e.g., MS-based noise PSD trackers, the subspace decomposition based noise tracking algorithms allow for the faster noise tracking for many nonstationary noises [34]. However, the improved noise tracking performance of the subspace based noise trackers is accompanied by a significant increase in the computational complexity.
In the fourth group of methods, the derivation of the noise spectral power estimators is based on Bayesian estimation principle and assumed statistical model. In [30], [31], minimum mean square error (MMSE) estimator derived by minimizing the mean square error (MSE) of spectral power is used to estimate the instantaneous noise power and a firstorder recursive smoothing technique is employed to update the noise PSD estimate. However, for noise power estimation, the simple bias compensation in [30] is motivated heuristically, whereas the bias compensation in [31] is derived rigorously based on assumed signal model. The SPP-based approach [32] is a further modification of the MMSE-based approach [31]. In the SPP-based method, the noise PSD estimate is obtained by the sum of the previous noise PSD estimate weighted by the conditional probability of speech presence and the periodogram of noisy speech weighted by the conditional probability of speech absence. These MMSE algorithms [31], [32] achieve fast noise spectral power tracking and are demonstrated to have a more robust noise estimation performance [34]. More recently, a model-based noise PSD estimation method was reported in [35] and [36], where different codebooks were trained for different noise and speech types. This model-based method [35] performs best for noise-types for which the algorithm is trained. However, since the number of models increases with the product of the codebook, this might lead to an intractable computational complexity.
Although the spectral mean-square error (MSE) distortion metric is mathematically tractable and also leads to good results in [30]- [32], it appears to be not perceptually meaningful. In fact, human ear has a logarithmic response to sound (whether speech or noise) intensity changes [37] and it is argued that a distortion metric based on the MSE of the log-spectral is perceptually more relevant, and more appropriate for speech processing [38]. Based on such facts it was presented in [3] and [13] to estimate speech spectral amplitude by minimizing the log-spectral MSE. Recently, an algorithm was presented in [39] to track the speech and noise in the logpower spectral domain. Motivated by these facts, the noise is naturally regarded as ''target'' signal (not speech signal in [3], [13]), and we therefore exploit this distortion metric for noise estimation and develop a noise spectral power estimator that minimizes the MSE of log-spectral power. Moreover, speech estimators [3], [13] focus on reconstructing the instantaneous speech spectral amplitude, while noise tracking algorithms are interested in estimating the noise PSD (expectation of instantaneous noise spectral power). In this algorithm, the noise PSD estimation is obtained by recursively averaging the log-spectral MMSE estimate of the current noise power. The smoothing parameter is adjusted by the speech presence probability determined by the smoothed posteriori SNR. For the noise spectral power estimate, we derive a nonlinear spectral weighting function, which relies on the a priori and the a posteriori SNR. In this work, we consider the standard ''decision-directed'' (DD) estimator for the a priori SNR estimation. Experimental results show that for different nonstationary noises the proposed noise PSD tracker achieves a more accurate and rapid noise PSD estimate, and a better speech enhancement performance in terms of both the segmental SNR [22], [40] and three composite measures [41].
The remainder of this paper is organized as follows. Section II explains the used notation, and the signal model employed to derive the noise spectral power estimator. In Section III, we propose to employ Log-spectral MMSE estimate of noise power to recursively update the noise PSD estimate, which reduces the probability of speech leakage. Section IV gives a detailed derivation of the proposed Log-spectral MMSE noise power estimator. In Section V, we evaluate the performance of the proposed algorithm and make comparisons with four state-of-the-art methods, MS [24], MCRA-2 [27], MMSE-based algorithm [31], and SPP-based algorithm [32], in terms of tracking performance, and overall performance in a noise suppression framework. Conclusions are finally presented in Section VI.

II. SIGNAL MODEL AND NOTATION
Let y(n) denotes a noisy speech signal, which consists of a clean speech signal x(n) contaminated with additive noise signal d(n), i.e., y(n) = x(n) + d(n), where n is the discrete time index. The noisy signal y(n) is segmented into overlapping frames, followed by windowing with a square-root-Hann window. Subsequently, each frame is transformed by applying the short-time Fourier transform (STFT). The noisy speech signal in the time-frequency domain is expressed as where X (l, k) and D(l, k) represent the complex STFT coefficients of the clean speech and additive noise term, respectively. Furthermore, l is the frame index and k is the frequency index. It is assumed that X (l, k) and D(l, k) are conditionally independent across time and frequency, and obey zeromean complex Gaussian distributions with model parameters , respectively, where E {·} denotes the statistical expectation operator. λ x (l, k) and λ d (l, k) denote the PSDs (or variances) of the speech and the noise signals, respectively. In the sequel, the indexes l and k will be omitted for simplicity, whenever it is possible. The STFT coefficients can be represented in terms of their amplitude and phase, denoted as Y = Re jα , X = Ae jβ , and D = Ne jθ . We will call N 2 the (instantaneous) noise spectral power.
Further, we use the terms a priori SNR ξ and the a posteriori SNR γ , defined as respectively. A hat symbol is used to denote the estimated quantities of variables, e.g.,N 2 is an estimator of noise spectral power N 2 .

III. TEMPORAL RECURSIVE SMOOTHING OF NOISE LOG-SPECTRAL POWER MMSE ESTIMATION
The temporal-recursive averaging algorithms, MCRA [25], IMCRA [26], obtain the noise PSD estimation by recursively smoothing the noisy speech spectral power R 2 [32], i.e., As the minimum values in a long time window are used to avoid speech leakage into noise PSD estimate, these methods show a slow response to fast increases in noise level [42]. In this paper, the noise PSD is estimated by recursively averaging a (instantaneous) noise spectral power estimator N 2 instead of noisy spectral power R 2 , given bŷ Compared to recursive averaging technique with fixed smoothing factor, SPP-based recursive averaging technique is a more general and widely used method. Similar to MCRA and IMCRA, the time-varying smoothing parameter α N (l, k) is also adjusted by an estimatep(l, k) of SPP where α n (0 < α n < 1) is a smoothing parameter which usually has a value range of [0.8, 0.95] as suggested in [22] and is empirically set to 0.8 in this work. Utilizing noise spectral power estimateN 2 instead of noisy spectral power R 2 has the benefit of reducing the amount of speech component leaking into noise PSD estimate. Therefore, an extremely accurate SPP estimator is not necessary. ForN 2 the Logspectral MMSE estimator of the noise power N 2 is exploited. Different from IMCRA, this work uses a simpler estimation method forp(l, k) that allows for faster tracking.

A. SPEECH PRESENCE PROBABILITY ESTIMATION
Since the noise PSD estimate is updated with noise spectral power estimateN 2 , the risk of speech leakage is reduced. Accordingly, there is no need to design an extremely accurate SPP estimator. In this work, we employ a very simple SPP estimator, which depends on the smoothed posteriori SNR.
Considering the correlation of speech presence in the neighboring time-frequency points [43], we calculate the smoothed posteriori SNR over a time-frequency region where M = (2 k + 1) · ( l + 1) is the number of neghboring time-frequency points which are averaged. k and l denote number of the adjacent frequency bins and successive time frames, respectively, set to 1 and 2. Then, the smoothed posteriori SNR is compared against a threshold to decide speech present regions as follows where (k) is the threshold, which controls the trade-off between the update speed of noise PSD estimation and the amount of speech leakage. The higher the value, the faster the tracking speed, but the higher the risk of speech leakage.
The speech presence probability I (l, k) is smoothed over time using the following first-order recursion: where α p (0 < α p < 1) is a smoothing parameter, set to 0.2 in our experiment as adopted in [25]. The smoothing parameter α N is obtained by substituting (8) into (5). Here, using averaged priori SNR reduces random fluctuations inp(l, k), at the same time fast react to changing noise levels is achieved (minimum tracking is abandoned). Additionally, similar to MCRA-2, we exploit frequency-dependent thresholds (k) instead of the fixed threshold in MCRA method, set to where K is the window length as well as STFT length.

IV. LOG-SPECTRAL POWER MMSE ESTIMATOR A. DERIVATION OF THE WEIGHTING FUNCTION
To estimate the noise PSD, in this section we derive an estimator of the noise spectral powerN 2 , which minimizes the MSE of the log-spectral power, given bŷ In [3] the MMSE estimator of the speech spectral magnitude in logarithmic domain was derived by exploiting moment generating function. Similar to [3], the moment generating function of log N 2 given Y , i.e., log N 2 |Y , is exploited to derive the noise spectral power estimator according to (9). Let P = log N 2 , then the moment generating function of P given Y takes the form By exploiting the first derivation of M P|Y (µ) at µ = 0, the estimator in (9) is obtained aŝ Therefore, we need to evaluate the moment generating function M P|Y (µ) and then to obtain the estimatorN 2 using (11). By applying Bayes' theorem, M P|Y (µ) can be expressed as Under the assumed complex Gaussian distributions, f (Y |n, θ) and f (n, θ) are given by By substituting (13) and (14) into (12) with λ = λ x λ d λ x +λ d and where (·) is the gamma function, (·) is the confluent hypergeometric function [44, Eq. 9.210.1], and η satisfies the relation The first derivative of M P|Y (µ) at µ = 0 in (11) is then given by According to the basic derivative rules, the part 1 in (17) is given by

VOLUME 7, 2019
For the derivation of the part 2, we can obtain the derivative of (µ + 1) through the derivative of log (µ + 1). Exploiting the series expansion of log (µ + 1) [44, Eq. 8.342.1], we obtain d dµ where c is the Eulers constant. For the computation of part 3, utilizing [44, Eq. 9.210.1] and derivative rules, it can be written as Now, summing the results of (18), (19) and (20), and followed by utilizing (11), (16) The noise spectral power estimation is obtained from the noisy speech through a multiplicative nonlinear weighting function which depends only on the a priori and the a posteriori SNR. The weighting function is defined as After estimating noise spectral powerN 2 with (21), the noise PSD estimation is updated via (4) and (5) aŝ

B. PRIORI SNR ESTIMATION FOR NOISE PSD ESTIMATION
It is observed from (22) that the weighting function takes the priori and posteriori SNRs as parameters. As these parameters are unknown in practice, it is necessary to make an estimation. We have known that noise PSD tracking performance depends on the particular priori SNR estimator used. For the a priori SNR estimate, the DD approach and the ML approach are proposed in [2]. The DD approach is based on a heuristic knowledge and is widely accepted in literature. In this work, the standard DD priori SNR estimator is exploited to estimate the a priori SNR used for noise PSD estimate: where ξ min = −15 dB is the minimum value allowed for the priori SNR ξ , α NS is the smoothing factor,λ d is the estimated noise PSD, andÂ 2 (l − 1, k) is the speech spectral power estimate obtained in the previous frame. The smoothing factor α NS typically lies in the range [0.9, 0.99] [45] and is set to 0.98 in this work.

C. SAFETY NET
Moreover, as in MMSE-based algorithm [31], in order to ensure that the noise PSD estimator continues to work properly in the extreme situation where the noise power level abruptly changes from one level to another, an effective and simple safety-net presented in [42] is adopted. In the safetynet, some memory resources are required to store the previous 0.8 seconds of the smoothed periodogram S(l, k) of noisy speech |Y (l, k)| 2 , where S(l, k) is given by S(l, k) = 0.1 S(l − 1, k) + 0.9 |Y (l, k)| 2 . The minima S min (l, k) of S(l, k) is used as a reference value. Then, the noise PSD estimationλ d (l, k) obtained with (23) is checked whether it fulfills the condition:λ d (l, k)/S min (l, k) < 1.5. If that happens, the final noise PSD estimation is updated bŷ λ d (l, k) = max 1.5 · S min (l, k),N 2 (l, k) .

V. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, several comparisons and experiments are carried out to evaluate the performance of noise PSD trackers and demonstrate the superiority of proposed algorithm over other four state-of-the-art methods. Performance evaluations are conducted on the NOIZEUS database, which contains 30 IEEE sentences produced by three female and three male speakers [22], [46]. Clean speech signals are corrupted by five distinct types of noise sources at five input SNR levels, namely -5, 0, 5, 10, and 15 dB. The noise sources are modulated white Gaussian noise, babble noise from NOISEX-92 database [47], passing car noise, passing train noise, and traffic noise. The modulated white Gaussian noise is obtained through modulating white Gaussian noise by the following function where n is the discrete-time index, f s the sampling frequency, and f mod = 0.2 Hz denotes the modulation frequency. The passing car noise, passing train noise, and traffic noise are taken from Freesound database [48]. Speech and noise signals used in our experiments are sampled at a frequency of f s = 8 kHz. All noise PSD trackers employ a overlapping square-root-Hann window for spectral analysis and synthesis. The window length as well as the DFT length is K = 256 samples (32 ms), and the amount of the overlap between successive frames is 50%. In section V-A, we first compare the noise estimation accuracy of all noise trackers in five different noise environments. Subsequently, in section V-B the noise PSD estimators are integrated into a noise suppression framework and the speech enhancement performance is compared. Finally, the computational complexity is analyzed in section V-C.

A. NOISE ESTIMATION ACCURACY
The noise estimation accuracy is measured using the averaged logarithmic spectral error distance between the estimated noise PSDλ d (l, k) and the ideal reference noise PSD λ d (l, k). The ideal reference noise PSD λ d (l, k) is calculated VOLUME 7, 2019      by employing a recursive temporal smoothing of noise periodograms [28], [34] i.e., (26) with a smoothing parameter α d = 0.9 [28], [34]. The averaged logarithmic spectral error distance (LogErr) is defined as follows [28], [34] (27) where L and K indicate the number of signal-frames and frequency bins respectively. The lower LogErr value, the better the tracking capability.
To illustrate the noise tracking performance of the proposed method in comparison to four competing trackers, we consider an example where three speech signals obtained from one female and two male speakers are concatenated and is degraded by modulated white Gaussian noise at an overall SNR of 0 dB. In Fig. 1, the estimated noise PSDs are shown for proposed method and four competing noise estimators together with ideal reference noise PSD. The clean and noisy speech signals are shown in Fig. 1(a) and Fig. 1(b), respectively. Fig. 1(c) exhibits the results of noise PSD estimation at frequency bin k = 36. This frequency bin index corresponds to the DFT band centered around 1125 Hz. Fig. 1(d) displays the estimated noise PSDs averaged over all frequency bins. It is observed that the proposed noise estimator tracks the increases and decreases in noise level much better than other four approaches. As expected, MS is not capable of tracking   the changes when noise PSD increases. MCRA-2 is based on the minimum-tracking principle and therefore also shows a relatively large delay in tracking the increasing noise PSD. Compared to MS and MCRA-2, the MMSE and SPP algorithms perform better, but still have a tracking delay when the noise PSD rises.
In Fig. 2, we show a second example where the same speech signal is corrupted by noise originating from passing train at an overall SNR of 0 dB. It is observed again that the proposed method exhibits better performance of handling both fast increases and decreases in noise level than other four reference approaches. For a rapidly increasing noise PSD, i.e., in the time-interval from 0-4 seconds, the proposed algorithm has a shortest tracking delay. When the noise is decreasing, e.g., in the time-span from 5 till 8 seconds, the proposed method, MS, MMSE and SPP exhibit similar performance. Compared to MS, the MCRA-2 algorithm is slightly better in tracking the increasing noise level, but it has the tendency to overestimate the noise PSD when the noise is decreasing. Fig. 3 shows another example where four different speech signals spoken by two male and two female speakers are concatenated and is corrupted by modulated white noise at an SNR of 0 dB. It is evident from Fig. 3 that the proposed noise tracker shows a better tracking performance than other competing methods.
The quantitative evaluation results of noise tracking performance of all noise PSD estimators are given in Fig. 4 in terms of LogErr measure. It can be observed from the results in Fig. 4 that the proposed algorithm clearly outperforms other four competing methods in terms of LogErr for almost all noise sources and SNR levels, except for babble noise at 10 and 15 dB input SNR, where MMSE performs slightly better. As the proposed method can quickly update the noise PSD estimate, the superiority in terms of tracking performance is obvious especially at low SNR conditions. However, with the increase of SNR, the proposed tracker updates noise estimate quickly which may lead to overestimation of noise, and shows an increase in terms of LogErr.

B. NOISE SUPPRESSION PERFORMANCE
In order to investigate the impact of noise PSD trackers on noise suppression performance, the estimated noise PSDs are then incorporated into a DFT domain-based single channel speech enhancement system. The block diagram of the standard DFT-based single channel speech enhancement framework is depicted in Fig. 5. For the speech estimator, this work employs an MMSE amplitude estimator, which is derived under the assumption that the speech DFT coefficients follow a generalized-Gamma distribution with distribution 80994 VOLUME 7, 2019 parameters γ = 1 and ν = 0.6 [4]. In this speech enhancement system, we estimate the priori SNR using the decisiondirected approach with a smoothing parameter α dd = 0.98.
The speech enhancement performance is evaluated in terms of the segSNR metric and three composite objective metrics. The segmental SNR (segSNR) is defined as follows [22], [40] where L and N denote the number of frames in the signal and the frame length, respectively, and (x) = min{max(x, −10), 35}. For the segSNR computation, only the signal segments containing speech are taken into account. The segSNR values are limited in the range of [−10dB, 35dB] thereby avoiding the need for a speech/silence detector. The segSNR measure results obtained with different noise PSD tracking algorithms are given in Table 1. It is found that the proposed noise PSD tracker yields larger segSNR improvements than all other algorithms for almost each noise source except for babble noise. For babble noise, MMSE obtains slightly higher segSNR values in case of input SNR more than 5 dB. However, the segSNR measure, which is widely used to evaluate noise reduction performance of speech enhancement algorithms, yields a poor correlation coefficients with subjective measure. For this, three composite objective metrics are employed to evaluate the enhancement performance.
The three composite objective metrics are C sig , C bak , and C vol , which are obtained by linearly combining existing widely used measures, segSNR, weighted-slope spectral (WSS) distance [49], perceptual evaluation of speech quality (PESQ) [50], log likelihood ration (LLR) and Itakura-Saito (IS) distance measure [51]. The three composite metrics are given below [41]: (29) C sig , C bak and C vol are designed to provide the high correlations with three subjective measures, i.e., Mean opinion score (MOS) predictor of speech distortion (SIG), MOS predictor of background intrusiveness (BAK), and MOS predictor of overall speech quality (OVRL).
The scores of three composite objective metrics obtained with all noise PSD estimation methods are shown in Figs. 6-8. Since the three composite measures provide very high correlation coefficients with subjective measures, especially C ovl measure has the highest correlation with the real subjective test, the evaluation results of composite measures are more important than segSNR measure. From the scores in Figs. 6-8, we observe that the proposed noise estimator is clearly superior to other noise tracking methods for all noise types and SNR conditions, except for babble noise at 15 dB.
Figs. 9 and 10 present the enhanced waveforms and spectrograms obtained with different noise estimators for a speech example which is degraded by the traffic noise at 5 dB input SNR. In this way, the enhancement performance of speech enhancement algorithm combined with different noise estimators can be seen more directly. Fig. 9(a)-(b) and Fig. 10(a)-(b) show waveforms and spectrograms of clean speech and noisy speech, respectively. Fig. 9(c)-(f) display the enhanced speech waveforms obtained using four competing noise trackers, and the respective spectrograms are shown in Fig. 10(c)-(f). Fig. 9(h) and Fig. 10(g) show the enhanced waveform and spectrogram using proposed algorithm, respectively. Additionally, Fig. 9(g) also shows the estimated noise PSDs together with ideal reference noise PSD. Clearly, the proposed method performs better than other four competing algorithms. In general, the proposed approach shows a good tradeoff between noise suppression and speech distortion as it obtains higher segSNR and higher three composite measures.

C. COMPUTATIONAL COMPLEXITY ANALYSIS
To investigate the computational complexity of proposed algorithm and other four competing algorithms, we compare the execution time of Matlab implementations of these algorithms in this section [32]. The Matlab implementations of these methods run on a PC with a Intel Core i7-7700 processor. Table 2 shows the execution times of all five methods, normalized by the execution time of the proposed method. It is observed that the proposed algorithm exhibits a higher computational complexity than other methods. The computational complexity of the proposed method is mainly determined by the computation of the nonlinear weighting function (22), as exponential operation of the special exponential integral function needs to be computed.
However, in a practical system, all nonlinear weighting functions can be computed offline for the relevant range of the parameters and stored in a lookup table. In this way, the noise PSD tracker can be implemented with significantly reduced execution time (normalized execution time: 0.52). The computation complexity is not an issue then. In addition, since more and more computational power will be available with improved technology, this problem will be easily solved. Notice, that the numbers as given in Table 2 are rough estimates since there will be some changes depending on the implementation details. The number in Table 2 reflects all processing steps of the proposed algorithm.

VI. CONCLUSION
A crucial component of single-channel speech enhancement algorithms is the estimation of noise PSD. This paper develops a novel algorithm for noise PSD estimation. In this method, a nonlinear weighting function of the log-spectral power MMSE estimator is derived to estimate instantaneous noise spectral power, which depends on the a priori and the a posteriori SNR. Then, the noise PSD estimation is updated by performing a temporal recursive averaging of log-spectral MMSE estimation of the current noise power. The smoothing parameter in the temporal recursive smoothing operation is adjusted by a simple estimate of speech presence probability.
Experimental results of LogErr measure demonstrate that the proposed algorithm achieves faster and more accurate noise PSD tracking. Additionally, evaluation results of segSNR and three composite measures (C sig , C bak , C ovl ) show that the enhancement performance of proposed method is clearly superior to other competing methods in the presence of various noise sources and levels. The overall performance improvements of the proposed noise tracker come with an increase of computational complexity, which is mainly determined by the nonlinear weighting function computation. However, in a practical system, all weighting function can be evaluated offline and stored in a lookup table, thus the proposed method can be implemented with a significant decrease in computational complexity. As a result, the proposed method leads to a better tradeoff between the computational complexity and the overall performance. The techniques developed in this paper are of importance for many applications, such as hearing aids, speaker identification, human-computer interactions and many others.

ACKNOWLEDGMENT
Authors thank a lot the Circuit and Systems (signal processing) Group at Delft University of Technology for providing the matlab code of the MMSE speech estimator with generalized Gamma priors. His research interests include the low-power loss chip design, speech signal processing, speech coding, speech enhancement, and audio/image deep learning algorithm with application to the AI processing chip design.