Skip to Main Content
Pitch estimation from acoustic signals is a fundamental problem in many areas of speech research. For noise-corrupted speech, reliable pitch estimation is difficult. This paper presents a study of pitch estimation in noisy speech based on robust temporal-spectral representation and sparse reconstruction. We propose to accumulate spectral peaks over consecutive time frames. Since harmonic structure of speech changes much more slowly than noise spectrum, spectral peaks related to pitch harmonics would stand out over the noise through the accumulation. Experimental results show that the accumulated peak spectrum is indeed a robust representation of pitch harmonics. Subsequently, the accumulated peak spectrum is expressed as a sparse linear combination of a large set of clean peak spectrum exemplars. Gaussian mixture density is used to model noise spectrum peaks. The weights of the linear combination are estimated so as to maximize the likelihood of the accumulated peak spectrum under sparsity constraint. Robust pitch estimation is done based on the sparse weights and the corresponding peak spectrum exemplars. The use of Gaussian mixture model leads to non-convexity of the objective function for sparse weight estimation. By approximation and reformulation, two convex optimization approaches are developed to estimate the weights. Extensive experimental studies are carried out to evaluate performance of the proposed pitch estimation algorithms on a wide variety of noise conditions. It is clearly shown that the proposed methods significantly and consistently outperform the conventional methods, particularly at very low signal-to-noise ratios (e.g., SNR <; -5 dB).