Skip to Main Content
We present a new source-filter based method to separate two speakers talking simultaneously at equal level mixed into a single sensor. First, the relation between the spectral whitened mixture and the speakers excitation signals is analyzed. Therefore, a factorial HMM capturing also time dependencies is exploited. Then, the estimated excitation signals are combined with best fitting vocal tract information taken from a trained dictionary. We report results on the database of Cooke considering 108 speech mixtures. The average improvement of 2.9 dB in SIR for all data is lower but not significantly lower compared to the Gaussian mixture method which relies on known pitch-tracks. Although the performance is currently moderate we believe in this approach and its significance towards the development of speaker independent single sensor speech separation.