Skip to Main Content
Traditional approaches in speech-signal processing analyze short-time frames of the signal (e.g., the short-time Fourier transform). Findings from auditory neurophysiology coupled with image processing principles, however, have motivated an alternative 2-D processing framework in which 2-D analysis is performed on the time-frequency distribution itself. This paper develops a 2-D model of speech in local time-frequency regions of narrowband spectrograms using sinusoidal-series-based modulation. Our model is shown to distribute vocal tract and onset/offset content based on source information (e.g., noise and voicing) in a transformed 2-D space, thereby explicitly representing different classes of energy modulations commonly observed in spectrograms. We demonstrate the model's ability to represent speech sounds by developing and evaluating algorithms for analysis/synthesis of spectrograms. As an example application, we demonstrate the utility of the model for co-channel speaker separation using prior pitch information of two overlapping speakers. Finally, our separation scheme based on 2-D modeling is compared against a reference (frame-based) sinusoidal separation system using both prior and estimated pitch.