Skip to Main Content
Handsets that are not seen in the training phase (unseen handsets) are significant sources of performance degradation for speaker identification (SID) applications in the telecommunication environment. In this paper, a novel latent prosody analysis (LPA) approach to automatically extract the most discriminative prosodic cues for assisting in conventional spectral feature-based SID is proposed. The concept of the LPA approach is to transform the SID problem into a full-text document retrieval-like task via 1) prosodic contour tokenization, 2) latent prosody analysis, and 3) speaker retrieval. Experimental results of the phonetically balanced, read-speech, handset-TIMIT (HTIMIT) database demonstrated that the proposed method of fusing the LPA prosodic feature-based SID systems with maximum-likelihood a priori handset knowledge interpolation (ML-AKI) spectral feature-based SID outperformed both the pitch and energy Gaussian mixture model (Pitch-GMM) and the bigram of the prosodic state (Bigram) counterparts for both cases of counting all and only unseen handsets.
Audio, Speech, and Language Processing, IEEE Transactions on (Volume:15 , Issue: 6 )
Date of Publication: Aug. 2007