Skip to Main Content
We propose a novel method of measuring the similarity between two or more speech utterances for speaker clustering, based on probability theory and factor analysis. The similarity function is formulated as the probability that the utterances originated from the same speaker, and uses statistical eigenvoice and eigenchannel models to incorporate physical knowledge of interspeaker and intraspeaker variabilities, allowing the similarity function to be trainable and robust. The comparison function can be efficiently computed using a compact set of sufficient statistics for each speech utterance, allowing the acoustic features to be discarded. We begin using only eigenvoices, and then show how the eigenchannels can be incorporated into the equation to result in an identical form but with a different set of sufficient statistics. We test the proposed model in a speaker clustering task using the CALLHOME telephone conversation corpus and show that it performs better than two other well-known similarity measures: the Cross-Likelihood Ratio (CLR) and Generalized Likelihood Ratio (GLR).