Skip to Main Content
Spoken term detection (STD) is the name given to the task of searching large amounts of audio for occurrences of spoken terms, which are typically single words or short phrases. One reason that STD is a hard task is that search terms tend to contain a disproportionate number of out-of-vocabulary (OOV) words. The most common approach to STD uses subword units. This, in conjunction with some method for predicting pronunciations of OOVs from their written form, enables the detection of OOV terms but performance is considerably worse than for in-vocabulary terms. This performance differential can be largely attributed to the special properties of OOVs. One such property is the high degree of uncertainty in the pronunciation of OOVs. We present a stochastic pronunciation model (SPM) which explicitly deals with this uncertainty. The key insight is to search for all possible pronunciations when detecting an OOV term, explicitly capturing the uncertainty in pronunciation. This requires a probabilistic model of pronunciation, able to estimate a distribution over all possible pronunciations. We use a joint-multigram model (JMM) for this and compare the JMM-based SPM with the conventional soft match approach. Experiments using speech from the meetings domain demonstrate that the SPM performs better than soft match in most operating regions, especially at low false alarm probabilities. Furthermore, SPM and soft match are found to be complementary: their combination provides further performance gains.