Skip to Main Content
This paper presents our work on the retrieval of spoken information in Turkish. Traditional speech retrieval systems perform indexing and retrieval over automatic speech recognition (ASR) transcripts, which include errors either because of out-of-vocabulary (OOV) words or ASR inaccuracy. We use subword units as recognition and indexing units to reduce the OOV rate and index alternative recognition hypotheses to handle ASR errors. Performance of such methods is evaluated on our Turkish Broadcast News Corpus with two types of speech retrieval systems: a spoken term detection (STD) and a spoken document retrieval (SDR) system. To evaluate the SDR system, we also build a spoken information retrieval (IR) collection, which is the first for Turkish. Experiments showed that word segmentation algorithms are quite useful for both tasks. SDR performance is observed to be less dependent on the ASR component, whereas any performance change in ASR directly affects STD. We also present extensive analysis of retrieval performance depending on query length, and propose length-based index combination and thresholding strategies for the STD task. Finally, a new approach, which depends on the detection of stems instead of complete terms, is tried for STD and observed to give promising results. Although evaluations were performed in Turkish, we expect the proposed methods to be effective for similar languages as well.