Abstract:
Keyword search (KWS) is an important application of spoken language technology. The technique of Large Vocabulary Continuous Speech Recognition (LVCSR) is playing an impo...Show MoreMetadata
Abstract:
Keyword search (KWS) is an important application of spoken language technology. The technique of Large Vocabulary Continuous Speech Recognition (LVCSR) is playing an important role in KWS system. However, for a language with large vocabulary and relatively insufficient text corpus, the vocabulary size keeps going up very quickly with the increasing amount of text, as we observed in Tamil. This brings difficulty in training a reliable language model, which may undermine KWS performance. Subword unit has been successfully employed in KWS system to handle out-of-vocabulary (OOV) problem. Inspired by this, we propose a novel subword scheme from the perspective of pronunciation to alleviate the large vocabulary problem. We find that the subword-based system outperforms our best word-based system on Tamil conversational telephone speech. The experiment of system combination shows that, over the best word-based system, a single subword-based system contains more complementary information than the total of that of the other three word-based systems.
Published in: 2014 IEEE Spoken Language Technology Workshop (SLT)
Date of Conference: 07-10 December 2014
Date Added to IEEE Xplore: 02 April 2015
Electronic ISBN:978-1-4799-7129-9