Abstract:
Automatic speech recognition (ASR) systems can benefit from incorporating contextual information to improve recognition accuracy, especially for uncommon words or phrases...Show MoreMetadata
Abstract:
Automatic speech recognition (ASR) systems can benefit from incorporating contextual information to improve recognition accuracy, especially for uncommon words or phrases. Current approaches like custom vocabularies or prompting with previous transcript segments provide limited contextual control. Compared to existing context biasing methods, RAG promises more flexible and scalable contextual control by leveraging LLMs’ broad knowledge. To this end, we propose leveraging large language models (LLMs) and retrieval-augmented generation (RAG) to enhance the contextual capabilities of ASR systems. Specifically, we propose systems based on text and audio LLMs to perform contextual error correction with context retrieved by querying a text-based retriever using the ASR module’s firstpass ASR hypotheses and a frequency-based custom vocabulary (CV) list. Our experiments reveal that the fine-tuned system has effectively learned to extract the relevant context to perform error correction while maintaining robustness against noise.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: