Abstract:
Although end-to-end (E2E) automatic speech recognition (ASR) systems excel in general tasks, they frequently struggle with accurately recognizing personal rare words. Lev...Show MoreMetadata
Abstract:
Although end-to-end (E2E) automatic speech recognition (ASR) systems excel in general tasks, they frequently struggle with accurately recognizing personal rare words. Leveraging contextual information to bias the internal states of E2E ASR model has proven to be an effective solution. However most existing work focuses on biasing for a single domain and it is still challenging to expand such contextualization mechanisms to many domains. To address this limitation, in this work we propose a hierarchical attention architecture to scale contextual biasing to a wide range of domains simultaneously. Given multiple catalogs of contextual information, the high-level attention determines which source of catalog to focus on and the low-level attention learns to attend to the most relevant entity within the focused catalog. Experiments on diverse domains demonstrate the proposed architecture results in 35 \% to 60 \% relative WER improvements on personal rare words and outperforms existing approaches.
Date of Conference: 16-20 December 2023
Date Added to IEEE Xplore: 19 January 2024
ISBN Information: