CIF-RNNT: Streaming ASR Via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers | IEEE Conference Publication | IEEE Xplore

CIF-RNNT: Streaming ASR Via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers


Abstract:

This paper introduces CIF-RNNT, a model that incorporates Continuous Integrate-and-Fire into RNN-Transducers (RNNTs) for streaming ASR via acoustic word embeddings (AWEs)...Show More

Abstract:

This paper introduces CIF-RNNT, a model that incorporates Continuous Integrate-and-Fire into RNN-Transducers (RNNTs) for streaming ASR via acoustic word embeddings (AWEs). CIF can dynamically compress long sequences into shorter ones, while RNNTs can produce multiple symbols given an input vector. We demonstrate that our model can not only streamingly segment acoustic information and produce AWEs, but also recover the represented word using a fixed set of output tokens with a shorter decoding time. Moreover, we improved CIF with new mechanisms that outperformed conventional ones when evaluated on Japanese and English ASR datasets. As the first attempt at combining CIF with RNNT, this paper advances our understanding of applying CIF’s dynamic compression capabilities to obtain AWEs for streaming ASR and paves the way for speech and text integration via words instead of architecturally confined tokens.
Date of Conference: 14-19 April 2024
Date Added to IEEE Xplore: 18 March 2024
ISBN Information:

ISSN Information:

Conference Location: Seoul, Korea, Republic of

1. INTRODUCTION

Past research leveraging CIF [1] has made strides in closing the gap between speech and text, as evidenced by studies as [2], [3], [4], [5], [6]. However, they are confined to token-level integration, demanding strict architectural agreement between speech and text components in terms of output token sets. Despite the desirability to break through these architectural constraints, obtaining acoustic word embeddings (AWEs) for streaming ASR presents challenges, as existing methods either are architecturally not adaptable for streaming [7] or need explicit timestamp durations for each word [8].

Contact IEEE to Subscribe

References

References is not available for this document.