Loading [MathJax]/extensions/MathMenu.js
Understanding Shared Speech-Text Representations | IEEE Conference Publication | IEEE Xplore

Understanding Shared Speech-Text Representations


Abstract:

Recently, a number of approaches to train speech models by incorporating text into end-to-end models have been developed, with Maestro advancing state-of-the-art automati...Show More

Abstract:

Recently, a number of approaches to train speech models by incorporating text into end-to-end models have been developed, with Maestro advancing state-of-the-art automatic speech recognition (ASR) and Speech Translation (ST) performance. In this paper, we expand our understanding of the resulting shared speech-text representations with two types of analyses. First we examine the limits of speech-free domain adaptation, finding that a corpus-specific duration model for speech-text alignment is the most important component for learning a shared speech-text representation. Second, we inspect the similarities between activations of unimodal (speech or text) encoders as compared to the activations of a shared encoder. We find that the shared encoder learns a more compact and overlapping speech-text representation than the uni-modal encoders. We hypothesize that this partially explains the effectiveness of the Maestro shared speech-text representations.
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information:

ISSN Information:

Conference Location: Rhodes Island, Greece

Contact IEEE to Subscribe

References

References is not available for this document.