Performance-Efficiency Trade-Offs in Unsupervised Pre-Training for Speech Recognition | IEEE Conference Publication | IEEE Xplore

Performance-Efficiency Trade-Offs in Unsupervised Pre-Training for Speech Recognition


Abstract:

This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several ar...Show More

Abstract:

This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW-D (Squeezed and Efficient Wav2vec with Disentangled Attention), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW-D achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25–50% across different model sizes.
Date of Conference: 23-27 May 2022
Date Added to IEEE Xplore: 27 April 2022
ISBN Information:

ISSN Information:

Conference Location: Singapore, Singapore

Contact IEEE to Subscribe

References

References is not available for this document.