Conferences >ICASSP 2021 - 2021 IEEE Inter...

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units ...Show More

Metadata

Abstract:

Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.

Published in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 June 2021

Date Added to IEEE Xplore: 13 May 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP39728.2021.9414460

Conference Location: Toronto, ON, Canada

Contents

1. INTRODUCTION

Given the huge imbalance between available labeled and unlabeled data, self-supervised pre-training is key for good performance on low-resource downstream tasks. Learning representations of discrete input sequences, such as in Natural Language Processing (NLP) applications, uses either masked prediction [1], [2] or auto-regressive generation [3] of input sequences with partial obfuscation. For continuous inputs, such as in Computer Vision (CV) applications, representations are often learned through instance classification, in which each image and its augmentations are treated as a single output class, contrasted against all other samples [4].

References is not available for this document.

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?