Loading [MathJax]/extensions/MathMenu.js
Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training? | IEEE Conference Publication | IEEE Xplore

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?


Abstract:

Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units ...Show More

Abstract:

Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.
Date of Conference: 06-11 June 2021
Date Added to IEEE Xplore: 13 May 2021
ISBN Information:

ISSN Information:

Conference Location: Toronto, ON, Canada

1. INTRODUCTION

Given the huge imbalance between available labeled and unlabeled data, self-supervised pre-training is key for good performance on low-resource downstream tasks. Learning representations of discrete input sequences, such as in Natural Language Processing (NLP) applications, uses either masked prediction [1], [2] or auto-regressive generation [3] of input sequences with partial obfuscation. For continuous inputs, such as in Computer Vision (CV) applications, representations are often learned through instance classification, in which each image and its augmentations are treated as a single output class, contrasted against all other samples [4].

Contact IEEE to Subscribe

References

References is not available for this document.