Journals & Magazines >IEEE/ACM Transactions on Audi... >Volume: 32

Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Previous approaches train on data from almost exclusively audi...Show More

Metadata

Abstract:

Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Previous approaches train on data from almost exclusively audio-visual datasets, i.e., every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality such as audiobooks, radio podcasts, and speech recognition datasets. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24 kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.

Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 32)

Page(s): 2255 - 2268

Date of Publication: 27 March 2024

ISSN Information:

DOI: 10.1109/TASLP.2024.3382500

Funding Agency:

Contents

References is not available for this document.

Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?