Conferences >ICASSP 2024 - 2024 IEEE Inter...

Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and...Show More

Metadata

Abstract:

A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing input speech according to augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. ¹

Published in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 14-19 April 2024

Date Added to IEEE Xplore: 18 March 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP48485.2024.10445914

Conference Location: Seoul, Korea, Republic of

Contents

1. INTRODUCTION

Text-to-speech (TTS) and voice conversion (VC) have been actively studied to obtain the desired speech. In recently developed TTS and VC systems, a two-stage approach is commonly adopted, whereby the first model predicts the intermediate representation (e.g., mel spectrogram) from the input data (e.g., text or speech), and the second model synthesizes speech from the predicted intermediate representation. The second model, the neural vocoder, has been extensively studied through autoregressive models (e.g., WaveNet [1] and WaveRNN [2]) and non-autoregressive models, including distillation-based (e.g., Parallel WaveNet [3] and ClariNet [4]), flow (e.g., Glow [5])-based (e.g., WaveGlow [6]), diffusion [7], [8] -based (e.g., WaveGrad [9] and DiffWave [10]), and generative adversarial network (GAN) [11] -based (e.g., [12]–[27]) models. This study focuses on a GAN-based model because it is fast, lightweight, and high-quality.

References is not available for this document.

Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?