Abstract:
Although diffusion probabilistic vocoders WaveGrad and DiffWave can realize real-time high-fidelity speech synthesis with a simple loss function in training, all noise co...Show MoreMetadata
Abstract:
Although diffusion probabilistic vocoders WaveGrad and DiffWave can realize real-time high-fidelity speech synthesis with a simple loss function in training, all noise components with over the full range of noise levels are predicted by one model in all iterations. This paper proposes a simple but effective noise level-limited sub-modeling framework for diffusion probabilistic vocoders Sub-WaveGrad and Sub-DiffWave. In the proposed method, DiffWave conditioned on a continuous noise level like WaveGrad, and spectral enhancement post-filtering are also provided. The proposed Sub-WaveGrad and Sub-DiffWave models are realized using 10 sub-models. These models are separately trained with different noise level limits, and only necessary sub-models are used according to the noise schedule during inference. The results of experiments using a Japanese female speech corpus indicate that both the proposed Sub-WaveGrad and Sub-DiffWave outperform vanilla WaveGrad and DiffWave in terms of the model accuracy and synthesis quality while retaining the inference speed.
Published in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 June 2021
Date Added to IEEE Xplore: 13 May 2021
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Accuracy Of Model ,
- Range Of Levels ,
- Noise Components ,
- Inference Speed ,
- Training Loss Function ,
- Speech Synthesis ,
- Synthesis Quality ,
- Simple Noise ,
- Average Score ,
- Time Domain ,
- Utterances ,
- Mixture Model ,
- Weighting Factor ,
- Generative Adversarial Networks ,
- Average Loss ,
- Synthesis Conditions ,
- Total Set ,
- Short-time Fourier Transform ,
- Langevin Dynamics ,
- Autoregressive Structure ,
- Mean Opinion Score ,
- NVIDIA Tesla V100 GPU ,
- Hanning Window ,
- Early Iterations ,
- Dilated Convolution Layers ,
- Acoustic Model
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Accuracy Of Model ,
- Range Of Levels ,
- Noise Components ,
- Inference Speed ,
- Training Loss Function ,
- Speech Synthesis ,
- Synthesis Quality ,
- Simple Noise ,
- Average Score ,
- Time Domain ,
- Utterances ,
- Mixture Model ,
- Weighting Factor ,
- Generative Adversarial Networks ,
- Average Loss ,
- Synthesis Conditions ,
- Total Set ,
- Short-time Fourier Transform ,
- Langevin Dynamics ,
- Autoregressive Structure ,
- Mean Opinion Score ,
- NVIDIA Tesla V100 GPU ,
- Hanning Window ,
- Early Iterations ,
- Dilated Convolution Layers ,
- Acoustic Model
- Author Keywords