Loading [MathJax]/extensions/TeX/cellcolor_ieee.js
Decoding Knowledge Transfer for Neural Text-to-Speech Training | IEEE Journals & Magazine | IEEE Xplore

Decoding Knowledge Transfer for Neural Text-to-Speech Training


Abstract:

Neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways. However, the exposure bias problem, that arises from the mismatch bet...Show More

Abstract:

Neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways. However, the exposure bias problem, that arises from the mismatch between the training and inference process in autoregressive models, remains an issue. It often leads to performance degradation in face of out-of-domain test data. To address this problem, we study a novel decoding knowledge transfer strategy, and propose a multi-teacher knowledge distillation (MT-KD) network for Tacotron2 TTS model. The idea is to pre-train two Tacotron2 TTS teacher models in teacher forcing and scheduled sampling modes, and transfer the pre-trained knowledge to a student model that performs free running decoding. We show that the MT-KD network provides an adequate platform for neural TTS training, where the student model learns to emulate the behaviors of the two teachers, at the same time, minimizing the mismatch between training and run-time inference. Experiments on both Chinese and English data show that MT-KD system consistently outperforms the competitive baselines in terms of naturalness, robustness and expressiveness for in-domain and out-of-domain test data. Furthermore, we show that knowledge distillation outperforms adversarial learning and data augmentation in addressing the exposure bias problem.
Page(s): 1789 - 1802
Date of Publication: 03 May 2022

ISSN Information:

Funding Agency:


References

References is not available for this document.