Loading [MathJax]/extensions/MathMenu.js
Factorized-VITS: Decoupling Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner | IEEE Conference Publication | IEEE Xplore

Factorized-VITS: Decoupling Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner


Abstract:

We propose Factorized-VITS, an advanced end-to-end text-to-speech model that incorporates explicit text-side prosody modeling control into VITS while achieving a clean fa...Show More

Abstract:

We propose Factorized-VITS, an advanced end-to-end text-to-speech model that incorporates explicit text-side prosody modeling control into VITS while achieving a clean factorization of the audio prior hidden space into text and prosody subspaces. Unlike previous works that rely on external or secondary aligners, Factorized-VITS is the first work attempting to do on-the-fly alignment in the factorized text subspace without introducing extra parameters, which not only simplifies the training procedure but also enables the use of a more complex prosody prior. Our experiments demonstrate the accuracy and effectiveness of this approximation strategy. Furthermore, we implement an in-context learning joint predictor for pitch, energy, and duration, which offers a flexible streaming deployment option.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.