Abstract:
We propose Factorized-VITS, an advanced end-to-end text-to-speech model that incorporates explicit text-side prosody modeling control into VITS while achieving a clean fa...Show MoreMetadata
Abstract:
We propose Factorized-VITS, an advanced end-to-end text-to-speech model that incorporates explicit text-side prosody modeling control into VITS while achieving a clean factorization of the audio prior hidden space into text and prosody subspaces. Unlike previous works that rely on external or secondary aligners, Factorized-VITS is the first work attempting to do on-the-fly alignment in the factorized text subspace without introducing extra parameters, which not only simplifies the training procedure but also enables the use of a more complex prosody prior. Our experiments demonstrate the accuracy and effectiveness of this approximation strategy. Furthermore, we implement an in-context learning joint predictor for pitch, energy, and duration, which offers a flexible streaming deployment option.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: