Factorized-VITS: Decoupling Prosody and Text in End-to-End Speech Synthesis without External or Secondary Aligner | IEEE Conference Publication | IEEE Xplore