Abstract:
Zero-shot voice conversion (VC) aims to alter the speaker identity in a voice to resemble that of the target speaker using only a short reference speech. While existing m...Show MoreMetadata
Abstract:
Zero-shot voice conversion (VC) aims to alter the speaker identity in a voice to resemble that of the target speaker using only a short reference speech. While existing methods have achieved notable success in generating intelligible speech, balancing the trade-off between quality and similarity of the converted voice remains a challenge, especially when using a short target reference. This paper proposes ExVC, a zero-shot VC model that leverages the mixture of experts (MoE) layers and Conformer modules to enhance the expressiveness and overall performance. Additionally, to efficiently condition the model on speaker embedding, we employ feature-wise linear modulation (FiLM), which modulates the network based on the input speaker embedding, thereby improving the ability to adapt to various unseen speakers. Objective and subjective evaluations demonstrate that the proposed model outperforms the baseline models in terms of naturalness and quality. Audio samples are provided at: https://tksavy.github.io/exvc/.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: