ExVC: Leveraging Mixture of Experts Models for Efficient Zero-shot Voice Conversion | IEEE Conference Publication | IEEE Xplore

ExVC: Leveraging Mixture of Experts Models for Efficient Zero-shot Voice Conversion


Abstract:

Zero-shot voice conversion (VC) aims to alter the speaker identity in a voice to resemble that of the target speaker using only a short reference speech. While existing m...Show More

Abstract:

Zero-shot voice conversion (VC) aims to alter the speaker identity in a voice to resemble that of the target speaker using only a short reference speech. While existing methods have achieved notable success in generating intelligible speech, balancing the trade-off between quality and similarity of the converted voice remains a challenge, especially when using a short target reference. This paper proposes ExVC, a zero-shot VC model that leverages the mixture of experts (MoE) layers and Conformer modules to enhance the expressiveness and overall performance. Additionally, to efficiently condition the model on speaker embedding, we employ feature-wise linear modulation (FiLM), which modulates the network based on the input speaker embedding, thereby improving the ability to adapt to various unseen speakers. Objective and subjective evaluations demonstrate that the proposed model outperforms the baseline models in terms of naturalness and quality. Audio samples are provided at: https://tksavy.github.io/exvc/.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.