Abstract:
In this paper, we present a voice conversion system that improves the quality of generated voice and its similarity to the target voice style significantly. Many VC syste...Show MoreMetadata
Abstract:
In this paper, we present a voice conversion system that improves the quality of generated voice and its similarity to the target voice style significantly. Many VC systems use feature-disentangle-based learning techniques to separate speakers' voices from their linguistic content in order to translate a voice into another style. This is the approach we are taking. To prevent speaker-style information from obscuring the content embedding, some previous works quantize or reduce the dimension of the embedding. However, an imperfect disentanglement would damage the quality and similarity of the sound. In this paper, to further improve quality and similarity in voice conversion, we propose a novel style transfer method within an autoencoder-based VC system that involves generative adversarial training. The conversion process was objectively evaluated using the fair third-party speaker verification system, the results shows that ASGAN-VC outperforms VQVC + and AGAINVC in terms of speaker similarity. A subjectively observing that our proposal outperformed the VQVC + and AGAINVC in terms of naturalness and speaker similarity.
Published in: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Date of Conference: 07-10 November 2022
Date Added to IEEE Xplore: 21 December 2022
ISBN Information: