Abstract:
We propose a novel, cross-lingual voice conversion (VC) method using a cyclic variational auto-encoder (CycleVAE). Voice conversion is the transformation of the voice of ...View moreMetadata
Abstract:
We propose a novel, cross-lingual voice conversion (VC) method using a cyclic variational auto-encoder (CycleVAE). Voice conversion is the transformation of the voice of one speaker into the voice of another speaker, while cross-lingual VC performs voice conversion between speakers who speak different languages. When using VC methods based on parallel learning, it is necessary to prepare accented speech uttered by the source or target speaker, using the pronunciation system of the speaker's mother tongue. On the other hand, VC methods which use a non-parallel learning approach can utilize the natural speech data of both the source and target speakers, produced in their own native languages. It then becomes necessary, however, to deal with the issues of time-alignment and language mismatches. To address these issues, we apply CycleVAE to cross-lingual VC as a sophisticated, non-parallel method of VC. We also apply the WaveNet vocoder in the waveform generation process of CycleVAE-VC to improve overall conversion quality. Our objective and subjective experimental results when performing cross-lingual VC from a native English speaker to a native Japanese speaker confirm that the proposed method achieves a higher level of naturalness and speaker similarity than a conventional RNN-based parallel VC method using accented speech.
Published in: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Date of Conference: 07-10 December 2020
Date Added to IEEE Xplore: 31 December 2020
ISBN Information:
ISSN Information:
Conference Location: Auckland, New Zealand