Adversarial Speech for Voice Privacy Protection from Personalized Speech Generation | IEEE Conference Publication | IEEE Xplore

Adversarial Speech for Voice Privacy Protection from Personalized Speech Generation


Abstract:

The rapid progress in personalized speech generation technology, including personalized text-to-speech (TTS) and voice conversion (VC), poses a challenge in distinguishin...Show More

Abstract:

The rapid progress in personalized speech generation technology, including personalized text-to-speech (TTS) and voice conversion (VC), poses a challenge in distinguishing between generated and real speech for human listeners, resulting in an urgent demand in protecting speakers' voices from malicious misuse. In this regard, we propose a speaker protection method based on adversarial attacks. The proposed method perturbs speech signals by minimally altering the original speech while rendering downstream speech generation models unable to accurately generate the voice of the target speaker. For validation, we employ the open-source pre-trained YourTTS model for speech generation and protect the target speaker's speech in the white-box scenario. Automatic speaker verification (ASV) evaluations were carried out on the generated speech as the assessment of the voice protection capability. Our experimental results show that we successfully perturbed the speaker encoder of the YourTTS model using the gradient-based I-FGSM adversarial perturbation method. Furthermore, the adversarial perturbation is effective in preventing the YourTTS model from generating the speech of the target speaker. Audio samples can be found in https://voiceprivacy.github.io/Adeversarial-Speech-with-YourTTS.
Date of Conference: 14-19 April 2024
Date Added to IEEE Xplore: 18 March 2024
ISBN Information:

ISSN Information:

Conference Location: Seoul, Korea, Republic of

1. INTRODUCTION

In the era of deep learning, techniques in speech generation have developed rapidly, greatly improving the quality and naturalness of the artificial speech. Among them, the personalized speech generation techniques, including text-to-speech synthesis (TTS) [1] and voice conversion (VC) [2] facilitate generating the speech of a target speaker in high speaker similarity, raising the potential security risks concerning the voice privacy. To be specific, given the speech utterances of a target speaker for reference, multi-speaker TTS and VC technologies can be utilized to generate its speech utterances with fabricated content, which can potentially be used to spoof a speaker authentication system or manipulate public opinions.

Contact IEEE to Subscribe

References

References is not available for this document.