1. INTRODUCTION
In the era of deep learning, techniques in speech generation have developed rapidly, greatly improving the quality and naturalness of the artificial speech. Among them, the personalized speech generation techniques, including text-to-speech synthesis (TTS) [1] and voice conversion (VC) [2] facilitate generating the speech of a target speaker in high speaker similarity, raising the potential security risks concerning the voice privacy. To be specific, given the speech utterances of a target speaker for reference, multi-speaker TTS and VC technologies can be utilized to generate its speech utterances with fabricated content, which can potentially be used to spoof a speaker authentication system or manipulate public opinions.