Introduction
The speaker recognition system, which determines different identities based on specific features of voice signals, is also widely used in voice services, such as identity recognition for banking, judicial authentication, and personalized services on smart devices. The development of neural networks has improved the performance of speaker recognition systems, but they are vulnerable to the threat of adversarial examples [1]. Therefore, defense methods that can be applied in speaker recognition systems and resist various adversarial attacks are what we want to study.
Literature review [2] points out that most defense methods are based on prior knowledge of attacks, which can become ineffective if attackers improve the original attacks to produce stronger ones. Therefore, there is an urgent need to find a defense method that can resist attacks without impacting the robustness of the original model. Our method precisely addresses this problem by effectively defending against different attacks with minimal impact on the model’s accuracy.
To explore the relationship between generative models and robustness, this paper uses generative adversarial networks to learn the distribution of input features. The method of this paper is inspired by the article [3] in the field of images. Unlike us, their generator only learns the distribution of adversarial examples and the distribution of benign examples for adversarial examples elimination. However, in a real production scenario, we inevitably need to receive benign data when performing the defense of adversarial examples. Therefore, our generator learns how to map natural data to benign data, where natural data contains both adversarial examples and benign examples. The model in this paper is based on CycleGAN-VC2 [4] in the speech synthesis domain. For better defense against adversarial examples, we make minor modifications to the loss function of CycleGAN-VC2, and we name the defense methods as CYC-L1 and CYC-L2 according to the different loss functions.
Our defense approach has the following contributions:
We propose a defense against adversarial examples in the field of speaker recognition based on the CycleGAN-VC2 model. This method eliminates the perturbations existing in natural data by mapping natural data to benign data.
In CycleGAN-VC2, the
distance was used for model training. No one in the field of speaker recognition defense has used CycleGAN-VC2 withL_{1} distance. Moreover, when we tried to use theL_{1} distance for defense against the adversarial examples, we found that it was not as effective as the defense using theL_{1} distance. Because CycleGAN-VC2 uses training data where one speaker corresponds to one speaker, while for defense, we use training data where multiple speakers correspond to multiple speakers. TheL_{2} distance can encourage the model to select more features, it fits our defense scenario. Therefore, we choseL_{2} distance as the defense method. Experimental results show that in most cases, the method usingL_{2} distance achieves better defense results. Moreover, we use a method similar to decremental learning in the training process, which solves the problem of difficult training of generative adversarial networks and greatly reduces the training time.L_{2} To test the robustness of our method, we defend against different attacks and compare them with other defenses. The experimental results show that our defense method is better than other defense methods.
The remainder of this paper is divided into six sections. Section II focuses on the basics of speaker recognition. Then, we introduce the attack methods and defense methods used in the experiments of this paper. In Section III, we discuss the work related to defense in speaker recognition systems. In Section IV, we describe the proposed method in detail. In Section V, we describe the dataset, experimental environment, parameter settings, and model architecture, and provide a detailed description of the evaluation metrics. In Section VI, we present and analyze the experimental results. Finally, in Section VII, we give an overview of the effectiveness of the defense method proposed in this paper and present the shortcomings of the experiments and future research directions.
Background
A. Basics of Speaker Recognition
Speaker recognition, also called voice recognition, is a technology that distinguishes speakers based on the characteristics of the speaker’s voice. The flow of a typical speaker recognition system is shown in Figure 1. To recognize a user, the system first needs to extract relevant features such as MFCC [5]. Then the model is trained and stored in a model database. In the testing phase, features are extracted from the voice to be recognized and compared with the data in the database for similarity, scored, and judged to determine the speaker of the current voice.
In addition, speaker recognition systems have three main tasks: closed-set identification (CSI), speaker verification (SV), and open-set identification (OSI). For CSI, the recognition result is the one with the highest matching score among the registered speakers. For SV, the recognition result depends on whether the matching score is higher than the threshold. If it is higher, it is accepted, otherwise, it is rejected. For OSI, the recognition result must satisfy two conditions. Firstly, the highest score in the set. Secondly, the score must be higher than the set threshold \begin{align*} D(x)=\begin{cases} \displaystyle arg\mathop {max}\limits _{i\in G}{[S(x)] }_{i}, & if\mathop {max}\limits _{i\in G}{[S(x)] }_{i}\ge \theta;\\ \displaystyle imposter, & otherwise\end{cases} \tag{1}\end{align*}
B. Adversarial Attack
Adversarial examples also known as adversarial attacks. The process is to add small perturbations to benign examples to make the speaker recognition model misclassify. The adversarial examples formula is as follows:\begin{equation*} x^{\prime }=x+\delta such ~~ that \delta < \epsilon \tag{2}\end{equation*}
Adversarial attacks can be categorized into targeted and untargeted attacks, as well as black-box attacks and white-box attacks. Targeted attacks aim to modify the model’s output to a specific incorrect class. In contrast, untargeted attacks only misclassify the model. White-box attacks can know all the information about the target model. Black-box attacks do not have access to valid information about the model.
FGSM (Fast Gradient Sign Method) [6] attack is a simple gradient-based method in the field of adversarial attacks. Its basic formula is as follows:\begin{equation*} x^{\prime }=x+\varepsilon \cdot sign(\nabla _{x}L(x,y)) \tag{3}\end{equation*}
MI-FGSM (Momentum Iterative Fast Gradient Sign Method) [7] attack is an upgrade of the FGSM attack, with the addition of momentum and multiple iterations to the FGSM attack. The basic formula is as follows:\begin{align*} g_{k+1}&=\mu \cdot g_{k}+\frac {\nabla _{x}L(x_{k}^{\prime },y)}{{\parallel \nabla _{x}L(x_{k}^{\prime },y)\parallel }_{1}} \tag{4}\\ x_{k+1}^{\prime }&=x_{k}^{\prime }+\alpha \cdot sign\left ({g_{k+1} }\right) \tag{5}\end{align*}
CW2 (Carlini & Wagner) [8] is an optimization-based attack that produces a perturbation so small that one can barely perceive it in the audio and extends the constraint range from 0]–[1 to [\begin{align*} &{x_{i}+\delta }_{i}=\frac {1}{2}(tanh(\omega _{i})+1) \tag{6}\\ &\mathrm {minimize}{\parallel \frac {1}{2}\left ({\tanh \left ({\omega }\right)+1 }\right)-x\parallel }_{2}^{2}+c\cdot f(\frac {1}{2}(\mathrm {tanh}(\omega)+1) \tag{7}\\ &f(x^{\prime})=max(maxZ\left ({x^{\prime} }\right)_{i}:i\ne t-Z{(x^{\prime})}_{t},-\kappa) \tag{8}\end{align*}
CW2 attack needs to be optimized with two objectives, the first objective is that the gap between the generated adversarial example
PGD (Project Gradient Descent) [9] is a gradient-based attack, which is an improvement of the FGSM attack. The PGD attack performs multiple iterations on the FGSM attack and each iteration clips the iterations to the specified range. The basic formula is as follows:\begin{equation*} x_{t+1}^{\prime} ={Clip}_{x,\epsilon }(x_{t}^{\prime} +\alpha \mathrm {\cdot }sign(\nabla _{x}L(x_{t}^{\prime},y\mathrm {;}\theta \mathrm {))} \tag{9}\end{equation*}
ADA [10] is an improvement of the PGD attack. The main idea of the ADA attack is dynamically adjusting the perturbation strength. Next, it uses cosine annealing to decrease the step size, which means that the added perturbation decreases throughout multiple iterations to achieve a higher level of stealth adversarial example without affecting the success rate of the attack.
C. Adversarial Defense
Adversarial defense is mainly done by various methods to enhance the robustness and security of the model against adversarial examples.
QT (Quantization) [11] defense is to limit the amplitude of the sound to an integer multiple of the parameter \begin{equation*} \lfloor \frac {audio}{q}+0.5\rfloor \ast q \tag{10}\end{equation*}
AS (Average Smoothing) [12] defense is done by fixing a sample point
QT, AS, and MS are defense methods based on the time domain. However, QT defense is ineffective against adversarial examples with high perturbation strength. AS defense and MS defense may degrade the audio quality and have an impact on the performance of the model.
DS (Down Sampling) [13] defense is to eliminate the perturbation by reducing the sampling rate of the audio and then recovering it. Specifically, the downsampling frequency \begin{equation*} \tau =\frac {new\_{}audio\_{}sample}{ori\_{}audio\_{}sample} \tag{11}\end{equation*}
LPF (Low Pass Filter) [14] defense is to filter the high frequency sounds by setting a low pass filter. BPF (Band Pass Filter) [15] defense is to set up a filter that filters both high and low frequency sounds.
DS, LPF, and BPF are defense methods based on the frequency domain. The DS method requires the sampling frequency to be at least twice the highest frequency in the signal to avoid sampling distortion and aliasing effects. LPF cannot defend against low frequency adversarial noise. Since different adversarial examples may have different frequency characteristics, BPF is challenging to choose an appropriate frequency. Therefore, it is necessary to select the appropriate filter parameters according to the specific situation. This may involve extensive experimentation and cannot be directly applied to practical applications.
OPUS [16] and SPEEX [17] are two audio codecs with audio compression. They deal with noise in the compression of the audio. The noise processing by OPUS is divided into two stages, NSA (Noise Shaping Analysis) and NSQ (Noise Shaping Quantization). Among them, the purpose of the NSA stage is to find the compensation gain \begin{equation*} Y(z)=G\cdot \frac {1-F_{ana}\left ({z }\right)}{1-F_{syn}(z)}\cdot X(z)+\frac {1}{1-F_{syn}(z)}\cdot Q(z) \tag{12}\end{equation*}
SPEEX is mainly used for real-time VoIP communication. SPEEX also decreases noise in the encoder to improve the quality of network calls. The treatment of noise is shown below:\begin{equation*} W(z)=\frac {A(z/\gamma _{1})}{A(z/\gamma _{2})} \tag{13}\end{equation*}
OPUS and SPEEX are defense methods based on speech compression. Both OPUS and SPEEX use compression algorithms to reduce the size of audio files, resulting in a certain degree of information loss that affects the performance of the speaker recognition system.
To address the shortcomings in time domain, frequency domain and speech compression methods, we propose a speech synthesis-based defense method, which mitigates the audio distortion problem and defends against various attacks.
Related Work
Sun et al. [18] used an adversarial training approach to enhance the robustness of the model. They used MFCC features as input and used FGSM to select each small batch of data for generating adversarial examples, and then used the adversarial examples and the original labels to generate data for adversarial training, and dynamically put the FGSM generated adversarial data into the training set of the model to enhance the classification ability of the model. Moreover, they used a teacher-student (T/S) training method [19] to improve the robustness of the model. However, they only used FGSM to enhance the robustness of the model, without testing other attack methods.
Du et al. [12] used adversarial training, audio downsampling, and average filtering for the defense of the adversarial examples. They believe that adversarial training has significant limitations and can only provide an effective defense with fixed parameters; if the attacker increases the perturbation strength or changes to another attack method, adversarial training may not achieve a good defense. In their experiments, both audio downsampling and averaging filtering were found to reduce the success rate of the attack. However, it is worth noting that according to Nyquist’s sampling theorem, downsampling the audio can cause distortion when the sampling rate is lower than twice the highest frequency of the original audio. Similarly, averaging filtering may degrade the audio quality.
Yang et al. [11] proposed the use of a time dependent approach for the detection of adversarial examples. The main process is that for a time length
Yuan et al. [13] employed two methods for detecting adversarial examples: noise addition detection and reduced sampling rate detection. In their experimental results, they observed that background noise reduced the success rate of adversarial examples. Therefore, they adopted the defense method of adding noise. Specifically, they introduced noise, denoted as
Rajaratnam et al. [21] used band pass filters and audio compression methods of AAC [22], MP3 [23], OPUS [16], and SPEEX [17] for the study of adversarial examples defense. They found that AAC and MP3 had the worst defenses among audio compression methods. The method using bandpass filters had better defenses than audio compression methods, but had a greater impact on target model accuracy than the audio compression methods.
Zeng et al. [24] migrated the main idea of Multi-version programming (MVP) [25] to adversarial examples detection. Based on the adversarial examples mobility [26] and the fact that different recognition systems have the same output results for a single normal speech. They compare the similarity of the outputs two by two in different recognition systems, and the examples with similarity below a given threshold are considered adversarial examples. However, if the systems cannot effectively recognize benign samples, the detection accuracy will decrease.
Esmaeilpour et al. [27] used a class-conditional generative adversarial network [28] for defense. They minimize the relative chord distance between the random noise and the adversarial examples, find the optimal input vector to put into the generator to generate the spectrogram, and then reconstruct the one-dimensional signal based on the original phase information of the adversarial example and the generated spectrogram. Thus, this reconstruction does not add any additional noise to the signal and achieves the purpose of removing the perturbation. However, their method performs poorly in defending against black-box attacks.
Proposed Defense Approach: CYC
We will present our approach in the following areas: (A) the overall workflow, (B) the associated loss functions, and (C) the specific training methods.
A. Overall Workflow
Firstly, we preprocess the audio. In the preprocessing, we mainly get four main features: fundamental frequency (F0), spectral envelope (SP), Mel-Cepstral Coefficients (MCEPs), and Aperiodic Parameter (AP). Where F0 is the lowest frequency sine wave in the sound signal and is used to determine the start time position of the sound start in each audio. SP is the variation of different frequency amplitudes obtained by Fourier transform of the sound signal. AP is a representation of external noise whose frequency of vibration does not have a distinct periodicity. MCEPs are low-dimensional features obtained by dimensionality reduction of SP. The mean and standard deviation of F0 and SP are used to reconstruct the audio data, and then the MCEPs are mapped into the interval of standard normal distribution for training the generators and discriminators.
The training processes of the generators and discriminators are shown in (a) and (b) in Figure 2. The two processes (a) and (b) will be trained alternately. In (a), the generator
Finally, three parameters, F0, SP, and AP, are needed to synthesize a benign audio. We put the MCEPs features of natural data into
B. Loss Function
The objective optimization function of GAN [29] is the key to training. For this purpose, we designed loss functions \begin{equation*} L_{G}=L_{adv1}^{G}+{\alpha L}_{cyc}+{\beta L}_{id}+L_{adv2}^{G} \tag{14}\end{equation*}
1) Adversarial Loss of Generator
The adversarial loss of the generators is divided into the first step adversarial loss \begin{align*} L_{adv1}^{G}\left ({G_{nat\to ori} }\right)&=\mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ \log \left ({1-D_{ori}\left ({G_{nat\to ori}\left ({x }\right) }\right) }\right) }\right] \\ &\quad + \!\mathbb {E}_{y\sim P_{Y}\left ({y }\right)}\left [{ \log \left ({1\!-\!D_{nat}\left ({G_{ori\to nat}\left ({y }\right) }\right) }\right) }\right] \tag{15}\end{align*}
\begin{align*} &\hspace {-.5pc}L_{adv2}^{G}(G_{nat\to ori},G_{ori\to nat}) \\ &=\mathbb {E}_{x\sim P_{X}\left ({x}\right)}[\mathrm {log}(1- D_{nat}(G_{ori\to nat}(G_{nat\to ori}(x))))] \\ &\quad +\mathbb {E}_{y\sim P_{Y}\left ({y }\right)}[\mathrm {log} (1-D_{ori}(G_{nat\to ori}(G_{ori\to nat}(y))))] \tag{16}\end{align*}
Among them,
2) Cycle-Consistency Loss
We replace the original cycle-consistency loss expressed in with the \begin{align*} &\hspace {-.5pc}L_{cyc}\left ({G_{nat\to ori},G_{ori\to nat} }\right) \\ &=\mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ {\parallel G_{ori\to nat}\left ({G_{nat\to ori}\left ({x }\right) }\right)-x\parallel }_{2} }\right] \\ &\quad +\mathbb {E}_{y\sim P_{Y}(y)}[{\parallel G_{nat\to ori}(G_{ori\to nat}(y))-y\parallel }_{2}] \tag{17}\end{align*}
The CycleGAN-VC2 model differs from classical GAN models by introducing
3) Identity-Mapping Loss
CycleGAN-VC2 differs from classical GANs as it incorporates the
Similarly, we use \begin{align*} &\hspace {-2pc} L_{id}\left ({G_{nat\to ori},G_{ori\to nat} }\right) \\ &=\mathbb {E}_{y\sim P_{Y}(y)}[{\parallel G_{nat\mathrm {\to }ori}(y\mathrm {)-}y\parallel }_{2}] \\ &\quad +\mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ {\parallel G_{ori\to nat}\left ({x }\right)-x\parallel }_{2} }\right] \tag{18}\end{align*}
The input of
4) Adversarial Loss of Discriminator
For the discriminators, we define the loss as:\begin{equation*} L_{D}=L_{adv1}^{D}+L_{adv2}^{D} \tag{19}\end{equation*}
\begin{align*} L_{adv1}^{D}\left ({D_{ori} }\right)&=\mathbb {E}_{y\sim P_{Y}\left ({y }\right)}\left [{ logD_{ori}\left ({y }\right) }\right] \\ &\quad + \mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ log\left ({1-D_{ori}\left ({G_{nat\to ori}\left ({x }\right) }\right) }\right) }\right] \\ &\quad + \mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ logD_{nat}\left ({x }\right) }\right] \\ &\quad + \mathbb {E}_{y\sim P_{Y}\left ({y }\right)}\left [{ log\left ({1-D_{nat}\left ({G_{ori\to nat}\left ({y }\right) }\right) }\right) }\right] \tag{20}\end{align*}
The discriminator \begin{align*} &\hspace {-0.5pc}L_{adv2}^{D}(D_{ori}) \\ &=E_{y\sim P_{Y}\left ({y }\right)}[logD_{ori}(y)] \\ &\quad +E_{y\sim P_{Y}\left ({y }\right)}[log\left ({1-D_{ori}\left ({G_{nat\to ori}\left ({G_{ori\to nat}\left ({y }\right) }\right) }\right) }\right) \\ &\quad + E_{x\sim P_{X}\left ({x }\right)}[logD_{nat}(x)] \\ &\quad + E_{x\sim P_{X}\left ({x }\right)}[log\left ({1-D_{nat}\left ({G_{ori\to nat}\left ({G_{nat\to ori}\left ({x }\right) }\right) }\right) }\right) \tag{21}\end{align*}
The discriminator
Algorithm 1 CYC-L2
natural example
benign example
Initialize
for
for k
Update
Update
end for
end if
end if
end for
Return
C. Training Method
At the beginning of training, we train two generators and two discriminators. The generators are
Halfway through the training, we test the ability of
When the input of
Experimental Setup
A. Dataset and Experimental Environment
The benign data is sourced from Librispeech [30]. After selection, 10 individuals were chosen (5 males and 5 females). Each person selected 110 audio samples for the experiment, with 10 randomly selected samples used for speaker enrollment, and the remaining 100 samples used for testing the model. In addition, the PGD attack is the strongest first-order attack, if our defense method can provide a good defense against the adversarial examples generated by PGD, this method can also defend against the adversarial examples generated by other attacks. Then we use the adversarial examples generated by PGD attack after 30 iterations as adversarial data, and merge the adversarial data and benign data into natural data.
This experiment was implemented on an Intel i7-12700KF with 3.6GHz CPU, NVIDIA GeForce RTX 3080Ti with 12GB of video memory, and 64GB of RAM on an Ubuntu 20.04 system.
B. Model Introduction
We use the generators and discriminators in CycleGAN-VC2 for training to achieve defense in speaker recognition systems. And we use the x-vector model as the target model.
The x-vector [31] model is the most mainstream baseline model framework in the current speaker recognition. Its complete framework structure is shown in Figure 3: the first five layers of the x-vector are TDNN for capturing frame-level features, the sixth layer is a statistical pooling layer that aggregates the features captured in the fifth layer, the seventh and eighth layers are segment-level fully connected layers, and the ninth layer is a SoftMax layer. The embedding a is extracted in the x-vector in the seventh layer as the speaker’s feature vector.
To alleviate the problem of too many parameters in the last layer of the discriminator, the patch-GAN [32] architecture is used in Figure 4, in which the last layer of the model is a convolutional layer that maps the input to an
C. Metrics
In this paper, \begin{align*} {\begin{array}{cccccccccccccccccccc} {acc}_{adv} & =&c_{adv}/n_{adv}\\ \end{array} } \tag{22}\\ {\begin{array}{cccccccccccccccccccc} {acc}_{ben} & =&c_{ben}/n_{ben}\\ \end{array} } \tag{23}\end{align*}
ASR represents the attack success rate of adversarial examples. The higher the ASR, the lower the robustness of the model. The definitions are as follows:\begin{equation*} ASR=\frac {Num(f(x^{\prime })\ne y)}{Num(f(x)=y)} \tag{24}\end{equation*}
D. Algorithm Parameter Setting
We set the perturbation strength of FGSM, MIM, PGD, and ADA to 0.002, and the confidence level of CW2 to 10, noted as CW2-10. The higher the confidence level, the higher the strength of CW2. For MIM, PGD, and ADA, we perform 30 iterations to generate adversarial examples respectively, noted as MIM-30, PGD-30, ADA-30. And we set the threshold value of open-set identification as 13.76, and the accuracy of the x-vector model is 99.2% under this threshold; the accuracy of the x-vector model in closed-set identification is 99.9%.
We set the parameters of QT defense to 512, the sampling rate of DS to 0.5, and the number of reference points to 17 when performing AS or MS defense. CYC-L1 and CYC-L2 are used to constrain the loss function using
Experimental Results
A. Effectiveness Analysis of Various Attack Algorithms in Speaker Recognition
Table 1 shows the success rates of targeted and untargeted attacks in closed-set identification and open-set identification. Among them, the success rate of the FGSM attack is lower than the other four attacks, especially the hard label in OSI is only 1.01%. Because FGSM only performs one iteration and cannot find the direction of gradient update well. The other methods perform multiple iterations and can reach 100% success rate of the attack for the undefended x-vector model.
B. Analysis of Defense Effects in Closed-Set Identification
Most defense methods will reduce the
Our methods CYC-L1 and CYC-L2 did not reduce the
In addition,
Tables 3 and 4 show the effectiveness of defense against targeted attacks with simple labels and hard labels in closed-set identification. On the whole, defense methods show better effectiveness against hard label attacks compared to simple label attacks. For example, defending against FGSM attack using CYC-L2, the
Our method is better than other methods in defending against both simple label and hard label attacks. For example, in Table 4, CYC-L2 shows an increase in
C. Analysis of Defense Effect in Open-Set Identification
In open-set identification, the side effects of different defense methods are more obvious. Because there is a threshold value set for open-set identification. If the defense causes significant distortion to benign audio, it is likely to be considered an imposter by the model, leading to a decrease in the
CYC-L2 is better than other methods in defending against other attacks. For example, in Table 5, when CYC-L2 defends against FGSM attack, the
Based on the analysis of various defensive effects in open-set identification, we find that CYC-L2 is far ahead of other methods in open-set identification. For example, in Table 6, when defending against MIM targeted attack with simple labels in open-set identification, CYC-L2 showed differences in
Table 7 displays the effectiveness of defending against targeted attacks with hard labels in open-set identification. In defending against the FGSM hard label attack, our CYC-L2 method reduces the robustness of the model. When using CYC-L2 to defend against the FGSM attack, the
We also plotted waveforms and spectrograms to visualize the defense against MIM, as shown in Figures 7 and 8. From the waveforms, it can be observed that the waveform generated by the CYC-L2 defense against MIM does not differ significantly from the original waveform; they only slightly reduce the amplitude. However, the waveforms generated by the other methods exhibit significant differences from the original waveform. Specifically, AS, MS, DS, LPF, BPF, OPUS, and SPEEX reach maximum amplitude values, indicating severe audio distortion.
The speaker with id 2: Waveforms for defense against hard targeted attacks in open-set identification.
The speaker with id 2: Spectrograms for defense against hard targeted attacks in open-set identification.
From the spectrograms, we can see that the color of the original audio’s spectrogram is blue, indicating lower energy in the frequency range. The MIM attack introduces noise, causing the dark blue parts of the original audio to become light blue. Our CYC-L2 method removes some perturbations, converting part of the light blue areas back to dark blue. On the other hand, QT, AS, MS, DS, LPF, BPF, OPUS, and SPEEX introduce high-frequency noise, resulting in the overall color of the spectrograms turning red or even yellow.
Conclusion and Future Directions
In this paper, we propose a new defense method and conduct a comprehensive study on the defense method of speaker recognition system. Multiple defenses such as CYC, QT, and AS were evaluated in five attack methods. The experimental results show that our CYC-L2 defense has almost no impact with benign examples, and has a better defense effect on adversarial examples.
There are two shortcomings in this paper. Firstly, our method only defends against white-box attacks and does not defend against black-box attacks. However, most black-box attacks are based on optimization methods, we believe that the defense effect against black-box attacks should be similar to the defense effect of CW2. We will verify my thought in the future. Secondly, it is impossible to defend against unknown speakers in open-set identification.