A Robust CycleGAN-L2 Defense Method for Speaker Recognition System

With the rapid development of voice technology, speaker recognition is becoming increasingly prevalent in our daily lives. However, with its increased usage, security issues have become more apparent. The adversarial attack poses a significant security risk to the speaker recognition model by making small changes to the input and thus causing the neural network model to produce an incorrect output. Nevertheless, there are currently limited defense techniques for speaker recognition models. To this end, we propose a robust CycleGAN-L2(CYC-L2) defense method. The method automatically adjusts the size of the dataset according to the learning of the generative adversarial networks on the dataset, and uses L2 loss functions to constrain the generative adversarial networks for better and faster training. In this paper, we will compare the effectiveness of defense against white-box attacks using existing defenses and the defenses proposed. The experimental results show that our defense method not only plays a better defense effect than the other defense methods mentioned under the x-vector model but also does not reduce the accuracy of benign examples in closed-set identification.


I. INTRODUCTION
The speaker recognition system, which determines different identities based on specific features of voice signals, is also widely used in voice services, such as identity recognition for banking, judicial authentication, and personalized services on smart devices. The development of neural networks has improved the performance of speaker recognition systems, but they are vulnerable to the threat of adversarial examples [1]. Therefore, defense methods that can be applied in speaker recognition systems and resist various adversarial attacks are what we want to study.
Literature review [2] points out that most defense methods are based on prior knowledge of attacks, which can become ineffective if attackers improve the original attacks to produce stronger ones. Therefore, there is an urgent need to find The associate editor coordinating the review of this manuscript and approving it for publication was Nuno M. Garcia . a defense method that can resist attacks without impacting the robustness of the original model. Our method precisely addresses this problem by effectively defending against different attacks with minimal impact on the model's accuracy.
To explore the relationship between generative models and robustness, this paper uses generative adversarial networks to learn the distribution of input features. The method of this paper is inspired by the article [3] in the field of images. Unlike us, their generator only learns the distribution of adversarial examples and the distribution of benign examples for adversarial examples elimination. However, in a real production scenario, we inevitably need to receive benign data when performing the defense of adversarial examples. Therefore, our generator learns how to map natural data to benign data, where natural data contains both adversarial examples and benign examples. The model in this paper is based on CycleGAN-VC2 [4] in the speech synthesis domain. For better defense against adversarial examples, we make minor modifications to the loss function of CycleGAN-VC2, and we name the defense methods as CYC-L1 and CYC-L2 according to the different loss functions.
Our defense approach has the following contributions: • We propose a defense against adversarial examples in the field of speaker recognition based on the CycleGAN-VC2 model. This method eliminates the perturbations existing in natural data by mapping natural data to benign data.
• In CycleGAN-VC2, the L 1 distance was used for model training. No one in the field of speaker recognition defense has used CycleGAN-VC2 with L 1 distance. Moreover, when we tried to use the L 1 distance for defense against the adversarial examples, we found that it was not as effective as the defense using the L 2 distance. Because CycleGAN-VC2 uses training data where one speaker corresponds to one speaker, while for defense, we use training data where multiple speakers correspond to multiple speakers. The L 2 distance can encourage the model to select more features, it fits our defense scenario. Therefore, we chose L 2 distance as the defense method. Experimental results show that in most cases, the method using L 2 distance achieves better defense results. Moreover, we use a method similar to decremental learning in the training process, which solves the problem of difficult training of generative adversarial networks and greatly reduces the training time.
• To test the robustness of our method, we defend against different attacks and compare them with other defenses. The experimental results show that our defense method is better than other defense methods. The remainder of this paper is divided into six sections. Section II focuses on the basics of speaker recognition. Then, we introduce the attack methods and defense methods used in the experiments of this paper. In Section III, we discuss the work related to defense in speaker recognition systems. In Section IV, we describe the proposed method in detail. In Section V, we describe the dataset, experimental environment, parameter settings, and model architecture, and provide a detailed description of the evaluation metrics. In Section VI, we present and analyze the experimental results. Finally, in Section VII, we give an overview of the effectiveness of the defense method proposed in this paper and present the shortcomings of the experiments and future research directions.

A. BASICS OF SPEAKER RECOGNITION
Speaker recognition, also called voice recognition, is a technology that distinguishes speakers based on the characteristics of the speaker's voice. The flow of a typical speaker recognition system is shown in Figure 1. To recognize a user, the system first needs to extract relevant features such as MFCC [5]. Then the model is trained and stored in a model database. In the testing phase, features are extracted from the voice to be recognized and compared with the data in the database for similarity, scored, and judged to determine the speaker of the current voice.
In addition, speaker recognition systems have three main tasks: closed-set identification (CSI), speaker verification (SV), and open-set identification (OSI). For CSI, the recognition result is the one with the highest matching score among the registered speakers. For SV, the recognition result depends on whether the matching score is higher than the threshold. If it is higher, it is accepted, otherwise, it is rejected. For OSI, the recognition result must satisfy two conditions. Firstly, the highest score in the set. Secondly, the score must be higher than the set threshold θ. If the highest score is below the threshold, the voice is recognized an impostor. The decision module is shown below: where the parameter θ is a set threshold and G is a collection of registered speakers. i is a specific certain speaker. S(x) is the scoring function.

B. ADVERSARIAL ATTACK
Adversarial examples also known as adversarial attacks. The process is to add small perturbations to benign examples to make the speaker recognition model misclassify. The adversarial examples formula is as follows: where x is the original example, δ is the added perturbation, and x ′ is the generated adversarial example. Adversarial attacks can be categorized into targeted and untargeted attacks, as well as black-box attacks and whitebox attacks. Targeted attacks aim to modify the model's output to a specific incorrect class. In contrast, untargeted attacks only misclassify the model. White-box attacks can know all the information about the target model. Black-box attacks do not have access to valid information about the model. 82772 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
FGSM (Fast Gradient Sign Method) [6] attack is a simple gradient-based method in the field of adversarial attacks. Its basic formula is as follows: where ε is the perturbation strength, ∇ x L(x, y) is the gradient of the loss function, and sign( * ) represents the sign function. MI-FGSM (Momentum Iterative Fast Gradient Sign Method) [7] attack is an upgrade of the FGSM attack, with the addition of momentum and multiple iterations to the FGSM attack. The basic formula is as follows: where µ is the decay of the control gradient, g k is the previously saved gradient, and g k+1 is the current gradient after adding momentum. α is the perturbation size. CW 2 (Carlini & Wagner) [8] is an optimization-based attack that produces a perturbation so small that one can barely perceive it in the audio and extends the constraint range from [1,0] to [−∞, +∞] using ω instead of the adversarial example, which is formulated as follows: CW 2 attack needs to be optimized with two objectives, the first objective is that the gap between the generated adversarial example x ′ and the original example x is as small as possible, and the constraint between the gaps is performed by L 2 distance. The second objective is that the generated adversarial example x ′ is to make the model Z classify into the specified target t.
PGD (Project Gradient Descent) [9] is a gradient-based attack, which is an improvement of the FGSM attack. The PGD attack performs multiple iterations on the FGSM attack and each iteration clips the iterations to the specified range. The basic formula is as follows: where Clip( * ) is to clip perturbation greater than ϵ (the maximum perturbation strength), α is the perturbation intensity of each iteration. In addition, the PGD attack will start by finding a random point within the bounded range from the starting point to perform the PGD attack. ADA [10] is an improvement of the PGD attack. The main idea of the ADA attack is dynamically adjusting the perturbation strength. Next, it uses cosine annealing to decrease the step size, which means that the added perturbation decreases throughout multiple iterations to achieve a higher level of stealth adversarial example without affecting the success rate of the attack.

C. ADVERSARIAL DEFENSE
Adversarial defense is mainly done by various methods to enhance the robustness and security of the model against adversarial examples.
QT (Quantization) [11] defense is to limit the amplitude of the sound to an integer multiple of the parameter q. Since the amplitude of the perturbation is usually small in the input space, the QT defense can eliminate the perturbation.
AS (Average Smoothing) [12] defense is done by fixing a sample point x, using k samples before and after x as reference points, and replacing the values of the sample point x according to the mean value of the reference points. The audio is modified by this method to reach a reduced impact of the adversarial examples. MS (Median Smoothing) [11] defense, on the other hand, changes the mean value of the reference points to the median. However, AS defense and MS defense may degrade the audio quality and thus affect the performance of the model.
QT, AS, and MS are defense methods based on the time domain. However, QT defense is ineffective against adversarial examples with high perturbation strength. AS defense and MS defense may degrade the audio quality and have an impact on the performance of the model. DS (Down Sampling) [13] defense is to eliminate the perturbation by reducing the sampling rate of the audio and then recovering it. Specifically, the downsampling frequency τ (τ < 1) is set, then the audio is downsampled according to the τ value, and finally the downsampled audio is recovered by upsampling.
LPF (Low Pass Filter) [14] defense is to filter the high frequency sounds by setting a low pass filter. BPF (Band Pass Filter) [15] defense is to set up a filter that filters both high and low frequency sounds.
DS, LPF, and BPF are defense methods based on the frequency domain. The DS method requires the sampling frequency to be at least twice the highest frequency in the signal to avoid sampling distortion and aliasing effects. LPF cannot defend against low frequency adversarial noise. Since different adversarial examples may have different frequency characteristics, BPF is challenging to choose an appropriate frequency. Therefore, it is necessary to select the appropriate filter parameters according to the specific situation. This may involve extensive experimentation and cannot be directly applied to practical applications.
OPUS [16] and SPEEX [17] are two audio codecs with audio compression. They deal with noise in the compression of the audio. The noise processing by OPUS is divided into two stages, NSA (Noise Shaping Analysis) and NSQ (Noise Shaping Quantization). Among them, the purpose of the NSA stage is to find the compensation gain G and filter coefficients VOLUME 11, 2023 used in NSQ. In the NSQ stage, the noise is filtered by the following equation: · Q(z) (12) where F ana (z) and F syn (z) are the Analysis Noise Shaping Filter and Synthesis Noise Shaping Filter, respectively, and X and Q are the Quantizer. The first half of the equation is the input signal to be shaped, and the second half is the noise signal to be shaped. SPEEX is mainly used for real-time VoIP communication. SPEEX also decreases noise in the encoder to improve the quality of network calls. The treatment of noise is shown below: where A(z) is a linear prediction filter with parameter values of 0.9 and 0.6 for γ 1 and γ 2 , respectively. The linear prediction filter is designed to allow the sound to have varying noise levels at different frequencies. Specifically, the output W (z) has reduced noise in the lower frequency bands of the sound while introducing some noise at higher frequencies. Rajaratnam [15] argues that SPEEX's defense method is capable of effectively eliminating the perturbation with minimal alteration to the audio. OPUS and SPEEX are defense methods based on speech compression. Both OPUS and SPEEX use compression algorithms to reduce the size of audio files, resulting in a certain degree of information loss that affects the performance of the speaker recognition system.
To address the shortcomings in time domain, frequency domain and speech compression methods, we propose a speech synthesis-based defense method, which mitigates the audio distortion problem and defends against various attacks.

III. RELATED WORK
Sun et al. [18] used an adversarial training approach to enhance the robustness of the model. They used MFCC features as input and used FGSM to select each small batch of data for generating adversarial examples, and then used the adversarial examples and the original labels to generate data for adversarial training, and dynamically put the FGSM generated adversarial data into the training set of the model to enhance the classification ability of the model. Moreover, they used a teacher-student (T/S) training method [19] to improve the robustness of the model. However, they only used FGSM to enhance the robustness of the model, without testing other attack methods.
Du et al. [12] used adversarial training, audio downsampling, and average filtering for the defense of the adversarial examples. They believe that adversarial training has significant limitations and can only provide an effective defense with fixed parameters; if the attacker increases the perturbation strength or changes to another attack method, adversarial training may not achieve a good defense. In their experiments, both audio downsampling and averaging filtering were found to reduce the success rate of the attack. However, it is worth noting that according to Nyquist's sampling theorem, downsampling the audio can cause distortion when the sampling rate is lower than twice the highest frequency of the original audio. Similarly, averaging filtering may degrade the audio quality.
Yang et al. [11] proposed the use of a time dependent approach for the detection of adversarial examples. The main process is that for a time length t audio example a, the k (k < t) time lengths in a are selected as a ′ , put a and a ′ into the target model, and compare the output results of the model. If it is an adversarial example, the results of a and a ′ outputs will be very different due to the loss of time dependence. If it is a benign example, it is not affected by the time dependence, and the outputs of a and a ′ are consistent. However, if an attack does not modify the entire time sequence of the audio, time dependent methods may not be effective in detecting the attack. In addition, they experimented with quantization, smoothing, downsampling, and autoencoder methods. They found that the Magnet encoder [20] is very effective in defending against adversarial examples in the image domain, but has limited effectiveness in defending against audio.
Yuan et al. [13] employed two methods for detecting adversarial examples: noise addition detection and reduced sampling rate detection. In their experimental results, they observed that background noise reduced the success rate of adversarial examples. Therefore, they adopted the defense method of adding noise. Specifically, they introduced noise, denoted as n, into the input audio x. If the output result of the target model for the audio (x + n), after noise addition, does not match the output result of the original audio x (i.e., x + n ̸ = Ùx), it is considered that there is perturbation in the input audio x. If too much noise is added, it can impact the accuracy. The detection using reduced sampling rate involved lowering the sampling rate of the input audio x and obtaining the output y from the target model. If the model's output for the original input audio x is denoted as y ′ , and y ′ ̸ = Ùy, it is concluded that there exists perturbation in the input audio x.
Rajaratnam et al. [21] used band pass filters and audio compression methods of AAC [22], MP3 [23], OPUS [16], and SPEEX [17] for the study of adversarial examples defense. They found that AAC and MP3 had the worst defenses among audio compression methods. The method using bandpass filters had better defenses than audio compression methods, but had a greater impact on target model accuracy than the audio compression methods.
Zeng et al. [24] migrated the main idea of Multi-version programming (MVP) [25] to adversarial examples detection. Based on the adversarial examples mobility [26] and the fact that different recognition systems have the same output results for a single normal speech. They compare the similarity of the outputs two by two in different recognition systems, and the examples with similarity below a given threshold are considered adversarial examples. However, if the systems cannot effectively recognize benign samples, the detection accuracy will decrease.
Esmaeilpour et al. [27] used a class-conditional generative adversarial network [28] for defense. They minimize the relative chord distance between the random noise and the adversarial examples, find the optimal input vector to put into the generator to generate the spectrogram, and then reconstruct the one-dimensional signal based on the original phase information of the adversarial example and the generated spectrogram. Thus, this reconstruction does not add any additional noise to the signal and achieves the purpose of removing the perturbation. However, their method performs poorly in defending against black-box attacks.

IV. PROPOSED DEFENSE APPROACH: CYC
We will present our approach in the following areas: (A) the overall workflow, (B) the associated loss functions, and (C) the specific training methods.

A. OVERALL WORKFLOW
Firstly, we preprocess the audio. In the preprocessing, we mainly get four main features: fundamental frequency (F0), spectral envelope (SP), Mel-Cepstral Coefficients (MCEPs), and Aperiodic Parameter (AP). Where F0 is the lowest frequency sine wave in the sound signal and is used to determine the start time position of the sound start in each audio. SP is the variation of different frequency amplitudes obtained by Fourier transform of the sound signal. AP is a representation of external noise whose frequency of vibration does not have a distinct periodicity. MCEPs are low-dimensional features obtained by dimensionality reduction of SP. The mean and standard deviation of F0 and SP are used to reconstruct the audio data, and then the MCEPs are mapped into the interval of standard normal distribution for training the generators and discriminators.
The training processes of the generators and discriminators are shown in (a) and (b) in Figure 2. The two processes (a) and (b) will be trained alternately. In (a), the generator G nat→ori is trained first, and then the generator G ori→nat is trained. In this process, the input of G nat→ori is divided into two stages. In the first stage, the input consists of natural data, including adversarial examples and benign examples. In the second stage, the input is benign data. However, the output of G nat→ori in both stages generated benign examples. The data generated by G nat→ori will be treated as the input of G ori→nat , and the output of G ori→nat is the generated adversarial and benign examples. In (b), the generator G ori→nat is trained first, followed by the training of the generator G nat→ori . The discriminators D nat and D ori are responsible for discriminating the generated data and the real data, where D ori mainly focuses on discriminating whether the benign data is generated or not, while D nat mainly focuses on discriminating whether the benign data and adversarial data are generated or not. For the generators G ori→nat and G nat→ori , we use G nat→ori as a defense tool and G ori→nat to assist in the training of G nat→ori . Finally, three parameters, F0, SP, and AP, are needed to synthesize a benign audio. We put the MCEPs features of natural data into G nat→ori to generate MCEPs features of benign data. We expand the dimensionality of the MCEPs to obtain the SP of the benign data. Then, we map the F0 of the natural data to the benign data to obtain the F0 of the benign data. Last, we use external noise AP of the benign data.

B. LOSS FUNCTION
The objective optimization function of GAN [29] is the key to training. For this purpose, we designed loss functions L G and L D for the generators and discriminators, respectively. For the generators, we define the loss function as: where L G adv1 and L G adv2 are the adversarial loss of the generators, L cyc is the cycle-consistency loss, L id is the identitymapping loss, and α and β are the weight parameters with initial values of 10 and 5.

1) ADVERSARIAL LOSS OF GENERATOR
The adversarial loss of the generators is divided into the first step adversarial loss L G adv1 and the second step adversarial loss VOLUME 11, 2023 L G adv2 . The loss function of L G adv1 is as follows: where x ∈ X and y ∈ Y , X represents the set of natural data ({adversarial, benign} ⊆ X ), Y represents the set of benign data ({benign} ⊆ Y ), adversarial and benign are the adversarial and benign examples, respectively. D ori is the discriminator, which mainly discriminates between benign data and benign data generated by G nat→ori (x), forcing the data generated by G nat→ori (x) to be closer to benign data. Similarly, D nat is a discriminator that discriminates the data generated by G ori→nat (y). Specifically, L G adv1 measures the difference between real and generated data, and a smaller value of L G adv1 indicates that the generator has stronger capability or the discriminator has limited ability to distinguish whether the data is generated or not. The loss function of L G adv2 is as follows: Among them, G ori→nat generates natural data, G nat→ori generates benign data, and then D ori discriminates the generated benign data, while D nat discriminates the generated natural data, thus L G adv2 can supervise both G nat→ori and G ori→nat simultaneously. It is argued in the literature [3] that the second step of adversarial loss can mitigate the over-smoothing caused by the cycle-consistency loss L cyc represented by the L 1 distance.

2) CYCLE-CONSISTENCY LOSS
We replace the original cycle-consistency loss expressed in with the L 2 distance, and the L cyc loss function is as follows: The CycleGAN-VC2 model differs from classical GAN models by introducing L cyc to address the issue of mode collapse in generative adversarial networks. For the generator, the goal of G nat→ori is to generate benign data that cannot be distinguished by D ori . If y ∈ Y , Y has 10 labels, and G nat→ori (x) thinks that it is good enough to deceive D ori by generating x into any one of Y , then G nat→ori (x) will only generate a single data. L cyc is needed to prevent the generator mode collapse by supervising the input data x and the output data x ′ (x ′ = G ori→nat (G nat→ori (x))).

3) IDENTITY-MAPPING LOSS
CycleGAN-VC2 differs from classical GANs as it incorporates the L id loss function. In our approach, the L id loss function is utilized to alleviate the impact of our defense on benign examples. Similarly, we use L 2 distance to express the identitymapping loss function, and the L id loss function is as follows: The input of G nat→ori in identity-mapping loss is benign data and the output is generated benign data, using L 2 distance to constrain the input and output. Because there is not just one type of labeled data in the natural dataset, but data with multiple labels. Using L 1 distance to constrain L id and L cyc tends to result in the selection of a few features. While using L 2 distance constraints will select more features, which is in line with our natural data with multiple classes.

4) ADVERSARIAL LOSS OF DISCRIMINATOR
For the discriminators, we define the loss as: L D adv1 is the first step adversarial loss of the discriminators and L D adv2 is the second step adversarial loss of the discriminators. The first step adversarial loss is as follows: The discriminator D ori distinguishes between benign data and benign data generated by the generator, while the discriminator D nat distinguishes between natural data and natural data generated by the generator. D ( * ) ∈ [0, 1], when D ( * ) = 1, the discriminator considers the input data as real data, and when D ( * ) = 0, the discriminator considers the input data as fake data. The second step of adversarial loss is as follows: The discriminator D ori aims to distinguish between benign data and benign data generated by the two generators G ori→nat and G nat→ori . The discriminator D nat aims to distinguish between natural data and natural data generated by both generators G nat→ori and G ori→nat .

Algorithm 1 CYC-L2
Input: natural example x, origin example y, adversarial example x ′ , number of epochs E, number of iterations K , Generators {G nat→ori , G ori→nat } and Discriminators {D ori , D nat }, parameter αandβ, classifier C Output: benign example y fake 1: Initialize 2: for e← 1 to E do 3: for k ← 1 to K do 4: y fake ← G nat→ori (x)

5:
x cycle ← G ori→nat y fake 6: y identity ← G nat→ori (y) 7: x fake ← G ori→nat (y) 8: y cycle ← G nat→ori x fake 9: x identity ← G ori→nat (x) 10: Update L D : 11: ∇((0 − D ori (y fake )) 2 D ori (y)) 2 ) 13: Update L G : 14: ∇((1 − D ori (y fake )) 2 At the beginning of training, we train two generators and two discriminators. The generators are G nat→ori and G ori→nat . G nat→ori is mapping the natural data distribution to the benign data distribution, and G ori→nat is mapping the benign data distribution to the natural data distribution. The discriminators are D ori and D nat . D ori distinguishes between benign data and fake benign data, and D nat distinguishes between natural data and fake natural data.
Halfway through the training, we test the ability of G nat→ori to generate benign examples. When the input is a benign example, the output of G nat→ori can achieve an accuracy of 98% in the classifier. If continuing to train G nat→ori , we remove the benign examples from the natural data and set the parameter β to 0 when the output makes the accuracy of the classifier unchanged or decreases.
When the input of G nat→ori is an adversarial example, the output makes the accuracy of the classifier unchanged or decreases to end the training.

V. EXPERIMENTAL SETUP A. DATASET AND EXPERIMENTAL ENVIRONMENT
The benign data is sourced from Librispeech [30]. After selection, 10 individuals were chosen (5 males and 5 females). Each person selected 110 audio samples for the experiment, with 10 randomly selected samples used for speaker enrollment, and the remaining 100 samples used for testing the model. In addition, the PGD attack is the strongest first-order attack, if our defense method can provide a good defense against the adversarial examples generated by PGD, this method can also defend against the adversarial examples generated by other attacks. Then we use the adversarial examples generated by PGD attack after 30 iterations as  adversarial data, and merge the adversarial data and benign data into natural data.
This experiment was implemented on an Intel i7-12700KF with 3.6GHz CPU, NVIDIA GeForce RTX 3080Ti with 12GB of video memory, and 64GB of RAM on an Ubuntu 20.04 system.

B. MODEL INTRODUCTION
We use the generators and discriminators in CycleGAN-VC2 for training to achieve defense in speaker recognition systems. And we use the x-vector model as the target model. The x-vector [31] model is the most mainstream baseline model framework in the current speaker recognition. Its complete framework structure is shown in Figure 3: the first five layers of the x-vector are TDNN for capturing framelevel features, the sixth layer is a statistical pooling layer that aggregates the features captured in the fifth layer, the seventh and eighth layers are segment-level fully connected layers, and the ninth layer is a SoftMax layer. The embedding a is extracted in the x-vector in the seventh layer as the speaker's feature vector.
To alleviate the problem of too many parameters in the last layer of the discriminator, the patch-GAN [32] architecture is used in Figure 4, in which the last layer of the model is a convolutional layer that maps the input to an N * N matrix. This N * N matrix is then used to evaluate the generated speech. Because 2D CNN (Convolutional Neural Networks) can consider both temporal and frequency information, they can provide a more comprehensive understanding of the sound signal. In contrast, 1D CNN is more feasible for capturing dynamic changes and local patterns in sequential data. Therefore, the generator uses the 2-1-2D CNN architecture shown in Figure 5.
where Num( * ) represents the number. If it is a targeted attack, the numerator becomes Num(f x ′ = y).

D. ALGORITHM PARAMETER SETTING
We set the perturbation strength of FGSM, MIM, PGD, and ADA to 0.002, and the confidence level of CW 2 to 10, noted as CW 2 -10. We set the parameters of QT defense to 512, the sampling rate of DS to 0.5, and the number of reference points to 17 when performing AS or MS defense. CYC-L1 and CYC-L2 are used to constrain the loss function using L 1 and L 2 distance, respectively. The experimental of CYC using L 1 and L 2 distances in PGD-30 is shown in Figure 6. The stopband end frequency of LPF is set to 4400Hz, and the stopband end frequency of BPF is set to 160Hz and 5900Hz. The bit rates of OPUS and SPEEX output audio are set to 4000 and 7200 respectively. Table 1 shows the success rates of targeted and untargeted attacks in closed-set identification and open-set identification. Among them, the success rate of the FGSM attack is lower than the other four attacks, especially the hard label in OSI is only 1.01%. Because FGSM only performs one iteration and cannot find the direction of gradient update well. The other methods perform multiple iterations and can reach 100% success rate of the attack for the undefended x-vector model.

B. ANALYSIS OF DEFENSE EFFECTS IN CLOSED-SET IDENTIFICATION
Most defense methods will reduce the acc ben to varying degrees. The decrease of acc ben reflects the side effects of the defense methods. According to the analysis of experimental results in Table 2, MS defense causes the greatest decrease in acc ben . Specifically, it brings acc ben down to 76.6%, which indicates that MS makes the audio distortion greater. QT, DS, OPUS, and SPEEX had less impact on acc ben , with decreases of 3.5%, 2.1%, 5.3%, and 4.0%, respectively. AS, LPF, and BPF had almost no impact, with decreases of 0.8%, 0.9%, and 1.7%, respectively.
Our methods CYC-L1 and CYC-L2 did not reduce the acc ben . It is probably because our model is trained with identity-mapping loss, which reinforces the mapping of benign data to benign data, that the side effects of defense are very small or even absent. In addition, acc adv shows the effectiveness of the defense. The success rate of FGSM untargeted attack in closed-set identification is 36.1%, i.e., the accuracy of the x-vector model under the FGSM attack is 63.9%. According to the FGSM attack analysis in Table 2, AS and OPUS have the worst defense effects, and the acc adv only up to 65.9% and 68.2%. DS, LPF, BPF, QT, MS, and SPEEX have slightly better defense effects, with the acc adv higher than 70%. CYC-L1 and CYC-L2 have the best defense effect, and the acc adv up to 94.4% and 94.7%. CYC-L1 and CYC-L2 are also better than most other methods in defending against not only FGSM attacks but also other attacks. However, CYC-L2 was slightly less effective than QT in defending against ADA, with only a 3.2% difference. It may be QT effectively reduces the aggressiveness of ADA by limiting the amplitude to an integer multiple of 512. Tables 3 and 4 show the effectiveness of defense against targeted attacks with simple labels and hard labels in closedset identification. On the whole, defense methods show better effectiveness against hard label attacks compared to simple label attacks. For example, defending against FGSM attack using CYC-L2, the acc adv increases to 94.6% in defending against the simple label attack, while it increases to 99.6% in defending against the hard label attack. This is because generating adversarial examples with hard labels poses greater difficulty in various attacks, which results in better effectiveness in defending against hard label attacks compared to simple label attacks.
Our method is better than other methods in defending against both simple label and hard label attacks. For example, in Table 4, CYC-L2 shows an increase in acc adv of 3.6%, 12.7%, 4.9%, 4.4%, and 4.6% in defending against FGSM, MIM, PGD, CW 2 , and ADA attacks, respectively, compared to the best defending performance of the compared methods.

C. ANALYSIS OF DEFENSE EFFECT IN OPEN-SET IDENTIFICATION
In open-set identification, the side effects of different defense methods are more obvious. Because there is a threshold value set for open-set identification. If the defense causes significant distortion to benign audio, it is likely to be considered an imposter by the model, leading to a decrease in the acc ben . According to the analysis in Table 5, CYC-L2 has fewer side effects than the other defense. In open-set identification, after applying the CYC-L2 defense, the acc ben is 97.7%, resulting in a decrease of only 1.5%. On the other hand, after applying the CYC-L1 defense, the acc ben is 96.3%, with a decrease of 2.9%. This further confirms that CYC-L2 performs better than CYC-L1 in terms of  performance. CYC-L2 has 8%, 17.3%, 14.1%, and 9.8% differences in acc ben compared to QT, DS, LPF, and BPF, respectively. Among all the defense methods, MS has the highest side effects. After applying the MS, the acc ben is 15.6%, resulting in a decrease of 83.6%, indicating that MS makes the audio distortion seriously affect the x-vector. The side effects of the other two audio compression-based defenses are very similar. The acc ben of OPUS and SPEEX are 57.8% and 63.1%, respectively. CYC-L2 is better than other methods in defending against other attacks. For example, in Table 5 Based on the analysis of various defensive effects in openset identification, we find that CYC-L2 is far ahead of other methods in open-set identification. For example, in Table 6, when defending against MIM targeted attack with simple labels in open-set identification, CYC-L2 showed differences in acc adv by 45.5%, 63.8%, 58.2%, 63.8%, 63.8%, 63.8%, 45.4%, and 47.6% compared to QT, AS, MS, DS, LPF, BPF, OPUS, and SPEEX, respectively. Table 7 displays the effectiveness of defending against targeted attacks with hard labels in open-set identification. In defending against the FGSM hard label attack, our CYC-L2 method reduces the robustness of the model. When using CYC-L2 to defend against the FGSM attack, the acc adv of the x-vector model decreases to 90.4%. However, compared to CYC-L1, QT, AS, MS, DS, LPF, BPF, and SPEEX, CYC-L2 exhibits a difference in acc adv of 0.4%, 11.4%, 32.2%, 85.9%, 44.0%, 34.2%, 40.5%, 49.5%, and 65.6%, respectively. Therefore, CYC-L2 has a minor impact on model robustness.
We also plotted waveforms and spectrograms to visualize the defense against MIM, as shown in Figures 7 and 8. From the waveforms, it can be observed that the waveform generated by the CYC-L2 defense against MIM does not   differ significantly from the original waveform; they only slightly reduce the amplitude. However, the waveforms generated by the other methods exhibit significant differences from the original waveform. Specifically, AS, MS, DS, LPF, BPF, OPUS, and SPEEX reach maximum amplitude values, indicating severe audio distortion.
From the spectrograms, we can see that the color of the original audio's spectrogram is blue, indicating lower energy VOLUME 11, 2023 82781 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  in the frequency range. The MIM attack introduces noise, causing the dark blue parts of the original audio to become light blue. Our CYC-L2 method removes some perturbations, converting part of the light blue areas back to dark blue. On the other hand, QT, AS, MS, DS, LPF, BPF, OPUS, and SPEEX introduce high-frequency noise, resulting in the overall color of the spectrograms turning red or even yellow.

VII. CONCLUSION AND FUTURE DIRECTIONS
In this paper, we propose a new defense method and conduct a comprehensive study on the defense method of speaker recognition system. Multiple defenses such as CYC, QT, and AS were evaluated in five attack methods. The experimental results show that our CYC-L2 defense has almost no impact with benign examples, and has a better defense effect on adversarial examples.
There are two shortcomings in this paper. Firstly, our method only defends against white-box attacks and does not defend against black-box attacks. However, most black-box attacks are based on optimization methods, we believe that the defense effect against black-box attacks should be similar to the defense effect of CW2. We will verify my thought in the future. Secondly, it is impossible to defend against unknown speakers in open-set identification.