A Highly Stealthy Adaptive Decay Attack Against Speaker Recognition

Speaker recognition based on deep learning is currently the most advanced and mainstream technology in the industry. Adversarial attacks, an emerging and powerful attack against neural network models, also posing serious security problems for speaker recognition. Common gradient-based attack methods such as FGSM (Fast Gradient Sign Method), PGD (Projected Gradient Descent), and MI-FGSM (Momentum Iteration-FGSM) generate adversarial examples that are poorly stealthy and easily perceived by the human ear. To improve the stealthiness of the adversarial examples, this paper proposes a new attack method called the Adaptive Decay Attack (ADA), whose stealth is very close to the CW2(Carlini&Wagner) method based on optimization attacks, with much less computation time than CW2. The method takes the set number of iterations as the termination condition, automatically adjusts the size of the maximum perturbation according to whether the attack is successful or not, and then uses the decay methods in learning rates such as exponential decay and cosine annealing to continuously reduce the step size. The experimental results show that under the two speaker recognition models x-vector, and i-vector, the proposed attack method improves the stealthiness metrics such as SNR and PESQ by at least 30% and 39%, respectively, compared with the best PGD attack under speaker identification of untargeted attacks. For the speaker identification task with targeted attacks, the average improvement is at least 20% and 25% compared to PGD. For the speaker verification task, the improvement is at least 29.5% and 33.4% compared to PGD. In addition, we also use this attack method for adversarial training to enhance the robustness of the model. Experimental results show that ADA-based adversarial training takes 28.31% less time than PGD-based adversarial training, and its improved robustness is generally superior to PGD-based adversarial training. Specifically, the attack success rate of PGD and ADA methods decreased from 50.88% to 36.47% and 64.74% to 45.82%, respectively.


I. INTRODUCTION
A speech contains the identity of the speaker, text content, language information, etc. [1] Compared with other biometric recognition technologies, a speech is easy to collect, low cost, and the recognition process is contactless [2]. Speaker recognition, as a technique to recognize or identify a person from speech, is widely used in daily life and work, such as The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei .
Studies have shown that speaker recognition has been subject to malicious spoofing attacks, such as voice conversion [7] and speech synthesis [8], which have existed in the past, to a recent emerging type of attack called adversarial attacks. Speech conversion aims to change the source speaker's voice to that of the targeted speaker's tone while keeping the content of the voice unchanged. Speech synthesis aims at converting VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ any text into the corresponding speech. The main idea of adversarial attacks is to add a small artificial perturbation to a piece of original speaker utterances to form a new piece of audio that still sounds like the original speaker, at least to humans, and the model forces the identity of the new audio to be someone else. The powerful capabilities of neural networks have led to their widespread use in various fields related to people's daily lives, yet recent studies have shown that neural networks are vulnerable to adversarial attacks. Adversarial attacks, also known as adversarial examples, have posed a significant security threat to the currently widely used neural network techniques since adversarial attacks were proposed. Regarding the reason for adversarial examples, Goodfellow et al [9] argued that the linear nature of deep neural networks in high-dimensional space leads to the creation of adversarial examples, which is believed that neural networks have high-dimensional and linear characteristics so that the initial perturbation values will be superimposed continuously when passed backward in the neural network, which is eventually sufficient to change the classification results of the model. With the continuous development and improvement of adversarial attacks, they have successfully deceived the current neural network-based designs for autonomous driving [10], face recognition [11], speech recognition [12], malicious code detection [13], and other related tasks. In recent years, adversarial attacks in speaker recognition scenarios have not been extensively studied, and it has become significant to understand the vulnerability of speaker recognition to adversarial attacks and how to increase its robustness.
Our contributions are as follows: •In the task of attacking speaker recognition, we provide common gradient-based and optimization-based attack methods such as FGSM, PGD, MI-FGSM, and CW 2 . And we propose a new attack method called the ADA, which can be applied to different speaker recognition models and all recognition tasks. The aim of the new method is to improve the stealthy of the generated adversarial examples from being easily perceived by humans. Experimental results show that this method improves the stealthy significantly compared to other gradient-based methods, and the stealthiness is very close to that of the optimization-based CW 2 method, and the computation time is much faster than that of CW 2 .
•In addition, we consider the problem of how to improve the robustness of the model. We compare the proposed ADA-based method for adversarial training with the FGSM-based and PGD-based adversarial training methods for analysis. Experiments demonstrate that ADA-based adversarial training improves the robustness of the model overall better than the other two methods and requires less time for training.
The remainder of this paper is organized as follows: Section II covers a basic introduction to speaker recognition and adversarial attacks. Section III describes the research related to adversarial attacks in speaker recognition. In Section IV, we introduce some attack methods, reveal their shortcomings, propose a new attack method called the ADA, and then introduce the defense method of adversarial training. Section V contains the experimental setup and experimental environment, the models used, and the metrics measured. Section VI presents the results of the attack and defense. Finally, Section VII summarizes the overall contents of this paper and proposes future research directions.

A. BASICS OF SPEAKER RECOGNITION
Speaker recognition [14], also known as voice recognition, is a technology that distinguishes the voices of different speakers according to the identity of the speaker. Speaker recognition is fundamentally different from speech recognition technology. Speech recognition is a technology that converts speech signals into text content, and in most cases does not care whom the speaker is, hoping to filter out information related to the speaker's identity from the signal and retain only the textual content information. Speaker recognition technology, on the contrary, wants to filter out information related to the text content from the signal and retain only the speaker's identity information, robustly identifying the speaker's identity among the different speech segments.
A complete speaker recognition system is shown in Figure 1 below. Speaker recognition technologies are divided into two main categories according to the task and application scenario they are designed to recognize: speaker verification (SV) and speaker identification (SI). The question to be solved by speaker verification technology is: ''Is this speech spoken by this particular person?'' The recognition result is either accepted or rejected, so the voice verification technology can be seen as a 1-to-1, two-category problem. At the registration stage, the speaker verification technique first performs feature extraction based on all audio examples provided by a particular speaker, and further aggregates the audio features to generate a model with the ability to represent the identity of that speaker. In the recognition phase, unidentified audio data is provided, which is then compared with the model generated in the previous step, resulting in a matching score. We compare this match score with a predefined threshold to get the recognition result. If the match score is greater than the threshold, it is recognized as accepted by the model; conversely, it is recognized as rejected. The higher the score, the more likely it is that the new audio is spoken by the registrant.
The speaker identification technology needs to deal with the question: ''Who spoke the passage?'' This is limited to a particular speaker in a set containing N particular speakers, which can be seen as a many-to-one, multi-classification problem. Speaker identification can be subdivided into closed-set speaker identification (CSI) and open-set speaker identification (OSI). In closed-set speaker identification, The recognition result is that the person with the highest matching score in a set of N speakers; while in open-set speaker identification, due to the role of impostor (i.e., not in the set of speakers), our set size becomes N+1. And the recognition  result must satisfy two conditions, a): the highest score in the set; b): the score must be greater than the threshold. If the highest score is below the threshold, the audio is recognized as an impostor.
Speaker recognition system can be classified according to the recognition task, but also according to the content of the recognition, into three categories: text-dependent recognition, text-independent recognition, and text-prompted recognition. Text-dependent recognition, usually called ''fixed text'' speaker recognition, requires restricting the text content and duration of the speaker's speech. Text-independent recognition, on the contrary, can be recognized regardless of the content and duration of the speaker's speech. Text-prompted recognition randomly selects one text from a set containing multiple texts and then asks the speaker to say this text for speaker recognition. The experiments conducted in this paper all use text-independent recognition.

B. BASICS OF ADVERSARIAL ATTACKS
Adversarial attacks are an example generated by adding a small perturbation to the original benign data that is imperceptible to the human ear, which can effectively fool the target model into giving a wrong prediction output with high confidence. A satisfactory adversarial example often needs to satisfy two conditions: firstly, it must be able to force the model to classify errors and be imperceptible to humans after adding small perturbations, i.e., it has a high success rate of attack; secondly, the smaller the perturbations added, the better, i.e., it has high stealthiness. Figure 2 illustrates an example of adversarial attack on the speaker verification task: adversarial audio formed by artificially adding subtle perturbation to the original audio that the human ear cannot imperceptible. The adversarial example causes the speaker verification model to give a different result from the original example and then switch from rejection to acceptance, but the human ear doesn't sound different from the two audios. If the attacker chooses to attack speaker identification task, the impostor can be recognized as one of the speakers in the registered set, or one of the speakers in the registered set can be recognized as another person in the set. From the above, it is seen that the presence of adversarial attacks may expose the speaker recognition system to serious security problems.
In the case of untargeted attacks, the attacker does not need to specify a specific attack category when generating the adversarial example, but only needs to make the target model misclassify the adversarial example; whereas targeted attacks not only require the target model to misclassify but also require the adversarial example generated by the attack algorithm to further fool the target model to identify as the specified target category, which is more complicated than untargeted attacks. The theoretical difference between the two is that the untargeted attack maximizes the loss function that is different from the original label of the example, and the targeted attack minimizes the loss function of the original label and the target label. The optimization equation for both is as follows: where f (·) denotes the given model, x denotes the input example, δ denotes the added perturbation, y is the true label corresponding to the input example x, t is the label set by the attacker, and ε is the set maximum perturbation. Also, adversarial attacks can be classified into two types of white-box [15] and black-box attacks [16], [17] according to whether they know the specific details of the model. In a white-box attack, the attacker knows all the information about the target model, such as the network structure and model parameters, and even the parameters and structure of the defense, to effectively design attack algorithms; while in a more sophisticated black-box attack, the attacker cannot get any information about the model and can only iteratively query the model and estimate the target model based on the results returned by the model. The commonly used blackbox approach is to build an alternative model, aiming to train a model with similar decision bounds to the target model, perform a white-box attack on this model, and then migrate the generated adversarial examples to the target model. Compared with the black-box attack scheme, the white-box attack scheme has the advantage of being easy to implement. In this paper, we mainly consider the untargeted and targeted attacks under the white-box attack and study the black-box attack in the subsequent work.
The current adversarial examples generation algorithms for white-box attacks mainly include two types: 1) gradientbased attack methods; and 2) optimization-based methods. Gradient-based attack methods are mainly designed for maximizing the target loss, solving the gradient according to the loss value, and further adding adversarial perturbations in the gradient direction, thus effectively fooling the target model VOLUME 10, 2022 to generate false prediction outputs. This type of attack algorithms can often generate the adversarial examples quickly, but the perturbation of the adversarial examples is more obvious. Most of the attack algorithms belong to this type of attack method, mainly including FGSM [9], PGD [18], MI-FGSM [19], etc. The attack method proposed in this paper is also based on the gradient attack, and the added perturbation is guaranteed to be small and the stealthiness of adversarial examples to be high. The optimization-based attack method, on the other hand, views the adversarial example generation process as an optimization problem, and finally generates the adversarial examples by continuously optimizing the target loss, and the representative of this type of attack algorithm is the CW 2 [20] algorithm. The adversarial examples generated by this method tend to have smaller adversarial perturbations but at the cost of the very low attack efficiency of this algorithm.
With the continuous development of adversarial attacks, the defense methods of adversarial attacks have also received extensive attention and research. In the field of speech recognition, it is mainly from two aspects of eliminating adversarial perturbations and improving the robustness of models. In eliminating adversarial perturbations, the main reference is large to the methods in the image domain such as feature compression, JPEG compression, quantization, random smoothing, and other input transformation-based defense methods [21], [22], [23]. By combining the characteristics of audio (e.g., temporality, etc.) and input transformation methods to eliminate adversarial perturbations, it is not yet known whether this can be applied in the field of speaker recognition. In terms of improving the robustness of the speaker recognition model, the speaker recognition model based on deep learning is trained using a dataset with mixed adversarial examples and original examples by adversarial training [9] to improve the sensitivity of the speaker recognition model to the adversarial examples. In this paper, the adversarial training approach is mainly adopted for an active defense to improve the robustness of the model.

III. RELATED WORK
Jati et al. [24] used classical attack methods such as FGSM, PGD, etc. for attack models, it is demonstrated that the speaker recognition system is highly vulnerable to adversarial attacks, then a series of ablation experiments are conducted to find the best parameters for the attack methods, and finally, adversarial training is performed by combining different attack methods, and it is found that the adversarial training based on PGD is the best defense method, which effectively improves the robustness of the model. However, it lacks to consider the security issues under targeted attacks and open set identification scenarios.
Kreuk et al. [25] claimed that the vulnerability of the end-to-end DNN-based speaker verification system against FGSM attacks is first demonstrated. The authors also experiment with the speaker verification system against attacks in the cross-feature (MFCC and Mel-spectrum), and cross-dataset cases. In this paper, no defense method is proposed, and the attack method and recognition task scenario are single.
Li et al. [26] shown that the traditional speaker verification system based on the i-vector is vulnerable to adversarial attacks, and the adversarial examples generated with the FGSM attack method are migratory and can pose a threat to different recognitional models such as x-vector systems under cross-model and cross-feature conditions. However, the attack method and recognition task are single, and no defense method is proposed.
Chen et al. [27] performed a black-box targeted adversarial attack on speaker recognition systems for the first time and proposed a method based on the attack algorithm BIM and the gradient estimation algorithm NES to generate adversarial examples to attack these traditional speaker recognition models such as GMM-UBM and i-vector models, and achieve close to 100% attack success rate on both open source and commercial voice recognition systems (Tiancong Intelligence), and can effectively migrate to the Microsoft Azure voice recognition system, including API attacks and over-the-air physical attacks in real-world scenarios. However, attacks under DNN-based speaker recognition models are not considered.
Shamsabadi et al. [28] proposed a white-box steganographybased adversarial attack method that changes the previous approach from optimizing adversarial loss to using a Gated Convolutional Autoencoder (GCA) operating in the DCT domain by the inter-frame cosine similarity between the MFCC feature vectors extracted from the original audio file and the adversarial audio file degree to take human perception into account and is trained using a multi-objective loss function (perceptual loss + adversarial loss) to generate and hide the adversarial perturbations in the original audio file. This approach reduces the perceptibility of noise to some extent and has a high PESQ metric.
Wang et al. [29] Based on the psychoacoustic principle of frequency masking, use a masking threshold instead of a parametric number to limit the size of perturbations to generate perturbations inaudible to the human ear and perform a targeted white-box attack on the speaker recognition system x-vector, specifying any speaker target, with a success rate of 98.5%. In addition, this attack method is also applied to non-speech data such as music to perform the attack.
Wang et al. [30] used two types of attacks, FGSM and LDS (local distributional smoothness), to generate adversarial examples to attack the end-to-end speaker verification model, respectively, and experimentally demonstrate the vulnerability of the speaker verification model to adversarial attacks, and then combine these two types of adversarial examples for model regularization to improve model robustness.

IV. PROPOSED METHOD A. ATTACK METHOD
In general, gradient-based untargeted attacks generate adversarial examples mainly by solving the optimization problem for the following equation: Maximize the loss function L of the label corresponding to the adversarial example with the true label y in the limit of the maximum perturbation and p-parametrization.
FGSM: FGSM is a fast gradient-based untargeted attack method, only one iteration to complete the attack, belongs to the single-step attack, in the generation time is the shortest, yet the success rate of the attack is very limited. The method maximizes the loss concerning the original target label by adding perturbations to the original example in the lp parameter limit and performing updates along the gradient direction of the loss function. In this paper, the experiments are mainly conducted under the l∞ paradigm. Its formula for generating adversarial examples is as follows: where is the maximum perturbation allowed to be added (hyperparameter), also the step size of the optimization, is the partial derivative of the loss function, in the CSI task the cross-entropy loss function is used, while in the OSI, SV task the margin loss is used due to the problem of judging the threshold. Under targeted attacks, it is required to minimize the loss with a designated target, t is a designated target. Its formula for generating the adversarial example is as follows: PGD: To solve the linearity assumption problem in FGSM, PGD is proposed to solve the internal maximum problem. PGD is an improved version of FGSM by dividing the perturbation size of one iteration of FGSM into a small fraction of each iteration and then projecting the updated adversarial example perturbation to a prescribed range, replacing the overflow with a boundary value. Compared with FGSM, PGD can find noise points more precisely and effectively and belongs to a multi-step attack, which consumes much more computational resources and time than a single-step attack, and its worst effect of generating adversarial examples is also comparable to FGSM. The adversarial example generation algorithm for the projected gradient descent method is shown in the following equation: where Clip{ * } is used to crop the overflow value to ensure that the adversarial example is within the domain of the original example, α is the perturbation value that increases with each iteration. MI-FGSM: Also known as MIM. A method based on momentum iterative gradient, which memorizes the gradient of the loss function for each iteration based on BIM, i.e., when performing iterations, the perturbation in each round is not only related to the current gradient, but also to the previously calculated gradient, which can stabilize the update direction and avoid local maximum.
g k indicates that the gradient of the previous k iterations is stored, and µ is defined as the decay factor (hyperparameter). CW 2 : Unlike the other above methods, CW 2 uses L2 norm to the optimization of Equation 3 to measure the difference between the adversarial example x' and the original example x. Furthermore, the problem of optimizing δ is transformed by introducing a new variable ω to optimize ω: This turns the optimization problem into an unconstrained minimization problem. By mapping to tanh space, the adversarial examples can transform on (−∞, +∞), which is beneficial for optimization. The following equation is generally used for the loss function: (11) Z( * ) is the output of the logit layer. k is the preset confidence, the larger k is, the higher the confidence of the generated adversarial examples.
ADA: For CW 2 , although it is an attack method based on finding the minimum perturbation, the generation efficiency is extremely low, and the practicality is not high. There is no doubt that PGD and MIM perform very powerfully in terms of attack performance, yet the adversarial examples generated using their attack ideas are not guaranteed to generate small enough perturbations to create adversarial examples that are easily perceived by humans to some extent. For PGD, the attack success rate is closely related to the hyperparameter maximum perturbation value ε and the step size α. If the maximum perturbation value ε is set too large, the attack success rate will always satisfy the attack demand with the number of iterations, but the added perturbation is not the most satisfying; Conversely, if the maximum perturbation value ε is set too small, the increased perturbation, on the one hand, will also be very small, which may make the generated examples not adversarial. On the other hand, an unreasonable step size α may cause the gradient optimization process to fail to converge and oscillate back and forth between the local optimum or the global optimum. For MIM, to ensure the success rate of the attack, the information of previous gradients is additionally added to each gradient update for calculation, which makes the addition of a larger perturbation, and the larger the hyperparameter µ is set, the larger the perturbation is. VOLUME 10, 2022 Given the shortcomings of these two attack methods, the ADA is proposed to find the minimum perturbation that satisfies the success of the attack, and the complete attack steps are shown in Algorithm 1. First, input the original benign example x, initialize the step size α, and indeed the attack type as an untargeted attack or targeted attack. The optimal perturbation is performed instead of setting a fixed size maximum perturbation value like FGSM, PGD, or MIM. Specifically, the norm is constrained by projecting the adversarial perturbation δ within the maximum perturbation range around the original audio x. The perturbation size is then modified based on the results of the two-category of judgments. And if the example after adding the perturbation in the (k −1) th iteration if it is not adversarial, expand the range of the maximum perturbation value in the next iteration to (1+λ)ε k−1 ; Conversely, after the adversarial example is adversarial, the range of the maximum perturbation value is narrowed down to (1−λ)ε k−1 in the next round of iterations. After each two-category of judgment, the value of step size α is reduced, and the means of reduction are exponential decay and cosine annealing in the learning rate decay method. As many iterations pass, the maximum perturbation value and step size become smaller and smaller, and finally, an adversarial example that is both adversarial and satisfies the added perturbation is small enough is returned. We refer to the exponential decay function to reduce the size of α as ADA-E and the cosine annealing function to reduce the size of α as ADA-C. The following algorithm is an example of the exponential decay function.

B. ADVERSARIAL TRAINING
As a typical active defense method, the idea of adversarial training is very straightforward. The generated adversarial examples are added to the training process, so that the model learns the adversarial example data in advance, which can be understood as a min-max optimization problem: where θ is the weight parameter of the model, δ is the size of the perturbation, S is the range of the perturbation, and D is the data distribution. The inner layer is a maximization that aims to find the perturbation that maximizes the loss function, which simply means that the added perturbation should try to cheat the neural network. The outer layer is a minimization formula that optimizes the neural network, i.e., when the perturbation is fixed, we train the neural network model to minimize the loss of the training data, i.e., to make the model robust to the perturbation. Adversarial training is more time-consuming than normal training, and the resulting model will be less accurate for benign examples, yet it is still a powerful tool to defend against adversarial attacks.
Taking ADA-based adversarial training as an example, the adversarial training objective function can be expressed as: L(θ, x, y) = cL(θ, x, y) L(θ, x, y)), y) (13) Algorithm 1 ADA-E Input: controlling with/without targeted attack m, number of iterations K , gradient information grad, benign example x, label (untargeted) or preset label (targeted) y, loss function L cross , modelf ( * ), step size α, sign function sign( * ), clipping function clip( * ), perturbation size ε, adjusting the range of perturbations λ, Exponential decay function ExponentialLR( * ) Output: adversarial example x 1: L(θ, x, y) is the adversarial example generated by the benign example x iteratively according to the ADA method; c is used to balance the accuracy of the benign and adversarial examples, i.e., the ratio taken by the adversarial and benign examples.

V. EXPERIMENTAL SETUP A. DATASETS AND EXPERIMENTAL ENVIRONMENT
Like [24] and [31], the datasets are taken from Librispeech [32], the speech database Librispeech, which contains 1000 hours of 16 kHz recordings, cut and organized into textannotated audio files of about 10 seconds each. We provide a total of 5 datasets, the first 3 datasets for the 3 types of identification tasks, which are taken from ''dev-other'' and ''train-other-500'' in Librispeech named as enroll 10 , test 10 , and imposter 10 . enroll 10 has 10 people (5 men and 5 women), and each person takes 10 random speech data for speaker registration; test 10 also has 10 people, but the 10 people taken must be the same as enroll 10 , and each person takes 100 random speech data (no conflict with enroll 10 ) for testing; imposter 10 denotes the impostor dataset mainly used for OSI, SV tasks, where all 10 speakers in the dataset are different from enroll 10 , and each speaker is randomly taken 100 voices. The latter two datasets are used for adversarial training. The datasets are taken from ''train-clean-100'' named train 251 , test 251 , both of which contain 251 individuals (126 men and 125 women). The train 251 is used for training and contains 25652 speech data, and test 251 is used for testing and contains 2887 speech data. This experiment was implemented on an Ubuntu 20.04 system with an Intel i7-11700KF at 3.6GHz CPU, an NVIDIA GeForce RTX 3070Ti with 8GB of video memory, and 32GB of RAM.

B. MODEL INTRODUCTION
We will use the two models i-vector [33], and x-vector [34] to implement the attack on the three types of recognition tasks, for the AudioNet [35] model is more biased toward doing adversarial training.
The x-vector system is a speaker recognition system based on DNN, which is the mainstream baseline model framework in the current speaker recognition field. The DNN is trained to extract the vocal features of the speaker, and the extracted speaker embedding is called the x-vector. The whole system can be divided into two modules, and the complete architecture is shown in Figure 3 [36] below: the x-vector system contains five frame-level TDNN layers, one statistical pooling layer, two sentence-level fully connected layers, and one SoftMax layer.
After the speaker model is trained, the back-end will use the extracted speaker features x-vector to train a PLDA [37] model for channel compensation to reduce the impact of channel noise on the system and use the model for similarity scoring.
Before the rise of deep learning-based speaker recognition, i-vector, which belongs to the traditional speaker recognition models, have been the most popular. I-vector is a simplified version of joint factor analysis based on JFA [38], that is, a Total factor matrix (T) is used to describe both speaker information and channel information, and then the speech is mapped to a fixed and low-dimensional vector. The existence of channel information in matrix T will interfere with the recognition system and even seriously affect the recognition accuracy of the system. Therefore, channel compensation for i-vector is required, so WCCN [39], Linear Discriminant Analysis LDA [40] and Probabilistic Linear Discriminant Analysis (PLDA) are usually used. The framework of i-vector system is shown in Figure 4 below: AudioNet is a one-dimensional convolutional neural network model with a digital signal processing (DSP)  front-end [24] added to the original model for extracting the log-Mel spectrogram from the time-domain waveform of the audio as an input to the convolutional layer. The neural network consists of 8 convolutional layers and is mainly used to transform the spectrogram into a single 32-dimensional vector of speaker embedding. BatchNorm and ReLU operations are performed for all CNN layers, and only MaxPooling1D is added at the end of the CNN layers in layers 1, 4, and 6. The final fully connected layer maps the speaker embedding into the class logits. The complete network architecture is shown in Table 1.

C. METRICS
In this paper, we will evaluate the attack effect of each generation algorithm on speaker recognition models using attack success rate (ASR), signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ), and time for generating adversarial examples.
The attack success rate is used to indicate the percentage of generated adversarial examples that are misclassified by the model, and the untargeted attack is defined as: Num( * ) represents the number, and if it is a targeted attack, the numerator is changed to Num(f (x ) = t).
For measuring the perceptibility of speech adversarial examples, we use speech quality evaluation methods such as signal-to-noise ratio (SNR) and speech quality perception assessment (PESQ). The signal-to-noise ratio is the ratio of the power of the signal to the power of noise, and the unit of measurement is dB. The main measure of distortion in the experiments is the size of the added perturbation relative to the original audio, and then the difference between the adversarial audio generated by the various generation algorithms is compared, which is calculated as follows: Ps represents the power of benign examples and Pn represents the power of perturbations. The larger the value of the signal-to-noise ratio, the better.
The calculation of PESQ is more complicated, mainly by extracting the difference between the input two signals in the time-frequency domain or transform domain feature parameters and then mapping the feature parameter differences by a neural network model to obtain an objective sound quality score. The PESQ score ranges from 0 to 5. Higher scores indicate better voice quality.
The time to generate the adversarial examples is mainly used to accurately compare the generation speed of various attack algorithms in seconds.

A. ALGORITHM PARAMETER SETTING
The step size of FGSM is ε = 0.002 [24,31]. We also set the maximum perturbation ε = 0.002, number of iterations K = {10,20,30} for PGD, MIM, ADA-E, and ADA-C. The step size α = 0.0004 for PGD and MIM, similarly the initial step size α = 0.0004 in ADA. For CW 2 we use 9 binary search steps to minimize adversarial perturbations, run 60-600 iterations to converge, and vary the confidence k from 0, 5, 10. In the experimental results, PGD-T, MIM-T, ADA-E-T, and ADA-C-T are used to represent the number of iterations of PGD, MIM, ADA-E, and ADA-C, e.g., PGD-10 means 10 iterations of PGD. CW 2 -k denotes CW 2 when the confidence is set to k.
The first thing we do is to perform a series of ablation experiments in the x-vector model untargeted closed-set speaker recognition with MIM and ADA-E as examples to find the best parameters for the attack.
In the MIM experiment, its hyperparameter µ = {0,0.2,0.4,0.6,0.8}, the λ = 0.2 that modifies the range of perturbation size in the ADA-E experiment, and the decay factor in the exponential decay that is the bottom γ = {0.75,0.8,0.85,0.9,0.95}. After experiments, it is proved that the success rate of the attack is always kept at 100% when the hyperparameter µ takes any value, so we will choose the value of µ when the perturbation is the smallest, i.e., the maximum value of SNR and PESQ, so we have the most suitable u = 0 in this scenario. Similarly, it can be obtained that the decay factor γ of ADA-E is most suitable to take 0.85 under the premise of ensuring a high success rate. Figures 5 and  6 below show the graphs of the tuning results for the two attack methods. Table 2 and Table 4 show the untargeted attacks under closedset identification and open-set identification, respectively. In terms of attack success rate, no matter which identification task or which identification model, or which test dataset, the  FGSM attack is the weakest attack among all attack methods, for example, the success rate is only 32.37% in x-vector for closed set recognition, while all other methods can achieve 100% attack success rate to deceive the model because FGSM is a single-step attack and does not need to perform iterations, but its speed of generating adversarial examples is far from that of other methods.

B. SPEAKER IDENTIFICATION FOR UNTARGETED ATTACK
Other gradient-based attack methods stop finding adversarial examples based on the number of iterations and are close in generation time. In terms of the audio quality of the generated adversarial examples, PGD is the stealthiest in generating adversarial examples among the three compared methods FGSM, PGD, and MIM. Specifically, the SNR and PESQ values take the maximum value in PGD-10, yet the maximum SNR does not exceed 35 dB and PESQ score does not exceed 3 in both recognition models, and the perturbation increases with the increase of the number of iterations, and the SNR and PESQ values gradually decrease in PGD and MIM. For CW 2 , the generated adversarial example has the highest SNR metric among all experiments, thanks to its optimization-based attack method, with the attendant problem that it consumes the most computational time of all methods. As the confidence k is set larger, the higher the success rate of CW 2 and the lower the stealthiness.
The ADA method proposed in this paper guarantees a high attack success rate and generation time very close to other methods, the lowest SNR and PESQ indexes in the adversarial examples generated by ADA-E and ADA-C methods are 42db and 4, which are 30% and 39% higher than those of PGD-10 with the best comparative experimental results, and the improved effect will continue to be enhanced with the increase in the number of iterations, the smaller the generated adversarial example perturbation will be, the more it can escape the detection of the human ear, but the computation time will also increase. Compared to CW 2 , ADA-C-30 has  higher PESQ than the former on datasets test 10 , with a greater advantage in time consumption, reducing the time by at least 94%. Table 3, Table 5, and Table 6 show the targeted attacks under closed-set identification and open-set identification, respectively. The targeted attacks are selected according to the set difficulty level. Simple indicates that the label of the most likely class other than the actual label of the normal example is used as the targeted class label; Hard indicates that the label of the least likely class other than the actual label of the normal example is used as the targeted class label. Under Simple difficulty, the success rate of FGSM attack in closed-set identification is still the lowest among all attack methods, but it is higher than that of untargeted attack under the same condition, and the success rates of targeted attack under x-vector of PGD-10 and MIM-10 are 99.88 and 99.89% respectively lower than that of untargeted attack under the same condition, and the success rate of attack can still reach 100% as the number of iterations increases. The success rate of the attack can still reach 100% with an increasing number of iterations, while the ADA can maintain a 100% success rate. The adversarial examples generated by all attack methods started equal or slightly improved in SNR, and PESQ metrics compared to the untargeted attacks under the same conditions. For open-set identification, the adversarial examples generated by the imposter 10 dataset are more confusing to deceive both models than test 10 for both untargeted and targeted attacks, and the PESQ metric of the adversarial examples generated by the imposter 10 dataset is greater than that of the adversarial examples generated by test 10 in terms of the stealthiness metric, while the opposite occurs for the SNR metric.   The attack success rate of the FGSM method appears extremely low under Hard difficulty, e.g., only 0.99% in the closed set identification of the test 10 test set under the x-vector model, and other comparison methods including the ADA cannot achieve 100% attack success rate at 10 iterations, yet the success rate of the ADA is higher than all comparison methods. This is because the ADA sacrifices the stealthiness of audio adversarial examples in exchange for an increase in success rate. And the improvement in SNR and PESQ is not as great as in the untargeted attack or Simple difficulty, but it is still the attack method with the highest stealthiness, which can be easily observed in the table.

C. SPEAKER IDENTIFICATION FOR TARGETED ATTACK
Combining the experimental results of Simple and Hard, without considering the CW 2 success rate, the stealthiness is slightly higher than the ADA method on the dataset imposter 10 and very close to the ADA method on dataset test 10 . The computation time of CW 2 is also the most and will also increase with the difficulty of the attack. The lowest SNR and PESQ values of ADA attack are 43db and 4.01 respectively under the Simple difficulty of targeted attack; the lowest SNR and PESQ indexes of ADA attack are 35db and 3.14 respectively under the Hard difficulty of targeted attack. The stealthiness of the ADA method under targeted attack is the highest among all methods, and the change in the number of iterations has a significant positive correlation with the change in SNR and PESQ metrics. The SNR and PESQ can be improved by 20% and 25.4%, respectively, on average for the ADA under the targeted attack compared to PGD-10. In general, the targeted attack is more difficult compared to the untargeted attack.

D. ATTACK FOR SPEAKER VERIFICATION
In the speaker verification experiments, we specifically and mainly attack those 10 speaker verification models in the registered enroll 10 dataset separately as imposters (i.e., the imposter 10 dataset), which is more realistic, and then calculate the average of each attack result to obtain Table 7. Observation of Table 7 reveals that the ADA method improves SNR and PESQ by at least 29.5% and 33.4% on average compared to PGD-10 specifically, which is like the improvement in untargeted speaker identification, proving that this attack method is general and can be applied in all speaker recognition tasks. To understand the adversarial examples under the speaker recognition domain more intuitively, we take the speaker verification task as an example, from the following Figure 7  7-10 and 11-14, the human eye can intuitively find that, through the comparison of waveform and spectrum, the size of perturbation increased by the attack method proposed in this paper is much smaller than other gradient attack methods but slightly larger than CW 2 . This phenomenon is reasonable, CW 2 to find the minimum perturbation of the sample at the cost of huge computation time, but according to the experimental results ADA-C-30 and CW 2 generated examples of PESQ, SNR values are very close. Among these the perturbations added by FGSM and PGD are not easily distinguishable in the waveform, yet the perturbations added by PGD are also superior to the FGSM method from the comparison of the spectrogram.

E. ANALYSIS OF DIFFERENT MODELS
It is observed from Tables 2-7 that either the neural networkbased x-vector system or the GMM-UBM-based i-vector system is vulnerable to adversarial attack spoofing and cannot VOLUME 10, 2022  resist the adversarial attack. Among them, i-vector systems are more threatened by adversarial attacks than x-vector systems, e.g., the success rate of FGSM attacks under i-vector systems is higher than x-vector systems in all speaker recognition tasks, etc. In terms of stealthiness, the SNR metrics and PESQ metrics of the adversarial examples generated by the two systems are not significantly different. The SNR metrics of the adversarial examples generated on the i-vector system are equal to or higher than those generated on the x-vector system, and the size of the PESQ metrics has advantages and disadvantages for each of the two systems under different recognition tasks. In terms of generation time, it is more difficult and takes more time to generate adversarial examples on the i-vector system. Table 8 presents the robustness of the trained model by attacking it with different attack methods after we trained the model separately in a specific way of adversarial training to test the robustness of the model. Among them, we selected three adversarial training methods, using FGSM-based adversarial training, PGD-10-based adversarial training, and ADA-C-10-based adversarial training, denoted as FGSM AT, PGD-10 AT, and ADA-C-10 AT, and the number of training epochs set was 150, with 50% of the adversarial examples and 50% of the benign examples in the adversarial training, and the maximum perturbation of all methods in the adversarial training ε = 0.002.

F. ANALYSIS OF ADVERSARIAL TRAINING
The experimental results show that the neural network AudioNet model without adversarial training is extremely vulnerable to adversarial attacks, even the worst attack FGSM has an 82.61% success rate, and the other three attack methods can achieve a 100% attack success rate. By comparing the adversarial training based on the three different methods, first, we can find that the models after adversarial training not only have a slight decrease in accuracy for  all benign examples but also add a lot of training time. FGSM AT showed the least decrease in accuracy for benign examples and the most decrease in adversarial training for PGD-10 AT.  Second, the training time consumed by FGSM AT is the most time-efficient among these three approaches, yet the improved model robustness is the weakest among these three approaches, which cannot resist PGD-30, MIM-30,   and ADA-C-30 attacks, and only improves the resistance to FGSM attacks, which reduces the success rate of FGSM attacks by about 68%.
PGD-10 AT takes the longest time, about three times longer than FGSM AT and 1.5 times longer than ADA-C-10 AT, and its improved defensive effect is generally stronger VOLUME 10, 2022 than the counter training of FGSM AT, yet weaker than ADA-C-10 AT, specifically reducing the FGSM, PGD-30, and ADA-C-30 by about 67%, 50%, and 36%, respectively method's attack success rates.
And the proposed method for adversarial training in this paper not only takes less time and improves the model robustness overall the best among these three adversarial pieces of training, only slightly lower than the other two adversarial pieces of training in resisting FGSM attacks, yet the defense effect is still efficient, specifically reducing the attack success rate of FGSM, PGD-30, and ADA-C-30 methods by about 60%, 64%, and 55%, respectively, with significant improvement in resisting PGD and ADA-C attacks.
Finally, despite the adversarial training of different methods, it was not possible to improve the defense against MIM attack methods, and MIM was able to achieve a 100% attack success rate. The possible reason is that the adversarial examples generated by FGSM, PGD, and ADA-C are completely different from those generated by MIM methods, so although the adversarial training was performed to increase the diversity of model recognition examples, it was still not able to defend against MIM attack.

VII. CONCLUSION AND FUTURE DIRECTIONS
To explore the adversarial examples in the field of speaker recognition, this paper attacks two different speaker recognition models and reveals that there are serious security problems in speaker models. The proposed attack method compensates the shortcomings of traditional attack methods FGSM, PGD, MIM, and CW 2 , greatly improves the stealthiness or reduces the generation time of adversarial examples and is applicable to all recognition tasks and different models. Finally, the proposed method is used for adversarial training, and its improved model robustness is generally better than the FGSM-based and PGD-based adversarial training.
The deficiency of this paper is that the research and experiment are carried out under the assumption of white-box attack, which has certain limitations. The next step will be to study speaker recognition under black box attacks and explore other defense methods besides adversarial training.