Processing math: 100%
A Robust CycleGAN-L2 Defense Method for Speaker Recognition System | IEEE Journals & Magazine | IEEE Xplore

A Robust CycleGAN-L2 Defense Method for Speaker Recognition System


A Speaker Recognition Defense Method Based On Generative Adversarial Network.

Abstract:

With the rapid development of voice technology, speaker recognition is becoming increasingly prevalent in our daily lives. However, with its increased usage, security iss...Show More

Abstract:

With the rapid development of voice technology, speaker recognition is becoming increasingly prevalent in our daily lives. However, with its increased usage, security issues have become more apparent. The adversarial attack poses a significant security risk to the speaker recognition model by making small changes to the input and thus causing the neural network model to produce an incorrect output. Nevertheless, there are currently limited defense techniques for speaker recognition models. To this end, we propose a robust CycleGAN-L2(CYC-L2) defense method. The method automatically adjusts the size of the dataset according to the learning of the generative adversarial networks on the dataset, and uses L2 loss functions to constrain the generative adversarial networks for better and faster training. In this paper, we will compare the effectiveness of defense against white-box attacks using existing defenses and the defenses proposed. The experimental results show that our defense method not only plays a better defense effect than the other defense methods mentioned under the x-vector model but also does not reduce the accuracy of benign examples in closed-set identification.
A Speaker Recognition Defense Method Based On Generative Adversarial Network.
Published in: IEEE Access ( Volume: 11)
Page(s): 82771 - 82783
Date of Publication: 31 July 2023
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

The speaker recognition system, which determines different identities based on specific features of voice signals, is also widely used in voice services, such as identity recognition for banking, judicial authentication, and personalized services on smart devices. The development of neural networks has improved the performance of speaker recognition systems, but they are vulnerable to the threat of adversarial examples [1]. Therefore, defense methods that can be applied in speaker recognition systems and resist various adversarial attacks are what we want to study.

Literature review [2] points out that most defense methods are based on prior knowledge of attacks, which can become ineffective if attackers improve the original attacks to produce stronger ones. Therefore, there is an urgent need to find a defense method that can resist attacks without impacting the robustness of the original model. Our method precisely addresses this problem by effectively defending against different attacks with minimal impact on the model’s accuracy.

To explore the relationship between generative models and robustness, this paper uses generative adversarial networks to learn the distribution of input features. The method of this paper is inspired by the article [3] in the field of images. Unlike us, their generator only learns the distribution of adversarial examples and the distribution of benign examples for adversarial examples elimination. However, in a real production scenario, we inevitably need to receive benign data when performing the defense of adversarial examples. Therefore, our generator learns how to map natural data to benign data, where natural data contains both adversarial examples and benign examples. The model in this paper is based on CycleGAN-VC2 [4] in the speech synthesis domain. For better defense against adversarial examples, we make minor modifications to the loss function of CycleGAN-VC2, and we name the defense methods as CYC-L1 and CYC-L2 according to the different loss functions.

Our defense approach has the following contributions:

  • We propose a defense against adversarial examples in the field of speaker recognition based on the CycleGAN-VC2 model. This method eliminates the perturbations existing in natural data by mapping natural data to benign data.

  • In CycleGAN-VC2, the L_{1} distance was used for model training. No one in the field of speaker recognition defense has used CycleGAN-VC2 with L_{1} distance. Moreover, when we tried to use the L_{1} distance for defense against the adversarial examples, we found that it was not as effective as the defense using the L_{2} distance. Because CycleGAN-VC2 uses training data where one speaker corresponds to one speaker, while for defense, we use training data where multiple speakers correspond to multiple speakers. The L_{2} distance can encourage the model to select more features, it fits our defense scenario. Therefore, we chose L_{2} distance as the defense method. Experimental results show that in most cases, the method using L_{2} distance achieves better defense results. Moreover, we use a method similar to decremental learning in the training process, which solves the problem of difficult training of generative adversarial networks and greatly reduces the training time.

  • To test the robustness of our method, we defend against different attacks and compare them with other defenses. The experimental results show that our defense method is better than other defense methods.

The remainder of this paper is divided into six sections. Section II focuses on the basics of speaker recognition. Then, we introduce the attack methods and defense methods used in the experiments of this paper. In Section III, we discuss the work related to defense in speaker recognition systems. In Section IV, we describe the proposed method in detail. In Section V, we describe the dataset, experimental environment, parameter settings, and model architecture, and provide a detailed description of the evaluation metrics. In Section VI, we present and analyze the experimental results. Finally, in Section VII, we give an overview of the effectiveness of the defense method proposed in this paper and present the shortcomings of the experiments and future research directions.

SECTION II.

Background

A. Basics of Speaker Recognition

Speaker recognition, also called voice recognition, is a technology that distinguishes speakers based on the characteristics of the speaker’s voice. The flow of a typical speaker recognition system is shown in Figure 1. To recognize a user, the system first needs to extract relevant features such as MFCC [5]. Then the model is trained and stored in a model database. In the testing phase, features are extracted from the voice to be recognized and compared with the data in the database for similarity, scored, and judged to determine the speaker of the current voice.

FIGURE 1. - Flowchart of the speaker recognition system.
FIGURE 1.

Flowchart of the speaker recognition system.

In addition, speaker recognition systems have three main tasks: closed-set identification (CSI), speaker verification (SV), and open-set identification (OSI). For CSI, the recognition result is the one with the highest matching score among the registered speakers. For SV, the recognition result depends on whether the matching score is higher than the threshold. If it is higher, it is accepted, otherwise, it is rejected. For OSI, the recognition result must satisfy two conditions. Firstly, the highest score in the set. Secondly, the score must be higher than the set threshold \theta . If the highest score is below the threshold, the voice is recognized an impostor. The decision module is shown below:\begin{align*} D(x)=\begin{cases} \displaystyle arg\mathop {max}\limits _{i\in G}{[S(x)] }_{i}, & if\mathop {max}\limits _{i\in G}{[S(x)] }_{i}\ge \theta;\\ \displaystyle imposter, & otherwise\end{cases} \tag{1}\end{align*}

View SourceRight-click on figure for MathML and additional features. where the parameter \theta is a set threshold and G is a collection of registered speakers. i is a specific certain speaker. S(x) is the scoring function.

B. Adversarial Attack

Adversarial examples also known as adversarial attacks. The process is to add small perturbations to benign examples to make the speaker recognition model misclassify. The adversarial examples formula is as follows:\begin{equation*} x^{\prime }=x+\delta such ~~ that \delta < \epsilon \tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where x is the original example, \delta is the added perturbation, and x^{\prime} is the generated adversarial example.

Adversarial attacks can be categorized into targeted and untargeted attacks, as well as black-box attacks and white-box attacks. Targeted attacks aim to modify the model’s output to a specific incorrect class. In contrast, untargeted attacks only misclassify the model. White-box attacks can know all the information about the target model. Black-box attacks do not have access to valid information about the model.

FGSM (Fast Gradient Sign Method) [6] attack is a simple gradient-based method in the field of adversarial attacks. Its basic formula is as follows:\begin{equation*} x^{\prime }=x+\varepsilon \cdot sign(\nabla _{x}L(x,y)) \tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \varepsilon is the perturbation strength, \nabla _{x}L(x,y) is the gradient of the loss function, and sign(\ast) represents the sign function.

MI-FGSM (Momentum Iterative Fast Gradient Sign Method) [7] attack is an upgrade of the FGSM attack, with the addition of momentum and multiple iterations to the FGSM attack. The basic formula is as follows:\begin{align*} g_{k+1}&=\mu \cdot g_{k}+\frac {\nabla _{x}L(x_{k}^{\prime },y)}{{\parallel \nabla _{x}L(x_{k}^{\prime },y)\parallel }_{1}} \tag{4}\\ x_{k+1}^{\prime }&=x_{k}^{\prime }+\alpha \cdot sign\left ({g_{k+1} }\right) \tag{5}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \mu is the decay of the control gradient, g_{k} is the previously saved gradient, and g_{k+1} is the current gradient after adding momentum. \alpha is the perturbation size.

CW2 (Carlini & Wagner) [8] is an optimization-based attack that produces a perturbation so small that one can barely perceive it in the audio and extends the constraint range from 0]–[1 to [-\infty , +\infty ] using \omega instead of the adversarial example, which is formulated as follows:\begin{align*} &{x_{i}+\delta }_{i}=\frac {1}{2}(tanh(\omega _{i})+1) \tag{6}\\ &\mathrm {minimize}{\parallel \frac {1}{2}\left ({\tanh \left ({\omega }\right)+1 }\right)-x\parallel }_{2}^{2}+c\cdot f(\frac {1}{2}(\mathrm {tanh}(\omega)+1) \tag{7}\\ &f(x^{\prime})=max(maxZ\left ({x^{\prime} }\right)_{i}:i\ne t-Z{(x^{\prime})}_{t},-\kappa) \tag{8}\end{align*}

View SourceRight-click on figure for MathML and additional features.

CW2 attack needs to be optimized with two objectives, the first objective is that the gap between the generated adversarial example x^{\prime} and the original example x is as small as possible, and the constraint between the gaps is performed by L_{2} distance. The second objective is that the generated adversarial example x^{\prime} is to make the model Z classify into the specified target t .

PGD (Project Gradient Descent) [9] is a gradient-based attack, which is an improvement of the FGSM attack. The PGD attack performs multiple iterations on the FGSM attack and each iteration clips the iterations to the specified range. The basic formula is as follows:\begin{equation*} x_{t+1}^{\prime} ={Clip}_{x,\epsilon }(x_{t}^{\prime} +\alpha \mathrm {\cdot }sign(\nabla _{x}L(x_{t}^{\prime},y\mathrm {;}\theta \mathrm {))} \tag{9}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where Clip(\ast) is to clip perturbation greater than \epsilon (the maximum perturbation strength), \alpha is the perturbation intensity of each iteration. In addition, the PGD attack will start by finding a random point within the bounded range from the starting point to perform the PGD attack.

ADA [10] is an improvement of the PGD attack. The main idea of the ADA attack is dynamically adjusting the perturbation strength. Next, it uses cosine annealing to decrease the step size, which means that the added perturbation decreases throughout multiple iterations to achieve a higher level of stealth adversarial example without affecting the success rate of the attack.

C. Adversarial Defense

Adversarial defense is mainly done by various methods to enhance the robustness and security of the model against adversarial examples.

QT (Quantization) [11] defense is to limit the amplitude of the sound to an integer multiple of the parameter q . Since the amplitude of the perturbation is usually small in the input space, the QT defense can eliminate the perturbation.\begin{equation*} \lfloor \frac {audio}{q}+0.5\rfloor \ast q \tag{10}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

AS (Average Smoothing) [12] defense is done by fixing a sample point x , using k samples before and after x as reference points, and replacing the values of the sample point x according to the mean value of the reference points. The audio is modified by this method to reach a reduced impact of the adversarial examples. MS (Median Smoothing) [11] defense, on the other hand, changes the mean value of the reference points to the median. However, AS defense and MS defense may degrade the audio quality and thus affect the performance of the model.

QT, AS, and MS are defense methods based on the time domain. However, QT defense is ineffective against adversarial examples with high perturbation strength. AS defense and MS defense may degrade the audio quality and have an impact on the performance of the model.

DS (Down Sampling) [13] defense is to eliminate the perturbation by reducing the sampling rate of the audio and then recovering it. Specifically, the downsampling frequency \tau (\tau \mathrm { < 1} ) is set, then the audio is downsampled according to the \tau value, and finally the downsampled audio is recovered by upsampling.\begin{equation*} \tau =\frac {new\_{}audio\_{}sample}{ori\_{}audio\_{}sample} \tag{11}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

LPF (Low Pass Filter) [14] defense is to filter the high frequency sounds by setting a low pass filter. BPF (Band Pass Filter) [15] defense is to set up a filter that filters both high and low frequency sounds.

DS, LPF, and BPF are defense methods based on the frequency domain. The DS method requires the sampling frequency to be at least twice the highest frequency in the signal to avoid sampling distortion and aliasing effects. LPF cannot defend against low frequency adversarial noise. Since different adversarial examples may have different frequency characteristics, BPF is challenging to choose an appropriate frequency. Therefore, it is necessary to select the appropriate filter parameters according to the specific situation. This may involve extensive experimentation and cannot be directly applied to practical applications.

OPUS [16] and SPEEX [17] are two audio codecs with audio compression. They deal with noise in the compression of the audio. The noise processing by OPUS is divided into two stages, NSA (Noise Shaping Analysis) and NSQ (Noise Shaping Quantization). Among them, the purpose of the NSA stage is to find the compensation gain G and filter coefficients used in NSQ. In the NSQ stage, the noise is filtered by the following equation:\begin{equation*} Y(z)=G\cdot \frac {1-F_{ana}\left ({z }\right)}{1-F_{syn}(z)}\cdot X(z)+\frac {1}{1-F_{syn}(z)}\cdot Q(z) \tag{12}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where F_{ana}\left ({z }\right) and F_{syn}\left ({z }\right) are the Analysis Noise Shaping Filter and Synthesis Noise Shaping Filter, respectively, and X and Q are the Quantizer. The first half of the equation is the input signal to be shaped, and the second half is the noise signal to be shaped.

SPEEX is mainly used for real-time VoIP communication. SPEEX also decreases noise in the encoder to improve the quality of network calls. The treatment of noise is shown below:\begin{equation*} W(z)=\frac {A(z/\gamma _{1})}{A(z/\gamma _{2})} \tag{13}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where A(z) is a linear prediction filter with parameter values of 0.9 and 0.6 for \gamma _{1} and \gamma _{2} , respectively. The linear prediction filter is designed to allow the sound to have varying noise levels at different frequencies. Specifically, the output W\left ({z }\right) has reduced noise in the lower frequency bands of the sound while introducing some noise at higher frequencies. Rajaratnam [15] argues that SPEEX’s defense method is capable of effectively eliminating the perturbation with minimal alteration to the audio.

OPUS and SPEEX are defense methods based on speech compression. Both OPUS and SPEEX use compression algorithms to reduce the size of audio files, resulting in a certain degree of information loss that affects the performance of the speaker recognition system.

To address the shortcomings in time domain, frequency domain and speech compression methods, we propose a speech synthesis-based defense method, which mitigates the audio distortion problem and defends against various attacks.

SECTION III.

Related Work

Sun et al. [18] used an adversarial training approach to enhance the robustness of the model. They used MFCC features as input and used FGSM to select each small batch of data for generating adversarial examples, and then used the adversarial examples and the original labels to generate data for adversarial training, and dynamically put the FGSM generated adversarial data into the training set of the model to enhance the classification ability of the model. Moreover, they used a teacher-student (T/S) training method [19] to improve the robustness of the model. However, they only used FGSM to enhance the robustness of the model, without testing other attack methods.

Du et al. [12] used adversarial training, audio downsampling, and average filtering for the defense of the adversarial examples. They believe that adversarial training has significant limitations and can only provide an effective defense with fixed parameters; if the attacker increases the perturbation strength or changes to another attack method, adversarial training may not achieve a good defense. In their experiments, both audio downsampling and averaging filtering were found to reduce the success rate of the attack. However, it is worth noting that according to Nyquist’s sampling theorem, downsampling the audio can cause distortion when the sampling rate is lower than twice the highest frequency of the original audio. Similarly, averaging filtering may degrade the audio quality.

Yang et al. [11] proposed the use of a time dependent approach for the detection of adversarial examples. The main process is that for a time length t audio example a , the k (k < t ) time lengths in a are selected as a^{\prime} , put a and a^{\prime} into the target model, and compare the output results of the model. If it is an adversarial example, the results of a and a^{\prime} outputs will be very different due to the loss of time dependence. If it is a benign example, it is not affected by the time dependence, and the outputs of a and a^{\prime} are consistent. However, if an attack does not modify the entire time sequence of the audio, time dependent methods may not be effective in detecting the attack. In addition, they experimented with quantization, smoothing, downsampling, and autoencoder methods. They found that the Magnet encoder [20] is very effective in defending against adversarial examples in the image domain, but has limited effectiveness in defending against audio.

Yuan et al. [13] employed two methods for detecting adversarial examples: noise addition detection and reduced sampling rate detection. In their experimental results, they observed that background noise reduced the success rate of adversarial examples. Therefore, they adopted the defense method of adding noise. Specifically, they introduced noise, denoted as n , into the input audio x . If the output result of the target model for the audio (x+n ), after noise addition, does not match the output result of the original audio x (i.e., x+n\ne \unicode{0x00D9}x ), it is considered that there is perturbation in the input audio x . If too much noise is added, it can impact the accuracy. The detection using reduced sampling rate involved lowering the sampling rate of the input audio x and obtaining the output y from the target model. If the model’s output for the original input audio x is denoted as y^{\prime} , and y^{\prime} \ne \unicode{0x00D9}y , it is concluded that there exists perturbation in the input audio x .

Rajaratnam et al. [21] used band pass filters and audio compression methods of AAC [22], MP3 [23], OPUS [16], and SPEEX [17] for the study of adversarial examples defense. They found that AAC and MP3 had the worst defenses among audio compression methods. The method using bandpass filters had better defenses than audio compression methods, but had a greater impact on target model accuracy than the audio compression methods.

Zeng et al. [24] migrated the main idea of Multi-version programming (MVP) [25] to adversarial examples detection. Based on the adversarial examples mobility [26] and the fact that different recognition systems have the same output results for a single normal speech. They compare the similarity of the outputs two by two in different recognition systems, and the examples with similarity below a given threshold are considered adversarial examples. However, if the systems cannot effectively recognize benign samples, the detection accuracy will decrease.

Esmaeilpour et al. [27] used a class-conditional generative adversarial network [28] for defense. They minimize the relative chord distance between the random noise and the adversarial examples, find the optimal input vector to put into the generator to generate the spectrogram, and then reconstruct the one-dimensional signal based on the original phase information of the adversarial example and the generated spectrogram. Thus, this reconstruction does not add any additional noise to the signal and achieves the purpose of removing the perturbation. However, their method performs poorly in defending against black-box attacks.

SECTION IV.

Proposed Defense Approach: CYC

We will present our approach in the following areas: (A) the overall workflow, (B) the associated loss functions, and (C) the specific training methods.

A. Overall Workflow

Firstly, we preprocess the audio. In the preprocessing, we mainly get four main features: fundamental frequency (F0), spectral envelope (SP), Mel-Cepstral Coefficients (MCEPs), and Aperiodic Parameter (AP). Where F0 is the lowest frequency sine wave in the sound signal and is used to determine the start time position of the sound start in each audio. SP is the variation of different frequency amplitudes obtained by Fourier transform of the sound signal. AP is a representation of external noise whose frequency of vibration does not have a distinct periodicity. MCEPs are low-dimensional features obtained by dimensionality reduction of SP. The mean and standard deviation of F0 and SP are used to reconstruct the audio data, and then the MCEPs are mapped into the interval of standard normal distribution for training the generators and discriminators.

The training processes of the generators and discriminators are shown in (a) and (b) in Figure 2. The two processes (a) and (b) will be trained alternately. In (a), the generator G_{nat\mathrm {\to }ori} is trained first, and then the generator G_{ori\mathrm {\to }nat} is trained. In this process, the input of G_{nat\mathrm {\to }ori} is divided into two stages. In the first stage, the input consists of natural data, including adversarial examples and benign examples. In the second stage, the input is benign data. However, the output of G_{nat\mathrm {\to }ori} in both stages generated benign examples. The data generated by G_{nat\mathrm {\to }ori} will be treated as the input of G_{ori\mathrm {\to }nat} , and the output of G_{ori\mathrm {\to }nat} is the generated adversarial and benign examples. In (b), the generator G_{ori\mathrm {\to }nat} is trained first, followed by the training of the generator G_{nat\mathrm {\to }ori} . The discriminators D_{nat} and D_{ori} are responsible for discriminating the generated data and the real data, where D_{ori} mainly focuses on discriminating whether the benign data is generated or not, while D_{nat} mainly focuses on discriminating whether the benign data and adversarial data are generated or not. For the generators G_{ori\mathrm {\to }nat} and G_{nat\mathrm {\to }ori} , we use G_{nat\mathrm {\to }ori} as a defense tool and G_{ori\mathrm {\to }nat} to assist in the training of G_{nat\mathrm {\to }ori} .

FIGURE 2. - Training process for generative adversarial networks.
FIGURE 2.

Training process for generative adversarial networks.

Finally, three parameters, F0, SP, and AP, are needed to synthesize a benign audio. We put the MCEPs features of natural data into G_{nat\mathrm {\to }ori} to generate MCEPs features of benign data. We expand the dimensionality of the MCEPs to obtain the SP of the benign data. Then, we map the F0 of the natural data to the benign data to obtain the F0 of the benign data. Last, we use external noise AP of the benign data.

B. Loss Function

The objective optimization function of GAN [29] is the key to training. For this purpose, we designed loss functions L_{G} and L_{D} for the generators and discriminators, respectively. For the generators, we define the loss function as:\begin{equation*} L_{G}=L_{adv1}^{G}+{\alpha L}_{cyc}+{\beta L}_{id}+L_{adv2}^{G} \tag{14}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where L_{adv1}^{G} and L_{adv2}^{G} are the adversarial loss of the generators, L_{cyc} is the cycle-consistency loss, L_{id} is the identity-mapping loss, and \alpha and \beta are the weight parameters with initial values of 10 and 5.

1) Adversarial Loss of Generator

The adversarial loss of the generators is divided into the first step adversarial loss L_{adv1}^{G} and the second step adversarial loss L_{adv2}^{G} . The loss function of L_{adv1}^{G} is as follows:\begin{align*} L_{adv1}^{G}\left ({G_{nat\to ori} }\right)&=\mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ \log \left ({1-D_{ori}\left ({G_{nat\to ori}\left ({x }\right) }\right) }\right) }\right] \\ &\quad + \!\mathbb {E}_{y\sim P_{Y}\left ({y }\right)}\left [{ \log \left ({1\!-\!D_{nat}\left ({G_{ori\to nat}\left ({y }\right) }\right) }\right) }\right] \tag{15}\end{align*}

View SourceRight-click on figure for MathML and additional features. where x\in X and y\in Y , X represents the set of natural data (\left \{{ adversarial, benign }\right \}\subseteq X ), Y represents the set of benign data (\left \{{ benign }\right \}\subseteq Y ), adversarial and benign are the adversarial and benign examples, respectively. D_{ori} is the discriminator, which mainly discriminates between benign data and benign data generated by G_{nat\to ori}(x) , forcing the data generated by G_{nat\to ori}(x) to be closer to benign data. Similarly, D_{nat} is a discriminator that discriminates the data generated by G_{ori\to nat}(y) . Specifically, L_{adv1}^{G} measures the difference between real and generated data, and a smaller value of L_{adv1}^{G} indicates that the generator has stronger capability or the discriminator has limited ability to distinguish whether the data is generated or not. The loss function of L_{adv2}^{G} is as follows:\begin{align*} &\hspace {-.5pc}L_{adv2}^{G}(G_{nat\to ori},G_{ori\to nat}) \\ &=\mathbb {E}_{x\sim P_{X}\left ({x}\right)}[\mathrm {log}(1- D_{nat}(G_{ori\to nat}(G_{nat\to ori}(x))))] \\ &\quad +\mathbb {E}_{y\sim P_{Y}\left ({y }\right)}[\mathrm {log} (1-D_{ori}(G_{nat\to ori}(G_{ori\to nat}(y))))] \tag{16}\end{align*}
View SourceRight-click on figure for MathML and additional features.

Among them, G_{ori\to nat} generates natural data, G_{nat\to ori} generates benign data, and then D_{ori} discriminates the generated benign data, while D_{nat} discriminates the generated natural data, thus L_{adv2}^{G} can supervise both G_{nat\to ori} and G_{ori\to nat} simultaneously. It is argued in the literature [3] that the second step of adversarial loss can mitigate the over-smoothing caused by the cycle-consistency loss L_{cyc} represented by the L_{1} distance.

2) Cycle-Consistency Loss

We replace the original cycle-consistency loss expressed in with the L_{2} distance, and the L_{cyc} loss function is as follows:\begin{align*} &\hspace {-.5pc}L_{cyc}\left ({G_{nat\to ori},G_{ori\to nat} }\right) \\ &=\mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ {\parallel G_{ori\to nat}\left ({G_{nat\to ori}\left ({x }\right) }\right)-x\parallel }_{2} }\right] \\ &\quad +\mathbb {E}_{y\sim P_{Y}(y)}[{\parallel G_{nat\to ori}(G_{ori\to nat}(y))-y\parallel }_{2}] \tag{17}\end{align*}

View SourceRight-click on figure for MathML and additional features.

The CycleGAN-VC2 model differs from classical GAN models by introducing L_{cyc} to address the issue of mode collapse in generative adversarial networks. For the generator, the goal of G_{nat\to ori} is to generate benign data that cannot be distinguished by D_{ori} . If y\in Y , Y has 10 labels, and G_{nat\to ori}\left ({x }\right) thinks that it is good enough to deceive D_{ori} by generating x into any one of Y , then G_{nat\to ori}\left ({x }\right) will only generate a single data. L_{cyc} is needed to prevent the generator mode collapse by supervising the input data x and the output data x^{\prime} (x^{\prime} =G_{ori\to nat}\left ({G_{nat\to ori}\left ({x }\right) }\right)) .

3) Identity-Mapping Loss

CycleGAN-VC2 differs from classical GANs as it incorporates the L_{id} loss function. In our approach, the L_{id} loss function is utilized to alleviate the impact of our defense on benign examples.

Similarly, we use L_{2} distance to express the identity-mapping loss function, and the L_{id} loss function is as follows:\begin{align*} &\hspace {-2pc} L_{id}\left ({G_{nat\to ori},G_{ori\to nat} }\right) \\ &=\mathbb {E}_{y\sim P_{Y}(y)}[{\parallel G_{nat\mathrm {\to }ori}(y\mathrm {)-}y\parallel }_{2}] \\ &\quad +\mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ {\parallel G_{ori\to nat}\left ({x }\right)-x\parallel }_{2} }\right] \tag{18}\end{align*}

View SourceRight-click on figure for MathML and additional features.

The input of G_{nat\to ori} in identity-mapping loss is benign data and the output is generated benign data, using L_{2} distance to constrain the input and output. Because there is not just one type of labeled data in the natural dataset, but data with multiple labels. Using L_{1} distance to constrain L_{id} and L_{cyc} tends to result in the selection of a few features. While using L_{2} distance constraints will select more features, which is in line with our natural data with multiple classes.

4) Adversarial Loss of Discriminator

For the discriminators, we define the loss as:\begin{equation*} L_{D}=L_{adv1}^{D}+L_{adv2}^{D} \tag{19}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

L_{adv1}^{D}\vphantom {\sum _{R_{R}}} is the first step adversarial loss of the discriminators and L_{adv2}^{D} is the second step adversarial loss of the discriminators. The first step adversarial loss is as follows:\begin{align*} L_{adv1}^{D}\left ({D_{ori} }\right)&=\mathbb {E}_{y\sim P_{Y}\left ({y }\right)}\left [{ logD_{ori}\left ({y }\right) }\right] \\ &\quad + \mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ log\left ({1-D_{ori}\left ({G_{nat\to ori}\left ({x }\right) }\right) }\right) }\right] \\ &\quad + \mathbb {E}_{x\sim P_{X}\left ({x }\right)}\left [{ logD_{nat}\left ({x }\right) }\right] \\ &\quad + \mathbb {E}_{y\sim P_{Y}\left ({y }\right)}\left [{ log\left ({1-D_{nat}\left ({G_{ori\to nat}\left ({y }\right) }\right) }\right) }\right] \tag{20}\end{align*}

View SourceRight-click on figure for MathML and additional features.

The discriminator D_{ori} distinguishes between benign data and benign data generated by the generator, while the discriminator D_{nat} distinguishes between natural data and natural data generated by the generator. D\left ({\ast }\right)\in [{0,1}] , when D\left ({\ast }\right)=1 , the discriminator considers the input data as real data, and when D\left ({\ast }\right)=0 , the discriminator considers the input data as fake data. The second step of adversarial loss is as follows:\begin{align*} &\hspace {-0.5pc}L_{adv2}^{D}(D_{ori}) \\ &=E_{y\sim P_{Y}\left ({y }\right)}[logD_{ori}(y)] \\ &\quad +E_{y\sim P_{Y}\left ({y }\right)}[log\left ({1-D_{ori}\left ({G_{nat\to ori}\left ({G_{ori\to nat}\left ({y }\right) }\right) }\right) }\right) \\ &\quad + E_{x\sim P_{X}\left ({x }\right)}[logD_{nat}(x)] \\ &\quad + E_{x\sim P_{X}\left ({x }\right)}[log\left ({1-D_{nat}\left ({G_{ori\to nat}\left ({G_{nat\to ori}\left ({x }\right) }\right) }\right) }\right) \tag{21}\end{align*}

View SourceRight-click on figure for MathML and additional features.

The discriminator D_{ori} aims to distinguish between benign data and benign data generated by the two generators G_{ori\to nat} and G_{nat\to ori} . The discriminator D_{nat} aims to distinguish between natural data and natural data generated by both generators G_{nat\to ori} and G_{ori\to nat} .

Algorithm 1 CYC-L2

Input:

natural example x , origin example y , adversarial example x^{\prime } , number of epochs E , number of iterations K , Generators {G_{nat\to ori},G_{ori\to nat} } and Discriminators {D_{ori},D_{nat} }, parameter \alpha and \beta , classifier C

Output:

benign example y_{fake}

1:

Initialize

2:

for \text{e}\leftarrow ~1 to E do

3:

for k \leftarrow ~1 to K do

4:

{ y}_{fake}\leftarrow G_{nat\to ori}\left ({x }\right)

5:

{ x}_{cycle}\leftarrow G_{ori\to nat}\left ({y_{fake} }\right)

6:

{ y}_{identity}\leftarrow G_{nat\to ori}\left ({y }\right)

7:

{ x}_{fake}\leftarrow G_{ori\to nat}\left ({y }\right)

8:

{ y}_{cycle}\leftarrow G_{nat\to ori}\left ({x_{fake} }\right)

9:

{ x}_{identity}\leftarrow G_{ori\to nat}\left ({x }\right)

10:

Update L_{D} :

11:

\nabla {({(0-D}_{ori}(y_{fake}))}^{2}+{(1-D_{ori}(y))}^{2}+{(0-D_{nat}(x_{cycle}))}^{2}+{({(1-D}_{nat}(x)}^{2})

12:

\nabla {({(0-D}_{nat}(x_{fake}))}^{2}+{(1-D_{nat}(x))}^{2}+{(0-D_{ori}(y_{cycle}))}^{2}+{({(1-D}_{ori}(y))}^{2})

13:

Update L_{G} :

14:

\mathrm {\nabla }{({(1-D}_{ori}(y_{fake}))}^{2}+{(1-D_{nat}(x_{cycle}))}^{2}+{(1-D_{nat}(x_{fake}))}^{2}+{(1-D_{ori}(y_{cycle}))}^{2})

15:

\mathrm { \nabla (}\alpha (\sqrt [{2}]{{(y-y_{cycle})}^{2}}+\sqrt [{2}]{{(x-x_{cycle})}^{2}})+\beta (\sqrt [{2}]{{(y-y_{identity})}^{2}}+\sqrt [{2}]{{(x-x_{identity})}^{2}}))

16:

end for

17:

\mathbf {if} \mathrm {e \% 100 \leftarrow 0 \mathbf {then}}

18:

\mathbf {if} {C\left ({G_{nat\to ori}\left ({x }\right) }\right)}_{k}\ge \unicode{0x00DD}\mathrm {0.98 and}{C\left ({G_{nat\to ori}\left ({x }\right) }\right)}_{k}-{C\left ({G_{nat\to ori}\left ({x }\right) }\right)}_{k-1}\,\,\le 0 \mathbf {then}

19:

\beta \leftarrow 0,x\leftarrow x^{\prime }

20:

end if

21:

end if

22:

end for

23

Return y_{fake}

C. Training Method

At the beginning of training, we train two generators and two discriminators. The generators are G_{nat\to ori} and G_{ori\to nat} . G_{nat\to ori} is mapping the natural data distribution to the benign data distribution, and G_{ori\to nat} is mapping the benign data distribution to the natural data distribution. The discriminators are D_{ori} and D_{nat} . D_{ori} distinguishes between benign data and fake benign data, and D_{nat} distinguishes between natural data and fake natural data.

Halfway through the training, we test the ability of G_{nat\to ori} to generate benign examples. When the input is a benign example, the output of G_{nat\to ori} can achieve an accuracy of 98% in the classifier. If continuing to train G_{nat\to ori} , we remove the benign examples from the natural data and set the parameter \beta to 0 when the output makes the accuracy of the classifier unchanged or decreases.

When the input of G_{nat\to ori} is an adversarial example, the output makes the accuracy of the classifier unchanged or decreases to end the training.

SECTION V.

Experimental Setup

A. Dataset and Experimental Environment

The benign data is sourced from Librispeech [30]. After selection, 10 individuals were chosen (5 males and 5 females). Each person selected 110 audio samples for the experiment, with 10 randomly selected samples used for speaker enrollment, and the remaining 100 samples used for testing the model. In addition, the PGD attack is the strongest first-order attack, if our defense method can provide a good defense against the adversarial examples generated by PGD, this method can also defend against the adversarial examples generated by other attacks. Then we use the adversarial examples generated by PGD attack after 30 iterations as adversarial data, and merge the adversarial data and benign data into natural data.

This experiment was implemented on an Intel i7-12700KF with 3.6GHz CPU, NVIDIA GeForce RTX 3080Ti with 12GB of video memory, and 64GB of RAM on an Ubuntu 20.04 system.

B. Model Introduction

We use the generators and discriminators in CycleGAN-VC2 for training to achieve defense in speaker recognition systems. And we use the x-vector model as the target model.

The x-vector [31] model is the most mainstream baseline model framework in the current speaker recognition. Its complete framework structure is shown in Figure 3: the first five layers of the x-vector are TDNN for capturing frame-level features, the sixth layer is a statistical pooling layer that aggregates the features captured in the fifth layer, the seventh and eighth layers are segment-level fully connected layers, and the ninth layer is a SoftMax layer. The embedding a is extracted in the x-vector in the seventh layer as the speaker’s feature vector.

FIGURE 3. - x-vector model architecture.
FIGURE 3.

x-vector model architecture.

To alleviate the problem of too many parameters in the last layer of the discriminator, the patch-GAN [32] architecture is used in Figure 4, in which the last layer of the model is a convolutional layer that maps the input to an \text{N}\ast \text{N} matrix. This \text{N}\ast \text{N} matrix is then used to evaluate the generated speech. Because 2D CNN (Convolutional Neural Networks) can consider both temporal and frequency information, they can provide a more comprehensive understanding of the sound signal. In contrast, 1D CNN is more feasible for capturing dynamic changes and local patterns in sequential data. Therefore, the generator uses the 2-1-2D CNN architecture shown in Figure 5.

FIGURE 4. - Discriminator model architecture.
FIGURE 4.

Discriminator model architecture.

FIGURE 5. - Generator model architecture.
FIGURE 5.

Generator model architecture.

C. Metrics

In this paper, {acc}_{adv} represents the accuracy of identifying the adversarial examples, and {acc}_{ben} represents the accuracy of identifying the benign examples. The definitions are as follows:\begin{align*} {\begin{array}{cccccccccccccccccccc} {acc}_{adv} & =&c_{adv}/n_{adv}\\ \end{array} } \tag{22}\\ {\begin{array}{cccccccccccccccccccc} {acc}_{ben} & =&c_{ben}/n_{ben}\\ \end{array} } \tag{23}\end{align*}

View SourceRight-click on figure for MathML and additional features.

n_{adv} is the number of adversarial examples generated, c_{adv} is the number of benign examples after defense processing. The larger the value of {acc}_{adv} , the better the effect of converting adversarial examples into benign examples after defense.

n_{ben} is the number of benign examples, and c_{ben} is the number of benign examples after defense processing. A larger value of {acc}_{ben} represents a smaller effect on benign examples after performing defense processing. In summary, larger values of {acc}_{adv} and {acc}_{ben} represent better defenses.

ASR represents the attack success rate of adversarial examples. The higher the ASR, the lower the robustness of the model. The definitions are as follows:\begin{equation*} ASR=\frac {Num(f(x^{\prime })\ne y)}{Num(f(x)=y)} \tag{24}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where Num(\ast) represents the number. If it is a targeted attack, the numerator becomes Num(f\left ({x^{\prime } }\right)=y) .

D. Algorithm Parameter Setting

We set the perturbation strength of FGSM, MIM, PGD, and ADA to 0.002, and the confidence level of CW2 to 10, noted as CW2-10. The higher the confidence level, the higher the strength of CW2. For MIM, PGD, and ADA, we perform 30 iterations to generate adversarial examples respectively, noted as MIM-30, PGD-30, ADA-30. And we set the threshold value of open-set identification as 13.76, and the accuracy of the x-vector model is 99.2% under this threshold; the accuracy of the x-vector model in closed-set identification is 99.9%.

We set the parameters of QT defense to 512, the sampling rate of DS to 0.5, and the number of reference points to 17 when performing AS or MS defense. CYC-L1 and CYC-L2 are used to constrain the loss function using L_{1} and L_{2} distance, respectively. The experimental of CYC using L_{1} and L_{2} distances in PGD-30 is shown in Figure 6. The stopband end frequency of LPF is set to 4400Hz, and the stopband end frequency of BPF is set to 160Hz and 5900Hz. The bit rates of OPUS and SPEEX output audio are set to 4000 and 7200 respectively.

FIGURE 6. - CYC defense results in PGD-30 attacks.
FIGURE 6.

CYC defense results in PGD-30 attacks.

SECTION VI.

Experimental Results

A. Effectiveness Analysis of Various Attack Algorithms in Speaker Recognition

Table 1 shows the success rates of targeted and untargeted attacks in closed-set identification and open-set identification. Among them, the success rate of the FGSM attack is lower than the other four attacks, especially the hard label in OSI is only 1.01%. Because FGSM only performs one iteration and cannot find the direction of gradient update well. The other methods perform multiple iterations and can reach 100% success rate of the attack for the undefended x-vector model.

TABLE 1 Experimental Results of Various Attack Algorithms in Closed-Set Identification and Open-Set Identification
Table 1- 
Experimental Results of Various Attack Algorithms in Closed-Set Identification and Open-Set Identification

B. Analysis of Defense Effects in Closed-Set Identification

Most defense methods will reduce the {acc}_{ben} to varying degrees. The decrease of {acc}_{ben} reflects the side effects of the defense methods. According to the analysis of experimental results in Table 2, MS defense causes the greatest decrease in {acc}_{ben} . Specifically, it brings {acc}_{ben} down to 76.6%, which indicates that MS makes the audio distortion greater. QT, DS, OPUS, and SPEEX had less impact on {acc}_{ben} , with decreases of 3.5%, 2.1%, 5.3%, and 4.0%, respectively. AS, LPF, and BPF had almost no impact, with decreases of 0.8%, 0.9%, and 1.7%, respectively.

TABLE 2 Defense of Untargeted Attacks in Closed-Set Identification
Table 2- 
Defense of Untargeted Attacks in Closed-Set Identification

Our methods CYC-L1 and CYC-L2 did not reduce the {acc}_{ben} . It is probably because our model is trained with identity-mapping loss, which reinforces the mapping of benign data to benign data, that the side effects of defense are very small or even absent.

In addition, {acc}_{adv} shows the effectiveness of the defense. The success rate of FGSM untargeted attack in closed-set identification is 36.1%, i.e., the accuracy of the x-vector model under the FGSM attack is 63.9%. According to the FGSM attack analysis in Table 2, AS and OPUS have the worst defense effects, and the {acc}_{adv} only up to 65.9% and 68.2%. DS, LPF, BPF, QT, MS, and SPEEX have slightly better defense effects, with the {acc}_{adv} higher than 70%. CYC-L1 and CYC-L2 have the best defense effect, and the {acc}_{adv} up to 94.4% and 94.7%. CYC-L1 and CYC-L2 are also better than most other methods in defending against not only FGSM attacks but also other attacks. However, CYC-L2 was slightly less effective than QT in defending against ADA, with only a 3.2% difference. It may be QT effectively reduces the aggressiveness of ADA by limiting the amplitude to an integer multiple of 512.

Tables 3 and 4 show the effectiveness of defense against targeted attacks with simple labels and hard labels in closed-set identification. On the whole, defense methods show better effectiveness against hard label attacks compared to simple label attacks. For example, defending against FGSM attack using CYC-L2, the {acc}_{adv} increases to 94.6% in defending against the simple label attack, while it increases to 99.6% in defending against the hard label attack. This is because generating adversarial examples with hard labels poses greater difficulty in various attacks, which results in better effectiveness in defending against hard label attacks compared to simple label attacks.

TABLE 3 Defense of Simple Targeted Attacks in Closed-Set Identification
Table 3- 
Defense of Simple Targeted Attacks in Closed-Set Identification
TABLE 4 Defense of Hard Targeted Attacks in Closed-Set Identification
Table 4- 
Defense of Hard Targeted Attacks in Closed-Set Identification

Our method is better than other methods in defending against both simple label and hard label attacks. For example, in Table 4, CYC-L2 shows an increase in {acc}_{adv} of 3.6%, 12.7%, 4.9%, 4.4%, and 4.6% in defending against FGSM, MIM, PGD, CW2, and ADA attacks, respectively, compared to the best defending performance of the compared methods.

C. Analysis of Defense Effect in Open-Set Identification

In open-set identification, the side effects of different defense methods are more obvious. Because there is a threshold value set for open-set identification. If the defense causes significant distortion to benign audio, it is likely to be considered an imposter by the model, leading to a decrease in the {acc}_{ben} . According to the analysis in Table 5, CYC-L2 has fewer side effects than the other defense. In open-set identification, after applying the CYC-L2 defense, the {acc}_{ben} is 97.7%, resulting in a decrease of only 1.5%. On the other hand, after applying the CYC-L1 defense, the {acc}_{ben} is 96.3%, with a decrease of 2.9%. This further confirms that CYC-L2 performs better than CYC-L1 in terms of performance. CYC-L2 has 8%, 17.3%, 14.1%, and 9.8% differences in {acc}_{ben} compared to QT, DS, LPF, and BPF, respectively. Among all the defense methods, MS has the highest side effects. After applying the MS, the {acc}_{ben} is 15.6%, resulting in a decrease of 83.6%, indicating that MS makes the audio distortion seriously affect the x-vector. The side effects of the other two audio compression-based defenses are very similar. The {acc}_{ben} of OPUS and SPEEX are 57.8% and 63.1%, respectively.

TABLE 5 Defense of Untargeted Attacks in Open-Set Identification
Table 5- 
Defense of Untargeted Attacks in Open-Set Identification

CYC-L2 is better than other methods in defending against other attacks. For example, in Table 5, when CYC-L2 defends against FGSM attack, the {acc}_{adv} is 88.3%, an increase of 42.3%. Moreover, the {acc}_{adv} of CYC-L2 differs from CYC-L1, QT, AS, and LPF by 1.1%, 12.3%, 40.3%, and 38.2%, respectively. Additionally, MS, DS, BPF, OPUS, and SPEEX are unable to defend against FGSM attack and even worsen the model’s robustness. For instance, after applying MS defense, the x-vector model achieves {acc}_{adv} of only 4.9% under FGSM attack.

Based on the analysis of various defensive effects in open-set identification, we find that CYC-L2 is far ahead of other methods in open-set identification. For example, in Table 6, when defending against MIM targeted attack with simple labels in open-set identification, CYC-L2 showed differences in {acc}_{adv} by 45.5%, 63.8%, 58.2%, 63.8%, 63.8%, 63.8%, 45.4%, and 47.6% compared to QT, AS, MS, DS, LPF, BPF, OPUS, and SPEEX, respectively.

TABLE 6 Defense of Simple Targeted Attacks in Open-Set Identification
Table 6- 
Defense of Simple Targeted Attacks in Open-Set Identification

Table 7 displays the effectiveness of defending against targeted attacks with hard labels in open-set identification. In defending against the FGSM hard label attack, our CYC-L2 method reduces the robustness of the model. When using CYC-L2 to defend against the FGSM attack, the {acc}_{adv} of the x-vector model decreases to 90.4%. However, compared to CYC-L1, QT, AS, MS, DS, LPF, BPF, and SPEEX, CYC-L2 exhibits a difference in {acc}_{adv} of 0.4%, 11.4%, 32.2%, 85.9%, 44.0%, 34.2%, 40.5%, 49.5%, and 65.6%, respectively. Therefore, CYC-L2 has a minor impact on model robustness.

TABLE 7 Defense of Hard Targeted Attacks in Open-Set Identification
Table 7- 
Defense of Hard Targeted Attacks in Open-Set Identification

We also plotted waveforms and spectrograms to visualize the defense against MIM, as shown in Figures 7 and 8. From the waveforms, it can be observed that the waveform generated by the CYC-L2 defense against MIM does not differ significantly from the original waveform; they only slightly reduce the amplitude. However, the waveforms generated by the other methods exhibit significant differences from the original waveform. Specifically, AS, MS, DS, LPF, BPF, OPUS, and SPEEX reach maximum amplitude values, indicating severe audio distortion.

FIGURE 7. - The speaker with id 2: Waveforms for defense against hard targeted attacks in open-set identification.
FIGURE 7.

The speaker with id 2: Waveforms for defense against hard targeted attacks in open-set identification.

FIGURE 8. - The speaker with id 2: Spectrograms for defense against hard targeted attacks in open-set identification.
FIGURE 8.

The speaker with id 2: Spectrograms for defense against hard targeted attacks in open-set identification.

From the spectrograms, we can see that the color of the original audio’s spectrogram is blue, indicating lower energy in the frequency range. The MIM attack introduces noise, causing the dark blue parts of the original audio to become light blue. Our CYC-L2 method removes some perturbations, converting part of the light blue areas back to dark blue. On the other hand, QT, AS, MS, DS, LPF, BPF, OPUS, and SPEEX introduce high-frequency noise, resulting in the overall color of the spectrograms turning red or even yellow.

SECTION VII.

Conclusion and Future Directions

In this paper, we propose a new defense method and conduct a comprehensive study on the defense method of speaker recognition system. Multiple defenses such as CYC, QT, and AS were evaluated in five attack methods. The experimental results show that our CYC-L2 defense has almost no impact with benign examples, and has a better defense effect on adversarial examples.

There are two shortcomings in this paper. Firstly, our method only defends against white-box attacks and does not defend against black-box attacks. However, most black-box attacks are based on optimization methods, we believe that the defense effect against black-box attacks should be similar to the defense effect of CW2. We will verify my thought in the future. Secondly, it is impossible to defend against unknown speakers in open-set identification.

References

References is not available for this document.