ADVSV: An Over-the-Air Adversarial Attack Dataset for Speaker Verification

It is known that deep neural networks are vulnerable to adversarial attacks. Although Automatic Speaker Verification (ASV) built on top of deep neural networks exhibits robust performance in controlled scenarios, many studies confirm that ASV is vulnerable to adversarial attacks. The lack of a standard dataset is a bottleneck for further research, especially reproducible research. In this study, we developed an open-source adversarial attack dataset for speaker verification research. As an initial step, we focused on the over-the-air attack. An over-the-air adversarial attack involves a perturbation generation algorithm, a loudspeaker, a microphone, and an acoustic environment. The variations in the recording configurations make it very challenging to reproduce previous research. The AdvSV dataset is constructed using the Voxceleb1 Verification test set as its foundation. This dataset employs representative ASV models subjected to adversarial attacks and records adversarial samples to simulate over-the-air attack settings. The scope of the dataset can be easily extended to include more types of adversarial attacks. The dataset will be released to the public under the CC BY-SA 4.0. In addition, we also provide a detection baseline for reproducible research.


INTRODUCTION
An Automatic Speaker Verification (ASV) system is to decide whether a claimed identity is an impostor or a genuine speaker by comparing a presented utterance against an enrolled voice.Although deep learning has significantly enhanced the performance of ASV [1], it is known that ASV is vulnerable to impersonation, replay, voice conversion, and speech synthesis as discussed in [2].It is also known that deep neural networks are vulnerable to adversarial attacks [3].An adversarial attack is an attack on deep learning by adding a perturbation.ASV systems built on top of deep neural networks are no exception.
There are two types of adversarial attacks in the context of ASV, namely digital and over-the-air (OTA) adversarial attacks.A digital attack is to send a digital copy of an adversarial sample directly to an ASV system, and an OTA adversarial attack is to play a pregenerated adversarial sample in front of an ASV system.This process involves a loudspeaker, a microphone and acoustic environment or conditions (e.g.room reverberation and background noise).To perform attacks, the projected gradient descent (PGD) [4] algorithm is commonly used to generate perturbations, and it was initially used in image classification.It has since been modified to attack ASV systems digitally.For instance, FoolHD [5] employs a multi-objective loss function to generate adversarial samples that are hard for humans to perceive.FakeBob [6] introduces a threshold estimation algorithm, combined with gradient estimation, to achieve black-box attacks.Zuo et al. [7] propose a speaker-specific utterance ensemble method to enhance the generalization of adversarial attack samples.
Several studies have focused on OTA adversarial attacks.Xie et al. [8] use room impulse response to simulate room reverberation, increasing success rates of OTA adversarial attacks, employing the VCTK corpus.O'Reilly et al. [9] transform bonafide samples into adversarial ones with adaptive filtering using the VoxCeleb2 dataset.With a combination of Common Voice, CommanderSong [10] and LibriSpeech datasets, Zheng et al. [11] treats decision-only blackbox adversarial attacks as a discontinuous large-scale global optimization problem, adaptively decomposing it into subproblems and collaboratively optimizing each one to find a solution.Both digital and OTA adversarial studies confirm security concerns.However, each study develops their dataset with a specific setting.
Various studies have addressed ASV system security concerns with countermeasures against adversarial attacks [12,13,14,15,16].Adversarial-aware training approaches have been proposed in [17,18,19] to enhance the ASV model's resilience to attacks.An x-vector-based attack signature has been proposed in [20] to detect adversarial perturbations.A diffusion-based approach has been proposed in [15] to remove perturbations for discrimination.In these studies, the VoxCeleb1, VoxCeleb2, Speech Commands, TIMIT, ASVspoof2019, and Librispeech datasets are used.Each individual study constructs its dataset for countermeasure research.Without a benchmark dataset, it is not feasible to perform benchmark comparisons, making reproducible research even more challenging.
Although there is a growing concern about the threat of adversarial attacks, the lack of a benchmark dataset is a bottleneck for further research, especially reproducible research.Existing studies on adversarial attacks and countermeasures develop their own datasets, usually for specific purposes.To promote reproducible research, this work presents an open-source dataset on adversarial attacks for speaker verification (AdvSV).OTA adversarial attacks involve a loudspeaker, a microphone, and acoustic environment or conditions (e.g.room reverberation and background noise).Due to the potential variations of OTA configurations, it is more challenging to conduct reproducible research.Hence, the AdvSV dataset is designed to serve that purpose.The focus of this study is over-theair adversarial attacks, and the scope of the dataset can be easily extended to include more types of adversarial attacks.The AdvSV arXiv:2310.05369v2[cs.SD] 16 Jan 2024 dataset will be released to the public under the CC-BY license 1 .

OVER-THE-AIR ADVERSARIAL ATTACK DATASET
This section presents a framework for producing the proposed Ad-vSV dataset.In this study, we focus on the over-the-air target attack, which modifies a sample to attack a specific target speaker's verification model.Fig. 1 presents an illustration of the over-the-air (OTA) adversarial attack, which consists of two steps: perturbation generation and OTA attack.Both steps will be described in this section.

Perturbation generation
To synthesize adversarial samples, this study employs the project gradient descent (PGD) [4] algorithm.The PGD algorithm is to add a perturbation to a testing sample, aiming to manipulate the ASV decision.The adversarial sample synthesis process is formulated in Eq. 1, where x enroll is an enrollment sample, xt is an adversarial sample at the t-th iteration, y stands for a label (i.e.genuine or impostor), α is the step size, S is the number of steps, J signifies the loss function, and sign is the sign function.When we apply the PGD algorithm, we assume the algorithm knows everything about the model including model architecture and parameters.
Compute adversarial sample of 6: Mi: x adv ← P GD(x enroll , x adv , y) end for

8:
until Adversarial sample attacks all surrogate models 9: return x adv 10: end function The PGD algorithm presented in Eq. 1 is to attack a single target ASV model.Different from the PGD algorithm, an ensemble PGD algorithm is to attack a few ASV systems at the same time [21].The algorithm is presented in algorithm 1.The idea of the ensemble PGD algorithm is to iteratively attack each victim model until the adversarial sample can spoof all the victim ASV models [21].
Both PGD and ensemble PGD algorithms require access to the victim model's parameters for gradient calculations.In practice, obtaining all the information about the target ASV model is not feasible.A practical way to perform a transfer attack is to synthesize an adversarial sample using one or a few known victim models and then use that adversarial sample to attack the target ASV system.
Implementation details: The PGD Attack is configured with a step size (α) of 0.004, 20 steps (S), and uses cosine similarity as the loss function.For the Ensemble PGD Attack, three ASV models are used as victim models, while the remaining one serves as a test for transfer attacks.

Over-the-air attack setup
An OTA adversarial attack involves a perturbation generation algorithm, a loudspeaker, a microphone, and a replaying environment.In this work, we simulated the OTA adversarial attack in a soundproof studio to reduce the impact of environmental noise and focus the dataset on the impact of perturbation generation, loudspeakers, and microphones.These three variables already result in a significant number of combinations.
We chose three types of loudspeakers and three types of recording devices (i.e., microphones).The high-end, medium-end, and low-end loudspeakers are priced at around $300 USD, $90 USD, and $50 USD, respectively.For the recording devices, we chose mobile devices, which are common in our daily lives.The iOS, Androidhigh, and Android-low devices are priced at around $900 USD, $750 USD, and $310 USD, respectively.
The distance and angle between the microphone and loudspeaker are other factors.In this study, we simplified this factor.The distance between the loudspeaker and microphone is set to 0.3 meters, and the angle is set to 90 degrees.

Dataset
To align with existing ASV research, we design the AdvSV dataset based on the Voxceleb12 dataset.The Voxceleb1 dataset is one of the most commonly-used dataset for speaker verification.We choose the Voxceleb1 verification set as the base set to generate adversarial samples.This set comprises 18,860 samples labeled with different speakers 3 .
To reduce the burden of replay and recording, 25% of the samples were retained (with the same speaker distribution).The proposed AdvSV dataset consists of a total of 314,496 samples 4 .Audio demo available on the webpage5 .

DETECTION OF ADVERSARIAL ATTACKS
In this paper, we provide a countermeasure baseline based on the one-class classification method [22], which is currently the mainstream approach for audio spoofing detection.Its pipeline is shown in Fig. 2.During the training phase, we use the World 6 vocoder and Opus Codec7 to re-synthesize the input audio, and subtract the re-synthesized audio from the input audio to remove irrelevant information for spoofing detection, such as speaker and speech content information.The subtracted audio is then fed into a one-class classifier.The world vocoder is used during the inference phase.1.It is observed that the difference between EERs on the full dataset and the subset is trivial.We presume the subset can represent the distribution of the full dataset, and hence we use the subset to construct the AdvSV dataset.

EXPERIMENT
Detection Model: The model is trained on the bonafide or replayed bonafide samples from the Voxceleb2 dataset, and is tested on the AdvSV dataset, assuming the detector has zero knowledge or limited knowledge of the adversarial attacks.
Evaluation metric: We use attack success rates and equal error rates(EER) [26] as evaluation metrics to evaluate adversarial attack performance and detection performance.The success rate is defined as, Attack Success Rate = Number of Successful Attacks Number of Attacks

Digital Adversarial Attack
We first examine the performance of digital adversarial attacks with the AdvSV dataset.The results are presented in Table 2.It is observed that if a surrogate model is the same as the victim model (i.e.white-box attack), the success rates are always high, 100% or close to 100%.Note that a white-box attack is expecting to achieve 100% success rate if there are enough PGD steps.In the experiments, we keep the same PGD steps for a fair comparison.Ensemble attacks achieve higher transferability than the single PGD attack.Transferability measures the performance of transfer attacks.In particular, with the PGD algorithm, the success rates are 10.6%, 8.8% and 14.1% when using RawNet to transfer attack ECAPA, ResNet and XVec, respectively.However, the ensemble PGD will increase success rates of transfer attacks to 62.3%, 62.5% and 79.5% for ECAPA, ResNet and XVec, respectively.
In summary, the success rates of PGD white-box attack are as high as 100% or close to 100%.Ensemble attacks can dramatically increase the success rates of transfer attacks.

Over-the-Air Adversarial Attack
We then assess the performance of over-the-air(OTA) adversarial attacks with the AdvSV dataset.Table 3 presents the success rates of OTA adversarial attacks.
For the PGD white-box attacks, the RawNet system behaves differently from other systems.The success rates are considerably lower than those of other systems.This is because we set the same fixed number of PGD steps for all systems, and RawNet requires more PGD steps to achieve a higher success rate.From the loudspeaker perspective, a high-end speaker produces higher success rates than a low-end speaker.Similarly, a highend phone/microphone achieves higher success rates than low-end phones or microphones.Additionally, the success rates of transfer attacks are considerably lower than those of white-box attacks.However, the success rates of transfer attacks still fall within the range of 30% -60%, with RawNet being an exception.
For the ensemble attacks, the success rates of transfer attacks are considerably higher than those of PGD transfer attacks.For example, when attacking the XVec system, the ensemble PGD can increase the success rates from the range of 7.4% -56.9% to the range of 55.5% -71.9%.The phenomenon in RawNet is similar to that in PGD, which is due to the fixed number of PGD steps.Similar to PGD attacks, high-end loudspeakers or phones give higher success rates than low-end ones.
In summary, when facing OTA adversarial attacks, ASV systems are still vulnerable to transfer attacks, even if the ensemble PGD algorithm has no access to the target system.Different OTA settings could result in different success rates.

Detection of Over-the-Air Adversarial Attacks
Last but not least, we provide a baseline for detecting both digital and OTA adversarial attacks with the AdvSV dataset.The detection results are presented in Table 4.
In row 1, the classifier is trained on bonafide data without the OTA process, but the testing data is processed with the OTA process.The overall detection EER is 6.66%.Rows 2a and 2b present the detection of digital adversarial attack samples.In comparison with row 1, the EER increases to 66.73% and 69.83% for PGD and  From the detection results, it suggests that detecting the adversarial attacks or adversarial perturbations is a more challenging task than detecting whether an audio sample has gone through the OTA process.

CONCLUSIONS AND FUTURE WORK
We designed an over-the-air adversarial attack dataset for speaker verification, called the AdvSV dataset, which will be released under the CC BY-SA 4.0 license.To develop the dataset, we used three loudspeakers, three microphones, two perturbation generation algorithms and four state-of-the-art ASV systems.In terms of adversarial attack success rate, the dataset presents a genuine problem; the success rates can be higher than 50% in transfer blackbox attacks.In terms of baseline detection performance, there is still a long way to go to develop a successful countermeasure.In future work, we will continue to expand the dataset by considering more realistic product scenarios.

Fig. 1 :
Fig. 1: Illustration of an over-the-air adversarial attack, consisting of (a) perturbation generation and (b) over-the-air attack steps.

Fig. 2 :
Fig. 2: Framework of the baseline system to detect adversarial attacks.'-' means subtracting re-synthesized spectrogram from the original spectrogram.

Table 1 :
Performance of ASV systems used as victim models in study.The performance is measured with Equal Error Rates (EER).

Table 2 :
Attack Success Rates (%) of Various Attacks.S represents the surrogate model, and V represents the victim model.

Table 3 :
Success rate(%) of over-the-air adversarial attacks.Light gray areas represent PGD white-box attacks, and dark gray areas represent ensemble transfer attacks (i.e.blackbox attacks).

Table 4 :
Detection results of over-the-air adversarial attacks (EER% ↓).The results for the four victim models are averaged to obtain the results for both PGD and ensemble PGD attacks.The training set for row 4a and 4b are bonafide samples go through the OTA process, while training samples of other rows use bonafide samples.The setting of testing set is indicated in 2nd column.The column 'overall' pools all the testing samples together to calculate EERs.Rows 3a and 3b present the detection results of OTA adversarial attacks.Note that the training data are bonafide samples without the OTA process.The overall EER is 3.49% and 3.20% for PGD and ensemble attacks, respectively.The only difference between row 3a/3b and row 1 is whether perturbations are added.The results indicate that the OTA process plays a more important role in the detection of attacks.Rows 4a and 4b use the same testing set as that in 3a and 3b, however, the training set of 4a and 4b are bonafide samples passing through the OTA process.In comparison to 2a and 2b, both training and testing sets of 4a and 4b go through the OTA process.Both settings have considerably high EERs (i.e. higher than 30%).