VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is continually increasing. However, the comparison and combination of different anonymization approaches remains challenging due to the complexity of evaluation and the absence of user-friendly research frameworks. We therefore propose an efficient speaker anonymization and evaluation framework based on a modular and easily extendable structure, almost fully in Python. The framework facilitates the orchestration of several anonymization approaches in parallel and allows for interfacing between different techniques. Furthermore, we propose modifications to common evaluation methods which improves the quality of the evaluation and reduces their computation time by 65 to 95%, depending on the metric. Our code is fully open source.


I. INTRODUCTION
S PEAKER anonymization [1] is a task in which speech recordings are automatically modified such that the original speaker becomes unidentifiable from the audio, usually by changing the voice in the direction of an artificial target speaker.The goal of this process is to preserve the speaker's voice privacy while keeping enough information from the input to use the anonymized audio in downstream tasks (e.g., speech recognition [2]).In order to foster research in this topic, the Voice Privacy Challenge (VPC) has been held in 2020 and 2022 [1], [3].The open-source framework accompanying the challenges -consisting of code bases, baseline and evaluation models, datasets, and techniques -have had a great influence on the field, with most approaches using at least part of the VPC framework.
The primary goal of the VPC framework is to address the challenges associated with voice privacy research.However, it has two key drawbacks: (1) The heavy reliance on the Kaldi toolkit [4] results in processes that are complicated and opaque.(2) The static structure of this framework introduces many redundant computations, leading to inefficiencies.
Given the current importance of voice privacy research in society and politics, there is demand for a framework that allows quick ideation and experimentation without the burden of infrastructure issues, in order to promote collaboration among researchers.Although alternative evaluation frameworks [5], [6] exist, they do not adhere to the standard evaluation protocol introduced by the VPC.We thus propose a robust and modular framework for speaker anonymization.The framework is implemented almost fully in Python and consists of two branches (1) for anonymization and (2) evaluation, as shown in Figure 1.Both branches exhibit a modular structure in which single components can be skipped or added.The methods and models in each component can be easily exchanged for alternatives.The control over the composition of a pipeline is done exclusively via configuration files such that different speaker anonymization systems (SASs) or evaluation metrics can be compared with minimal Evaluation FIGURE 1: Proposed framework, consisting of separate anonymization and evaluation branches and example configurations for each branch .effort.We extend and improve commonly used evaluation methods by employing the ESPnet [7] and SpeechBrain [8] toolkits for evaluation instead of Kaldi [4].Furthermore, we significantly speed up the evaluation by combining data reduction and finetuning techniques for privacy and utility metrics.Overall, we believe this framework will be an essential tool for advancing research in voice privacy and facilitating participation in future Voice Privacy Challenges in the field.
Our contributions are as follows: • We propose improvements to standard evaluation methods for speaker anonymization which make these evaluations more efficient, easier to use, and more feasible for intermediate evaluations.
• We show through experiments on two primary VPC baselines and two state-of-the-art SASs that these improvements provide stronger privacy tests, while requiring 65 to 95% less time to execute.• We release all code in a new toolkit, VoicePAT1 (Voice Privacy Anonymization Toolkit), which further includes pipelines for running different state-of-the-art SASs.• Due to its modularity, extending VoicePAT with more anonymization systems and evaluation metrics is easily possible, enabling the comparison of different approaches within one framework and probing the effectiveness of the anonymization using a more diverse attack landscape.

II. Background and Related Work
Before describing our proposed framework and evaluation, we will first give some background about the current state of the topic, existing evaluation metrics, and frameworks.

A. SPEAKER ANONYMIZATION
The goal of an SAS [1], [3] is to automatically modify the voice in an audio recording to make the original speaker unidentifiable.Following VPC baseline models, approaches can be categorized into methods based on disentanglement and those based on digital signal processing.
(2) Modification of original features: This is a crucial step to hide the original speaker's identity.Most works focus on modifying the original speaker embeddings, assuming that identity is mainly encoded in them.Typically, a selectionbased speaker anonymizer [24], replaces an original speaker with an anonymized speaker vector.This anonymized vector is a mean speaker vector derived from a randomly chosen set of speaker vectors in an external pool.However, the averaging process on speaker vectors leads to limited speaker diversity in the generated anonymized voices and serious speaker privacy leakage problems when facing stronger attacker [14].
To mitigate these problems, recent works adopt DNN-based anonymizers.For example, [10], [11] introduce a Wasserstein generative adversarial network (GAN) that is trained to turn random noise into artificial speaker embeddings that follow a similar distribution as the original speaker vectors.Another approach employs an orthogonal householder neural network (OHNN) to rotate original speaker vectors, ensuring that anonymized vectors follow the original space and maintaining speech naturalness.The parameters of the OHNN-based anonymizer are trained using classification and similarity losses, encouraging distinct speaker identities [14].
(3) Generation of anonymized speech: The anonymized speaker, prosody, and content features are then fed into a speech synthesis model [25], [26] to generate anonymized speech.
The VPC introduced two primary disentanglement-based SASs.In BL 1.a, speech is disentangled into speaker identity by a pre-trained x-vector, fundamental frequency by YAAPT algorithm, and linguistic information by a pre-trained factorized time delay neural network (TDNN-F) based ASR acoustic model [2], [20].Then, the selection-based speaker anonymization scheme [24] modifies a source x-vector to hide speaker information.The speech synthesis acoustic model (SS AM) generates Mel-filterbank features using the anonymized pseudo x-vector, F0, and linguistic features, followed by a neural source-filter (NSF)-based waveform generation model [25] to synthesize anonymized speech.Similar to BL 1.a, BL 1.b replaces the traditional speech synthesis pipeline (SS AM + NSF) with a unified HiFi-GAN [26] NSF model as the waveform generator.

B. EVALUATION
Speaker anonymization commonly has two objectives: (1) privacy: protecting the identity of a speaker, and (2) utility: keeping other attributes of the original audio needed for use in downstream applications (e.g., linguistic content, naturalness, prosody, speaker emotion) unchanged.The challenge is to optimize an SAS to achieve a trade-off between both objectives, whereby the weighting between them and the utility assessment metrics depend on the application.

1) SPEAKER PRIVACY PROTECTION
To evaluate the effectiveness of preserving speaker identity against different attackers, it is most common to compute the equal error rate (EER) using ASV evaluation models.For this, an attacker compares the anonymized trial utterance processed by users against enrollment utterances in different attack conditions with either a model trained on original (ASV eval ) or on anonymized (ASV anon eval ) data [1], [3]: • Unprotected (OO): a baseline metric to assess the effectiveness of the ASV eval attacker when both enrollment and trial utterances are not anonymized (i.e., original).• Ignorant (OA): attackers are unaware of the anonymization, they use original enrollment data and ASV eval to infer the identity of anonymized trial utterances.• Lazy-informed (AA-lazy): attackers use anonymized enrollment speech, generated with the same SAS but inaccurate parameters, and ASV eval to detect identities.• Semi-informed (AA-semi): the attacker is similar to the lazy-informed one, but employs a more powerful ASV anon eval model trained on anonymized speech, which helps to reduce the mismatch between the original and anonymized speech to infer the speaker identity.
For a successful anonymization, an EER close to 50% is targeted.An alternative to the EER metric is using the log-likelihood-ratio cost function C llr as discrimination loss (C min llr ) and calibration loss (C llr − C min llr ) [1].Other privacy metrics include the linkability (D sys ↔ ) between two utterances [27], [28], the de-identification (De ID ) based on voice similarity matrices [29], and the expected (D ECE ) and worst-case (log(l)) privacy disclosure metrics of the ZEBRA framework [30].

2) SPEECH UTILITY PRESERVATION
As most applications require the anonymization to retain the linguistic content of the speech, the primary evaluation for speech utility is performed with ASR and measured as word error rate (WER).The VPC introduces two models for this: ASR eval (A-lazy) is trained on the original data, and ASR anon eval (A-semi) is trained on the anonymized data.ASR anon eval tests the best-case scenario in which the downstream ASR knows how anonymization affects the audio quality and can adapt to it, whereas ASR eval might simulate a more realistic condition in which such information is not known.The lower the WER is, the better.
Another common utility metric is the gain of voice distinctiveness G VD [28] that assesses how well the ability of distinguishing different speakers is kept during anonymization.If the voice distinctiveness in the anonymized space is the same as in the original space, the G VD is close to zero.If it is improved, the score is above zero, otherwise below.The metric is closely related to the privacy metrics and is computed using ASV eval and voice similarity matrices.
Depending on the application, other speech utility metrics could be included, e.g., prosody preservation via pitch correlation as done in the VPC 2022 [3].A simple approach could be to transcribe speech with an ASR model and synthesize it back to speech from the transcription using a text-tospeech (TTS) system.This conceals speaker identity but also removes other paralinguistic features like emotion and health status which are important for health applications.
For simplicity, we will focus on EER as a privacy indicator, and WER and G VD for measuring utility in speech recognition tasks in our experiments.

C. EXISTING FRAMEWORKS AND THEIR LIMITATIONS
To support participation in the challenges, the VPC published an open-source framework with code for all baselines and evaluation metrics.However, this framework lacks the flexibility to skip single steps in its run pipeline, rerun only parts of it, or add new metrics.Most of the algorithms are written in the C++-based Kaldi toolkit which is challenging to maintain and lacks compatibility with standard Pythonbased speech processing models.Furthermore, the evaluation models included in the framework can take several days for computations.Combined with the difficulty of skipping previously computed calculations, performing a full anonymization with subsequent evaluation in the VPC framework is complicated and expensive, potentially discouraging new researchers from working on SAS development.Motivated by similar concerns about the framework, [5], [6] recently presented an alternative evaluation framework written in Python and exhibiting modular and extendable structures.However, they do not test their framework with standard SAS approaches nor compare their evaluation metrics with the ones in the VPC framework.This makes it difficult to assess the quality of their improvements.We thus find a lack of suitable anonymization and evaluation frameworks for this topic.Hence, we propose a new alternative in this paper.

III. PROPOSED EFFICIENT FRAMEWORK
The proposed framework for speaker anonymization research consists of two pipeline branches, shown in Figure 1: One for the anonymization process and one for evaluation.Both branches consist of several modules, which can be instantiated with different models.All parameters for selecting the order and type of modules, as well as all other settings for running a pipeline, are given in configuration files.Modules can thus be exchanged, extended, or skipped if the general objective allows it.In this way, the framework provides a flexible option to modify existing approaches, combine ideas from different systems, and test the effect of single components in a controlled fashion.

A. ANONYMIZATION BRANCH
The goal of the anonymization branch is to provide a platform for researchers developing an SAS to evaluate their ideas quickly.Ideally, if they only want to test a minor change like a different speaker embedding modification mechanism, they would only have to add a new model (a Python class) to the speaker embedding modification module and adjust the configuration file.If the modification is radical, they might need to add a new module and pipeline.
Generally, an SAS consists of the following components: (1) a configuration file, (2) a pipeline, (3) a collection of modules, and (4) one specific model in each module.The configuration file specifies the pipeline (e.g., the GAN-based pipeline of [11]), which then defines the obligatory and optional modules and their processing order.Each module can be instantiated with different models or approaches, e.g., different speech synthesis models.The selection of one model per module and the inclusion of optional modules are specified in the configuration file.Per default, the output of each module is saved to disk.This makes it possible to skip the computation of one module if it has been computed before and its input has not changed, and thus, to test minor modifications more efficiently.

B. EVALUATION BRANCH
Following the standards of voice privacy research, the evaluation branch is divided into two modules: privacy and utility.Each module corresponds to one evaluation aspect and can consist of one or several metrics.For example, ASV is one module of privacy evaluation and is mainly measured by one metric, EER, however, multiple models can be used to calculate this.Similar to the anonymization branch, all settings are again set in configuration files.To further improve the efficiency of the proposed framework, we employ more powerful ASV and ASR models, explore the training strategies, and modify the computation of G VD as described below.

1) Evaluation models
In the VPC framework, EER and G VD are computed using a Kaldi-based x-vector speaker encoder with a PLDA distance model [22], and WERs are computed using a Kaldi-based TDNN-F model [2].As ASV and ASR technology develops, it is important to examine the impact of advanced models on the respective evaluation results.Aiming to find the most reliable choice, we propose using evaluation models based on the state-of-the-art toolkits Speechbrain [8] for ASV and ESPnet [7] for end-to-end ASR, as shown in Figure 2.
Both toolkits are developed using PyTorch [31] in the context of research, and they meet two requirements for our purposes: (1) They provide a user-friendly approach to modify training recipes which cover a wide range of hyperparameters and architecture choices for the models.
(2) Both toolkits are continuously developed, ensuring they incorporate the latest advancements in both ASV and endto-end ASR techniques regularly.For ASV evaluation models, we present choices including the Speechbrain-based xvector and the cutting-edge ECAPA-TDNN, featuring both cosine and PLDA back-ends.For ASR evaluation models, we provide an ESPnet-based Transformer encoder-based Connectionist Temporal Classification (CTC) ASR model with Attention Encoder Decoder (AED) [32].A transformerbased language model is trained using LibriSpeech-trainclean-360 [33] once and used for decoding.

2) Modifications for G VD
The gain of voice distinctiveness metric G VD [29] is defined as the diagonal dominance ratio of two voice similarity matrices, one for the original speaker space and one for the anonymized one.In the VPC framework, those similarity scores are computed by the ASV eval model trained on the original data.ASV eval yields more reliable scores for the original data but introduces a mismatch when applied to anonymized speech.We are interested in exploring different evaluation models for computing similarity scores: (1) using the ASV eval model; (2) using the ASV anon eval model; (3) using ASV eval for original data and ASV anon eval for anonymized data to see whether this could improve the accuracy of similarity scores for different types of data.Furthermore, different from the VPC framework, where all utterances of each speaker are considered for similarity computation, we enhance efficiency by randomly selecting 5 utterances per speaker to compute the log-likelihood ratios between two speakers.
3) Training strategy of ASV anon eval and ASR anon eval models Four models are required for privacy and utility evaluation, as described in Section II.B: ASV eval and ASR eval , trained on the original LibriSpeech-train-clean-360 dataset, are directly provided by the VPC platform.Thus, we will assess only their evaluation times, not the training time.ASV anon eval and ASR anon eval are trained from scratch like the original models, but on the anonymized LibriSpeech-train-clean-360 processed by the same evaluated SAS.This requires extra time  for anonymizing training data and conducting the training process, in addition to the evaluation time.
Since the anonymization of the entire training dataset is quite time-consuming, we aim to explore the impact of obtaining ASV anon eval and ASR anon eval by finetuning the pretrained ASV eval and ASR eval , respectively, using only a subset of the anonymized data to eliminate the necessity of anonymizing the entire dataset.We consider two techniques for data reduction: (1) choosing a limited number of utterances from all the speakers, and (2) selecting all utterances from a specific subset of speakers.By using this approach, we can balance the trade-off between anonymization, training time, and the effectiveness of the ASV anon eval and ASR anon eval .

IV. EXPERIMENTS
In this section, we compare the effectiveness of different evaluation models in measuring privacy and utility performance across various SASs.The evaluation models used here are trained on either the entire original or anonymized LibriSpeech-train-clean-360 dataset.Once the evaluation models are chosen, the focus shifts to exploring the training strategy for the semi-informed evaluation models, particularly concerning the amount of training data.
To draw more general conclusions, we choose diverse disentanglement-based SASs, including both traditional and state-of-the-art methods, to generate anonymized speech and evaluate it using our proposed VoicePAT.The specifications for these SASs are listed in Table 1.All SASs follow the three steps described in Section II.A with different realizations of each component and different anonymization techniques on the speaker embeddings.All experiments are performed on the VCTK [34] and LibriSpeech [33] test sets as given by the VPC.The results consistently report the average score on them.Further settings, e.g., training hyperparameters, can be found in our source code.All time measurements apply to experiments conducted on single NVIDIA A100 GPU, except ASR evaluation using 4 GPUs.to decide which evaluation model provides the best results, we look at the OO condition, where no SAS is applied and it is expected to achieve very low EERs on this original data.The model using ECAPA-TDNN and cosine distance achieves with 3.11% the lowest EER and can therefore be considered as the best ASV evaluation model.This is consistent with the findings in the ASV field [35], [36].Accordingly, we choose ECAPA-TDNN + cosine as the primary proposed ASV evaluation model in all following experiments.In the central columns in Table 3, we compare the EERs with this proposed model to the x-vector + PLDA model of the VPC toolkit.Compared to the VPC model, the proposed one consistently achieves lower or similar EERs across all the conditions and SASs.This means that the proposed ASV model is a stronger attacker, which is reasonable as the proposed model exhibits a more powerful ability to infer the speaker's identity.

A. CHOICE OF EVALUATION MODELS 1) ASV evaluation models
Moreover, the proposed ASV model reduces the time needed for it perform the evaluation (Figure 2a).Instead of 6 hours for training the VPC model (which includes the x-vector encoder and the PLDA), our proposed ECAPA-TDNN + cosine model only requires 2 hours.The effect on evaluation time is smaller, though still noticeable: reducing the time needed from 30 minutes to 20.

2) ASR evaluation models
The right columns of Table 3 summarize the WERs for both original (O) and anonymized data (A-lazy, A-semi), using the VPC (TDNN-F) or proposed (Transformer-based CTC/AED) evaluation models.One common trend for all SASs is that the proposed ASR model achieves notably lower WERs in comparison to the VPC model.For the A-semi condition, the utilization of ASR anon eval can reduce the mismatch between original and anonymized data, further decreasing the WERs.
Another interesting observation is that for the VPC model, the WERs of A-semi decoded by ASR anon eval are consistently lower than the O condition decoded by ASR eval .In contrast, for the proposed model, the original data yield the lowest WERs for most SASs, regardless of the training data of ASR Regarding training and evaluation time for the ASR models (Figure 2b), we observe that our model increases the time for ASR evaluation from 5 hours of the VPC model to 6 hours.However, the training of the proposed model takes only 20 hours, significantly less than the 72 hours required by the VPC model.Moreover, comparing the Alazy and A-semi results show that training the ASR model on anonymized data, resulting in the ASR anon eval model, has less effect for the proposed model than the one from the VPC.It can therefore be argued that training the ASR anon eval model for each evaluation and using the A-semi condition is not necessary with the proposed model.This is further supported by arguing that using an ASR model specifically trained on anonymized data may be unrealistic for actual applications.Thus, by using a more robust ASR model and reverting to only the A-lazy condition, we can effectively reduce the evaluation time by 92% from 77 hours of the VPC to 6 hours2 .

B. GAIN OF VOICE DISTINCTIVENESS
Table 4 lists G VD results for various SASs computed by the VPC and proposed evaluation models.Looking at the G VD achieved by ASV eval , we can see: (1) Anonymized speech generated through the OHNNand GAN-based models exhibits higher voice distinctiveness than BL 1.a and BL 1.b, with G VD closer to zero.(2) The proposed model achieves either similar or lower G VD values compared to the VPC model.( 3) Comparing the proposed model using all utterances to that using 5 utterances per speaker reveals similar results, suggesting that the use of 5 utterances may be sufficient for small test sets with limited voice variation 3 .At the same time, using only 5 utterances per speaker reduces the time for computing the G VD drastically from 6 hours to only 20 minutes (Figure 2a).Thus, we can speed up the G VD evaluation by 95% without reducing the result quality.
However, when employing ASV anon eval or a combination of both ASV eval and ASV anon eval , the G VD is significantly higher and often above zero, indicating a positive gain in voice distinctiveness.The difference between using the VPC models and the proposed ones is more notable, whereby the proposed models suggest almost no difference between the SASs anymore.This shows that G VD is an unstable metric that highly depends on the model used for evaluation.In order to measure voice distinctiveness precisely, it may be necessary to consider downstream tasks like speaker diarization [37], instead of relying solely on the G VD metric.

C. TRAINING STRATEGY FOR ASV anon
eval AND ASR anon eval MODELS Figure 3 shows the influence of the different data reduction strategies on the privacy scores and evaluation efficiency for BL 1.a.The experiments are conducted across various ASV anon eval , trained using different amounts of anonymized speech data.
It can be observed from the results in Figure 3a that the more we decrease the amount of training data, the more the EER increases.This is especially problematic for #utts per spk=10 because its scores might suggest that the SAS's privacy protection would be better than it actually is.For WER, the effect of data reduction is negligible as it only changes from 7.66% WER (all) to 7.91% (#utts per spk=10) in the worst case.
The biggest impact of the different training strategies for ASV anon eval can be seen in Figure 3b.It shows that the largest factor of time needed for privacy evaluation comes from the requirement of having to anonymize the training data for each evaluation run.Decreasing the amount of training data therefore means less time being spent on anonymizing it, thus, the time cost of an evaluation decreases linearly with the reduction of data.
Based on these results, we found that #utts per spk=50 provides the best balance between EER increase and cost 4 .This data reduction decreases the total evaluation time for ASV evaluation from 16 hours (14.25 hours for anonymization, 2 hours for training) to 7 hours (6.25 hours for anonymization, 49 minutes for training).Compared to 3 LibriSpeech and VCTK contain read speech segments extracted from longer recordings. 4We validated this finding using other SASs as well.We omitted them due to limited space.

V. ANALYSIS
This section delves deeper into privacy results by examining resynthesis performance and summarizing rankings when employing various evaluation models for different SASs.

A. RESYNTHESIS
In the privacy evaluation, we obtain one privacy score (EER) for each model and attack condition.However, as all tested SASs employ a speech synthesis step after the actual anonymization method, it is not clear whether the anonymization power of an SAS actually from this anonymization method or from the synthesis.We therefore explore testing the resynthesis performance of each SAS.For this, we generate a new version of the evaluation data per SAS in which the anonymization method is skipped and instead the original speaker vector is used for synthesis.We evaluate two new conditions for ASV eval : OR and RR-lazy, with original or resynthesized enrollment data, respectively, and resynthesized trial data.For ASR eval , we test the decoding performance on the resynthesized data (R-lazy).In both cases, we compare to the performance on the original data.
Table 5 shows the results.Except for the OHNN-based SAS, all SASs clearly exploit the synthesis to increase the privacy protection and do not rely only on their anonymization method.The OHNN-based SAS, on the other hand, has almost no identification loss during resynthesis.However, the synthesis in all SASs lead to an increase in WER and thus reduced intelligibility.Using the synthesis to increase the privacy protection is not necessarily a drawback of the BL 1.a, BL 1.b and GAN-based SAS.However, researchers might not be aware of this effect and might put too much focus on optimizing their anonymization method instead of the synthesis.It also decreases the control one has about the actual outcome of the SAS.A related issue was already observed by [38] about the vocoder drift of speaker vectors during anonymization.

B. EFFECT ON RANKING
In this paper, we presented new evaluation models and strategies for training the lazy-informed attackers as alternatives to the evaluation framework of the VPC.We have shown that the perceived strength of an SAS depends on the models it has been evaluated with, however, it is also important to analyze how the relative performance of multiple SAS in comparison (i.e., their ranking) is influenced by the choice of evaluation models.
Comparing the scores for different SAS in the lazyand semi-informed attack conditions as shown in Table 3 reveals consistently the same ranking for the level of privacy protection (with GANand OHNN-based being partly on equal places): (1) GAN-based, (2) OHNN-based, (3) BL 1.a, and (4) BL 1.b.This is regardless of whether the VPC evaluation models or the proposed ones are applied, and also regardless of the training strategy.For ASR evaluation, on the other hand, the ranking of SASs' performances does not stay consistent but changes depending on the ASR model used for evaluation.However, the WER scores of all SASs are relatively similar to each other when using the proposed evaluation models.Thus, we argue that this change in ranking for utility evaluation is a rather small effect.

VI. DISCUSSION
Which evaluation models should we consider?For privacy evaluation, we proposed and evaluated various attacker models against a selection of SAS and found that the choice of attack model did not influence the ranking of privacy pro-tection ability for the chosen SASs, although they produced different privacy scores.However, we only tested a limited number of attackers from the same ASV family.It is possible that an attacker using a different technique, e.g., conformerbased [39] or SSL-based ASV models [40], might result in a different trend.
Moreover, the choice of attack conditions is still heavily based on assumptions.It is unclear whether semi-informed attackers are realistic or what we could assume about the knowledge and dedication of real attackers.Hopefully, challenges like a Voice Privacy Attacker Challenge5 will provide new insights and perspectives.
Overall, we saw a particular trade-off between the quality of objective results and their usability in terms of time requirement for computation, at least for privacy metrics.Retraining the ASV anon eval from scratch on the full anonymized training data seems to lead to the strongest attacker.However, it is costly, which can be significantly reduced by a finetuning strategy that leads to minimal reduction in attacker performance.We propose using this alternative training strategy during SAS development to speed up voice privacy research.However, a full retraining on all data might still be a better option for a final assessment of the full privacy capabilities.
What are the drawbacks of the current evaluation metrics?According to our experiments, G VD is a more problematic metric.We tested three model approaches (ASV eval , ASV anon eval , and the combination of both), but their results differ considerably.It is unclear which approach is better suited for measuring the preservation of voice distinctiveness among anonymized speech.We conclude that we need a more robust alternative, e.g., to use the anonymized dataset to perform downstream speaker verification or speaker diarization tasks.
What is missing?Currently, there are no definitions or measurable criteria for success or the guarantee of full privacy protection through anonymization since all existing evaluations rely on assumptions and specific attack models.It is unknown when an SAS could be considered good enough for use on data where privacy protection matters, or how the remaining privacy risk of current systems can be accurately measured 6 .
In summary, being open source, the proposed framework serves as a platform for unifying researchers and research on this topic.Researchers can add their own SASs and evaluation metrics to the framework such that large-scale and extensive evaluations and comparisons would be possible without much additional effort.In this way, we hope that this framework will help towards finding answers to the questions above, and towards the development of powerful anonymization tools.

VII. CONCLUSION
We proposed a new Python-based and modular open-source framework for speaker anonymization research.It allows combining, managing, and evaluating several anonymization approaches within one platform that is simple to apply and extend.We further present various improvements to standard evaluation techniques for speaker anonymization.Specifically, we exchange previous Kaldi-based evaluation models with more powerful techniques using the ESPnet and SpeechBrain toolkits.Moreover, we showed that we could decrease the time required for evaluation by up to 95% by reducing training and test data while keeping the quality of the evaluations at compatible levels.We anticipate that these changes to common development and evaluation procedures will significantly facilitate and support speaker anonymization research in the near future.

FIGURE 2 :
FIGURE 2: Comparisons of VPC and proposed privacy and utility evaluation models.Note that although G VD is a utility metric, we plotted G VD evaluation time in the privacy subplot as it is computed by ASV eval .

FIGURE 3 :
FIGURE 3: Effect of different training strategies for ASV anon eval and ASR anon eval on (a) evaluation metrics for BL 1.a, and (b) data and time efficiency.The strategies involve finetuning with data reduction, either by restricting the number of utterances per speaker or the number of speakers.They are compared against using all data for training the models from scratch.

Table 2
lists the mean EERs for various SASs under all conditions computed by different evaluation models.First,

TABLE 2 :
Comparison of four privacy attack models using the x-vector and ECAPA speaker encoders with PLDA and cosine as distance measures.Privacy scores for each SAS and attack condition are given as EER in %. ↑ means higher values are better, while ↓ means lower values are better.7%-11% under the AA-semi condition, indicating severe privacy leakage when facing the stronger, semi-informed attack model ASV anon eval .However, the EERs of OHNNand GANbased SASs yield over 40% across all the attack conditions, showing remarkable privacy protection capabilities.In order

TABLE 3 :
EERs and WERs obtained by the proposed and the VPC evaluation models.The proposed models are ECAPA-TDNN + cosine for ASV and transformer-based CTC/AED ASR.-l and -s stand for -lazy and -semi.

TABLE 4 :
Comparisons of G VD obtained by the VPC and proposed ECAPA-TDNN + cosine evaluation models.The evaluation models can be either only ASV eval , ASV anon eval , or a combination of both (ASV eval for original and ASV anon eval for anonymized data).#utts per spk=5 means randomly selecting 5 utterances per speaker for similarity computation.
evaluation models, except for the GAN-based SAS.Possible reasons could be either that the ASR eval provided by VPC was not adequately trained or the structure of this model is not powerful enough.In contrast, our proposed ASR model is more powerful in achieving accurate results.