On the Impact of Voice Anonymization on Speech Diagnostic Applications: A Case Study on COVID-19 Detection

With advances seen in deep learning, voice-based applications are burgeoning, ranging from personal assistants, affective computing, to remote disease diagnostics. As the voice contains both linguistic and para-linguistic information (e.g., vocal pitch, intonation, speech rate, loudness), there is growing interest in voice anonymization to preserve speaker privacy and identity. Voice privacy challenges have emerged over the last few years and focus has been placed on removing speaker identity while keeping linguistic content intact. For affective computing and disease monitoring applications, however, the para-linguistic content may be more critical. Unfortunately, the effects that anonymization may have on these systems are still largely unknown. In this paper, we fill this gap and focus on one particular health monitoring application: speech-based COVID-19 diagnosis. We test three anonymization methods and their impact on five different state-of-the-art COVID-19 diagnostic systems using three public datasets. We validate the effectiveness of the anonymization methods, compare their computational complexity, and quantify the impact across different testing scenarios for both within- and across-dataset conditions. Additionally, we provided a comprehensive evaluation of the importance of different speech aspects for diagnostics and showed how they are affected by different types of anonymizers. Lastly, we show the benefits of using anonymized external data as a data augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss seen with anonymization.


I. INTRODUCTION
S PEECH is one of the most powerful and easy-to-use communication interfaces between humans and machines.For example, voice assistants relying on automatic speech recognition (ASR) allow humans to control devices by providing voice commands [1]; automatic speaker verification (ASV) systems enable users to access personal properties (e.g., online bank accounts) via their voice [2].More recently, speech has also been shown as a promising measure for in-home disease detection and monitoring, including COVID-19 [3], chronic obstructive pulmonary disease (COPD) [4], and Alzheimer's disease [5], just to name a few.
Speech-based diagnostic systems have been motivated by the fact that speech requires complex and precise coordination of the respiratory system and neuromuscular control [6].Diseases that cause dysfunction in speech production would then The authors are with the Institut national de la recherche scientifique, University of Québec, Montréal, Canada.
Our code and voice demos are made available at https://github.com/zhu00121/Anonymized-speech-diagnostics.lead to changes in vocal characteristics.For example, major symptoms of COVID-19, such as cough, muscle soreness, and decreased neuromuscular control [7]- [9], have been shown to relate to increased vocal hoarseness and variance in syllabic rate [10].While human ears may not be able to capture such subtle changes, machine learning (ML) models have demonstrated the capability to detect certain abnormal patterns present in pathological speech [10]- [12].
Today, the great majority of speech-based applications rely on deep neural network (DNN) architectures with models containing hundreds of millions of parameters, with this number continuously rising.Commonly, these parameters are not stored locally on mobile devices [13] and speech data are sent to and processed in the cloud; decisions are then transmitted back to the user device.As more and more cases of cyberattacks are being reported [14]- [16], this transmission of speech data over the cloud could pose serious threats to user privacy.It has been previously reported that voice assistants and many third-party applications collect users' voices without their knowledge and share it with advertising partners [17].For example, Amazon patented a technique which recognizes health status via conversations with users and advertises the related medicines to them [18].This could be particularly risky for speech diagnostics applications, since the user's voice could be linked with sensitive medical information, such as health status [19], disease progression [20], or mental state [21], just to name a few.As such, speech privacy-preserving methods have gained increased attention globally, especially with the release of regulations, such as the General Data Protection Regulation (GDPR) in Europe [22] and the Personal Information Protection Law (PIPL) in China [23]; the latter is particularly aimed at personal biometrics (i.e., voice, facial image, and fingerprints).
Alternately, voice anonymization methods have emerged with the aim of manipulating the speech signal such that information about speaker identity is obfuscated, while the linguistic content and other para-linguistic attributes (e.g., timbre, naturalness) remain intact.Given the burgeoning interest in this domain, the Voice Privacy Challenges (VPC) were held in 2020 and 2022 to foster development in speech anonymization techniques [24], [25].However, these challenges were aimed at developing anonymization methods for downstream automatic speech recognition tasks [24]- [27], where linguistic content was preserved, but not para-linguistic information.
As speech applications emerge beyond the realm of ASR, it is important to also gauge what impacts anonymization tools can have on other downstream tasks.Some initial attempts have been made in this realm.Nourtel et al. showed significant degradation in speech emotion recognition when anonymization was applied [28].Dumpala et al. performed an initial exploration of the privacy-preserving features of depression speech [29].To the best of our knowledge, gauging the impact of anonymization on speech diagnostic applications has yet to be explored; this paper aims to fill that gap.
Furthermore, in a real-world scenario, diagnostics models are usually trained on open-source datasets due to the scarcity of medical data [30], while test data may come from varying conditions (e.g., geographic locations, languages, collection devices, etc.).Hence, it is difficult to have training and test data anonymized using the exact same approach in reality.However, existing anonymization testing conditions commonly assume that downstream models either have no or full knowledge of how the training and test data are anonymized (i.e., ignorant or fully-informed, respectively).As more anonymization techniques emerge, alternate testing conditions could be implemented, such as training with data processed by other anonymization methods (i.e., in a semi-informed manner) or with both original and conventional anonymization tools (i.e., augmented).Hence, more complex testing conditions need to be considered.Lastly, to avoid private information being sent to the cloud, the voice anonymization should be deployed locally on the user device, which could have limited computational resources.As such, it is important to evaluate the computational complexity (i.e., time and capacity needed for computation) of the anonymization methods alongside their effectiveness.
In this study, we comprehensively evaluated the impact of three voice anonymization methods on the accuracy of five leading COVID-19 detection systems.We started by quantifying the efficacy and computational complexity of the anonymization methods with COVID-19 speech recordings.We then investigated the within and cross-dataset performance of five COVID-19 diagnostics systems in different conditions, and explored the reasons behind the impact of different anonymization methods on diagnostics.Lastly, we showed the benefits of using anonymized external data as a data augmentation tool to recover the diagnostics accuracy loss in anonymized data.The following paper is organized as follows.Section II summarizes the related works in speech-based COVID-19 diagnostics and speech anonymization.Section III and IV describe the main components of the anonymized speech diagnostics framework and the experimental set-up.Section V describes and discusses the obtained results.Section VI presents the conclusions.

II. RELATED WORK A. Speech-based COVID-19 Diagnostics
Speech-based diagnostic systems can be categorized into two groups: ones that rely on carefully designed hand-crafted features coupled with conventional machine learning classifiers, and ones that input raw signals directly into a deep learning model for classification.In the latter 'end-to-end' scenario, the deep learning model serves as a feature extractor and feature mapping function in one.
When it comes to feature extraction from speech, the openSMILE toolkit [31] is by far the most popular.The largest feature set of openSMILE extracts over 6,000 acoustic features, including mel-frequency cepstral coefficients (MFCC), pitch contours, voicing-related information, as well as several other low-level descriptors (LLDs).This feature set has been used together with conventional classifiers, such as support vector machines (SVM), for the detection of different diseases [32]- [35].More recently, it has been employed as a benchmark feature set for the INTERSPEECH 2021 ComParE COVID-19 Detection Challenge [36].For in-the-wild speech analysis, on the other hand, the modulation spectral representation (MSR) has shown benefits over openSMILE features for different applications (e.g., [37], [38]), including disease characterization (e.g., [39], [40]) and COVID-19 detection [10].
Existing end-to-end systems, in turn, have relied on variants of the spectrogram representation as input, including the mel-spectrogram or the log-mel-spectrogram, as well as convolutional or recurrent neural network architectures for classification.Han et al., for example, showed that VGGish neural networks outperformed conventional methods in classifying different COVID-19 symptoms [35].Akman et al. developed a ResNet-like architecture for speech and coughbased COVID-19 detection [41].The Bi-directional Long-Short-Term-Memory (BiLSTM) neural network was used in the top-performing system competing in the second Diagnosis of COVID-19 using Acoustics (DiCOVA2) Challenge [12].Compared to conventional systems, end-to-end systems have demonstrated overall higher performance on several datasets without the need for a separate feature extraction step [12], [42], [43].Nonetheless, recent research has shown that while end-to-end models achieve state-of-the-art accuracy on a particular dataset, those results do not transfer well to other unseen datasets, where accuracy can drop to below chance levels [44]; this was not the case with hand-crafted features and conventional classifiers.

B. Speech Anonymization
Anonymization techniques comprise two categories: speech transformation and speech conversion.The former refers to modifications directly to the original speech, such as pitch shifting and warping [45], [46], to remove personal identifiable information from the speech signal.The latter, in turn, converts one's voice to sound like that of another without changes in linguistic content [47].As voice privacy concerns are on the rise, voice anonymization has gained popularity recently and, in 2020, the Voice Privacy Challenge (VPC) was created [24].A popular method from the 2020 and 2022 VPCs employs the so-called McAdams coefficients [24], [25], where shifts in the pole positions derived from linear predictive coding (LPC) analysis of speech signals [48] are used to achieve anonymization.Another popular voice transformation method is termed voicemask [49], where certain frequency components are compressed (or stretched) to generate a lower-pitched (or higher-pitched) voice signal.Voice conversion systems, on the other hand, have usually relied on modifications to speaker embeddings, such as the x-vector [50] and the ECAPA-TDNN embeddings [51], which are assumed to only carry nonverbal information that pertains to the speaker identity alone.The modified speaker embeddings are then input with speech content sequence to a speech synthesis module to reconstruct a new speech waveform [26].Several innovations have been proposed to the speech synthesis module to make the outcome sound more natural and of greater quality and intelligibility [27], [52]- [54].

III. ANONYMIZED SPEECH DIAGNOSTICS SYSTEMS
A. System Overview Figure 1 depicts the diagram of an anonymized speech diagnostics (SD) system.Conventionally, the original voice of user X is input to a diagnostic system that will generate a positive or negative output for the tested disease and/or symptom.If an automatic speaker verification (ASV) system was trained with data from user X, the ASV system would be able to detect user X's voice.In practice, SD systems are complex and models are often stored on the cloud, thus requiring the user's voice (or features) to be uploaded to the cloud.This transmission of data could result in privacy concerns.To overcome this, voice anonymization can be employed locally and anonymized data (or features) are sent to the cloud.In this case, user X would not be identified by the ASV system and speech-based diagnostics could proceed in a more secure and private manner.

B. Speech-Based Diagnostic Systems
Based on previous experiments on COVID-19 detection (e.g., [10]), the five top-performing diagnostics systems are explored herein: 1) openSMILE+SVM: A total of 6,373 static acoustics features were firstly extracted using the openSMILE toolbox [31], which were then input to a SVM classifier with a linear kernel.This system was used as the benchmark in the 2021 ComParE COVID-19 Speech Sub-challenge [36].2) openSMILE+PCA+SVM: The high dimensionality of the openSMILE features can be problematic for smaller datasets.In [55], principal component analysis (PCA) [56] was used to compress the 6,000+ features into 300 components.
Here, the number of principal components was treated as a hyper-parameter and a value of 100 was found to strike a good balance in accuracy and dimensionality.
3) MSR+SVM: The MSR features have been used in [10], [44] and shown to outperform openSMILE-based systems and to provide improved generalizability across datasets.The interested reader is referred to [39], [57] for more details about the modulation spectrum.The modulation spectrum decomposes each frequency component along time into different modulation frequencies, which captures the abnormalities in respiration and articulation by focusing on long-term dynamics of speech.Each modulation spectrum comprises 23 frequency bins and 8 modulation frequency bins, which is then flattened into a vector and used as input to a linear SVM classifier.
4) MSR+PCA+SVM: For more direct comparisons with the openSMILE system, here we also explore the compression of the 184-dimensional (23 × 8) vector via PCA, resulting in a final 100-dimensional vector for classification.
5) Logmelspec+BiLSTM: The winning system in the Di-COVA2 Challenge was employed [12] as a benchmark.This system adopts the conventional log-mel-spectrogram (logmelspec) with first-and second-order deltas as input, along with a BiLSTM as the classifier.More details about the network architecture can be found in [12].

C. Speech Anonymization Methods
A voice transformation and two voice conversion methods are explored here to gauge their differences in speech diagnostics performance.More details are provided below.
1) McAdams coefficient: This approach uses a classical signal processing technique and does not require model training.It employs the so-called McAdams coefficient method [48], [58] to shift the position of formants measured using linear predictive coding (LPC) [59].For each short-time speech frame, the method first separates the linear prediction residuals and linear prediction (LP) coefficients.The LP coefficients are then converted to pole positions in the z-plane by polynomial root-finding, where each pole position represents the position of one formant.The phase of the poles with imaginary parts is then raised to the power of the McAdams coefficient α.The new set of poles is then converted back to LP coefficients.Together with the original residuals, a new speech frame can be synthesized.
2) Ling-GAN: For voice conversion, we implemented two systems based on generative adversarial networks (GAN).The overall architecture of these systems can be found in Figure 2. The first system, abbreviated as 'Ling-GAN', was an off-theshelf anonymizer from [27], where all modules were already trained and applied to COVID-19 data without any fine-tuning.In general, it preserves the linguistic content (i.e., phoneme sequence) and uses a generator to generate fake, yet realistic speaker embeddings to substitute the original speaker embeddings.The original speech is first input to an automatic speech recognition (ASR) model to extract the phone sequence.The ASR model used here is based on the hybrid CTC (Connectionist Temporal Classification)/attention architecture [60] with a Conformer encoder [61] and a Transformer decoder.It should be emphasized that the output of the ASR is a phoneme sequence, detailing not only the phonemes uttered but also the pauses.In our exploratory analysis, we found that the removal of these pauses would change the rhythm of the generated speech and lead to degraded diagnostic performance.We hence kept all pauses in the extracted phoneme sequences.The ASR model used here supports English as the default input language, hence may lead to erroneous transcriptions when other languages are used.Although such issue can be potentially tackled by replacing with other multi-language ASR models, their compatibility with the anonymization and synthesizer blocks has not been tested.Hence, we remain using the same architecture as is proposed in [27], and leave the language compatibility for future investigation.
The anonymization is divided into two stages.During the first stage, the 512-dimensional x-vector [50] and the 192dimensional ECAPA-TDNN vector [51] are extracted using the SpeechBrain toolkit [62] and concatenated as the final speaker embeddings.At the second stage, a Wasserstein GAN with Quadratic Transport Cost (WGAN-QC) [63] is used to generate a pool of 5,000 'converted' speaker embeddings and saved for later use.When a new recording is input to the system, the model iteratively looks through the pool, and stops when it finds one with a cosine distance above 0.3 with the original speaker embeddings.This set of new embeddings are then used to substitute the original one for synthesis.The 0.3 threshold value of cosine distance was suggested from [27], which ensures sufficient difference in speaker traits while maintaining the naturalness.Finally, the FastSpeech 2 model [64] is used to synthesize the phone sequence into a spectrogram, followed by a HiFiGAN vocoder [52] to convert the spectrogram into a final speech waveform.The synthesizer is conditioned on the anonymized speaker embedding, hence keeping the linguistic content while obfuscating the speaker identity.
It is important to emphasize that this off-the-shelf GAN has not seen pathological speech data during its training [65].As a consequence, the generated speaker embeddings may not encapsulate health-related attributes, thus affecting diagnostic accuracy.The last anonymization system used overcomes this limitation, as detailed next.
3) Ling-Pros-GAN: The second GAN-based system, abbreviated as 'Ling-Pros-GAN', was modified from [65] which can be seen as a more advanced version of the Ling-GAN.While sharing similar architecture, such as the ASR module and the synthesizer, the Ling-Pros-GAN further preserves prosody (i.e., pitch, energy, and duration) during anonymization and uses the style embeddings from [66] to represent speaker attributes.In addition, we fine-tuned the generator and discriminator using the aggregated training set data from all three COVID-19 datasets employed in this study.The goal of finetuning was to enable the GAN to generate COVID-like speaker embeddings.
The generator and discriminator were jointly trained via 2,000 iterations, with the batch size of 128 and learning rate of .00005.Other fine-tuning hyperparameters remained the same as reported in [65], which can also be found in our code repository.Figure 3 depicts the t-distributed stochastic neighbor embedding (t-SNE) plots [67] showing a 2-dimensional representation of the speaker embeddings in the COVID-19 datasets (red dots), those produced by the generator without fine-tuning (blue), and after fine-tuning (green).As can be seen, using just the pre-trained generator is not sufficient to model the COVID-19 speaker embedding distribution.With 2,000 iterations of fine-tuning, the generator was able to generate embeddings following a similar distribution of the COVID-19 embeddings.Different from the original implementation in [65], where a pre-generated pool of speaker embeddings were used, we modified Ling-Pros-GAN in a way that it randomly generates a small set of different speaker embeddings each time it receives a new recording, then chooses which embeddings to swap by iteratively examining the cosine similarity.In other words, Ling-Pros-GAN is guaranteed to generate an unseen version of anonymized speech even with the exact same input recording.In contrary, since Ling-GAN always chooses embeddings from a pre-generated pool, there is a slight chance that two recordings may be anonymized with the same generated embeddings.Such possibility becomes higher when the number of speakers increases.While such modification to Ling-pros-GAN improves the privacy, the computing time increases simultaneously due to the online generation process of speaker embeddings.

A. Databases
At the time of writing, most existing COVID-19 sound datasets target cough sound, such as the COUGHVID [68], Tos COVID-19 [69], Virufy [70], and NoCoCoDa [71].Speech sound, on the other hand, is included in fewer datasets.To maximize the variability of data distribution and avoid biased results from one single dataset, we included three publicly available COVID-19 speech datasets, namely the multilingual 2021 ComParE COVID-19 Speech Sub-challenge (CSS) dataset [11], the second DiCOVA Challenge dataset [12], and the English subset from the Cambridge COVID-19 sound database [55].These datasets are referred to hereinafter as CSS, DiCOVA2, and Cambridge set, respectively.The demographics of three datasets are summarized in Table I.It should be noted that though the full Cambridge database contains more speech samples, the English subset has been more carefully examined by the data holders to avoid potential confounding factors (e.g., languages, data quality, class balance, etc.) [55], hence is considered more suitable for our analysis.
All three datasets were crowdsourced, volunteers across the globe were encouraged to upload their voice data and metadata via apps.The same speech content was required per dataset.With CSS, participants were asked to utter the sentence "I hope my data can help to manage the virus pandemic" at most three times in their mother tongue, with the majority of samples being uttered in English, Portuguese, Italian, and Spanish.The same speech content was used for the Cambridge set but in English only.With DiCOVA2, participants did number counting from 1 to 10 at a normal pace in English.For all datasets, participants were asked to self-declare whether they were COVID-negative (including healthy or having COVID-like symptoms) or COVID-positive (including symptomatic and asymptomatic cases).It can be noticed from Table I that all three sets contained 10% to 30% asymptomatic COVID-positive cases.Additionally, nearly half of the COVID-negative samples in CSS and Cambridge are symptomatic, which is three times higher than that in Di-COVA2.
The CSS and Cambridge datasets were partitioned into three separate subsets by the challenge organizers, namely training, validation, and test.For comparisons, we employed the same challenge partition in this study.It should be emphasized that in the CSS dataset, several COVID-positive recordings were originally sampled at 8 kHz while the majority of the other files were sampled at 16 kHz.As suggested in [36] and our previous exploration [10], keeping these up-sampled recordings has been shown to lead to overly-optimistic results since classifiers learned to capture the difference in sampling rates instead of the actual pathological pattern.Thus, we removed them from our analysis.The DiCOVA2 dataset, in turn, is comprised of development and evaluation subsets, with the evaluation data being accessible only to challenge participants.Hence, we performed a speaker-independent training-test split (80/20%) using the development subset only and left the evaluation set for testing.

B. Tasks
As our final goal is to not only provide accurate diagnostics decisions but also ensure the protection of privacy of speaker identity, the evaluation was divided into three tasks.In the first task, we compared the effectiveness and complexity of different anonymization techniques.In Task-2, we then quantified the impact of anonymization techniques on diagnostics accuracy in different conditions.Finally, we provided explanations for the impact seen in Task-2, and explored solutions for improving the proposed systems.
1) Task-1: Evaluating anonymization performance: As is shown in Figure 4, for each speech recording, the speaker embeddings were extracted separately from the original version, the McAdams-anonymized version, the Ling-GAN anonymized version, and the Ling-Pros-GAN anonymized version.Cosine similarity was then computed between the embeddings of each two signals, where higher cosine similarity values represented higher resemblance between two speech samples.Meanwhile, we employed the pre-trained ECAPA-TDDN speaker verification model from SpeechBrain [62] to detect if two recordings are from the same speaker, then evaluated the misclassification rate, where higher values suggest more successful anonymization.Since multiple evaluation scenarios were considered in this study, where training and test data were processed with different anonymization methods, the cosine similarity and the misclassification rate were computed between not only the clean and anonymized data, but also data processed by different anonymization methods.Additionally, we measured the computation time spent by the three methods per recording, and calculated the average and standard deviation for each dataset.This helps to quantify and compare the time efficiency of the three anonymization methods.
2) Task-2: Evaluating diagnostics accuracy: As aforementioned in Section I, training and test data could be anonymized using different methods.To mimic a realistic setting, we explore four different scenarios, as detailed below.Table II summarizes these conditions.Scenario-A: Unprotected: Here, both training and test data are original, thus anonymization is not performed.This encompasses the traditional diagnostic system evaluation and serves as a baseline of the maximum diagnostics accuracy that can be achieved by each model.As data distributions vary across datasets [72], diagnostics performance obtained under within-dataset conditions may lack external validity and has been shown to be over-optimistic [41], [44], [72].To ensure the generalizability of the tested methods, for each scenario we explore both within-and crossdataset results.In the latter, models are trained on one dataset and tested on data from another set.As the CSS is a subset of the Cambridge set, we avoid using both datasets in the crossdatabase condition to avoid overly-optimistic results [36], [55].
3) Task-3: Anonymization for data augmentation: Data augmentation has been widely used in speech applications based on deep neural networks to improve accuracy, especially under mismatched train-test data distributions.One of the approaches to increase model generalizability is to use external data augmentation, which refers to the case where data from external datasets curated for similar tasks are pooled with the in-domain data to increase training sample size [73], [74].In our case, we aim to improve the generalizability of diagnostic models to samples anonymized by unknown algorithms.We propose to combine anonymized external data with the original data as an augmentation approach to mitigate the degradation caused by anonymization.We focus on two cases, namely augmenting the ignorant and semi-informed scenarios, in which we observed diagnostic models performed the worst.As shown in Figure 5, we experimented with four versions of the augmentation data, including the clean version (i.e.,

C. Training and Inference Strategies 1) Training:
For the systems that rely on hand-crafted features, training data normalization was achieved by removing the mean and scaling to unit variance.The fitted scaler was then applied to the validation and test data.Hyper-parameters were tuned on the held-out validation set.The optimal SVM regularization parameter was searched between 1e −5 and 1; the SVM kernel was set to linear; and the number of PCs was experimented from 100 to 300.To train the BiLSTM classifier, in turn, recordings were first zero-padded to 10second length to ensure a fixed shape for logmelspec input; the spectrogram was then mean-variance normalized.Each minibatch was composed of 64 samples with random shuffling, forced to contain both COVID-positive and COVID-negative samples.Unlike [12], no oversampling of minority class or any other data augmentation techniques were used, as their effect on anonymization has yet to be quantified.The following hyper-parameters were used for training: Binary Cross Entropy (BCE) loss; Adam optimizer with an initial learning rate of 1e −4 and l 2 regularization set to 1e −4 .During the validation phase, an initial patience factor was set to 5 and reduced by 1 if the validation score did not increase.Training stopped whenever the patience factor reduced to 0, the number of training epochs was saved for the inference phase.
2) Inference: For the first four systems, the pre-trained model with the highest validation score was then used for testing.As the BiLSTM classifier is more data-hungry, the optimal hyper-parameters found in the training phase were then used to train the classifier from scratch with the aggregated training and validation data.The number of training epochs maintained the same as that saved in the training phase.

D. Evaluation Metrics
Since all three datasets are imbalanced, the area under the receiver-operating-characteristic curve (AUC-ROC) was chosen as the primary metric to measure the diagnostics accuracy.We further calculated the 95% confidence intervals (CIs) using 1000× bootstrap with replacement on the test set.According to [75], CIs can reflect the variability of diagnostics accuracy when the model is applied to a different population.
As mentioned previously, cosine similarity and misclassification rate were used to quantify the effectiveness of the three anonymization methods.The similarity scores are averaged across samples from all three datasets per method, where 0 represents no resemblance and 1 represents a perfect match between two tested speech conditions.While the Equal Error Rate (EER) is commonly used to evaluate anonymization efficacy, ground-truth speaker identifiers are required for each recording in order to verify if samples are from the same or different speakers.However, speaker identifiers were not available for the CSS and DiCOVA2 datasets.Instead, we rely on misclassification rate by employing a pre-trained speaker verification model from [62].For each recording, the model outputs a binary decision (yes/no) if it believes a pair of anonymized and clean speech signals come from the same speaker.The misclassification rate is then calculated by dividing the number of misclassified pairs over the total number of pairs per method, which reflects the percentage of successfully anonymized recordings.For an ideal anonymization system, the misclassification rate is expected to be 100%, i.e., the model should decide that all anonymized signals do not come from the same speaker as the clean signal counterpart.Lastly, computation time was recorded for each anonymization method, including the loading and exporting of the audio files.In the case of the two GAN methods, the loading time of the model itself was not taken into account in this computation.

A. Task-1: Anonymization Results
The average cosine similarity scores between the speaker embeddings of speech files anonymized by the different methods together with the misclassification rates are shown in Figure 6.As can be seen, near perfect anonymization performance was achieved with both GAN-based methods (misclassification rates), with almost no similarity with either the original speech or the speech anonymized by other  methods.On the other hand, nearly half of the McAdamsanonymized samples can be successfully detected, suggesting some speaker-unique information still remained.
The computational complexity of the three anonymization methods is presented in Table III for all three datasets.While the GAN-based methods are shown to provide better anonymization effectiveness, it requires computational times approximately 10-20 times longer than using the McAdams coefficient method.The longest time was seen with Ling-Pros-GAN, since it requires extra time to extract prosody, which involves an online training loop, and to generate and find embeddings in real-time.As model loading time was not taken into account, the computation footprint of the GANbased methods could be larger in real-world settings.Additionally, the GAN-based methods rely on several pre-trained neural networks with millions of parameters (e.g., 22.3 million for ECAPA-TDNN embedding extractor; 10 million for the generator), which could make it challenging to be deployed on mobile devices.

B. Task-2: Within-dataset Performance
The within-dataset performance of the five diagnostics systems under different anonymization scenarios is demonstrated in Figure 7.As can be seen from the average AUC-ROC scores per scenario, the highest performance is achieved under scenario A, i.e., when anonymization is not performed.When the test data are anonymized using the McAdams coefficient (scenario B1), the average AUC-ROC score over all systems dropped by 8.9% (CSS), 5.9% (DiCOVA2), and 6.3% (Cambridge) relative to scenario A. A substantial decrease was observed when using both the Ling-GAN and Ling-Pros-GAN anonymizers (scenario B2 and B3), where an average relative drop 22.5% and 18.1% was achieved respectively.Moreover, nearly all systems degraded to chance levels under scenario C where models were trained with data anonymized by one method and tested with data anonymized with another, suggesting that anonymization may drastically remove COVID-19 speech information.Diagnostic performance in the fully-informed scenarios is shown to be close to scenario A. Among the three anonymizers, McAdams anonymization leads to higher diagnostic performance on average in scenario D.
Compared to the Ling-GAN, Ling-Pros-GAN shows higher performance on the English datasets (DiCOVA2 and Cambridge) and lower performance on the multilingual one (CSS).Next, we evaluate the sensitivity of different diagnostics systems to anonymization and explore the relative drop in accuracy from scenario A to scenario B. Table IV reports the average drops seen per dataset.As can be seen, the two GANbased methods resulted in a substantially higher degradation relative to the McAdams coefficient method, with the Ling-GAN leading to the most severe decrease.This was expected and corroborates Task-1 results, where speaker embeddings of the GAN-anonymized speech showed practically no similarity to the original speech.Meanwhile, since Ling-Pros-GAN leaves the prosody intact and generates more COVID-like embeddings, it is likely to preserve more COVID-19 attributes than the Ling-GAN, thus rendering higher anonymized diagnostic performance.Previous studies have shown that speaker embeddings (e.g., x-vector) also contain other nonverbal information and can be used for speech para-linguistic tasks [76], [77], such as speech emotion recognition [78] and disease detection [79], [80].While the GAN-based anonymizers substitute the original speaker embedding with a dissimilar speaker embedding, the obtained results suggest that healthrelated vocal characteristics are likely also discarded, thus resulting in significant drops in diagnostics accuracy.
Lastly, we use scenario A as the baseline and calculate the average drop in accuracy for scenario C, showing the impact that training models completely on anonymized data would have.For both openSMILE and MSR methods, we use the PCA-SVM pipeline to avoid the effects of difference in the number of features.The comparative results are reported in Table V.As can be seen, all three diagnostic systems show degradaded performance, with the logmelspec+BiLSTM system shown to be on average more robust (21.6%) to the semiinformed anonymization scenario.Notwithstanding, it should  be highlighted that the logmelspec+BiLSTM system achieved the lowest AUC-ROC in scenario A. Interestingly, with the CSS dataset, the diagnostic system based on a BiLSTM and log-mel spectrogram input resulted in substantially lower degradation percentage compared to the two other systems based on traditional engineered features and classifiers.CSS is a multilingual dataset, thus hand-crafted features (e.g., syllabic rate, speech production features) used in these models may show more sensitivity to language.
C. Task-2: Cross-dataset Performance Figure 8 shows the cross-dataset performance under the thirteen different testing scenarios.In line with previous studies [44], [81], all five diagnostics systems demonstrated significantly lower performance relative to within-dataset results; the logmelspec+BiLSTM achieved the greatest drop in performance.Interestingly, in a few scenarios anonymization helped systems become more generalizable relative to the unprotected setting (e.g., scenarios B2 and C3 for the CSS-DiCOVA2 cross-database experiment).Figure 9 depicts the average change in accuracy relative to scenario A for all scenarios and diagnostic systems.While on average a 6.6% drop in accuracy was seen across all five systems, an increase of 2% and 5% was achieved with MSR+SVM and logmel-spec+BiLSTM systems for scenarios C4 and C2, respectively.It is important to note that both scenarios involved GANbased anonymized test data, thus had typically the lower crossdataset results to start off with.

D. Explaining the degradation caused by different anonymizers
While our study shows that typical anonymization systems lead to degraded diagnostic performance, it is unclear why different systems caused different levels of degradation and why some diagnostic models could still perform decently after anonymization.To answer these questions, we performed a comprehensive evaluation of the impact of different speech aspects on diagnostic performance, including the linguistic content, speaker representation, and prosody.Similar to the experimental setup of Task-1, we now compare the withindataset performance obtained by three categories of speech features, namely (1) the phoneme-level features, including the number of mispronunciations (as opposed to the speech script), number of pauses, and number of phonemes uttered per second; (2) the speaker representation extracted by concatenating  the pre-trained x-vector and ECAPA-TDNN embeddings [51]; and (3) prosodic features, such as the low-level descriptors of the F0 contour.
A Linear Discriminant Analysis (LDA) classifier is applied on top of each of the feature sets for classification.The results achieved by these features are reported in Table VI.Among the three feature sets, speaker embeddings appear to be the most crucial features for all datasets, corroborating with Task-1 results where the GANs suffered the most severe degradation, where the original speaker embeddings were entirely substituted.Such finding also suggests that speakerunique attributes and health-related information are highly entangled in the speaker embeddings.Considering that existing anonymization systems rely heavily on these off-the-shelf speaker embeddings, it remains challenging to preserve the health information while altering only the speaker identifier.
While a group of studies reported prosody as a key biomarker to characterize speech disorders, such as dysarthria [82]- [84], our results show that phoneme-level linguistic features outperform prosodic features for COVID-19 detection.Specifically, we found the number of pauses and number of mispronunciations to be the most important phoneme-level features, with COVID-positive samples demonstrating more mispronunciations and fewer pauses.While the correlation between phoneme-level features and COVID-19 status has not been systematically studied, similar features have been examined for other diseases affecting speech production.For example, [85] shows that individuals with Parkinson's disease produced fewer pauses at syntactic boundaries; the statistics of pauses have been shown crucial for diagnosing neuromuscular  disorders, such as dysarthria [86].Since GAN-based systems left linguistic content intact during anonymization, these findings help explain why the diagnostic models could perform above chance-level even when only the phoneme sequences were preserved during anonymization.

E. Visualizing speech processed by different anonymizers
To better understand the impact of different anonymization methods on speech characteristics, we first visualize the waveform of the speech processed by the three anonymizers (see Fig. 10) for a direct comparison.As can be seen, those processed by the McAdams anonymizer and Ling-Pros-GAN share higher similarities in the waveform envelope shape with the original signal compared to the one generated by Ling-GAN.The difference seen in the plot is in line with the architecture design of different anonymizers.Among the three, Ling-GAN loses prosody and most of the speaker attributes, hence is expected to cause the highest amount of changes in the anonymized speech.The Ling-Pros-GAN and McAdams anonymizer, in turn, leave the speech rhythm untouched (i.e., duration and energy of phonemes), hence leading to higher resemblance in the waveform envelope.
Next, t-SNE plots are used to visualize the distribution of the speech features in two dimensions.Figure 11 shows the clusters of speech anonymized with different methods (computed from the training and validation data) and for the three features modalities explored herein: openSMILE (subplots a), MSR (b), and logmelspec (c).As can be seen, for all three feature sets, the distribution of clean speech (blue) is closer to that of the McAdams anonymized speech (orange) and Ling-Pros-GAN anonymized speech (red), while the Ling-GAN anonymized speech (green) shows the least similarity with the other two, corroborating findings from Tasks 1 and 2.
Moreover, it can be seen from Figure 11a and Figure 11b that the clusters computed from openSMILE and MSR features show little overlap, while clusters of the logmelspec features show great overlap (Figure 11c).Together with Task-2 results, this shift in the feature space is likely the main cause of the higher decrease observed in the openSMILE and MSR systems under different anonymization settings.Meanwhile, since all anonymization methods keep the speech content intact and change only the nonverbal attributes, a greater shift in feature space may indicate a stronger correlation with the para-linguistic aspect and less with the linguistic aspect.This echoes with previous studies which showed that openSMILE and MSR features are preferred over logmelspec features in characterizing emotional and unnatural speech [87], [88].

F. Task-3: Improving Diagnostics Performance with Data Augmentation
Lastly, we investigate the impact of using anonymized external data for data augmentation and see its impact on the performance achieved with scenarios B and C. With scenario C, we chose sub-condition C1 and C3.To quantify the relative improvement, we used the within-dataset performance achieved in scenario A as the baseline and calculated the amount of performance increase seen (in percentage).The relative changes observed with the three diagnostics systems are reported in Table VII.Here, we explore augmentation with two different datasets and with four different methods: original, McAdams, Ling-GAN, and Ling-Pros-GAN.As can be seen, when test data are anonymized using the McAdams coefficient (B1), the highest improvement is generally achieved when the diagnostic system is augmented with the original data.In turn, when the test data are anonymized using the GANbased method (B2), augmenting the set with GAN-anonymized data from another dataset leads to a higher increase.Similar results are shown in scenarios C1 and C2, where clean and GAN-anonymized augment data result in more significant improvements.While not the top-performer, the McAdams method is shown to be a reliable augmentation strategy, especially for the openSMILE features.Overall, these findings suggest that anonymization has the potential to be used as a data augmentation approach to improve COVID-19 diagnostics accuracy when tested on anonymized data.

G. Limitations, Biases, and Future Work
The study's principal aim was to validate the effectiveness of anonymization methods within and across datasets in the context of assessing voice-based COVID-19 diagnostic accuracy.While we investigated three anonymization methods, other methods are emerging continuously (e.g., [89], [90]); thus, the findings reported herein should be validated with more recent methods.In the present study, the ASR anonymization method developed on English speech was applied to the multilingual CSS dataset.The finding that GAN-based anonymization had the lowest cross-dataset performance results may suggest challenges in applying this method in multilingual datasets and non-English speaking populations.In the future, multilingual GANs should be explored to avoid unfair outcomes [91] due to certain languages or cultural settings being excluded from the training and testing datasets.Moreover, while the injection of anonymized external data showed to be a useful data augmentation strategy, the final results were still at times lower than those achieved in the classical "unprotected" setting.This suggests that health-related information is being discarded during the anonymization process, thus future work could explore the development of diagnostic-aware anonymization methods that keep such discriminatory information intact.
Beyond tackling these limitations mentioned above, future work into voice-based diagnostics should be mindful of potential biases during data collection that could lead to confounds for both the anonymization and diagnostic steps.These confounders, if not properly dealt with, can reinforce the systemic nature of biases, for instance in relation to gender and racioethnic groups, that already exist within the healthcare system, thus transferring them to automated diagnostic systems.While [35] already showed some impact of sampling rate on diagnostic accuracy, several other potential biases may exist at the methodological level.For example, sociodemographic biases may emerge if age is not taken into consideration, as cognitive limitations (e.g., difficulty in speech planning or lexical access) associated with aging could alter speech patterns that could affect overall diagnostic accuracy.Recent work has shown that socioeconomic status could serve as a bias in COVID-19 detection [92].For example, as data was collected from participants from home, those living in crowded conditions could have resulted in increased background noise levels that negatively affected anonymization and diagnostic efficacy.Moreover, disadvantaged populations have been shown to have more chronic respiratory diseases [93] and higher levels of mood disorders and psychological distress [94].As such, anonymization processes affecting para-linguistic features associated with depressed mood may disproportionately affect those with a low socioeconomic status.Addressing biases in automated voice anonymization and diagnosis systems is beyond the scope of this paper and is left for future studies.

VI. CONCLUSION
In this study, we comprehensively evaluated the impact of three voice anonymization methods on the accuracy of five leading COVID-19 detection systems as well as the anonymization efficacy.All anonymization methods showed to degrade diagnostics accuracy, where the most severe degradation was seen with the systems that directly altered speaker embeddings.Our findings suggest that existing methods lack the capability of effectively preserving diagnostic information while obfuscating speaker identifiers.Lastly, we explored the use of anonymized external data as a data augmentation tool and promising results were obtained.

Fig. 1 .
Fig. 1.Block diagram of a speech-based diagnostics system with (protected) and without (unprotected) anonymization.'SD' stands for speech-based diagnostic system and 'ASV' for automatic speaker verification.

Fig. 2 .Fig. 3 .
Fig. 2. Diagram of the two GAN-based anonymizers implemented in this study.Compared to the Ling-GAN, the Ling-Pros-GAN not only preserves the original prosody, but also has the generator and discriminator fine-tuned with COVID-19 speech data, enabling it to generate more COVID-like speaker embeddings.

Fig. 4 .
Fig. 4. Evaluation of the effectiveness of different voice anonymization methods, as well as their computational complexity.

Fig. 6 .
Fig. 6.Cosine similarity between speech signals under different anonymization conditions averaged across three datasets.Values in the parentheses are the corresponding misclassification rates.

Fig. 7 .
Fig. 7. Within-dataset performance under different anonymization scenarios.Error bars represent the 95% CIs.The line plot values correspond to the average AUC-ROC scores over the five diagnostic systems calculated per scenario.

Fig. 8 .
Fig. 8. Cross-dataset performance under different anonymization scenarios.Error bars represent the 95% CIs.The line plot values correspond to the average AUC-ROC scores over the five diagnostic systems calculated per scenario.

Fig. 9 .
Fig. 9. Relative changes in the AUC-ROC under different anonymization scenarios for all diagnostics systems in the cross-dataset experiment.

Fig. 10 .
Fig. 10.A comparison of the waveforms processed by the three anonymizers and the original speech.

Fig. 11
Fig. 11.t-SNE clusters of anonymized speech features for different feature sets, namely: (a) openSMILE, (b) MSR, and (c) logmelspec.Blue dots corresponds to original speech; orange to McAdams coefficient anonymized speech; red to Ling-Pros-GAN anonymized speech; and green to Ling-GAN anonymized speech.

TABLE III AVERAGE
COMPUTATION TIME PER SPEECH FILE (SECOND) WITH STANDARD DEVIATIONS USING DIFFERENT ANONYMIZATION METHODS FOR THE THREE DATASETS.Pros-GAN 26.49±20.5324.9±11.6119.47±11.6123.62

TABLE IV DROP
IN WITHIN-DATASET AUC-ROC (%) FROM SCENARIO A TO SCENARIO B FOR DIFFERENT ANONYMIZATION METHODS.

TABLE V DROP
IN WITHIN-DATASET AUC-ROC (%) FROM SCENARIO A TO THE AVERAGE OF ALL SUB-CONDITIONS UNDER SCENARIO C FOR DIFFERENT DIAGNOSTICS SYSTEMS.

TABLE VII CHANGE
OF AUC-ROC SCORES ACHIEVED IN SCENARIOS B AND C AFTER DATA AUGMENTATION (GIVEN IN %).BOLD VALUES INDICATE THE HIGHEST IMPROVEMENT WITH EACH DIAGNOSTICS SYSTEM UNDER AGIVEN SCENARIO.