Speaker anonymization using orthogonal Householder neural network

Speaker anonymization aims to conceal a speaker's identity while preserving content information in speech. Current mainstream neural-network speaker anonymization systems disentangle speech into prosody-related, content, and speaker representations. The speaker representation is then anonymized by a selection-based speaker anonymizer that uses a mean vector over a set of randomly selected speaker vectors from an external pool of English speakers. However, the resulting anonymized vectors are subject to severe privacy leakage against powerful attackers, reduction in speaker diversity, and language mismatch problems for unseen-language speaker anonymization. To generate diverse, language-neutral speaker vectors, this paper proposes an anonymizer based on an orthogonal Householder neural network (OHNN). Specifically, the OHNN acts like a rotation to transform the original speaker vectors into anonymized speaker vectors, which are constrained to follow the distribution over the original speaker vector space. A basic classification loss is introduced to ensure that anonymized speaker vectors from different speakers have unique speaker identities. To further protect speaker identities, an improved classification loss and similarity loss are used to push original-anonymized sample pairs away from each other. Experiments on VoicePrivacy Challenge datasets in English and the \textit{AISHELL-3} dataset in Mandarin demonstrate the proposed anonymizer's effectiveness.


I. INTRODUCTION
S PEECH technology enables machines to recognize, analyze, and understand human speech, which facilitates human-machine communication and offers great convenience in our daily lives. Despite its prominent advantages, it suffers from voice privacy leakage, which allows for intrusion upon or tampering with a speaker's private information. For instance, by using advanced speaker [1], [2], dialect [3], [4], pathological condition [5], [6], or other types of speech attribute recognition systems, attributes such as a speaker's identity, geographical origin, and health status can easily be captured from speech recordings. Moreover, advanced speech synthesis techniques enable resynthesis, cloning, or conversion Natalia Tomashenko is with Laboratoire Informatique d'Avignon (LIA), Avignon University, France (email: natalia.tomashenko@univ-avignon.fr) of a speaker's identity information to access personal voicecontrolled devices [7], [8], [9]. In this paper, we are especially interested in speaker anonymization, which is a user-centric voice privacy solution to conceal a speaker's identity without degrading intelligibility and naturalness [10], [11], [12]. This task was standardized by the VoicePrivacy Challenge (VPC) committee [11], [12], [13], which held challenges in 2020 and 2022, to advance the development of voice privacy preservation techniques.
Several approaches to protect speaker privacy are based on digital signal processing (DSP) methods [11], [12], [14], [15], [16], [17], [18], which modify instantaneous speech characteristics such as the pitch, spectral envelope, and time scaling. State-of-the-art anonymization approaches have borrowed ideas from neural speech conversion and synthesis, mainly focusing on disentangled latent representation learning [10], [19], [20], [21], [22], [23], [24], [25] via two hypotheses. The first is that speech can be explicitly decomposed into content, speaker identity, and prosodic (intonation, stress, and rhythm) representations. Here, the speaker identity is a statistical time-invariant representation throughout an utterance, whereas content and prosodic information vary over time. The second hypothesis is that a speaker's identity representation carries most of his or her private information. Thus, generated speech using original content, prosodic, and anonymized speaker representations can suppress the original identity information (privacy) while maintaining intelligibility and naturalness (utility).
A general framework for disentanglement-based speaker anonymization involves the following components.
Fine-grained disentangled representation extraction from original speech: Here, extraction entails three aspects: (i) Content feature extraction. Low-dimensional phonetic bottleneck features are typically extracted from an intermediate layer of a language-specific automatic speech recognition neural acoustic model (ASR AM) [26], [27]. This type of content encoder is trained in a supervised manner using transcribed English training data. As the objective is to obtain accurate linguistic representations, the effectiveness is severely limited when applied to a different language. Content encoders based on self-supervised learning (SSL) can overcome this limitation thanks to being trained in a self-supervised manner using unlabeled training data. Specifically, they can provide general content representations not dependent on the language, thus enabling robust anonymization of speech data even for unseen languages. (ii) Prosody-related feature extraction to obtain the fundamental frequency, i.e., F0. (iii) Speaker embedding arXiv:2305.18823v2 [cs.SD] 13 Sep 2023 extraction. A speaker vector is extracted either from an automatic speaker verification (ASV) system based on a time-delay neural network (TDNN) [28], or from a more effective ASV system based on emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN) [29].
Speaker representation anonymization: The core idea of a speaker vector anonymizer is to hide original speaker information while preserving the diversity among different speakers. A widely used anonymizer is based on the selection and averaging of speaker vectors [30], [23]. Given a large set of speaker vectors, the anonymizer finds the N farthest candidate vectors away from an input original vector. It then randomly selects N * < N vectors among the N farthest ones and utilizes their average as a pseudo-speaker vector to replace the original speaker vector. The large set of speaker vectors, called an external pool, has to be loaded by the anonymizer during anonymization.
Anonymized speech synthesis: An anonymized speaker vector with the original fundamental frequency and content features is passed to a speech waveform generation model to synthesize high-quality anonymized speech. The speech synthesis model can be a traditional text-to-speech pipeline model-a speech synthesis acoustic model (SS AM) and a neural source filter-(NSF-) based vocoder [31]-or a unified HiFi-GAN [32].
Despite confirmation of this approach's effectiveness [11], [12], [33], there remains much room for improvement for different attack scenarios and unseen language anonymization. Previous works [11], [12], [33], [23] have suggested that the most significant performance bottleneck for the current mainstream approach is the selection-based speaker anonymizer, whose performance significantly depends on the distribution of the external pool and how pseudo-speakers are selected from the pool. (i) For English speaker anonymization [11], [12], [13], the performance of speaker verifiability has gradually decreased against more powerful attackers. Additionally, voice distinctiveness is significantly degraded by anonymization. (ii) For unseen-language (e.g., Mandarin) speaker anonymization, pseudo-speaker representations are generated from an external English speaker vector pool, and the resulting language mismatch increases the character error rate (CER) [33], [34].
Following this pipeline of disentanglement-based anonymization, with special consideration of the selectionbased approach's problems, we propose a novel speaker anonymization system (SAS) based on an orthogonal Householder neural network (OHNN). As shown in the lower part of Fig. 1, the OHNN-based anonymizer generates distinctive anonymized speaker vectors that can protect privacy under all attack scenarios and can successfully be adapted to unseen-language speaker anonymization without severe language mismatch. Specifically, original speaker vectors are rotated to anonymized ones by an OHNN, which is a linear transformation with orthogonality. This module ensures that the anonymized speaker vectors follow the distribution over the original speaker vector space. To discourage overlap between anonymized speakers and other speakers, we use a classification loss based on an additive angular margin softmax (AAM) and cross-entropy to train the OHNN, and we assign different target class labels to the original and anonymized speaker vectors of different speakers. This encourages the anonymized vectors to not overlap with any other speakers, regardless of whether they are original or anonymized. To further push original-anonymized sample pairs away from each other, an improved classification loss called weighted AAM (w-AAM) and a cosine similarity loss are used.
The main contributions of this work are as follows: • We propose an OHNN-based anonymizer that transforms original speaker vectors into anonymized ones with carefully designed training constraints. We show empirically that these anonymized speaker vectors are diverse and language-neutral. • We visualize the cosine similarities between pairs of speaker vectors extracted from the generated speech of users and different attackers. These generated speech are obtained using the commonly used selection-based anonymizer and our OHNN-based anonymizer. The results show that our proposed method effectively reduces the privacy leakage against different attackers and improves the diversity of anonymized speakers. We conducted experiments on VPC English datasets and the AISHELL-3 Mandarin datasets. Our findings show that the proposed model can be successfully adapted to both a matched language condition (i.e., English) and a mismatched language condition where the target language (Mandarin) is not included in the training database. The proposed anonymizer achieved a competitive performance under all attack scenarios in terms of privacy and utility metrics. Under the Semi-informed condition, our proposed methods achieved better results for English speaker anonymization than all the submissions to VPC2022 [35], [36], [24], [37], [38], [25].

II. RELATED WORK
In this section, we introduce the VPC's official design, which provides the setting for this study, including definitions of specific goals, attack models, and objective evaluation metrics. We also overview existing speaker anonymization approaches and their limitations.

A. The VoicePrivacy Challenges
The VPC formulates the speaker anonymization task as a game between users and attackers, as shown in Fig. 2. A user publishes anonymized data, called test trials, after applying an SAS to his or her original private speech. According to the VPC evaluation plan [13], an SAS should: • output an anonymized speech waveform; • conceal the speaker's identity from different attackers; • keep content and other paralinguistic attributes unchanged to maintain intelligibility and naturalness; • ensure all test trials from the same speaker are attributed to the same pseudo-speaker, while test trials from different speakers have different pseudo-speakers 1 1) Attack Models and Objective Evaluation Metrics: a) Privacy metric: To assess the ability to protect a speaker's identity in different scenarios, the ASV performance in terms of the equal error rate (EER) is computed as the primary privacy metric by using language-matched ASV evaluation models. This metric is calculated under the four attack models shown in the lower left of Fig. 2. The attackers are assumed to have access to a few original or anonymized utterances for each speaker, called enrollment utterances, and to have different levels of knowledge about the SAS: • Unprotected: No anonymization is applied, and attackers verify the original test trials against the original enrollment data by using an ASV system trained on the original dataset, denoted ASV eval . • Ignorant: Attackers are unaware of the anonymization strategy used for the test trial utterances; instead, they use the original enrollment data and ASV eval to infer a speaker's identity. • Lazy-informed: Attackers use a similar SAS without accurate parameters to anonymize their enrollment data, and they use ASV eval to detect a speaker's identity. • Semi-informed: The only difference from Lazy-informed is that the attackers use ASV anon eval , a more powerful version trained on anonymized speech, to reduce the mismatch between the original and anonymized speech and infer the speaker's identity. b) Primary utility metric: To assess how well speech content is preserved in anonymized speech, the ASR performance in terms of the word error rate (WER) is computed as a primary utility metric by using language-matched ASR evaluation models. As illustrated in the lower right of Fig. 2, two ASR models are trained in the same way to decode the anonymized data: ASR eval , trained on the original data, and ASR anon eval , trained on the anonymized data. This enables exploration of whether speech content can be maintained better by simply retraining with similarly anonymized data. c) Secondary utility metric: To assess and visualize the preservation of voice distinctiveness, the gain of voice distinctiveness metric, G VD [39], [40], is computed. Precisely, M = (M (i, j)) 1≤i≤N,1≤j≤N is a voice similarity matrix for N speakers, where the similarity value M (i, j) for speakers i and j is formulated as follows: Here, n i and n j are the numbers of utterances for each speaker; and LLR(x l ) is the log-likelihood ratio obtained by comparing the k-th utterance of the i-th speaker with the l-th utterance of the j-th speaker. These LLR scores are computed by probabilistic linear discriminant analysis (PLDA) [41] of the ASV eval model trained on the original data.
Three matrices are constructed from the original (o) and anonymized (a) data: M oo from the original data, M oa from the original and anonymized data, and M aa from the anonymized data. The diagonal dominance D diag (M ) is computed as the absolute difference between the mean values of diagonal and off-diagonal elements: Next, G VD [39] is defined as the diagonal dominance ratio of the two matrices: Here, a gain of G VD = 0 dB indicates that voice distinctiveness is preserved on average after anonymization, while a gain above or below 0 dB corresponds respectively to an average increase or decrease in voice distinctiveness. An ideal anonymization system should achieve high EERs (close to 50%) in the Ignorant, Lazy-informed, and Semiinformed scenarios to protect the speaker's information. In addition, the WER should be as low as for the original speech, and G VD should be close to 0 dB to preserve voice distinctiveness.  speaker attributes with distortion of the spectral envelope by using McAdams coefficients [42] to randomly shift the positions of formant frequencies. Widening of formant peaks [15] further distorts the spectral envelope. Data-driven formant modification can also be applied by using the formant statistics of desired speakers [16] or time-scale algorithms [18]. Phonetically controllable anonymization [17] modifies a speaker's vocal tract and voice source features, with a focus on F0 trajectories. Although these methods perceptually manipulate the speech signal, previous works have indicated that powerful attackers can effortlessly recover speaker identities [11], [12], [43].

B. Existing Speaker Anonymization Approaches
2) Disentangled Representation Methods: A typical approach based on disentangled representation learning, called x-vector based anonymization, is used as the primary baseline in the VPC [10], [11], [12], [13]. It extracts speaker representations and linguistic features by using a pretrained TDNN-based ASV system [28] and ASR AM based on a factorized timedelay neural network (TDNN-F), respectively. Then, to hide the original speaker's information, a selection-based speaker anonymizer [30] replaces the original x-vector with the mean vector of a set of randomly selected speaker vectors from an external pool of English speakers. Specifically, given a centroid of source speaker vectors from one speaker, the cosine distance is used to find the 200 farthest centroids in an external speaker vector pool, and 100 of those are randomly selected and averaged to obtain an anonymized speaker vector [30]. Finally, an SS AM generates mel-filterbank features from the anonymized pseudo x-vector, F0, and linguistic features, and an NSF-based waveform generator synthesizes anonymized speech.
Because this disentanglement-based method is more effective at protecting speaker identities than the DSP-based methods discussed in Section II-B1 [43], [12], most speaker anonymization studies have followed a similar framework. Improvements mainly come from two sources: Improved speech disentanglement: Some works [44], [45], [46] have argued that the disentangled linguistic information extracted from the language-specific ASR AM and F0 still contain speaker information. Accordingly, they modify the F0 and linguistic information to remove the residual speaker identity.
Improved speaker vector anonymization: Other researchers have modified the original x-vector in ways that increase the privacy protection ability. Perero-Codosero et al. [47] transformed an original x-vector to an anonymized one by using an autoencoder with an adversarial training strategy to suppress speaker, gender, and accent information. This requires labels for the speaker identity, gender, and nationality. Turner et al. [48] sampled anonymized x-vectors from a Gaussian mixture model in a space reduced by principal component analysis (PCA) over an external pool of speakers, which preserves the distributional properties of the original x-vectors. There have been recent attempts to generate a target pseudo-speaker for speaker anonymization in the systems submitted to the VoicePrivacy Challenge 2022. For example, Meyer et al. [24] utilized a generative adversarial network to generate artificial speaker embeddings, where the anonymization stage requires a manual search to find vectors that are dissimilar to the anonymized one. Yao et al. [25] proposed using a look-up table (LUT)-based method to generate pseudo-speaker embeddings, along with an average of randomly selected speaker embeddings from the real speakers. However, it suffers from limited variability in the anonymized voices. Chen et al. [35] proposed a method for distorting an input speech signal by adding adversarial noise designed to hide the original speaker identity.
Most of the existing approaches are limited in two aspects. First, they use an ASR-based content extractor that requires large amounts of transcribed English training data. Such an ASR-based content extractor is ineffective for speaker anonymization in unseen languages. Our previous work alleviates this issue by using an SSL-based content extractor [33]. As shown in Fig. 1, this SSL-based SAS consists of a HuBERT-based soft content encoder [49], an ECAPA-TDNN speaker encoder [29], an F0 extractor, and a HiFi-GAN decoder [32]. It does not require text transcriptions or any other language-specific resources, and it has demonstrated the ability to anonymize speech data with reasonable performance even if the data is in a language not included in the training data. However, it suffers from a remaining limitation of selectionbased anonymizers according to previous results [11], [12], [13], [33], [34]: the distribution of the external speaker pool significantly affects anonymized speakers, and the averaging of vectors from the speaker pool reduces voice distinctiveness.

III. PROPOSED OHNN-BASED ANONYMIZER
To mitigate the problems with existing approaches, we propose the OHNN-based anonymizer shown in Fig. 3. Hence, this section formulates speaker anonymization as a constrained optimization problem, describes a general form of the proposed anonymizer, and explains the implementation details.

A. Problem Formulation
i and the corresponding speaker label y o i . The speaker vector Accordingly, the anonymized speaker vectors follow another distribution x a i ∼ p x a or x a i ∼ p fΘ(x o ) . An ideal speaker anonymization method should meet at least three constraints: • Speaker privacy protection: x o i and x a i are dissimilar to hide the original speaker identity. More specifically, in the context of VPC, x o i and x a i are dissimilar to the extent that the anonymized speech generated using x a i is recognized as being a different speaker by the attackers' ASV.
• Speaker diversity: x a i has a unique speaker identity y a i to maintain the diversity of anonymized speech across different speakers. • Distribution similarity: x a i ∼ p x a satisfies the same distribution as x o i to maintain the naturalness of the original speech. The above constraints can be formulated as an optimization problem: where λ is a hyperparameter to balance the multi-objective function. L s is a similarity metric to optimize Θ by minimizing the similarity of the original-anonymized pair, which ideally makes the original and anonymized speech be recognized as different speakers by the attackers' ASV. Next, g Ψ (·) denotes the classifier layer, and L c is its classification loss function to optimize Θ and Ψ by minimizing the discrepancy between the sets of desired outputs, y o , y a , and predicted outputs, g Ψ (x o ), g Ψ (x a ). The outputs may be defined for a multi-speaker classification task in which the original and corresponding anonymized speaker vectors are intentionally treated as different target speaker classes. This means that all speaker vectors after anonymization are treated as different classes, as well as different classes from the original speakers to maintain speaker diversity.
Finally, D p x o , p fΘ(x o ) is the divergence between distributions of x included in a training database before and after anonymization. This term ensures similarity between the distributions of the anonymized and original speaker vectors, with some tolerance ϵ. The Kullback-Leibler divergence (KLD) or other types of divergence are applicable.

B. General Form of Proposed Anonymizer
Finding a direct solution of Eqs. (5) and (6) for an arbitrarily designed DNN-based f Θ is difficult. Here, we propose an anonymizer that, with a few assumptions, always satisfies the constraint in Eq. (6) regardless of the value of Θ. In such a case, Θ and Ψ can be optimized via Eq. (5) and a conventional gradient descent method.
Let µ x o ∈ R d and Σ x o ∈ R d×d be the mean and covariance matrix of p x o , respectively. Our proposed anonymizer f Θ (·) can be written as follows: where L x o is a whitening matrix 2 that satisfies L −1  Before introducing the parameterization and optimization of W, we show that the proposed anonymizer satisfies 3 We first decompose Eq. (7) into three steps: • Centering and whitening: The centered and whitened speaker vectorx o obviously follows a normal distributionx o ∼ N (0, I). As W is an orthogonal matrix,x a also follows a normal distribution N (W0, WW ⊤ ) = N (0, I). Through the affine transformation in the last step, we know that Hence, the defined anonymizer does not change the distribution, i.e., D p x o , p fΘ(x o ) = 0. The above explanation also reveals the core idea of our proposed anonymizer: while it does not change the overall distribution, each speaker vector is rotated through an orthogonal transformation. The anonymized x a is guaranteed to be different from the original x o as long as W ̸ = I. While an infinite number of orthogonal matrices can be applied for rotation, the optimal W with respect to the criterion in Eq. (5) must be estimated through an optimization process.
In real applications, µ x o and Σ x o of the test set data are unknown. They can be estimated by collecting multiple samples from the test domain if it is possible. Otherwise, we can either use the statistics from the training set or make some simplifications. Through preliminary experiments, we found an effective, simplified form: where µ train x o is the mean of the speaker vectors in the training set, and Σ x o is assumed to be an identity matrix.

C. Rotation Matrix Using Householder Reflection
We now need a specific way to parameterize W to guarantee that the learned W through gradient descent is orthogonal. While many methods can be used, we found that one based on a Householder reflection [53] is efficient for DNNs. Without loss of generality, assume that W is a product of multiple orthogonal matrices: where each matrix W l ∈ R d×d is given by Here, each sub-matrix H q l is constructed with a Householder reflection [53] given a non-zero vector v q l ∈ R d as follows: The resulting H is known to be an orthogonal matrix for any non-zero vector v q l , i.e., H ⊤ H = HH ⊤ = I and H ̸ = I, ∀v q l ̸ = 0. Accordingly, W l and W are orthogonal and guaranteed not to be the identity matrix.

2) Learnable orthogonal Householder (LOH) reflection:
Each v is transformed from a small NN given the input x o . In such a case, Θ is the set of the trainable weights in a set of small NNs. Fig. 4(b) illustrates an implementation in which each DNN has a single 1D convolution layer with 192 output channels and a kernel size of 3. While both implementations ensure that the transformation matrix W is orthogonal, the first approach assumes a global transformation for all the input speaker vectors. In contrast, the latter approach assumes that the transformation matrix varies according to the input.

D. Loss Functions
Before delving into the details of the loss functions, we describe how to build batch data for an OHNN-based anonymizer. Let N be the batch size and C be the number of original speakers. Each mini-batch comprises N/2 original samples: . Therefore, the number of speakers is 2C during the training of an OHNN-based anonymizer.
We now explain the loss functions for learning the best values of Θ and Ψ as defined in Eq. (5). For the classification loss L c , we first consider the widely used AAM softmax loss [54], [55]: where Z = e ||wy i ||·||xi||·cos(θy i ,i+m1 ) + 2C j=1,j̸ =yi e ||wj ||·||xi||·cos(θj,i) , w j is the j-th column of the weight in the fully-connected layer before the softmax layer, where w ∈ R d×2C ; and θ yi,i is the angle between x i and the target class's weight vector w yi . After fixing the weight ||w yi || = 1 by ℓ 2 -normalization and rescaling ||x i || to s to ensure that the gradient is not too small during training, we can write Eq. (12) as where Z = e s(cos(θy i ,i+m1)) + 2C j=1,j̸ =yi e s(cos(θj,i)) . Since the target label y i varies across the original and anonymized speakers, the classification loss L AAM-softmax encourages the OHNN-based anonymizer to produce anonymized vectors that are varied for different speakers and distinct from original speaker vectors.
To further improve the discrepancy for original-anonymized (or anonymized-original) pair samples, we add an extra margin penalty m 2 into the AAM softmax loss. The approach is called weighted additive angular margin (w-AAM) softmax. Let i ∈ In our experiments, we set m 1 = m 2 = 0.2, s = 30 and compared the performance with settings of L c = L AAM and L c = L w-AAM .
For the similarity metric L s , we choose the cosine similarity 4 given by , we set the margin m = 0. The cosine similarity is a reasonable choice because it is closer to what most ASV systems use for scoring the similarity between speaker vectors. As the anonymizers are trained to minimize the cosine similarity between original and anonymized speaker vectors, the anonymized speech is expected to be judged as a different speaker by the attacker ASV, hence protecting the speaker's identity.

IV. EVALUATION
To evaluate the effectiveness of the SSL-based SAS using the proposed OHNN-based anonymizer under all the attack scenarios for English speaker anonymization, we followed the VPC evaluation plan [11], [12], [13]   language-mismatched condition, using Mandarin data as the non-included language in the training database. The purpose of these experiments was to determine whether that the proposed OHNN-based anonymizer, which eliminates the need for an English speaker pool, can effectively reduce the language mismatch present in anonymized speaker representations. As a result, better speech content preservation is achieved for Mandarin speaker anonymization.
Unlike the selection-based anonymizer, which relies on an additional multi-speaker English dataset (LibriTTS-trainother-500) containing data from 1,160 speakers as the external pool, the OHNN-based anonymizers reuse a multi-speaker multi-language dataset (VoxCeleb-2), that is used to train the ECAPA-TDNN of the SSL-based SAS [33]. This large-scale dataset contains over 1 million utterances by 5,994 speakers of 145 different nationalities.
English speaker anonymization was evaluated on the official VPC development and test sets [11], [12], [13]. These two sets contain English utterances by several female and male speakers from the LibriSpeech and VCTK [58] corpora. For the Ignorant and Lazy-informed conditions, we used the languagematched ASV eval system provided by the VPC [11], [12], [13]. It was trained on the original LibriSpeech-train-clean-360 English dataset. For the Semi-informed condition, we trained ASV anon eval system in the same way as ASV eval , but with anonymized speech data. Likewise, ASR eval and ASR anon eval were trained with the same original and anonymized speech data, respectively.
The same anonymization systems used for English speakers were directly adopted for Mandarin speaker anonymization without training or fine-tuning on Mandarin data. The evaluation for Mandarin was conducted on a test set sampled from   [61]. The ASV evaluation model under the Semi-informed condition called ASV anon mand eval was finetuned from ASV mand eval using anonymized utterances from 285 speakers in the interview, speech and live broadcasting genres of CN-Celeb-1 & 2. The ASR evaluation model ASR mand eval was a publicly available ASR Transformer [62] trained on a 150hour Mandarin ASR dataset, AISHELL-1 [63].
2) Experimental Setup: Table I lists notations for the different speaker anonymization approaches that we examined. B1.a, B1.b, and B2 are the baseline systems from VPC 2022 [13]. S-Select denotes the SSL-based SAS using a selectionbased anonymizer. S-ROH denotes a system obtained by replacing the selection-based anonymizer of S-Select with a random OH (ROH) anonymizer and keeping other components unchanged. Likewise, S-LOH indicates the use of a learnable OH (LOH) anonymizer. Noted that, hereafter, S-ROH* and S-LOH* refer to models trained with the w-AAM and cosine similarity losses.
For S-Select, the YAAPT algorithm [64] is used to extract the F0. The ECAPA-TDNN with 512 channels in the convolution frame layers [29] provides 192-dimensional speaker identity representations. The HuBERT-based soft content encoder [49] takes the CNN encoder and the first and sixth transformer layers of the pretrained HuBERT base model as a backbone. It downsamples a raw audio signal into a 768-dimensional continuous representation, which is then mapped to a 200dimensional vector by one projection layer to predict discrete speech units. These speech units are obtained by discretizing the intermediate 768-dimensional representations via k-means clustering 6 [65], [66]. The training procedures are detailed in [33]. For the selection-based anonymizer, attackers had different random seeds from users when randomly choosing 100 speaker vectors from the 200 farthest ones; thus, the attackers had different pseudo-speaker vectors.
The OHNN-based anonymizer accepts 192-dimensional speaker representations extracted from a pretrained ECAPA-TDNN, which was the same here as the ECAPA-TDNN of the SSL-based SAS. We followed the VPC evaluation plan, in which attackers in the Lazy-informed and Semi-informed scenarios have partial knowledge of the speaker anonymizer. They are assumed to know the training dataset, structure, loss functions, and other training parameters of the user's OHNNbased anonymizer, except for the training seed to initialize the training weights. Specifically, the training seeds were 50 and 1986 for users and attackers, respectively 7 . Using knowledge of the OHNN-based anonymizer, an attacker trains a new anonymizer to anonymize speech. All the OHNN-based anonymizers were trained with a cyclical learning rate [67], which varied between 1e-8 and 1e-3, and the Adam optimizer [68] by using the SpeechBrain [62] toolkit based on PyTorch [69]. The number of iterations of one cycle was set to 130k. We fixed d = 192, L = 12 for both the ROH and LOH anonymizers, but we use q l = 192 and q l = 50 for the ROH and LOH training, respectively. The hyperparameter λ in Eq. (5) was set to 20 8 .

B. Speaker Anonymization Experiments in English
For the English experiments, first, we explored the difference between selection-and OHNN-based anonymizers by comparing the performance of S-Select and S-LOH*. Then, we investigated different configurations for the OHNN-based anonymizer, including the losses and whether to explicitly use speaker information to optimize the Householder transformation. Finally, we compared SSL-based speaker anonymization using an OHNN-based anonymizer with other approaches, including the disentanglement-and DSP-based approaches.

1) Comparison of Selection-and OHNN-Based Anonymizers:
In the first experiments, we visualized the original and anonymized speech generated by S-Select and S-LOH* in terms of speaker embeddings, the cosine similarity of the speech pairs, and voice distinctiveness. Original and anonymized speaker embeddings: To show the difference between the S-Select and S-LOH* anonymizers, we first applied t-distributed stochastic neighbor embedding (t-SNE) [70] to visualize the original and anonymized embeddings. The results are shown in Fig. 5. The speaker embeddings were extracted from 50 speakers in the VoxCeleb-2 training set, which are shown in different colors, and 10 utterances were randomly selected from each speaker. Clearly, the anonymized speaker vectors generated by S-Select were heavily dependent on the distribution of an external pool, whereas S-LOH* generated distinctive anonymized speaker vectors that followed the distribution of the original speaker vector space. Cosine similarity distribution on speech pairs: Fig. 6 plots the cosine similarities between pairs of speaker vectors extracted from generated speech for all the test sets of LibriSpeech and VCTK on speech pairs provided by [12]. Depending on the attack condition, the speech can be original or anonymized generated by S-Select or S-LOH*. For the Unprotected condition, shown on the left side of Fig. 6, the positive cosine similarity distributions (green) are close to 1, and the negative distributions (yellow) are close to 0, which indicates that the speaker vectors of the original speech were highly discriminative. To protect speaker privacy, an ideal SAS should push the positive score distributions toward the negative ones regardless of the attacker type.
On the right side of Fig. 6, the top part shows the score distributions for three attacker conditions with S-Select. There are much bigger overlaps of the positive and negative distributions for the Ignorant condition than for the Unprotected condition, which means that S-Select achieved reasonable speaker privacy performance under the Ignorant condition. Unfortunately, the overlaps are smaller for the Lazy-informed and Semi-informed conditions. This reveals the reason for the significant speaker privacy leakage under more powerful attack conditions. Moreover, most of the cosine similarity scores are very close to 1, which may pose a risk of reducing the diversity of the anonymized speakers.
The bottom right of Fig. 6 shows the score distributions for three attacker conditions with S-LOH*. The overlaps of the positive and negative distributions are well magnified under all the attack scenarios. This verifies the effectiveness of our OHNN-based anonymizer in ensuring that the attackers cannot gain significant speaker privacy information from users. Furthermore, most of the cosine similarity scores are far from 1, indicating the diversity of the anonymized speakers. Comparison of gain of voice distinctiveness (G VD ): Fig. 7 shows voice similarity matrices obtained for S-Select and S-LOH*. The upper-left submatrix of each matrix M is M oo , and the distinct diagonal reflects the high voice distinctiveness within the original speech. The upper-right (or lower-left) submatrix M oa reflects the voice similarity between the original and the anonymized speech, such that the diagonal disappears when they differ. The lower-right submatrix M aa reflects the voice similarity within the anonymized speech, where a dominant diagonal appears if the anonymized speakers remain distinguishable [39]. There is a very weak dominant diagonal in M aa for S-Select, indicating that voice distinctiveness was lost among the anonymized speakers. In contrast, the matrices for S-LOH* exhibit distinct diagonals in M aa , indicating that voice distinctiveness was preserved after anonymization.
In general, the S-LOH* anonymizer met the three constraints described in Section III-A: good privacy protection, voice distinctiveness, and naturalness of the speaker vector space from the above analysis and visualization.
2) Effects of Various Components for Proposed OHNN-Based Anonymizer: The proposed OHNN-based anonymizer has two novel components: the loss functions and the Householder transformations. Table II summarizes the average EERs and WERs 9 under all attack scenarios using two OHNN-based anonymizers with different losses. In the table, ↑ indicates a better performance with higher values, while ↓ indicates a better performance with lower values. Effect of the different losses: For the proposed OHNN-based anonymizer, w-AAM+cos performed better than AAM+cos in terms of the EER under most attacker conditions. This was because the introduced margin of w-AAM expands the interclass variance of original-anonymized pairs, thus increasing the dissimilarity. Effect of different Householder transformations: Clearly, the LOH anonymizers generally achieved better EERs than the ROH did. This result supports the view that, instead of using a global transformation for ROH, the LOH is more flexible because it learns from the speaker embeddings and thus brings more discriminative information. For the WERs, those computed by ASR anon eval were consistently lower than those of ASR eval for all systems. This implies that such utility degradation due to OHNN-based anonymizers can easily be offset by training ASR evaluation models on similar anonymized data. Meanwhile, all the OHNN-based anonymizers achieved similar WERs with ASR eval or ASR anon eval , which confirms that the orthogonality of ROH and LOH did not change the distributions of the original and anonymized speaker vectors.
3) Comparison of Various SASs Using Different Anonymizers: Primary privacy and utility evaluation: Table III lists the average EER and WER results for various SASs under all scenarios. To anonymize the speaker representations, B2 randomly alters the formant position, B1.a, B1.b, and S-Select used the selection-based anonymizer, while S-ROH* and S-LOH* used the OHNN-based anonymizer.
First, we examine the results with the selection-based anonymizer. Using the selection-based anonymizer, the EERs of S-Select, B1.a and B1.b decreased by around 30% under the Lazy-informed condition and 7%-9% under the Semi-informed condition, indicating severe speaker privacy leakage.
Next, we examine the results with the proposed OHNN-based anonymizer integrated into different configurations. First, S-ROH* and S-LOH* could protect speaker information almost as well as the VPC baselines (B1.a and B1.b) could when facing the Ignorant attacker. Moreover, for the Lazy-informed and Semi-informed attackers, it comfortably outperformed all the baseline systems, achieving over 40% EER. Second, among all the methods, S-ROH* and S-LOH* preserved speech content the best with ASR anon eval , achieving even lower WERs than for original speech on average.
Another interesting observation is that, while B2, B1.a, B1.b, and S-select are effective for protecting user privacy under the Ignorant condition, the utility performance in terms of WER and G VD is worse than that of the OHNN-based anonymizers. This suggests that the baseline methods sacrifice utility to achieve a high privacy protection performance. Our proposed methods achieve a good balance between improving both privacy and utility metrics under various attack scenarios. Secondary utility evaluation: The bottom of Table III lists the results for the average gain of voice distinctiveness, G VD . They indicate that our proposed S-ROH* and S-LOH* achieved much better preservation of voice distinctiveness than the SASs using the selection-based anonymizer. The G VD results of the S-select and OHNN-based anonymizers again confirm the findings described in Section IV-B1. MOS prediction: To further analyze the effectiveness of our proposed models, we utilize a recently proposed mean opinion score (MOS) prediction network [71] to estimate the perceived naturalness as another utility metric. Box plots of the predicted  MOS scores are shown in Fig. 9. The results demonstrate that S-Select has a higher naturalness than B2 and B1.a. After replacing the selection-based anonymizer with the OHNNbased anonymizers S-ROH* and S-LOH*, we see a further improvement in naturalness. Note that we used predicted MOS rather than human perception-based MOS obtained through listening tests in light of time and cost limits. The predicted MOS is reasonably wellaligned with human perception [71]. In Fig. 9, we can see that the ranking of the predicted MOS of the original, B1.a , and B2 are consistent with those from the listening test done by the VPC [12]. Overall performance: As there are multiple metrics for evaluating the model performance, we summarize the results using a radar chart for each system in Fig. 8. Each radar chart covers the EER values under the Ignorant, Lazy-informed, and Semi-informed conditions, W ER o by ASR eval , W ER a by ASR anon eval , and G VD . Note that the chart shows 100 − W ER, so the higher the better. Accordingly, a larger shaded area in the radar plot indicates a better overall performance. It is evident that the proposed S-ROH* and S-LOH* achieve larger shaded areas than the other systems, which performed particularly worse under the challenging semi-informed condition. Table IV lists the EERs and CERs for the Mandarin test dataset. The first observation is that baselines B1.a and B2 obtained EERs higher than 30% under the three conditions, but the CERs were higher than 60%. These results indicate that both systems achieved a high level of speaker identity protection by heavily distorting the speech contents. In particular, the results of B1.a suggest that it was inappropriate to use the ASR AM trained on the English data to extract speech content from the Mandarin data. The second observation is that the trends for the S-Select and OHNN-based anonymizers with different losses were remarkably similar to those observed on the English test sets. The proposed OHNN-based anonymizers obtained ASV EERs higher than 30% under all evaluation conditions, and the CERs were lower than those of other systems. Compared to the baselines, the proposed systems adequately protected the speaker information without heavily sacrificing the speech contents. Compared to the selection-based system, the proposed system[s?] achieved a lower CER while obtaining much higher ASV EERs, particularly in the most challenging Lazyinformed and Semi-informed scenarios. In particular, the CER on the anonymized speech decreased to less than 18% with the OHNN-based anonymizers, suggesting improved utility. One possible reason for the decreased CER when using OHNNbased anonymizers is that this mismatch was mitigated by the OHNN-based anonymizers trained using VoxCeleb 2, which contains large-scale, multi-speaker, and multi-language data.

V. CONCLUSIONS
This paper has proposed a novel OHNN-based speaker anonymization approach that rotates original speaker vectors into anonymized ones with a distribution following the original speaker vector space. Towards good privacy protection and voice distinctiveness, AAM/w-AAM and cosine similarity loss functions were introduced to encourage the generation of distinctive anonymized speaker vectors. Experiments on English VPC datasets demonstrated that the proposed model protects speaker privacy while maintaining speech content: it achieved competitive performance under all attack scenarios in terms of privacy and utility metrics. Comparison of the cosine similarities between pairs of speaker vectors extracted from the generated speech with a commonly used selectionbased anonymizer and the OHNN-based anonymizer further verified that our proposed method can effectively reduce privacy leakage when facing different attackers, while improving the diversity of anonymized speakers. Experiments on the Mandarin AISHELL-3 datasets demonstrated that our OHNNbased anonymizer is more robust to the language mismatch scenario than the selection-based methods and can be adopted for this unseen-language anonymization task directly.
To further improve the privacy protection performance under various attack scenarios, our future work will investigate the training loss. One potential direction is to optimize the distance between the original and anonymized speaker vectors by integrating a proxy ASV evaluation model into the training process i.e., using an ASV to measure L s in Eq. (5) on original and anonymized speech waveforms. Such a training scheme is closer to how attackers infringe on the speaker's identity. Additionally, we are considering extending the OHNN-based anonymizer to protect other personal attributes such as age, gender, emotion, and dialect. We previously proposed a system for concealing the gender of a speaker [72], and we feel the framework can be extended to other attributes as well. Our goal is to achieve controllable voice privacy protection that enables users to customize and control the anonymization process according to their specific privacy needs.  Natalia Tomashenko (Member, IEEE) is a researcher at the University of Avignon, France. She received the Ph.D. degree in computer science from the University of Le Mans, France. Her research interests focus on statistical machine learning for speech and language processing with application to automatic speech and speaker recognition, spoken language understanding, machine translation, and speech privacy. She is an organizer of the VoicePrivacy challenge.