When Speaker Recognition Meets Noisy Labels: Optimizations When Speaker Recognition Meets Noisy Labels: Optimizations for Front-ends and Back-ends for Front-ends and Back-ends

—A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success beneﬁts from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we ﬁrst conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically ﬁlters out high-conﬁdence noisy samples within a dataset. By applying the second application to the NIST SRE04-10 dataset and verifying ﬁltered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.


I. INTRODUCTION
S PEAKER recognition is a typical biometric authenti- cation technology that verifies the identities of speakers from their voices.A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end.The front-end extracts lowdimensional discriminative speaker representations (embeddings) from length-variable utterances, whereas the backend determines whether two embeddings are from the same speaker [1].
Conventional speaker recognition front-ends are based on Baum-Welch statistics, and the i-vector [2] is one of the typical front-ends, which is trained in an unsupervised manner.Probabilistic Linear Discriminant Analysis (PLDA) [3]- [6] is commonly used as a back-end scoring model.To satisfy the PLDA Gaussian assumptions for training data [7], extracted features generally require pre-processing, including Linear Discriminant Analysis (LDA) and length regularization [8], before being used to train a PLDA model.Both LDA and PLDA are trained in a supervised manner that requires training data with corresponding speaker labels.
Along with the increasing amount of training data and the development of neural networks, state-of-the-art performances for speaker recognition have been achieved by deep neural networks [9].Among these networks, the x-vector [10] is perhaps the most popular deep speaker embedding architecture.The x-vector directly replaces the i-vector to extract discriminative speaker representations using time delay neural network (TDNN) layers [11] and a statistical pooling layer.Based on the x-vector architecture, multiple deep speaker embedding network variants [12], [13] have been proposed to boost recognition performance.In addition, margin-based objective functions [14], [15] have been widely used to learn more discriminative speaker representations.Although these methods have achieved remarkable success, the supervised training for deep embedding models requires large-scale datasets that are correctly labeled.
Unfortunately, erroneously labeled samples are unavoidable during speaker utterance collections by web spider or crowdsourcing.This phenomenon is denoted as label noise, the incorrectly labeled utterances are denoted as noisy samples and the corresponding labels are noisy labels.For instance, the NIST SRE18 [16] development set does not provide speaker labels but instead only provides a phone number corresponding to each utterance [17].The VoxCeleb dataset [18] are collected from YouTube, and the speaker identities are confirmed through facial recognition based on convolutional neural networks (CNN).Data collections such as these often lead to label noise.Typically, these noisy labels can be categorized into two categories: closed-set, i. e., noisy samples whose true labels are contained within the training classes; and open-set, i. e., noisy samples whose true labels are outside the training set.Both types of label noise would impair both the front-end and backend model training, thereby degrading the speaker recognition performance.A recent study [19] shows that network learning with closed-set noise is more challenging, so in this paper, we focus more on this type of noise.Although mislabeled samples can be manually eliminated by human validation, this would be extremely time-consuming and costly.Making models robust to label noise is a more practical solution.
For a training dataset with an unknown number of noisy samples, the front-end goal is to learn discriminative feature spaces where different speaker embeddings are well separated, this is so-called learning with label noise.However, such a goal is practically challenging, as the high capacity of deep networks makes them capable of memorizing noisy labels even if the labels are completely random [20].Nonetheless, recent studies have shown that deep neural networks would first learn clean samples and general patterns from a dataset, and then they would be forced to memorize noisy labels [20], [21].
Recently, several approaches for learning with label noise have been proposed in the computer vision community [22]- [27].Paper [28] provides a comprehensive overview of recently proposed approaches.Although the literature on label noise for speaker recognition is relatively small, this topic is beginning to receive attention from researchers.For instance, in the x-vector front-end, the detrimental effects of label noise for speaker recognition are confirmed in [29], and the author proposed modifying the entropy loss to relax the constraints of the speaker identity function to avoid fitting noisy samples.Pham et al. [30] conducted extensive experiments based on VoxCeleb2 to investigate the effects of different types of label noise on the x-vector extractor.In addition, for the back-end, Borgström et al. [17] proposed a novel method for Bayesian estimation of PLDA when training labels are noisy (we refer to this method as NL-PLDA).However, the existing literature either focuses only on the front-end or the back-end, and does not comprehensively analyze the impact of label noise on different components for speaker recognition systems.
In this paper, we simultaneously optimize the front-end and back-end for speaker recognition systems when training datasets are noisy.For the network front-end, we propose a simple yet effective training paradigm to prevent networks from fitting noisy samples.The proposed framework consists of three major components: 1) A label confidence training scheme that incorporates network predicted pseudo-labels into the loss function; this method is similar to Bootstrapping [22], but we use a well-designed dynamically increasing confidence weight; 2) a re-scaling strategy that reduces the posterior probability of clean labels to emphasize them more in the loss function; 3) an improved AM-Softmax loss function that relaxes the intra-class constraint.For the back-end, we treat the true labels as multinomial random variables and train an NL-PLDA model to perform speaker identification scoring.
This paper extends the study of our previous work on combating noisy labels [31].Instead of conducting experiments on the VoxCeleb dataset in [31], the experiments in this paper are conducted on Switchboard and NIST04-10 datasets.Besides, more comprehensive experiments are conducted to analyze the effects of label noise on the x-vector, LDA, PLDA, and NL-PLDA models by setting different label error rates under both the close-set and open-set label noise scenarios.The contributions of this paper are multi-fold, which can be summarized as follows: 1) In the front-end, a label confidence training paradigm with a dynamic confidence policy, a re-scaling strategy, and an improved AM-Softmax are proposed for front-end learning when label noise is present.In combination with these components, the network shows consistent improvements in the robustness of label noise.Furthermore, a label correction method based on chunk-level label predictions is proposed that significantly reduces the number of noisy samples in a dataset.
2) In the back-end, we show how to apply NL-PLDA to filter out noisy labels.For further practical contributions, we provide an optimized expectation-maximization (EM) algorithm and pseudo-code for the NL-PLDA training process.
3) To verify whether there are noisy samples in the SRE04-10 dataset, we utilize the network prediction and NL-PLDA estimation to filter out high-confidence noisy samples and then verify them by human validation.As a result, we find that approximately 1% of the samples in the SRE04-10 dataset are mislabeled.After removing these samples, a relatively correct spk2utt mapping is released for this dataset.
This paper is organized as follows.Section II reviews the three most popular models used in speaker recognition systems.Section III introduces our proposed method for front-end learning with label noise.A detailed EM algorithm description for NL-PLDA is presented in Section IV and Appendix A. Section V shows comprehensive experiments, results, and analyses.Section VI shows applications of the proposed method.Conclusions are given in Section VII.

II. EMBEDDING-BASED SPEAKER RECOGNITION A. X-vector
Nowadays, deep learning-based speaker recognition is of increasing interest to researchers.The x-vector is a typical architecture that extracts discriminative low-dimensional vectors for speakers through a neural network.Benefiting from a large amount of data, the x-vector shows superior performance over the i-vector, and it is the main focus of this paper.The xvector typically consists of frame-level layers, a pooling layer, segment-level layers, and a softmax layer.The frame-level layers process length-variable speech acoustic frames using TDNN.The pooling layer aggregates length-variable frames into a fixed-dimensional vector.The segment-level layers are generally composed of two fully connected layers, and the outputs of the two layers are so-called x-vectors.
To deal with long-duration and length-variable training data, acoustic sequences are usually cut into multiple small chunks, and then they are used as inputs to train a network.During training, an objective function computes cross-entropy between given speaker labels and corresponding output probabilities.In addition to the standard softmax objective function, the additive margin softmax (AM-Softmax or CosFace) [14], [32] and the additive angular margin softmax (AAM-Softmax or ArcFace) [15] are two commonly used loss functions.The back-propagation-based optimization algorithm updates the parameters of a network by minimizing the loss function.However, in the presence of noisy labels, this loss function might drive a speaker network to learn the opposite.Therefore, improving the loss function to prevent a network from fitting noisy samples is the key for learning with noisy labels.

B. LDA
LDA is a supervised method to reduce feature dimensions, which is useful for classification tasks, therefore, it is widely used in image recognition and speaker recognition.LDA maximizes the Fisher criterion [33] for subspace embeddings by projecting high-dimensional features into low-dimensional features through a projection matrix P, i. e., LDA maximizes the between-class variance and minimizes the within-class variance.LDA is trained by optimizing the following Fisher criterion function: where S b and S w denote between-class and within-class variance, respectively.They are calculated as where N is the total number of embeddings from M speakers, N m is the number of samples of the m-th speaker, µ m denotes the mean of the m-th speaker, µ denotes the global mean, and x m n represents the n-th embedding of the m-th speaker.
LDA is typically performed as a channel compensation technology for both the i-vector and x-vector [2] [7].However, since LDA is a supervised model, we explore how label noise affects LDA in Section V-E.

C. PLDA
PLDA is a probabilistic generative model typically used to make probabilistic inferences about the class of data.It is a probabilistic version of LDA.Compared to LDA, PLDA adds a continuous Gaussian Prior to class centers, which enables it to generate new unseen class centers given even a single example.Besides the standard PLDA formulation [4], there are several variants of PLDA [34], such as simplified variant [5], two-covariance variant [3], [5], and heavy-tailed PLDA [6].In this paper, we adopt two-covariance PLDA, which is assumed to generate the class center and the observed data in a two-stage process: where µ, Σ −1 b and Σ −1 w are the parameters estimated by PLDA, namely the global mean, between-speaker, and withinspeaker covariance matrices, respectively.
In the hypothesis-testing stage, given a pair of individual embeddings, we can decide whether or not two given embeddings belong to the same speaker by computing a likelihood ratio.Although making such a decision based on cosine distance is a simpler way, but its performance might be suboptimal under more challenging situations (e. g., crosschannel, language mismatch, and noisy environments).While thanks to the multiple PLDA domain adaptation technologies [35]- [37], and data augmentation methods [10], [38], PLDA shows its superior advantages.Besides, PLDA proved to be the theoretically optimal scoring method for speaker recognition [39].Therefore, it is currently the dominant back-end algorithm for speaker recognition.In addition, for the problem of noisy PLDA training labels, a novel Bayesian estimation method has been proposed in [17], and we detail this method in Section IV.

III. FRONT-END LEARNING WITH NOISY LABELS
We propose a proper training scheme and a loss correction method to improve the noise-tolerant capacity of the frontend.As illustrated in Fig. 1, our framework consists of three major components: 1) incorporating pseudo-labels into a loss function with a well-designed dynamic confidence policy that weighs the combination of pseudo-labels and original labels; 2) re-scaling clean-label posterior probability; 3) introducing a sub-center layer into the AM-Softmax to separate noisy samples from training data.

A. Label Confidence Training
To put this formally, let us first rethink the deep embeddingbased speaker recognition systems from a classification perspective.The x-vector network training process is formulated as a problem of learning a model h θ (u) from a set of batch training samples , where B is the minibatch size, y i ∈ {0, 1} M denotes the ground-truth label corresponding to u i , and M is the total number of classes.For classification issues concerning label noise, label y i might be noisy (i.e., u i is a noisy sample).Supposing the extracted embedding of u i is x i , the parameters of the network would be updated by optimizing the following loss function: where P i,yi denotes the posterior probability that x i is classified as the ground-truth label y i .In this paper, we term P i,yi as the ground-truth label posterior probability; if adopting the AM-Softmax, P i,yi is formulated as: e s(cos(θy i ,i)−m) + M j =yi e s(cos(θj,i)) , where θ j,i is the angle between W j and x i .cos (θ j , i) = W T j x i represents the similarity score, and W j is the j-th class center vector of the fully connected layer matrix W. It is noted that here W ∈ R M ×d , whereas in Section III-C, the dimension is extended to R M ×K×d based on a sub-center layer.
However, a neural network trained directly on this objective function will overfit noisy samples, as the loss contains noisy labels.Nonetheless, we may observe that a network maintains highly generative performance without memorizing noisy labels at the beginning of the training process; an example is shown in Fig. 2, where the prediction accuracy for clean data is higher than than that for noisy data.This situation indicates that the network clusters noisy samples into correct classes; therefore, to leverage this ability, we incorporate a predicted label posterior probability P i,ŷi into the objective function to prevent fitting incorrect samples as training iterations become larger.The subscript ŷi ∈ {0, 1, ..., M − 1} denotes the predicted label of x i , which is categorized as class j with the max-activated output.Then, the loss function in Eq. ( 6) is extended as follows: where α t ∈ [0, 1] is the t-th training iteration confidence weight between P i,yi and P i,ŷi , and it determines whether the loss function relies more on the ground-truth label or the predicted label.This method is similar to Bootstrap [22].Bootstrap sets α t as a fixed small value for all iterations, and it maintains the effects of noisy labels during the whole training process; thus, the performance is suboptimal since the noisy label correction is limited.Conversely, we adopt a dynamic weight for α t .Since the network parameters are initialized randomly, and the predictions are likely to be incorrect, it is not a practical idea to set α t to be too large at the beginning of training process.Also, it should not be set too small in an advanced stage of the training; otherwise, there will be too many adverse effects from noisy labels.Thus, we set α t as the exponentially increasing function of training iterations, formulated as: where α T represents the confidence weight at the final iteration T , t denotes the number of iterations of the current training, and λ is the exponent that controls the rate of increase.With this confidence policy, α t would dynamically increase from 0.0 to α T as iterations increase.The basic assumption is that predictions become more and more accurate during training.Thus the loss function should accordingly put more reliability on the predictions.Therefore, we refer to this method as label confidence training.However, during the last few optimization processes, there is the risk that the network may simply predict all samples as belonging to one same class to minimize the loss.To avoid this issue, inspired by [23], [40], we further incorporate a label regularization term into the objective function for backpropagation, and the total loss L total is written as: where β is the regularization coefficient, and Pj = 1 B B i=1 P i,j is the mean softmax probability for class j.The label regularization term enables the classifier to allocate each sample with a probability of belonging to every class, thereby preventing all samples from being assigned to a single class.

B. Clean Label Probability Re-scaling
If the predicted label of a sample is the same as its groundtruth label, we can generally believe that this sample is correctly labeled since the label confidence training criterion prevents the network from fitting noisy labels.Then, Eq. ( 8) is equivalent to Eq. ( 6).Nonetheless, we emphasize these clean samples to utilize them to learn discriminative speaker embeddings.Intuitively, the posterior probability is larger for clean samples and smaller for noisy samples, while the loss function is a monotonically decreasing function of the posterior probability.Therefore, we increase the contributions of clean samples to the parameter optimization by re-scaling its probability in the loss function.Specifically, the probability for clean samples is reduced to e s(cos(θy i ,i)−m) + M j =yi g (u) e s(cos(θj,i)) , (11) where g (u) is a re-scaling function, and is formulated as: where u ≥ 0, so that g (u) ≥ 1, thereby P i,yi ≤ P i,yi for clean samples.This approach follows [25].By re-scaling, in the last training process, most of the updates would be attributed to clean examples, thereby weakening the incorrectly labeled samples.

C. Sub-center AM-Softmax
AM-Softmax and AAM-Softmax are two angular marginbased objective functions that are commonly used in deep speaker recognition.The x-vectors learned by such a function are angularly distributed and naturally match the back-end scoring based on cosine similarity [41].Nevertheless, they also perform better than Softmax when adopting the PLDA back-end for scoring since they explicitly minimize the withinclass covariance [9].However, they are susceptible to label noise as the inter-class speakers contain incorrect samples.To address this problem, sub-center ArcFace has recently been proposed for face recognition [42], it relaxes the intra-class constraint of ArcFace [15].While for speaker recognition, the AM-Softmax performs comparably to AAM-Softmax [43]- [45], and it converges faster [46].So, in this work, we adopt the AM-Softmax as an objective function, and also manage to relax its intra-class constraint to further improve the robustness to label noise.
The concept of "sub-classes" has been employed in face recognition for some time.Research shows that it separates different patterns more clearly, thus improves recognition accuracy [47], [48].Following the set in [42], we introduce sub-classes into AM-Softmax and refer to the improved loss function as sub-center AM-Softmax (SubAM-Softmax for short).Specifically, K sub-centers are introduced for each class to relax the intra-class constraint; this is carried out where there is one dominant sub-class, containing the majority of clean samples, and (K − 1) non-dominant sub-classes contain noisy samples.To form K sub-centers, the dimension of matrix W in SubAM-Softmax is extended to R M ×K×d , and then the similarity score is formulated as cos (θ where max k denotes a max-pooling step.The sub-classes are able to capture the complex distribution for the training data and separate noisy samples from clean samples.Therefore, sub-classes enable the loss function to be more robust to label noise [42].

D. Discussions 1) Handling Hard Samples:
In this paper, we refer to hard samples as clean samples that require more time for the network to learn.In the proposed method, we re-label samples according to the network's predictions.Although the predictions are more accurate than the original noisy labels, they may also misclassify some hard samples as noisy.To reduce this mis-labeling, we keep the original labels in the loss function as a part of supervised training.Moreover, in the well-designed confidence policy, as shown in Eq. ( 9), α T controls the maximum confidence degree, and λ controls the confidence rate for the network.Thus the two parameters can be empirically set to trade-off ground-truth and pseudo labels.
2) How This Method Works: The framework leverages both the generalization ability of a network and speech single features to learn from label noise.Specifically, for generalization, a network first learns the patterns of a dataset from the correct samples and maintains highly generative performances at the beginning of training processes.This learning characteristic enables a learned model to generate correct patterns for noisy samples and classify them into correct classes before memorizing them.Therefore, the proposed method prevents a network from fitting incorrect samples by incorporating predicted labels to correct noisy samples on the fly.While for speech samples, an utterance is split into multiple chunks during training, and the network learns chunk-level speaker representations.Since speech is a non-stationary time series single, there are contrasts across chunks [49], so it is almost impossible for the network to predict all the chunks in an utterance to a mislabeled class.However, by adopting labelconfidence learning, the majority of chunks for an utterance are more likely to be correctly predicted.

IV. BACK-END PLDA ESTIMATION WITH NOISY LABELS
To address the problem of label errors in the back-end, a method for Bayesian estimation of PLDA with noisy labels is Algorithm 1 Bayesian estimation of PLDA with noisy labels Input : Traing set X = {x 1 , . . ., x N }, and corresponding labels L = {l 1 , . . ., l N }, where contains M individuals and N samples.
Update N e based on Eq. ( 14); Update P (l n |z n,m = 1, ) based on Eq. ( 27); for m = 1 to M do Update N m , r x,m based on Eq. ( 15), ( 16); Update Φ m based on Eq. ( 20); Update y m , y m y T m based on Eq. ( 18), ( 19); Update r y , R o y , R y , and R xy based on Eq. ( 21)-( 24); M-step: Update z n,m based on Eq. ( 25); Update based on Eq. ( 28); Update µ, Σ b , and Σ w based on Eq. ( 29)-( 31); until Convergence; Output: PLDA model parameters {µ, Σ b , Σ w }, and proposed in [17].Though the theoretical analysis of NL-PLDA has been covered extensively in [17], it does not illustrate a concrete implementation of the algorithm.In this section, for further practical contributions, we show a detailed algorithmic presentation of NL-PLDA and its utilization of automatic filtering to weed out high-confidence noisy samples.
To combat label noise, NL-PLDA treats true labels as multinomial random variables and estimates a model's parameters based on maximum-likelihood estimation in the context of Variational Bayes.Specifically, we suppose that the training samples and corresponding labels denoted as X = {x 1 , . . ., x N }, L = {l 1 , . . ., l N }, respectively, where N is the sample size from M individuals.However, since there are noisy samples, L is not the correct identity.To tackle this problem, the true label for each sample x n is modeled as a latent identity z n ∈ R M , and we let Z = {z 1 , . . ., z N } denote the set of true identities corresponding to X .Each element z n,m ( M m=1 z n,m ≡ 1) in Z indicates the confidence (probability) that x n belongs to individual m.
The EM algorithm for the NL-PLDA is summarized in short form in Algorithm 1, and the details are available in Appendix A. In the E-step, it estimates both the posterior of true identity and the individual feature distribution simultaneously, whereas the M-step updates the label error rate , true latent identities Z, and the parameters of NL-PLDA, respectively.
Moreover, since Z explicitly models the latent identity distribution, it can be utilized to filter out high-confidence noisy labels.Before training, no a priori information about the error rate is available; therefore, it is assumed that there are no label errors ( = 0).So, the initialization for Z can be shown as the left matrix of Eq. ( 13), where z n,m is initialized as z n,ln = 1, implying that x n belongs to the original corresponding individual l n .During the EM iteration steps, Z is determined by the maximum posterior estimation.
We assume that the final updated Z is shown as the right matrix of Eq. (13).When the value of z n,ln becomes relatively small, this indicates that l n might have a high probability of mislabeling.So, a threshold (e. g., z n,ln ≤ 0.1) is empirically set to filter out these samples.
V. EXPERIMENTS

A. Datasets 1) Training Datasets:
The training datasets are prepared following the SRE16 Kaldi recipe 1 , including the Switchboard Phase2-3, Cellular1-2 (SWBD), and the NIST SRE04-10 datasets.After filtering out non-speech frames by energy-based voice activity detection (VAD), the recordings shorter than four seconds and the speakers with less than eight recordings are discarded.Finally, the SWBD portion contains 18,407 English recordings from 1,318 speakers, and the SRE04-10 includes 2,682 speakers with 48,022 utterances.Most of these recordings are in English, while some are in Chinese, Russian, Arabic, etc.We use the two pooled datasets to train the frontend extractions, while only using the SRE04-10 portion for the LDA and PLDA back-end training.
2) Evaluation Datasets: The evaluation datasets consist of NIST SRE 2016 (SRE16) and NIST SRE 2018 CMN2 (SRE18).Specifically, SRE16 is composed of Cantonese and Tagalog telephone conversations; the Cantonese dataset contains 965,395 trials and the Tagalog dataset contains 1,021,332 trials.For SRE18, the CMN2 collection is mainly spoken in Tunisian Arabic, and contains 108,095 trials in the development set and 2,063,007 trials in the evaluation set.

B. Experimental Settings 1) Data Preparation:
In this section, we describe the way we prepare the simulated closed-set label noise.In our experiments, we first assume that the original training datasets are clean (without error labels), denoted as = 0%.Although we later confirmed that there are indeed a few mislabeled recordings in the training datasets.To better monitor the network prediction accuracy on a clean data set, we divide the training datasets into a training set and a validation set.Specifically, the validation set is composed of one utterance randomly selected from each speaker, and the remaining recordings are used to compose the training set.It is noteworthy that there is no overlap between the two subsets.To simulate different label error rates, we perform label disruption on the SWBD and NIST SRE04-10 training sets, respectively.Specifically, we randomly select ∈ {5%, 10%, 20%, 30%, 50%} for each speaker's utterances and then randomly relabel them as other 1 https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v1speaker identities presented in the training set.For instance, if = 50%, then half of each speaker's utterances are randomly relabeled as belonging to other speakers in the training set; in this way, we can obtain a 50% label error rate in the SWBD and NIST SRE04-10 training sets, respectively.Meanwhile, the validation set is kept clean and is used to monitor the actual prediction accuracy.It is noted that the validation set does not involve network parameter updates.
All of the raw audio files are converted to 40-dimensional Mel-frequency Cepstral Coefficients (MFCCs) with a 25 ms window and a 10 ms frame shift.Cepstral Mean Normalization over a three-second sliding window is applied to the MFCCs.After removing non-speech frames by VAD, the average duration of utterances in SWBD is 171 seconds and 160 seconds in SRE04-10.For the x-vector based front-end training, speech utterances are uniformly cut into chunks without overlaps, where the chunk length is set to 400 ms.These chunks are randomly formed into mini-batches as the network inputs.
2) Front-end Configurations: We conduct experiments using x-vector based front-ends, which are implemented in ASV-Subtools [50].For the x-vector baseline system, we apply the extended-TDNN (E-TDNN) structure [12] with AM-Softmax loss to train a 512-dimensional x-vector extractor.The detailed implementations of the E-TDNN source codes are released on GitHub2 .Besides the x-vector baseline, we also train six other x-vector based front-ends with different configurations as shown in Table I.These front-ends are roughly trained by adding more components progressively, which enables us to observe the contributions of individual components.It is noteworthy that the Front2 is set to adopt fixed confidence weights, which is characteristic of the Bootstrap method [22].All of these networks are trained on the GeForce RTX 2080 Ti GPUs with a mini-batch size of 256.AdamW is chosen as the optimizer, the weight decay is set to 1e − 1, and the learning rate is initially set to 1e − 3 and gradually reduced to 1e − 6.The networks are trained with 21 epochs, in which there are about two hundred thousand (200K) iterations in total.
3) Back-end Training & Evaluation: Once trained, we choose the epoch that gives the highest validation set accuracy as the final model.The activations for the model's penultimate fully connected layer are extracted as speaker embeddings (xvectors).For the evaluation process, we first project the xvectors to a lower 256-dimensional space using LDA and then adopt centering and length normalization.We trained LDA, PLDA, and NL-PLDA only on the SRE04-10 dataset.
Since the evaluation datasets are non-English, and PLDA is mainly trained on English utterances, domain mismatch

C. Results for The X-vector and Label Confidence Training
In this section, we compare the effects of label noise on the x-vector baseline and the Front1-3 extractors based on label confidence training.The exponent value for label confidence training is set as λ = 2.0, and α T is set to 1.0.

1) Training Processes & Results
: Before presenting the final results, we first show an explicit comparison between the x-vector baseline and the Front1 prediction accuracy on the training set and evaluation set, respectively.Note that the prediction accuracy is computed as the fraction of chunk samples in the training set or validation set that are classified correctly with respect to the corresponding labeled classes.As depicted in Fig. 2, the representative training evolutions with label error rates of ∈ {0%, 20%, 50%} are presented in order from left to right.Compared with Fig. 2(a), (b), and (c), one can clearly observe that the clean dataset converges faster than the mislabeled dataset, indicating that the network takes longer to learn mislabeled samples.It also shows that the network learns reasonable representations in the first few iterations, as shown by the fact that the validation set gives higher prediction accuracy than the training set with label error rates of 20% and 50%.Unfortunately, the increasing number of training iterations does not further motivate the model to learn as expected; instead, it leads to the model overfitting incorrect samples, which subsequently degrades the prediction accuracy on the validation set.On the contrary, the label confidence training scheme suffers from fewer adverse effects due to mislabeled samples.As shown in Fig. 2(e) and (f), the model produces higher validation set accuracy during the entire training evolution.Besides, it is quite remarkable that a final validation accuracy rate of around 85% can be achieved even when the label error rate increases to 50%.We would like to emphasize that the final training accuracy of the models trained by this approach is close to the expected true label error rate of the dataset, indicating that this approach can separate erroneous samples from correct samples within the whole training dataset.
The results of the x-vector baseline and Front1 extractor used in conjunction with PLDA and NL-PLDA are summarized in Table II.Let us first focus on the PLDA results, as shown in the left part of Table II.One can clearly observe that the x-vector's performance breaks down rapidly as the label error rate increases, while the Front1 significantly outperforms the baseline in situations with label noise.
Results for NL-PLDA are shown on the right side of Table II.In general, these results are consistent with the trend of PLDA, while NL-PLDA achieves better performance in terms of EER and minDCF under most situations.This implies NL-PLDA is capable of handling label noise.However, the performance of NL-PLDA degrades in the presence of strong label noise, especially for the x-vector.One explanation is that an insufficient number of correct labels poses a challenge to the NL-PLDA training.While another important reason is that the speaker embeddings learned by the x-vector contain less discriminative information, which confuses the NL-PLDA label noise estimation.This implies that a front-end that is  Another interesting observation is that the EERs of NL-PLDA on the SRE16 Tagalog and SRE18 CMN2 sets are slightly lower than those of PLDA.Since the only difference between the training of NL-PLDA and PLDA lies in how the speaker labels are treated, this phenomenon motivates us to investigate whether there are noisy samples in the original training datasets, which is explored in Section VI-B.
2) Comparisons with Fixed Confidence: In this section, we consider the comparisons between dynamic and fixed confidence weights.We empirically fix the confidence weight in Front2 as α t = 0.3 during the whole training process.Since the results of NL-PLDA are consistent with the trend of PLDA, only the PLDA results are reported in Table III.Compared with the results of Front1 (in Table II) and Front2, one can clearly observe that the Front1 outperforms Front2 in most situations, indicating that the gradually increasing confidence weights are more effective in reducing the impact of label noise.
3) Effects of Label Regularization: To examine the effects of label regularization term, the Front3 without label regularization, is trained as a comparison to Front1.The results of Front3 with a PLDA back-end are shown in Table IV

D. Effectiveness of Re-scaling and SubAM-Softmax
We further examine the effectiveness of the re-scaling strategy and SubAM-Softmax for handling label noise.Table V provides the results of adding a re-scaling, SubAM-Softmax, and the two combined front-ends, respectively.Compared to the results of Front1 in Table II, one can observe from the first part of Table V that the re-scaling helps boost the performance.These favorable results indicate that focusing more on clean labels is helpful when training with noisy labels.From the second part of Table V, one can observe that the SubAM-Softmax is more robust than the standard AM-Softmax in conditions with massive noise.Besides, the performance of SubAM-Softmax is even slightly better than AM-Softmax when trained on a clean dataset.
Finally, we obtain the best results on the Front6 extractor, which adopts label confidence training and combines the re-scaling and Sub-AM softmax.The results are shown in the third part of Table V.The Front6 shows its superior performance over the baseline and effectively enhances the robustness of the speaker recognition system for dealing with noisy labels.More specifically, compared to the x-vector baseline in Table II, when scoring with PLDA, the Front6 extractor trained with a 50% label error rate performs comparably compared to the x-vector trained with a 20% label error rate.While scoring with NL-PLDA, the performance of Front6 even approaches the x-vector baseline trained with a 5% label error rate.To better illustrate the performance comparisons with different front-ends and back-ends, the DET curves on Fig. 3. DET curves on SRE16 Cantonese for different front-ends and back-ends with 20% and 50% label error rates.the SRE16 Cantonese evaluation set are presented in Fig. 3.All the systems selected are trained with 20% and 50% label error rates.From Fig. 3, one can make the following observations: 1) Substantial improvements are obtained by label confidence training.2) Re-scaling and SubAM-Softmax have complementary properties, and the systems yield further improvements when used in tandem.3) NL-PLDA always performs better than PLDA when training labels are noisy.
4) The superiority of NL-PLDA benefits from a more robust front-end, especially in the presence of strong label error rates.

E. Effects of LDA Configurations
Since LDA training also requires speaker labels, in this set of experiments, we investigate the effects of different LDA configurations on PLDA and NL-PLDA.Three distinct backend configurations are compared concretely: without LDA, LDA trained on noisy labels, and LDA trained on clean labels, respectively.The x-vector baseline is used as the front-end, and the experimental results are shown in Table VI.For convenient comparisons, we re-present the results of x-vector baseline (in the middle part of Table VI).From Table VI, it is clear that the back-end without LDA yields the worst results.Moreover, NL-PLDA loses its ability to combat label noise and even achieves worse results than PLDA.However, the performance of NL-PLDA can be improved by using LDA projections, even if LDA is trained on incorrect labels.LDA trained on clean labels achieves optimal results, but the performance gain is not very large, especially when the label error rate is small.It seems that LDA is robust to noisy labels.
To further observe the effects of LDA, we visualize the speaker embeddings by plotting the t-SNE embeddings.The embeddings from ten distinct clusters without distinct LDA configurations are shown in Fig.  where "NL" denotes noisy labels and "CL" denotes clean labels.

F. Validations on Open-set Label Noise
In this section, we conduct experiments to validate the performance of the proposed front-end and NL-PLDA back-end on an open-set noisy datasets.The open-set noisy datasets are simulated by randomly selecting p ∈ {5%, 10%, 20%, 30%, 50%} training utterances per speaker in the original SWBD and SRE04-10 datasets and replacing them with utterances randomly selected from the concatenated Vox-Celeb2 datasets (subsegments belonging to the same video are concatenated together to form a unique utterance, and then it is down-sampled to 8 kHz).It is noteworthy that the labels and the number of utterances per speaker remain unchanged.These open-set noisy datasets are used to train an x-vector baseline and Front6 extractors, respectively.Experimental results for four validation sets using PLDA and NL-PLDA scoring are shown in Table VII.As shown, Front6 achieves better performance compared to the x-vector baseline; this indicates that our method robustly trains front-ends from open-set noisy datasets.Although these noisy samples cannot be localized to the corresponding correct labels in the training set, our method reduces their detrimental effects by clustering them separately into similar classes in the training set.Compared with the x-vector baseline in Table II, we observe that the impact of closed-set label noise is more harmful than that of open-set for the front-end.However, this phenomenon is opposite for the NL-PLDA back-end, as shown by the comparison of Front6 in Table V. NL-PLDA results in higher EER and minDCF under the same label error rates for the open-set label noise.Nonetheless, it still outperforms PLDA in all scenarios.

A. Label Correction on Synthetic Datasets
As shown above, when a model is trained to be robust to noisy labels, the prediction accuracy of the model is greater than the correct label rate of the data.In addition, this prediction accuracy is based on chunk-level samples, and since an utterance is composed of multiple chunks, it can be expected  where " " denotes the label error rate before label correction.
that the accuracy will improve if it is converted to utterancelevel training datasets.So, a straightforward application is to use chunk-level predictions from a well-trained model to re-label the training datasets, and we call this application label correction.This is of practical interest as label-corrected datasets can then be beneficially used to retrain front-end networks and back-end models.
To verify this method, we apply label correction on synthetic close-set noisy datasets.The Front6 extractors trained on different label error rates are utilized to predict the chunklevel labels for the corresponding dataset.Specifically, in the label-prediction process, the inputs for the network are the chunk-level samples from each utterance, while the output is the speaker label corresponding to each chunk sample.Then, each utterance is relabeled with the prediction that occurs the most frequently in its multiple chunks.As expected, the utterance-level prediction accuracy of the dataset with label noise is further improved through label correction, and the updated label error rate (one minus the corresponding prediction accuracy) is significantly reduced compared to the original error rate.This is summarized as follows: 5% → 1.2%, 10% → 1.4%, 20% → 1.9%, 30% → 2.6%, and 50% → 8.6%.Then, we use the label-corrected datasets to retrain the speaker recognition systems.The final results are shown in Table VIII.One can clearly observe that this method corrects erroneous labels even in the case of high error rates, thereby significantly alleviating adverse effects from mislabeling.

B. Label Denoising on Real-word NIST SRE04-10 Dataset
We further investigate whether there are mislabels in the original "clean" NIST SRE04-10 dataset.This experiment adopts more rigorous methods, including network prediction, NL-PLDA estimation, and human validation, and we call this process label denoising.Specifically, the Front6 and NL-PLDA trained on the original dataset are used to filter out high- confidence noisy samples.Two types of samples are filtered out: those with predicted labels that are inconsistent with the ground-truth labels and those with low latent identity (z n,m < 0.1).We find that most of the samples are overlapping.We obtain a sub-dataset that is 1.2% the size of the original SRE04-10 dataset.Then, we identify these erroneous labels by human validation and find that more than half of these samples are indeed listen-clearly mislabeled.Common types of corrupted labels include gender error (sentences with female sentences mixed in the category male or vice versa), language error (multiple languages mixed in the speaker sample and not sounding like the same person), and non-speech (barely audible human voice).Examples of mislabeled audio files are publicly available 3 .Finally, we retrain the x-vector baseline on the original dataset with these samples removed, and the results are shown in Table IX.Compared with the x-vector in Table II, slight improvements can be observed on all four evaluation sets.In addition, a relatively clean version of the SRE04-10 spk2utt file that contains speaker-to-utterance mappings is also available on the website 3 .

VII. CONCLUSIONS
In this paper, we demonstrate that label noise leads to significant performance degradation for both the x-vector front-end and PLDA back-end.Then, we propose a simple yet effective approach to combat label noise in the front-end training.Our proposed framework contains three strategies, including a label confidence training scheme, a posterior probability re-scaling strategy, and an improved AM-Softmax loss function.When progressively combining these three strategies, experiments conducted on the pooled SWBD and SRE04-10 datasets show consistent improvements in robustness against label noise.Since a speaker recognition system consists of both a frontend and a back-end, it is necessary to optimize both to achieve the best performance.Consequently, we also optimize the back-end PLDA when the training labels are noisy.When combining the optimized front-end and back-end, the whole speaker recognition system demonstrates strong resistance to noisy labels.
In addition, we show two practical applications of this improved system, including label correction and label denoising.Label correction is used to correct noisy labels that occur in a dataset.We propose correcting noisy samples based on utterance chunk-level predictions from a well-trained network.Experimental results show that label correction greatly reduces the number of noisy samples within a dataset.Therefore, models retrained on a label corrected dataset perform similarly to those trained on a clean dataset.Besides, we apply label denoising to the real-work NIST SRE04-10 dataset to weed out the original error labels, where both the front-end and back-end are used to algorithmically filter out high-confidence noisy samples, and then we verify them by human validation.As a result, we verify that approximately 1% of the samples are noisy in the original SRE04-10 dataset.Experimental results show that models trained on the label denoised datasets achieve slight improvements compared to the baseline system.
In the future, we are interested in validating our method on other front-end networks and conducting experiments on real-work label noise datasets.In addition, we plan to apply this approach to semi-supervised learning and self-supervised learning networks.APPENDIX A PLDA LEARNING ALGORITHM WITH NOISY LABELS This Appendix presents the EM algorithm for the parameter {µ, Σ b , Σ w } learning of NL-PLDA [17].For convenience, let • denote the expectation of a given random variable.
In the E-step, let us first pre-compute the number of error samples as: the number of samples for the m-th individual: the first-order moment for the m-th individual: and the global second-order moment: Then, we compute the first and second moments of the latent variables: where Next, we need to compute the following auxiliary matrices: For the M-step, we update the matrix of label latent identity Z by: = N e N .
After that we update the NL-PLDA parameters as follows:

Fig. 1 .
Fig. 1.An overview of the proposed method.During training, the SubAM-Softmax will output the predicted label of each chunk, and the total loss is formed by a confidence weighted combination of predicted and original labels.

4 .
Each embedding is represented by its corresponding true label.From Fig 4 (a), embeddings without LDA projections are more isolated within their classes, and they do not have clear separated boundaries between classes.While in Fig 4 (b) and (c), the embeddings with LDA are more effectively separated into clusters.The embeddings presented in Fig 4 (b) are very similar to those in Fig 4 (c), demonstrating that LDA shows robustness to label

Fig. 4 .
Fig. 4. t-SNE visualization of speaker embeddings for w/o LDA, LDA trained on noisy labels, and LDA trained on clean labels (in order from left to right, respectively).The speaker embeddings are extracted by the x-vector baseline trained with 50% label error rate.Each number or color represents a class.

TABLE I CONFIGURATIONS
OF THE X-VECTOR BASED FRONT-ENDS

TABLE II RESULTS
OF FRONT1 USED IN CONJUNCTION WITH PLDA AND NL-PLDA WITH DIFFERENT LABEL ERROR RATES ( )

TABLE III PERFORMANCE
OF FRONT2 IN CONJUNCTION WITH PLDA . One can observe that Front3 results lower EERs than Front1 on clean label training dataset, indicating that adding label regularization causes slight performance degradation when training label is clean.However, incorporating label regularization achieves

TABLE IV PERFORMANCE
OF FRONT3 IN CONJUNCTION WITH PLDA consistent improvements on the noisy dataset.One explanation is that label regularization results in redundant information within the clean training dataset's loss values, while this information is useful for learning with label noise.

TABLE V PERFORMANCE
COMPARISONS OF FRONT4-6 WITH DIFFERENT LABEL ERROR RATES ( )

TABLE VI PERFORMANCE
OF THE X-VECTOR BASELINE WITH DIFFERENT LDA CONFIGURATIONS

TABLE VII PERFORMANCE
COMPARISONS OF THE X-VECTOR BASELINE AND FRONT6 WITH OPEN-SET LABEL ERROR RATES (p)