Self-Augmentation Based on Noise-Robust Probabilistic Model for Noisy Labels

Learning deep neural networks from noisy labels is challenging, because high-capacity networks attempt to describe data even with noisy class labels. In this study, we propose a self-augmentation method without additional parameters, which handles noisy labeled data based on small-loss criteria. To this end, we use small-loss samples by introducing a noise-robust probabilistic model based on a Gaussian mixture model (GMM), in which small-loss samples follow class-conditional Gaussian distributions. With this sample augmentation using the GMM-based probabilistic model, we can effectively solve over-parameterization problems induced by label inconsistency in small-loss samples. We further enhance the quality of the small-loss samples using our data-adaptive selection strategy. Consequently, our method prevents networks from over-parameterization and enhances their generalization performance. Experimental results demonstrate that our method outperforms state-of-the-art methods for learning with noisy labels on several benchmark datasets. The proposed method produced a remarkable performance gap of up to 12% compared with the previous state-of-the-art methods on CIFAR dataset.


I. INTRODUCTION
The generalization performance of supervised classification problems depends heavily on accurate label information. However, the construction of high-quality ground-truth class labels for large-scale datasets from domain experts is extremely expensive and time-intensive. Thus, in practice, data with labels are collected from the internet through various methods such as crowdsourcing [3] and online queries [4]. However, these data inevitably contain noisy labels, which results in a poor generalization performance of deep neural networks (DNNs).
To solve this problem, recent methods [1], [5], [6], [7], [8], [9] have adopted small-loss criteria. By training networks using samples that produce small losses, data samples can be employed with correct labels to improve the robustness of the network toward noisy labels. However, although networks The associate editor coordinating the review of this manuscript and approving it for publication was Pietro Savazzi . based on small-loss criteria can be trained with clean samples in the early training stages, they are gradually over-fitted with samples with false labels as the training proceeds, owing to the memorization effect of DNNs [10], [11], which degrades the learning performance of the networks.
The following two representative approaches can prevent the aforementioned degradation in learning samples with small-loss values. In the first approach, filtering methods [1], [8], [9] are used to filter out wrongly selected noisy samples from small-loss samples, thereby reducing the accumulated errors induced by noisy samples. For this, two networks are utilized simultaneously and the different learning abilities of the networks are shared. However, employing multiple networks requires additional learning parameters, inducing a high computational cost for training. In addition, as shown in Fig.1(b), the learning abilities of the two networks do not significantly differ from each other after over-parameterization. Thus, they fail to effectively filter out noisy samples. In the second approach, sophisticated small-loss selection rules are VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Visualization of learned representations. We conducted experiments using the CIFAR10 train dataset with 50% symmetric noise and visualized the pre-softmax layer using 2D t-SNE (t-distributed stochastic neighbor embedding) [2]. Each cluster indicates a set of data with the same prediction class inferred by the network. Blue colors represent samples with correct labels and correct network predictions. Orange colors represent samples with correct labels but false network predictions. Green colors represent network prediction made by additional networks in (b) and newly generated sample from the proposed GMM in (c). (a) Deep neural networks that are directly trained on noisy datasets (i.e., Standard) cannot differentiate between samples with correct (i.e., blue) and false (i.e., orange) predictions owing to network over-parameterization. (b) Co-teaching [1] augments samples (i.e., green) using different learning abilities of dual networks. However, owing to network over-parameterization, learning abilities of the two networks become similar. Thus, they fail to differentiate between samples with correct (i.e., blue) and false (i.e., orange) predictions. (c) Our method enables reliable augmentation (i.e., green) using GMM, which prevents the network from being overly fitted to noisy samples. Thus, correct (i.e., blue) and false (i.e., orange) predictions have relatively better representations than previous methods in the light of separation.
used. In [1], small-loss samples are empirically selected using manually tuned selection strategies based on the network memorization effect, which is not suitable for real-world data. Thus, to find the optimal selection strategy, a data-adaptive selection strategy [12] has been proposed using automated machine learning. However, it requires a large amount of high computational resources to determine the optimal strategy. In this study, we propose a self-augmentation method using only a single network, which can efficiently handle noisy labeled data based on small-loss criteria. Our method solves the existing problems of the two aforementioned approaches without additional parameters using the augmentation method. The method effectively regularizes over-parameterization based on the network's own inference. In particular, to protect deep neural networks from over-parameterization, we present a noise-robust probabilistic model as a strong statistical structure, such as Gaussian mixture model (GMM), in which small-loss samples follow class-conditional multivariate Gaussian distributions. Using the proposed GMM, we augment the samples to effectively regularize the over-parameterization of the network. Unlike [13], where training samples follow a Gaussian distribution during the test time with a pre-trained network, we conduct self-augmentation during training without additional parameters and supplementary information. Furthermore, we present a novel data-adaptive selection strategy based on the covariance structure of the proposed GMM. Using our data-adaptive selection strategy, we can considerably enhance the learning performance of the networks by training with high-quality small-loss samples (i.e., less noisy). Fig.1 shows the effectiveness of the proposed method.
Our main contributions are as follows: • We propose a novel self-augmentation method based on a single network that handles noisy labeled data based on small-loss criteria. To this end, we present a robust probabilistic model, the GMM, that employs class-conditional Gaussian distributions with moving weighted Gaussian moments across learning steps. Then, sample augmentation using the proposed GMM effectively prevents over-parameterization.
• We introduce a novel data-adaptive selection strategy that enhances the quality of small-loss samples. By training networks with our high-quality small-loss samples, our method considerably improves the generalization performance of the networks.
• We experimentally demonstrate that our method significantly outperforms other state-of-the-art methods for supervised classification tasks with noisy labels. Smallloss samples selected by our sample selection strategy are less noisy than those of other strategies. Organization is as follows. Section II relates the proposed method with existing methods. Section III introduces notations and terminologies used in this study. Section IV presents the proposed self-augmentation method to reduce the negative effects of noisy samples in small-loss data. We conclude the study in Section VII.

II. RELATED WORK
While there are a long history on methods that deal with noisy labels, we only discuss relevant approaches, including loss or label correction, small-loss selection, and mixture model methods. Table 1 highlights differences between the proposed method and other methods.

A. LOSS OR LABEL CORRECTION
Noisy labels have been addressed during training via loss or label correction. For example, a loss correction method FW [14] estimated the probability that each class was corrupted into another and used the probability during training. This method was enhanced by utilizing additional information data with trusted labels GLC [15]. In [16], samples were re-weighted based on their gradient directions to avoid overfitting to label noise. In addition, these noisy labels were handled by considering the consistency of predictions in a loss function [17]. In SELFIE [7], noisy samples with consistent label predictions were exploited and corrected to prevent noise accumulation. They transformed noisy labels either explicitly or implicitly into clean labels by correcting classification losses. In this case, the influence of noisy labels can be accumulated when the correction mechanism does not work appropriately. In contrast, small-loss selection-based approaches with well-designed selection strategies like our method can be free from this noise accumulation by discarding noisy samples.

B. SMALL-LOSS SELECTION
Noisy labels has been avoided via small-loss selection. However, the small-loss samples could still contain noisy samples. To solve this problem, a pre-trained teacher network was used to select small-loss samples that guide the training of a student network Mentornet [5]. In Co-teaching [1], Co-teaching+ [9], small-loss samples were exploited and exchanged via two networks to update the network parameters. Similar to [1], a joint training strategy was employed in JoCoR [8] to select clean samples effectively. However, employing additional networks resulted in high computational cost and increased the number of parameters. In contrast, we propose a self-augmentation strategy that uses only a single network to train the network without additional parameters.

C. MIXTURE MODEL
Mixture models have been introduced for noise-robust training. In [6], and [18] mixture models were employed to compute the confidence of the samples being clean. In particular, a two-component mixture model with beta distributions [18] was used to estimate the sample confidence. Similarly, a two-component mixture model with Gaussian distributions DivideMix [6] was used for this purpose. However, two-component mixture models cannot sufficiently capture the class-wise properties in the training data. In RoG [13], generative classifiers were induced and integrated into pre-trained networks to obtain more accurate decision boundaries in embedded feature spaces. This method assumes that class-conditional distributions follow a multivariate Gaussian distribution. As in [13], our method assumes that small-loss samples follow class-conditional multivariate Gaussian distributions. However, unlike in [13], our method introduces no additional classifiers, parameters, and supplemental information during training using a novel self-augmentation method based on data-adaptive probabilistic models.

III. NOTATIONS AND PROBLEM SETUP
For classification tasks with C classes, denotes N pairs of training data in noisy training datasetD, wherex i ∈ R d is the i-th sample paired with the probably noisy labelỹ i ∈ {1, · · · , C}. We denote the network predicted class of data sample asŷ and a deep neural network with parameters θ as f θ . We then optimize the network parameters by minimizing the following objective loss function (i.e., cross-entropy loss, L ce (D, θ)).
where f θ (x i ) outputs an estimated label and p c (·) transforms the output into a probability. However, because of the existence of noisy labels, learning the network parameters θ by optimizing (1) inevitably induces over-parameterization. Thus, the test accuracy decreases considerably, leading to poor generalization performance of the network.
To solve this over-parameterization problem, small-loss criteria have been widely adopted to deal with noisy labeled data. Intuitively, training samples that produce small losses are likely to be correctly labeled. Thus, by training DNNs with small-loss samples, the networks can be made robust to noisy samples. Thus, to select high-quality small-loss samples, we employ a cross-entropy-based selection strategy similar to that used in [1] and [5]. In particular, we select 1 − C(k) 1 percent of samples, which have low cross-entropy loss values in (1) at each learning step k.
While the label information is not considered by the proposed selection strategy, the selected small-loss training samples contain inconsistent label information (e.g., flipped labels). Overall, the proportion of noisy samples tends to gradually increase as the training proceeds [1]. Then, the network trained with small-loss samples eventually induces over-parameterization problems. Thus, we require an additional strategy to address this phenomenon.
In Section IV, we present a self-augmentation method to reduce the negative effects of noisy samples in smallloss data. In addition, we propose a data-adaptive selection strategy C(k) in Section IV-C, which determines high-quality small-loss samples at each learning step. For readability, we denote a small-loss sample as x and the corresponding network output (pre-softmax logit) as X = f θ (x). Furthermore, x c,k,i represents the i-th small-loss sample that is predicted to the c-th class at learning step k, and X c,k,i denotes FIGURE 2. Pipeline of the proposed method. For each mini-batch of training data, we select samples with small cross-entropy loss values, i.e., small-loss samples (Section IV-C). Then, we update the moving weighted moment queue (Section IV-B) based on network predictions of selected samples. In particular, we enqueue current data with the same predicted label and dequeue old data for each predicted class. Subsequently, we induce GMM (Section IV-A) using weighted inference statistics of data in queue at a previous learning step and soft-prediction results at a current step. their corresponding logit. To avoid confusion and improve readability, we omit the indices if they are understood in the context.

IV. SELF-AUGMENTATION WITH NETWORK PREDICTION
We propose a self-augmentation method using a noise-robust probabilistic model as a GMM with network inference. We assume that small-loss samples with the same predicted class follow class-conditional multivariate Gaussian distributions. Fig.2 shows the pipeline of our method.

A. NOISE-ROBUST PROBABILISTIC MODEL
For C-class classification, we define C Gaussian distributions and induce a probabilistic model as a GMM with a class-wise mixture probability prior φ c as follows: where N (X |µ c , c ) denotes a normal distribution with mean µ c and standard deviation c . To adopt the GMM as an augmented sample generative model, we compute data-adaptive mixture probabilities and Gaussian statistics (i.e., µ c and c ) with respect to network outputs X , where each term of GMM is computed as follows.
where 0 ≤ φ c ≤ 1, ∀c ∈ [C], and where X c = X · 1ˆy =c denotes the logits predicted to the c-th class and 1 (·) denotes an indicator function for the predicted label of X . To prevent over-parameterization, we regularize the network by minimizing the discrepancy between small-loss samples and augmented samples Z from GMM (i.e., Z ∼ P(X ) in (2)). However, the probabilistic model in (2) cannot guarantee the avoidance of over-parameterization. Thus, to prevent over-parameterization by guaranteed augmentation, we introduce an additional regularization term.
In course of minimizing the conventional objective function in (1), the confidence of the network prediction can be skewed toward specific uncertain classes [19]. If a network is eventually over-parameterized, it tends to assign most labels to a single class [20]. Thus, the network predictions for selected cross-entropy-based small-loss samples are also skewed to particular classes. In addition, a mixture probability prior φ c in (3) based on network prediction for small-loss

Algorithm 1 Moving Weighted Moment Queue
Require: x k : data with small-loss values at a current step, M : the number of collected samples for each class, Q k−1 = {Q 1,k−1 , · · · , Q C,k−1 }: queue at a previous step. for c = 1 to C (i.e., C-class classification) do Calculate samples peaks at few classes. Subsequently, the proposed probabilistic model built with φ c cannot deal with all classes fairly. In this case, we cannot ensure reliable augmented samples.
To address this issue, we replace φ c with soft-mixture probabilityφ c to attenuate the skewness of the probabilistic model [21]. The soft mixture probabilityφ c is defined aŝ Usingφ c , we can consider the network confidence for all classes, whereφ c in (5) becomes flatter than φ c in (3), which reflects only few skewed classes. In addition, we present an additional regularization loss term to further attenuate the skewness of the mixture probability as the training proceeds. The regularization loss, L soft , for the soft mixture probabilitŷ φ c is defined as follows: where D KL denotes the Kullback-Leibler (KL) divergence, and γ c is the prior probability distribution of the c-th class among all training samples.

B. MOVING WEIGHTED MOMENT QUEUE
Because the selected small-loss training samples inescapably contain inconsistent label information, the network is gradually fitted to noise-labeled samples. Thus, the decision boundaries of the network on the test dataset fluctuate as the training proceeds, which leads the network to produce inconsistent predictions. Thus, the accumulated error from inconsistent network predictions eventually induces overparameterization problems. Because the Gaussian parameters are computed using inconsistent network prediction, the GMM with naive Gaussian statistics in (4) cannot provide guaranteed augmentation to prevent the network from overparameterization.
In this section, we aim to build the machinery that reduces the aforementioned fluctuations in decision boundaries. Instead of computing Gaussian parameters based only on the network prediction of a current step, we consecutively collect the first-and second-order statistical moments (i.e., µ, ) of small-loss samples with our moving weighted moment queue Q k at each learning step k. In particular, we dynamically collect M small-loss samples based on the network predictions in the queue. For this, we update the c-th queue with β c,k small-loss samples, whose network prediction is the c-th class at learning step k. By weighting the statistical moments of small-loss samples by β c,k for a different learning step k, we can consider the confidence of network prediction for previous learning steps altogether. In updating (i.e., enqueue) the queue with small-loss samples at the current learning step, we discard (i.e., dequeue) M − β c,k small-loss samples of the early learning steps. Thus, small-loss samples selected from the early steps are forced to have less influence on network parameter updates. To reflect the aforementioned considerations for adopting a queue, we define the moving weighted moment queue Q k in Algorithm 1, and the corresponding Gaussian moments as where X c,k is a small-loss sample with the prediction to the c-th class at learning step k and β c,k is the number of small-loss samples with the prediction to the c-th class at learning step k (i.e., β c,k = |{x k |ŷ k = c}|). In (7), l c is a moving window that can be adaptively determined by l c = arg min j c max[( k k−j c β c,k−j c − M ), 0]. Using new Gaussian statistics µ c,k , c,k in (7) and the proposed soft mixture probabilityφ c in (5), we improve the conventional probabilistic model in (2) into a noise-robust probabilistic model as follows: Subsequently, we regularize the network parameters by minimizing the KL-divergence between the small-loss samples and augmented samples from a modified GMM in (8) without label information as follows: where Z The augmented samples Z are directly sampled from the proposed GMM. In (9), our augmentation helps the network to mitigate the accumulated error from inconsistent predictions by providing consistent guidance to the network for each learning step. Therefore, it enables the network to avoid overfitting to noisy samples and prevents over-parameterization. Algorithm 1 describes our moving weighted moment queue.

C. DATA ADAPTIVE SELECTION STRATEGY
In contrast to conventional methods that manually tune selection thresholds, we propose a data-adaptive smallloss selection strategy based on eigenvalues of the second moments k in the queue. Please note that we assume that each covariance matrix c,k is diagonal for easy computation. Eigenvalues can be considered as an indicator of network certainty in each class. In particular, if a network is overparameterized, it can confidently predict classes, even for noisy samples. In this case, the eigenvalues of k have small values owing to their certainty. For example, if the network is perfectly fitted to training samples, its prediction can be represented as a one-hot vector (over certainty); then, the eigenvalues of c,k , λ c,k , have small values for all classes. In contrast, if the network is under-parameterized, the certainty of network prediction for the training sample is relatively lower than that of the over-parameterized case, which means λ c,k has large values for all classes. Thus, the following inequality holds: where E c [λ c,k ] denotes the mean eigenvalues of the second moments c,k for all classes at learning step k. Fig.3 shows under-and over-parameterization in terms of eigenvalues.
To enhance the generalization performance of the networks, we balance between under-parameterization and overparameterization by selecting an appropriate threshold for selecting small-loss samples at each learning step. To determine an appropriate threshold dynamically, we design a data-adaptive selection strategy in proportion to the difference between the maximum and minimum values of the mean of eigenvalues E c [λ c,k ] across the previous learning steps k as where α is a hyperparameter that determines the selection scale. Our selection strategy C(k) is designed to satisfy the following conditions: (1) C(k) ∈ [0, 1]; (2) C(k) is a non-decreasing function on k. Because of the memorization effect of DNNs, they learn clean samples at the beginning of training, and then gradually memorize noisy sample which induce over-parameterization. Thus, C(k) needs to be increased to reduce the number of selected small-loss samples as the training proceeds to ignore noisy samples. Specifically, before the network memorizes the noisy samples, the network is safely updated with clean samples. The certainty of network prediction increases and E c [λ c,k ] reaches the highest value at a certain step, which is immediately before the memorization. Then, when the network starts to memorize noisy samples, it is gradually over-parameterized. Hence, E c [λ c,k ] decreases as the training proceeds. Consequently, the gap between the maximum and minimum values of E c [λ c,k ] across the previous learning steps increases. The number of small-loss samples decreases as the training proceeds. Thus, the network parameters are updated safely with qualified small-loss samples.

D. OBJECTIVE FUNCTION
Our main objective function is expressed as follows: (12) where L ce in (1) denotes a cross-entropy loss function, L reg in (9) and L soft in (6) denote regularization loss functions for augmented samples and soft mixture probability, respectively, and ω is a hyperparameter that weighs the regularization effect of soft-mixture probability. By minimizing (12), Algorithm 2 Self-Augmentation in a Single Network Require: f : Neural network, θ: network parameters,D: training dataset, Q: moment queue, M : number of samples in queue, C: number of classes, γ : class probability prior, α, ω: hyperperameters, and η: learning rate. for t = 1 to T (i.e., total number of training epochs) do Shuffle training setD. for k = 1 to K (i.e., total number of iterations) do Fetch mini-batchD k fromD.
Obtain small-loss sample {x k , y k } // sample (1 − C(k)) percent of data with low cross-entropy losses inD k . (5). Update the moving weighted moment queue Q k using Algorithm 1.
γ ) using our objective function in (12). Obtain eigenvalues λ c,k from c,k in (7). (11) end if end for end for we prevent over-parameterization and make the network robust to noisy labels. Algorithm 2 describes the overall algorithm of the proposed method.

V. EXPERIMENTS
A. EXPERIMENTAL SETTINGS 1) NETWORK ARCHITECTURE Table 2 shows detailed network architectures of baseline methods with the 9-layered CNN models [1], in which conv, LReLU, BN, and Dense denote a convolution filter, leaky rectified linear unit, batch normalization, and a fully connected layer, respectively.

2) HYPERPARAMETERS AND BASELINES
For fair comparisons on the CIFAR-10/100 datasets, we followed the experimental settings suggested by [1]. We used a single 9-layered CNN with a batch size of 128 as the baseline architecture. The Adam optimizer with a momentum of 0.9 was used with an initial learning rate of 0.001 and was linearly decayed to zero as training proceeded. Because the CIFAR dataset contains the same number of images for all 10 classes (5K images for each class), we set the classprobability γ c in (6) as 1 10 for all classes. For appropriate selection, we control C(k) by α in (11) with different noise settings. In particular, we set α = {0.3, 0.5, 0.6} for CIFAR-10 and α = {0.1, 0.18, 0.25} for CIFAR-100, respectively.
For the Clothing1M dataset, a pre-trained ResNet50 was used with a batch size of 64 as the baseline. We also adopted Adam optimizer with momentum of 0.9. We run 200 epochs in total and set a learning rate to 5 × 10 −5 and 8 × 10 −6 for VOLUME 10, 2022   first and second 40 epochs, respectively, and 8 × 10 −6 from 80 epochs. Because data imbalance existed, we randomly selected balanced subsets using noisy labels to mitigate data imbalance [22] and used them for training. These subsets had 270K images for all classes. For each epoch, we selected 32K images for efficient training. For pre-processing, we conducted random center cropping, random flipping, and normalization to 224 × 224 pixels. We set γ c in (6) to 1 14 for all classes. Table 3 shows detailed hyperparameter settings.

3) SMALL LOSS SAMPLE SELECTION SETTINGS
The network was trained with entire noisy data (i.e., C(k) = 0) for first 5 epochs as the warm-up stage. In this warn-up stage, the network was adapted to training data and its certainty for training samples was stabilized. After the warm-up stage, we gradually increased the forget rate C(k) as training proceeds based on our selection strategy to reduce the number of selected small-loss samples and avoid the memorization effect, as shown in Fig.4. For appropriate sample selection, we adopted hyperparameter α in (11) to scale down C(k) to [0, 1] and clip C(k) to 0.85 if C(k) > 0.85.

4) TEST DATASETS AND EVALUATION METRICS
We verified the effectiveness of our method on three benchmark datasets, namely CIFAR-10, CIFAR-100, and Clothing1M. Clothing1M [23] contains real-world noisy data with 14 classes [8], [14], [24], [25]. Because all datasets, except for Clothing 1M contained only clean data, we manually injected label noise into these datasets using the noise transition matrix Q, where Q i,j = P ŷ = j|y = i given that a clean label y is flipped into a noisy labelŷ. We assumed that Q has two typical structures [1]: (1) symmetric flipping [26], which involves flipping uniformly to all classes, and (2) asymmetric flipping [14], which involves flipping to a different class. The noise rate of label flipping, , was chosen to be {20%, 50%, 45%}. For asymmetric flipping, if = 45%, the training data are extremely noisy.
We evaluated the compared methods in terms of test accuracy (i.e., # of correct predictions # of test data ) and label precision (i.e., # of clean samples # of selected samples ). A high label precision means that less noisy data can be used for training. Thus, networks are less affected by noisy data and are robust to noisy labels. Table 5 describes the datasets in the experiments in details. We manually injected label noise into CIFAR-10/100 datasets using the noise transition matrix, as shown in Fig.5.

5) COMPARED METHODS
We compared our method against recent state-of-the-art single network-and two network-based methods. For single network-based methods, we compared our method against  (i) a standard deep network that is directly trained on noisy datasets with a cross-entropy loss, (ii) GCE [27], which uses the advantages of both MAE [28] and cross-entropy losses, (iii) SL [19], which combines a reversed cross-entropy loss with a standard cross-entropy loss, and (iv) NPCL [29], which extracts true-labeled samples from a noisy dataset with a manually specified threshold. For two network-based methods, we compared our method against (v) MentorNet [5], which uses an extra pre-trained network to select small-loss samples, (vi) Co-teaching [1], which develops two interactive networks to be robust to noisy labels, (vii) Co-teaching + [9], which improves two networks using disagreement-and crossupdate steps, (viii) JoCoR [8], which employs a joint-loss to select small-loss samples, and (ix) DivideMix [6] that maintains two networks to select an unlabeled set of each other.  Table 4 shows the test accuracy of the compared methods on the CIFAR-10 dataset. As shown in the table, our method exhibited state-of-the-art results for all cases. In particular, our method was significantly superior to other methods in the Asymmetric 45% case. As shown in Fig.6, our method solved over-parameterization problems and its test accuracy did not decrease even after 500 epochs. The label precision of our method was close to 100% in all three cases, which means that our selection strategy accurately selected clean sample data, and the method was robust to severely noisy labels.
2) COMPARISONS USING CIFAR-100 Table 6 shows the test accuracy on the CIFAR-100 dataset. In the Symmetric 50% case, our method and DivideMix considerably outperformed other baselines. In the Asymmetric 45%, our method significantly outperformed other methods. It should be noted that our method uses a single network and enables end-to-end training with noisy data. Table 7 shows the test accuracy of compared methods on real-world noisy labels using the Clothing1M dataset. Our method outperformed single network-based methods in Table 7(a), whereas it is comparable to two network-based methods in Table 7(b).

4) COMPARISONS ON COMPUTATIONAL COMPLEXITY
Because of Co-teaching, Co-teaching + , JoCoR, and DivideMix require additional networks, the number of network learning parameters is approximately twice than those of standard network, as shown in Table 8(b). Our method can effectively prevent over-parameterization only with a half of parameters compared with two network-based methods. Table 9 reports the network training time of Co-teaching, Divide Mix, and the proposed method on cifar10 datasets.

C. ABLATION STUDY
To verify the effectiveness of our method and provide an indepth analysis, we evaluated the method in a component-wise manner (i.e., w/o regularization loss for soft mixture    probability, L soft , in (12), w/o regularization loss for augmented samples, L reg , in (12), and moving weighted moment queue in Algorithm 1). For this experiment, we used the CIFAR-10 dataset with 50% symmetric noise. To analyze the contribution of L soft , we trained our network with ω = 0 in (12). As shown in the green curves in Fig.7(a), without L soft , the test accuracy and label precision decreased as the epochs increased, which indicates that regularization for soft mixture probability is crucial for selecting small-loss samples. As shown in the blue curves in Fig.7(a), owing to the network memorization effect, the test accuracy gradually increased even without L reg . However, the accuracy was considerably lower than that of our method with L reg . Fig.7(b) presents the contribution of the class-wise queues in Algorithm 1. As shown in the figure, by computing Gaussian statistics using queues (red), the proposed method produced accurate results, whereas our method without queues induced overparameterization (blue).

VI. DISCUSSION
The small loss selection is a widely adopted strategy for noisy labels. Specifically, it assumes that the sample with a small loss value potentially has a clean label. So, it trains the network only with these small loss samples while discarding samples with large loss values. In this case, the critical point is how to select small loss samples. If we select too small samples, the network cannot learn enough data representation because we discard many essential representations. In contrast, if we select too large samples, the network is affected by noisy label samples, so the generalization performance is degraded. The previous method's pre-designed selection strategy cannot flexibly adapt to the network's learning dynamics.
To remedy the above issues, we propose two crucial concepts ''self-augmentation from robust GMM'' and ''dataadaptive selection strategy.'' First, we assume that data samples in the pre-softmax layer follow a class-conditional Gaussian distribution, which is a usual assumption. Then, we obtain class-wise Gaussian parameters by ''Moving weighted moment queue'' from the sample we can expect as clean samples with high probability. After we collect enough samples in this queue, we construct GMM from this class-conditional distribution. Then, we sample the augmented sample Z from GMM. However, as training proceeds, Gaussian parameters are contaminated by noise labels if we do not use a carefully designed selection strategy. So, we propose the ''data-driven selection strategy'' derived from the second moment of Gaussian (covariance matrix).

VII. CONCLUSION
In this study, we induced accurate small-loss samples based on the proposed class-conditional GMM without additional classifiers and parameters. Because small-loss samples inevitably contain inconsistent (noisy) label information, we regularized Gaussian distributions with moving weighted Gaussian moments. In addition, we proposed a novel data-adaptive strategy to enhance the quality of smallloss samples. Our self-augmentation using the noisy-robust GMM effectively prevented over-parameterization of the network. The experimental results indicate that our method outperforms state-of-the-art methods on several datasets. Since our method relies on mixture distributions, generated samples from GMM cannot be directly used for back-propagation. It restricts our method to merely regularization. To be more powerful, the reparameterization trick for mixture sampling might be helpful for directly back-propagating through Gaussian parameters. Moreover, for future work, theoretical analysis of the proposed method on sub-Gaussianity may bring an in-depth analysis of generalization bound in the noisy label situation.