Limited Gradient Descent: Learning With Noisy Labels

Label noise may affect the generalization of classifiers, and the effective learning of main patterns from samples with noisy labels is an important challenge. Recent studies have shown that deep neural networks tend to prioritize the learning of simple patterns over the memorization of noise patterns. This suggests a possible method to search for the best generalization that learns the main pattern until the noise begins to be memorized. Traditional approaches often employ a clean validation set to find the best stop timing of learning, i.e., early stopping. However, the generalization performance of such methods relies on the quality of validation sets. Further, in practice, a clean validation set is sometimes difficult to obtain. To solve this problem, we propose a method that can estimate the optimal stopping timing without a clean validation set, called limited gradient descent. We modified the labels of a few samples in a noisy dataset to obtain false labels and to create a reverse pattern. By monitoring the learning progress of the noisy and reverse samples, we can determine the stop timing of learning. In this paper, we also theoretically provide some necessary conditions on learning with noisy labels. Experimental results on CIFAR-10 and CIFAR-100 datasets demonstrate that our approach has a comparable generalization performance to methods relying on a clean validation set. Thus, on the noisy Clothing-1M dataset, our approach surpasses methods that rely on a clean validation set.


Introduction
Noisy labels tend to affect the generalization performance of machine learning.Errors of manual annotation are often inevitable.Therefore, research on learning with noisy labels has great importance.In this field, some works have employed clean samples to aid in learning [17,31,32] or for verification [34,29].However, in practical applications, clean verification sets are sometimes not readily available.In this paper, we focus on the learning of noisy labels without the use of clean samples.
Deep neural networks (DNNs) have been applied to achieve breakthroughs in many fields.Many DNN-based methods have been proposed for learning with noisy labels.However, owing to the powerful fitting ability, DNNs may even memorize noise [33], which might hamper the learning of the main pattern (pattern of interest).A previous paper [2] reported that DNNs prioritize the learning of simple patterns over the memorization of noise.According to these characteristics of DNNs, Tanaka et al. [29] proposed a supervised method that uses a clean verification set to search for the best stop timing of training.
Unlike [29], we consider an unsupervised situation in which a clean verification set is not available.To estimate the best stop timing of training, we propose a method called limited gradient descent (LGD) under the assumption that a classifier learns the main pattern until the noise pattern begins to be memorized.It hopes to monitor the learning progresses of the main pattern and noise pattern.Unfortunately, samples of different patterns cannot be distinguished.For this problem, we randomly select a few samples from a noisy trainset as the reverse pattern, which is mutually exclusive from the main pattern.Specifically, we shift the labels of the selected samples as reverse labels (as in label + 1).It can be proved that the reverse labels are almost false.Note that the samples of the main pattern are still unknown.We can obtain the training accuracies for the two parts of the samples: the reverse samples and leftover noisy samples.At the early stage of training, the accuracy for the leftover samples increases because the main pattern is learned first, and the accuracy of the reverse samples does not increase (or may even decrease).We could monitor the ratio of the two accuracies to select the best generalization.In addition, we can apply a relabeling strategy [18] to LGD for further improving performances.Empirically, for uncomplicated datasets such as MNIST, we use LGD with relabeling to improve performance, while for challenging datasets such as CIFAR-10, we only use LGD once because the relabeling strategy may not significantly improve the performance.
The main contributions of the present study are as follows.First, we propose an unsupervised method called limited gradient descent (LGD) that can learn the main pattern as much as possible from noisy labels.Second, we prepare a few samples with false labels for training, which no study has attempted thus far to our knowledge.Third, we theoretically prove some sufficient conditions on LGD learning with noisy labels.Lastly, our method is free of models; thus, it can be applied to most DNNs and loss functions based on stochastic gradient descent (SGD) optimization.

Related Works
Learning with noisy labels has been a long-standing problem in machine learning, which can be traced back to the 1980s [1].A detailed survey [6] summarized the early studies on this problem.In recent years, approaches in this field have often resorted to DNNs.There are four streams of research within this field, as summarized below.
First, Sukhbaatar et al. [28] embedded a known noise transition matrix into the loss function.This is a Bayesian method that views real labels as latent variables.Unfortunately, the exact confusion matrix is usually unknown.Later, several methods focused on estimating it [9,12,23,16].However, accurate estimations can be difficult to obtain, especially when the number of classes is very large.Moreover, these methods are not suitable for symmetric label noise.
Second, some approaches aimed at sample selection to address noisy labels.Decoupling [21] and Co-sampling [10] introduced a sample-selection mechanism for carefully training predictors.They both maintain two predictors.The former selected disagreement samples to update the predictors, while the latter used small-loss samples to train the predictors.However, the selection mechanism itself is not very reliable, because the sample-selection bias may cause an accumulated error.
Third, in the context of noise-tolerant methods, several theoretically motivated noise-robust loss functions such as ramp loss and unhinged loss have been introduced [4,30].Ghosh et al. [8,7] proved and empirically demonstrated that the mean absolute error (MAE) is robust against noisy labels.However, the convergence speed of MAE is slow.Zhang et al. [35] found a loss function L q , which unifies categorical cross entropy (CCE) and MAE, to obtain a trade-off relationship between training speed and robustness.Additionally, regularization is an effective method to resist noisy labels, e.g.dropout [27].Zhang et al. [34] proposed the mixup method based on the idea that linear interpolations of feature vectors should lead to linear interpolations of the associated targets.Tanaka et al. [29] integrated the L p regularization [13] and the confidence penalty regularization [24] into the KL divergence loss function.However, since DNNs have the characteristic of memorizing noise [33], long-time training leads to performance degradation.Therefore, for noise tolerance, it is important to consider the stop timing in training.These methods often used clean validation samples to find the best epoch, which is similar to early stopping.It is noteworthy that some robust methods based on max-margin do not need clean samples, such as the method reported in [5].However, such methods cannot deal with asymmetric label noise (also called class-dependent noise or pair-flipping noise) owing to the limitation of max-margin.
Fourth, an alternative approach attempts relabeling [18], in which predictors and noisy labels are updated in turns.Bootstrapping [26] is a self-learning method with an assumption of consistency.However, the assumption of consistency is not always valid.Furthermore, little attention is given to the selection of the best time to update labels.By using clean samples, the method reported in [29] could be applied to select the suitable stop timing of training and update the labels of the trainset for further training.However, as mentioned above, clean samples are not always available.Without clean samples, this method might not choose the suitable stop timing.
To solve this problem, in this paper, we propose an unsupervised learning method called LGD, which creates a few reverse samples to help estimate the timing of best generalization.LGD is based on the characteristic [2] that DNNs tend to learn simple patterns before memorizing noisy patterns.LGD is free of models and can be applied to most of the noise-robust methods mentioned above.It can also choose a relabeling strategy to further improve the generalization ability of predictors according to specific tasks.
3 Problem Formulation

Polluted Dataset
Consider a problem of k-way multi-class classification.Let X ⊂ R d be the feature space and and y * i is the true label of x i , i.e., y * i is the oracle.The noisy label y i is corrupted with respect to the true label y * i with the probability η ∈ (0, 1).For random sampling from the continuous uniform distribution U (0, 1), if the sample falls within (0, η] , then y i = y * i¬ , where y * i¬ denotes any defined labels except y * i .If the sample falls within (η, 1), then y i = y * i , i.e., y * i is not flipped.Statistically, η is the pollution ratio.The labels of test data are true, i.e., the test labels are the oracle.We further assume that clean validation samples are not available; therefore, we do not set up a validation set.

Pollution source
Assume y * i is corrupted to y * i¬ .We consider two sources of pollution: symmetric label noise and asymmetric label noise.The symmetric label noise model obeys the uniform distribution The asymmetric label noise model is a specific map y i = f (y * i ), ∀y i = y * i .Here, f (•) can be seen as fixed-rule flipping.Taking the MNIST dataset as an example, we illustrate the two pollution sources in Figure 1.

Prerequisites of Learning with Noisy Labels 4.1 Regularity and Scale
A previous study [33] has shown that DNNs have the ability to memorize noise.If a DNN is adequately trained with noisy samples, it will learn not only the main pattern but also noise patterns.This affects the generalization of the main pattern.While the literature [2] has emphasized that DNNs tend to prioritize the learning of simple patterns over the memorization of noise pattern, we argue that this simple pattern is the regular pattern with the largest proportion in samples.Here, we should notice two important facts: regularity and scale.Regularity refers to samples that are subject to a certain rule, similar to a case where the label of the character ½ is 1, that of the character ¾ is 2, and so on.However, if the label of the character ½ is 2, that of the character ¾ is 3, and so on, the sample follows another rule.We call the latter label shifting.The two rules are mutually exclusive.
Scale refers to the number of samples of the regular pattern.In gradient descent optimization, the learning sequences of different regular patterns in samples vary according to their scales.Assume that two regular patterns are mutually exclusive.One pattern consists of large-scale regular samples (LSRS), while the other consists of small-scale regular samples (SSRS).At the beginning of training, the magnitude of gradient accumulation of LSRS is larger than that of SSRS.Furthermore, the directions of the two are quite different.Consequently, the direction of the gradient sum will be biased towards LSRS such that the LSRS learning takes the higher priority.With the progression of training, the loss and gradient magnitude of the LSRS both gradually decrease.When the gradient magnitude of the LSRS drops to a certain extent, the learning of SSRS will proceed progressively.For a chaos pattern (e.g., symmetric noise pattern), in general, the scale can be ignored because the scattered gradient directions of chaos samples lead to a negligible magnitude at the beginning of learning.Therefore, when the chaos pattern exists together with regular patterns, the regular patterns will always be learned first.
To demonstrate this, we conducted a simple training experiment using the MNIST dataset with the three different patterns mentioned above: the LSRS pattern, the SSRS pattern, and the chaos pattern.Assume that their scales are N L , N S , and N C , respectively, where N C = 2N L and N L = 2N S .Further, assume that each sample's pattern is known.From Figure 2(left), we can see that the LSRS pattern is learned first at the beginning of training, and its training accuracy increases rapidly, while the accuracy of the SSRS pattern does not increase or even decreases.Figure 2(right) shows a 2D plot of their gradients of dimension reduction via t-SNE [20] at this stage.The direction of the gradients' sum is very close to that of the LSRS and deviates from that of the SSRS.Therefore, the learning of SSRS stagnates or even deteriorates.As the training progresses, the accuracy of the SSRS begins to increase gradually, and the speed of increase is greater than that of the chaos pattern.Although the chaos pattern's scale is the largest, its training speed is the slowest because its magnitude of gradients is too small.In agreement with our analysis, the regularity and scale of patterns play important roles in the training based on gradient descent.For the sequence of learning, regular patterns are prioritized over chaos patterns.Moreover, LSRS patterns are prioritized over SSRS patterns.Although the training of DNNs has the above characteristics, for the actual training set, one cannot distinguish between the main pattern's samples (clean samples) and polluted samples.Furthermore, no clean sample for validation exists.Therefore, one cannot determine when the main pattern is best generalized.In order to solve this problem, this paper proposes to prepare some reverse samples to grasp a reverse pattern that is mutually exclusive with the main pattern.We randomly select β-proportion samples from the trainset to perform label shifting as the reverse pattern.We utilize the reverse pattern to help find the best generalization timing of the main pattern.The method is introduced in Section 5. Next, we theoretically provide some sufficient conditions of learning with noisy labels and illustrate how to choose the ratio β of the artificial reverse samples.Proof.Assume the sample set P , the labels of which are all polluted by symmetric noise.The sample subset with the label y = j is P j , the number of samples of which is r j .Recall that the symmetric noise follows a uniform distribution.The labels of After the labels of P j are shifted, rj k−1 samples will attain true labels.
The label shifting of the samples with other labels leads to similar conclusions.After all the labels of P are shifted, the number of samples that attain true labels is Theorem 1. Suppose the number of samples of set S with noisy labels is n, the pollution source of labels is symmetric noise, the pollution ratio is η, and the number of categories is k.β • n samples is randomly selected from S as the reverse pattern via the label-shifting operation, where β ∈ (0, 1) is the rate of selection.If the samples with true labels are the main pattern that can be learned first in a noisy environment, then the pollution rate η should satisfy η < k−1 k , and the selection rate β should satisfy β < Proof.According to the assumptions, we list the numbers of the samples of all patterns before and after label shifting in Table 1 and Table 2, respectively.Note that only the labels of the selected samples are shifted.After the chaos labels of the selected samples are shifted, some of the labels become clean labels (Lemma 1), while the other labels remain as chaos labels.The clean labels of the selected samples attain a regular shifted pattern after label shifting.The selected samples and leftover samples are separately cross-analyzed with the chaos pattern (symmetric noise pattern), clean pattern (main pattern) and regular shifted pattern.
Table 1: Cross-analysis of the selected and the leftover samples with all patterns before label shifting.

Chaos Pattern
Clean Pattern The selected ηβn Table 2: Cross-analysis of the selected and leftover samples with all patterns after label shifting.
Chaos Pattern Regular Shifted Pattern Clean Pattern The selected We need to produce as many reverse samples as possible that are mutually exclusive with the main pattern via the label-shifting operation.Thus, after label shifting, the clean labels of the selected samples should be reduced: To make the scale of the clean pattern lager than that of the regular-shifted pattern, This work studies the learnability of samples with symmetric noisy labels from the perspective of creating a reverse pattern and attains the condition η < k−1 k , which is exactly the same result as in [7].For a symmetric noise source, we conjecture that η < k−1 k might be the most relaxed condition of learning with noisy labels.
On the other hand, β < is only a basic condition.If the selection rate β of reverse samples is close to the upper bound , it is actually very difficult to train successfully.
Empirically, we need a tighter condition of β to ensure a sufficient learning performance.Proposition 1.Following Theorem 1, further assume that the scale of the clean pattern is not less than δ times that of the reverse pattern.Then, the selection rate β should satisfy β ≤ Proof.Similar to Theorem 1, according to the assumptions, the following inequality should be met: Hence, When δ = 1, this theorem becomes Theorem 1.When δ is sufficiently large, β ≤ 1 1+δ approximately.According to practical experiences, the value δ is usually set to δ ≥ 9.Then, β ≤ 1 10 .

Sufficient Conditions of Learning with Asymmetric Label Noise
Theorem 2. Suppose that the number of sample of set S with noisy labels is n, the pollution source of labels is asymmetric noise, the pollution ratio is η, and the number of categories is k.β • n samples are randomly selected from S as the reverse pattern via the label-shifting operation, where β ∈ (0, 1) is the rate of selection.If the samples with true labels are the main pattern that can be learned first in a noisy environment, then the pollution rate η should satisfy η < 1 2 , and the selection rate β should also satisfy β < 1  2 .
Proof.According to the assumptions, we list the numbers of the samples of all patterns before and after label-shifting in Table 3 and Table 4, respectively.Recall that the asymmetric polluted samples belong to a regular pattern.After label shifting, the original polluted samples in the selected samples will form a new regular pattern, called the shifted polluted pattern, and the original clean samples in them will become another new regular pattern, called the shifted clean pattern.In this case, there are four patterns in the samples set S, which are the polluted pattern (asymmetric noise pattern), shifted polluted pattern, clean pattern (the main pattern), and shifted clean pattern.Owing to the asymmetric pollution, all the samples are of regular patterns.Thus, the scale of all the patterns must be considered.
Table 3: Cross-analysis of the selected and leftover samples with all patterns before label shifting.
Polluted Pattern Clean Pattern The selected ηβn Note that we further assume that the selected samples will attain the true labels with extremely an small probability after label shifting.Therefore, after the selected labels are shifted, the number of polluted samples that turned into clean samples is almost 0.
Theorem 2 establishes a sufficient condition that the asymmetric polluted samples should be less than half of the total samples, while β < 1 2 is only a basic condition.If the selection rate β is close to the upper bound 1  2 , it is very difficult to train successfully.Empirically, we need a tighter upper bound of β to ensure a sufficient learning performance.Proposition 2. Following theorem 2, further assume that the scale of the clean pattern is not less than δ times those of the reverse patterns and that η < 1 2 holds.Then, the selection rate β should satisfy β ≤ 1 1+δ .
Proof.Recall that the reverse samples contain two regular patterns: the shifted polluted pattern and shifted clean pattern.
The condition of the selection rate β is similar to that in Proposition 1.When δ = 1, this theorem becomes Theorem 2. According to practical experiences, the value δ is usually set to δ ≥ 9.Then, β ≤ 1 10 .

Limited Gradient Descent
To solve the classification problem with noisy labels, it is necessary to know the type of noise source and estimate the pollution ratio [25].If the pollution ratio η satisfies the prerequisites of learning with noisy labels (refer to Section 4.2 and 4.3), the proposed LGD method can be used for learning.We randomly select β-proportion samples from the trainset to perform label shifting to create the reverse pattern.We can utilize the reverse pattern to help estimate the best generalization timing of the main pattern.Algorithm 1 illustrates the LGD method.According to the characteristic that DNNs tend to prioritize the learning of the LSRS pattern (see Section 4.1), the main pattern will be learned first if the scale of the main pattern is dominant.The reverse pattern and the main pattern are mutually exclusive.We can estimate the generalization performance of the main pattern by observing the training precisions of the leftover samples and reverse samples.The accuracy of the leftover samples is approximated to that of the main pattern.Meanwhile, the accuracy of the reverse samples is approximated to that of other regular patterns.In our opinion, training should be stopped when the main pattern is generalized as much as possible and the learning of other regular patterns is suppressed.We design a leftover-over-reverse (LoR) metric to estimate the learning performance of the main pattern.When the LoR reaches its maximum, the main pattern might be best generalized.See Algorithm 1 for details.Because LGD is free of models, it can be applied to most DNNs and loss functions based on SGD optimization.Therefore, the model and loss function are not specified in Algorithm 1.
For some datasets, we can further improve the generalization of the main pattern through relabeling [18].The framework of learning is described in Algorithm 2. After LGD training, one can also update the labels of the trainset S with the trained model net_rec, which could increase the scale of the main pattern.With iterative relabeling, the main pattern can be gradually spread in samples, and the generalization of the main pattern can be improved continuously.Further, during relabeling, the initial rising speed of LoR will keep increasing.Consequently, the gradient magnitude of the main pattern will keep increasing at the beginning of each LDG training (recall Section 4.1).We do not load the last trained model but randomly initialize the model before each LGD training.This is more conducive to the generalization of the main pattern.If one loads the trained model before each LGD training [29], the initial rising speed of LoR will be reduced.In addition, in order to prevent the introduction of extra regular pollution, we also randomly prepare reverse samples before each LGD learning.
It is worth mentioning that LGD is different from the methods reported in [21] and [10], which only learn reliable samples generally based on confidence or loss.In fact, they are based on a relatively tight learning condition.However, the relatively tight condition could sometimes be difficult to hold.In other words, selecting reliable samples is sometimes less reliable.Our method relies on with ROBUST+LGD.Note that the comparison is in the same environment, i.e. the same networks structure and hyper-parameters.
All networks used PreAct ResNet-18 [11,19] with dropout (0.3) [27].We attempted advanced noise-tolerance loss functions, such as the MAE [7] and L q [35], as well as advanced noise-robust regularization, e.g.mixup [3].For the L q loss function, the hyper-parameter q was set to 0.7.For the mixup regularization, the hyper-parameter α was fixed at 8. The corresponding experimental results are listed in Table 5.
From Table 5, we can see that the performances of LGD are slightly lower than those of the corresponding supervised robust methods.There are two reasons for this result.One is that the maximum value of LoR does not exactly correspond to the best score of test accuracy.The other is that the introduction of reverse samples slightly increases the noise labels.Although LGD slightly sacrifices performance, it can deal with cases in which no clean validation set is available.Overall, the performances of our unsupervised approaches are close to those of the supervised approaches.Next, we show training details of LGD through two experiments.We take mixup vs. LGD + mixup as an example of comparison.The pollution source is symmetric noise and the pollution fraction η is 0.6 in the first experiment, as shown in Figure 3 (left).For the second experiment, as shown in Figure 3 (right), the pollution source is asymmetric noise and the pollution rate η is 0.3.In Figure 3 (top), the three accuracy curves correspond to the leftover samples, reverse samples, and test samples, respectively.The LoR curves are shown in Figure 3 (bottom).In general, the peak of LoR is located on the left of the peak of the test accuracy curve but not far from it, which can often be regarded as the best generalization of the main pattern.The test accuracy corresponding to the LoR peak is not much different from the best accuracy of the test curve.Therefore, for the main pattern, the generalization performance, which is indicated by the LoR peak, is very close to the actual best generalization performance.

Assessment on MNIST Dataset
These experiments demonstrate the performance of LGD relabeling on the MNIST dataset, which is of low complexity.Empirically, we use LGD relabeling (Algorithm 2) on low-complexity datasets and utilize LGD (Algorithm 1) on high-complexity datasets.
The labels of the trainset were polluted by symmetric and asymmetric noisy labels, respectively.Our model was 2CNN-784-100-100-10 neural networks, which used BN [14] and ReLU [22] in hidden layers.For universality, our loss function was CCE.We applied the relabeling strategy to LGD.
We show the performance of LGD relabeling through two experiments.In the first experiment of learning with symmetric noisy labels, the pollution rate η varies from 0.2 to 0.8 in increments of 0.1.We compare the performance of LGD relabeling with those of current advanced methods such as Large margin [5], Co-sampling [10], and MentorNet [15].In the second experiment of learning with symmetric noisy labels, the pollution rate η incrementally varies from 0.30 to 0.46.We compare the results of LGD relabeling with those of current advanced methods such as Bootstrapping [26], Co-sampling, and MentorNet.
Figure 4 shows the performances of these methods.For the classification with symmetric noisy labels, as shown in Figure 4 (left), when the pollution rate η is in the interval [0.2, 0.6], our method is close to the best method, Large margin.When η is in the interval [0.7, 0.8], our method is superior to other methods.For the classification with asymmetric noisy labels shown in Figure 4 (right), when η is in the interval [0.30, 0.38], our method is superior to all other methods.When η is in the interval [0.40, 0.46], our method is inferior to Bootstrapping-recon and Bootstrapping-hard.Overall, our approach is comparable to the current state-of-the-art methods.It is worth noting that Large margin and Bootstrapping are not compatible with both types of pollution.Large margin is only suitable for symmetric noise owing to the use of max-margin.Bootstrapping can only be applied for asymmetric noise because it involves the evaluation of the transition matrix.In contrast, our approach can be applied for both types of pollution.Although Co-sampling and MentorNet are also suitable for both types of pollution, the performances of our method are superior.

Conclusion
This paper presented the LGD method for learning with noisy labels.
LGD is based on an interesting characteristic that DNNs tend to prioritize the learning of an LSRS pattern over the learning of an SSRS pattern or even a chaos pattern.Traditional methods often use a clean validation set to supervise the training process.In contrast, LGD is an unsupervised method that does not require a clean validation set; it creates a few samples that are different from the main pattern to help estimate the learning progress of the main pattern.It works under a quite relaxed condition that the scale of the main pattern is dominant in samples.We empirically verified the effectiveness of the algorithm on various datasets.The results of experiments with two different datasets strongly support the practical application of LGD.

Figure 1 :
Figure 1: Left: visualization of the symmetric label noise model.The true labels are flipped to other labels with equal probability.Right: example of the asymmetric label noise model.The true labels are flipped to specific false labels according to a fixed rule.

Figure 2 :
Figure 2: Left: training accuracy curves.Three different patterns are mixed into a trainset.The LSRS is set of clean samples.SSRS can be regarded as the samples of asymmetric noise labels.The chaos samples can be regarded as the samples of symmetric noise labels.Right: 2D plot of different patterns' gradients of dimension reduction via t-SNE at the beginning of training.

4. 2 1 .
Sufficient Conditions of Learning with Symmetric Label NoiseLemma Consider a k-class classification problem.Suppose that the labels of r samples are all polluted by symmetric noise.The label-shifting operation is defined as ŷ = M OD(y + 1, k), where y is a polluted label and ŷ is the label-shifting result of y.Then, after the r labels are shifted, r k−1 samples will attain true labels.

Figure 4 :
Figure 4: Left: comparison of methods for learning with symmetric noisy labels.Right: comparison of methods for learning with asymmetric noisy labels.

Table 4 :
Cross-analysis of the selected and leftover samples with all patterns after label shifting.
Algorithm 1 Limited Gradient Descent Randomly select β • n samples from the original training set S to shift the labels as the subset S r , and the leftover samples are referred to as the other subset S l .The new training set becomes S ′ = S r ∪ S l .Require: N et, loss function, the training set S ′ , LoR ← 0, LGD iterations N for each i ∈ [1, N ] do Train N et by one step (e.g., one epoch) with SGD on S ′ Infer S l and S r and obtain the test accuracies Acc l and Acc r , respectively.if Acc l Accr > LoR then LoR ← Acc l Predict the labels of the test set with net_rec to calculate test accuracy.

Table 5 :
Average test accuracy (%) of five experiments on the CIFAR-10 dataset.The experiment does not compare the performances of different robust methods but rather compares LGD with the corresponding supervised robust method.Therefore, there is no boldface to mark the best accuracies.