Adjusting Logit in Gaussian Form for Long-Tailed Visual Recognition

It is not uncommon that real-world data are distributed with a long tail. For such data, the learning of deep neural networks becomes challenging because it is hard to classify tail classes correctly. In the literature, several existing methods have addressed this problem by reducing classifier bias, provided that the features obtained with long-tailed data are representative enough. However, we find that training directly on long-tailed data leads to uneven embedding space. That is, the embedding space of head classes severely compresses that of tail classes, which is not conducive to subsequent classifier learning. This paper therefore studies the problem of long-tailed visual recognition from the perspective of feature level. We introduce feature augmentation to balance the embedding distribution. The features of different classes are perturbed with varying amplitudes in Gaussian form. Based on these perturbed features, two novel logit adjustment methods are proposed to improve model performance at a modest computational overhead. Subsequently, the distorted embedding spaces of all classes can be calibrated. In such balanced-distributed embedding spaces, the biased classifier can be eliminated by simply retraining the classifier with class-balanced sampling data. Extensive experiments conducted on benchmark datasets demonstrate the superior performance of the proposed method over the state-of-the-art ones. Source code is available at https://github.com/Keke921/GCLLoss.


I. INTRODUCTION
Deep learning methods have achieved better-than-human performance on a variety of visual recognition tasks [1], [2], [3] by virtue of the large-scale annotated datasets.In general, the success of deep neural networks (DNNs) relies on balanced-distributed data and sufficient training samples.That is, the number of samples in each class is basically the same and large enough.Unfortunately, from the practical perspective, data collected from the real world would follow a power-law distribution [4], [5], which means that a tiny number of head classes occupy large volumes of instances while the vast majority of tail classes each have fairly few samples, showing a "long tail" in the data distribution.In fact, class importance is independent of the number of training samples.In other words, few samples cannot imply the unimportance of the tail classes [6].Even more, misclassification of tail classes can have severe consequences, especially in critical applications such as medical diagnosis [7] or road monitoring [8].Therefore, it is important to develop methods that can effectively address the long-tailed distribution of data and improve the recognition performance on tail classes particularly.
In the literature, many researchers have addressed the issue of long-tailed visual recognition by focusing on the classifier level.It is well-known that DNN can be decoupled into a feature extractor and a classifier [9], [10].Recently, Zhou et al. [11] have conducted empirical studies to demonstrate that the features (also referred to as embeddings interchangeably hereinafter) obtained from the original long-tailed dataset are already sufficiently representative.Consequently, they shifted their focus to balancing the classifier through two versions of sampling data.Also, two-stage decoupling methods [12], [13], [14], [15] have been proposed to obtain a representation in the first stage and then re-train the classifier on balanced sampling data in the second stage.These methods obtain the representation by cross-entropy (CE) loss, which, however, leads to a severely uneven distribution of the embedding space, hindering the acquisition of a better classifier.Furthermore, re-training the classifier can only alleviate the classifier bias but cannot adjust the distorted embedding space, which is not conducive to further promoting the model performance.
For the feature issue, specifically, the embedding spatial span of tail classes is drastically compressed by head classes because they have limited training samples that cannot cover T-SNE visualization of the distorted embedding space 2 .The embedding distributions of head and tail classes are shown in shaded areas.We can see that there are many overlapping regions between each class.the true distribution in embedding space.For ease of understanding, we use a simple experiment to demonstrate the distortion of the embedding space, as illustrated in Fig. 1, where the features are projected by t-SNE [16].It can be observed that the tail class occupies a much smaller spatial span than the head class.
A straightforward way to calibrate the distorted embedding space is to enlarge the spatial distribution of tail classes.Analogous to human cognition, where a person is capable of inferring the extension of an entire category from a single instance [17], we treat one training sample as a set of similar samples.By augmenting the features, we can control the spatial span of the embedding.As only the orientation of the class anchors contributes to the classification, we increase the perturbation amplitude of the tail classes along the direction of the corresponding class anchors.This expands the spatial distribution of tail classes and prevents them from being overly compressed by head classes.Conversely, these amplitudes for head classes should be small.Since their samples with enough diversity already cover the actual spacial span, additional expansion is no need anymore.Eventually, as shown in Fig. 2, the tail class samples can be pushed further away from the other classes so that the distortion of the embedding space can be well calibrated.To this end, we first expand the embedding spatial span with a Gaussian form of perturbation.Based on this, we propose a novel logit adjustment method in two forms: normalized Euclidean and Angular.This method improves model performance with negligible additional computation.Since Gaussian distribution has a cloud-like shape, we name the perturbation amplitude as cloud size and the proposed method as Gaussian clouded logit (GCL).After calibrating the embedding space with GCL, the features of different classes can be more evenly distributed.It turns out that the classifier bias can be easily eliminated through class-balanced sampling data [18], [19] in such a balanced-distributed space.Extensive comparison experiments implemented on multiple commonly Fig. 2.
Overview of the proposed method.The embedding distribution obtained by CE loss is uneven, leading to difficulty in classifying the tail class.By assigning larger cloud sizes to the tail class features, the distortion of the embedding space can be well-calibrated.used long-tailed benchmarks demonstrate the superiority of the proposed GCL.
Compared to our preliminary work reported in [20], the primary distinction of this paper can be summarized as follows: Firstly, this paper provides a general form of perturbed logit by perturbing the logit to calibrate the distribution of embedding space.Accordingly, two specific forms based on different metrics are derived from this general form.Secondly, we present the analysis and explanation of the rationale of GCL in detail, based on which more general parameter selection strategies are provided.After calibrating the embedding space with GCL, the classifier bias can be mitigated by simply retraining with the balanced sampling data.Thirdly, more experiments are conducted to demonstrate the effectiveness of the proposed method.Specifically, we add more classification baselines to show the efficacy of GCL.Furthermore, we demonstrate that GCL can enhance the performance of mixture of experts (MoE) model.Additionally, we provide in-depth theoretical and experimental analyses of the characteristics of GCL in both its normalized Euclidean and angular forms.In summary, the main contributions of this paper are threefold.
• We propose a simple but effective GCL adjustment method derived from the Gaussian perturbed feature.Tail classes are assigned larger cloud sizes than head classes along the direction of the corresponding class anchors.Consequently, it can address the problem of the distorted embedding space caused by long-tailed data.• We provide in-depth discussions into GCL for longtail learning from the perspective of optimization and geometric interpretation.They help set the sign and magnitude of the perturbation and provide a new idea for better generalization to the test set.• We obtain two specific forms of GCL.Both of them outperform state-of-the-art counterparts on long-tailed benchmark datasets without additional computation.
Their advantages and disadvantages in different longtailed scenarios are analyzed in detail.The remainder of this paper is organized as follows.Section II makes an overview of the recent related works.Section III details the derivation and rational analysis behind the proposed Gaussian clouded logit.Section V presents our experimental results in comparison with the baseline methods, as well as model validation and analysis.Finally, Section VI draws a conclusion.

II. RELATED WORKS
Over the past years, a number of methods have been proposed to address long-tailed visual recognition.This section provides an overview of the most related four regimes.That is, data augmentation, two-stage method, mixture of experts, and loss modification and logit adjustment.

A. Data Augmentation
Input augmentation increases sample diversity in the data space.The classical augmentation methods [1] encompass operations such as flipping, rotating, cropping, padding, etc.Most recently, Wang et al. [21] proposed rare-class sample generator (RSG) that augments tail classes by utilizing encoded variation information obtained from head classes.M2m [22] establishes a well-balanced dataset through the translation of samples from head classes to tail classes, facilitated by an auxiliary pre-trained classifier.
Feature augmentation serves to enhance data diversity within the feature space.Knowledge transfer is a promising technology.For instance, Yin et al. [23] exemplified knowledge transfer by leveraging the intra-class variance derived from head classes in an encoder-decoder-based network to augment the features of tail class samples.Liu et al. [24] employed the transfer of angular variance, computed from head classes, to enrich the intra-class diversity within tail classes.Moreover, recent applications in addressing long-tailed data incorporate the use of class activation maps (CAM) [25].Chu et al. [26] utilized CAM to decompose the features into a class-generic and a class-specific component.Then, tail classes are augmented by fusing the class-specific components obtained from the tail classes with the class-generic components of the head classes.Also, Zhang et al. [27] exploited CAM to obtain the foreground in an image and then augment the obtained foreground object by flipping, rotating, jittering, etc.The augmented foreground is then covered on the unchanged background to obtain a new informative image.
Those methods mentioned above require either an increase in data size or model complexity to solve the issues in longtailed distribution, resulting in additional computational costs.

B. Two-stage Method
Recently, two-stage methods have been proposed and empirically demonstrated their efficacy.For example, Cao et al. [13] proposed LDAM-DRW, wherein features are learned in the initial stage, and a deferred re-weighting (DRW) strategy is employed to refine the classifier in the subsequent stage.
While it markedly enhances long-tailed prediction accuracy, the theoretical underpinnings of the deferred DRW strategy remain unclear.Following this, Kang et al. [12] precisely identified out that the learning process of representation and classifier can be decoupled into two separate stages.The first stage performs representation learning on the original longtail data.The second stage fixes the parameters of the backbone network and re-trains the classifier using class-balanced sampling data.Several studies [14], [15], [28] have further refined this strategy.For example, Zhang et al. [15] proposed an adaptive calibration function to calibrate the predicted logits of different classes, aligning them with a balanced class prior to preparation for the second stage.Zhong et al. [28] proposed class-based soft labels to address varying degrees of overconfidence in the predicted logit of each class, which can improve the classifier learning in the second stage.Another alternative approach is proposed by Zhou et al. [11], wherein the network structure is bifurcated into two branches.One branch focuses on learning the representation of head classes, while the other is tailored for tail classes.This structure incorporates feature mixup [29] into a cumulative learning strategy, yielding state-of-the-art results.Subsequently, Wang et al. [30] introduced contrastive learning into this bilateralbranch structure, further enhancing the performance of longtailed classification.

C. Mixture of Experts
More recently, researchers have explored the use of mixture of experts (MoE) methods to enhance performance by integrating multiple models into the learning framework.The fundamental concept behind these approaches is to introduce diversity to the data or models, which enables experts to concentrate on different portions of the data or allows experts with different structures to analyze the data.BBN [11] proposes a two-branched classifier that learns both the longtailed and inverse distributions simultaneously, with a smooth transition of focus between them.BAGS [31], LFME [32], and ACE [33] divide the long-tailed data into different subsplits and fit multiple experts on them.ResLT [34] designs residual structured classifiers that allow experts to specialize in different parts of the long-tailed data and complement each other.RIDE [35] and TLC [36] employ multiple experts, each trained on different augmented data, to independently learn the long-tailed distribution.The predictions of all experts are then gradually integrated to reduce overall model variance or uncertainty.SHIKE [37] investigates the impact of feature depth on data of varying scales in long-tailed visual recognition.The authors propose a new architecture, which incorporates features from different layers of a neural network to exploit the rich information present at different depths of a network.NCL [38] adopts multiple complete networks to learn the longtailed data individually and uses self-supervised contrastive strategy [39] to collaboratively transfer knowledge among each individual expert.

D. Loss Modification and Logit Adjustment
Re-weighting the loss function is one of the most intuitive ways to improve the attention of DNN model on tail classes.
In the literature, sample-wise re-weighting [40], [41] introduces the fine-grained coefficients into the loss function to make the model pay more attention to the difficult samples.Furthermore, class-wise re-weighting [42], [18], [43] assigns the standard CE loss with category-specific parameters that are inversely proportional to the class sizes.These methods can alleviate the data imbalance to a certain extent.However, when the imbalance ratio is very high, large weights may cause overfitting to the tail classes.Besides that, another side effect of assigning higher weights to difficult samples/tail classes is overly focusing on harmful samples (e.g., abnormal samples or mislabeled data) [44].
Loss function can also be modified by adjusting the logit.Menon et al. [45] proposed logit adjustment (LA), which is consistent in minimizing the balanced error.The logit shifting in LA of different classes is based on label frequencies of training data.By contrast, LADE [46] post-processes the model prediction by disentangling the training set distribution from the prediction.This method does not require the test set to be a uniform distribution.Also, DisAlign [15] adjusts the logit by calibrating the distribution of model prediction to a balanced one by minimizing the expected KL divergence.Overall speaking, these three methods can well adjust the classifier but do not take into account the distorted embedding space.Alternatively, re-margining methods [13], [47], [48] address long-tailed data by leaving large relative margins for tail classes during training.For example, label-distribution-aware margin (LDAM) loss [13] utilizes Rademacher complexity to theoretically prove that the margin should be inversely proportional to a quarter power of class sizes.The hard margin on target logit helps make the samples within a class more compact but the strict margin constraints increase the risk of overfitting and cannot actually expand the tail class coverage area in embedding space.

III. PROPOSED METHOD
The basic idea of our proposed method is to perturb the features with varying magnitudes in the directions of different class anchors, thereby automatically balancing the spatial span of head and tail classes.The details of the proposed approach are presented as follows.

A. Basic Notations
This section defines the notation used throughout this paper.For dataset: Suppose {x, y} ∈ T represents a sample {x, y} from the training set T , where T has C classes and N training samples in total, x represents the image that needs to be classified and y ∈ {1, . . ., C} is the ground truth label.The number of training samples of class j, For backbone: The feature vector f ∈ R D is derived from the embedding layer, with a dimensionality of D. W = {w 1 , w 2 , • • • , w C } ∈ R D×C represents the weight matrix of the classifier, where w j represents the anchor vector of class j in the classifier.The predicted logit of class j is represented by z j , thus, z j = w T j f .The subscript y indicates the target class.That is, z y denotes the target logit and z j , j ̸ = y is the non-target logit.

B. Embedding Space Calibration
Suppose a feature point and a small area around it belong to the same type.It is reasonable that the adjacent points around a feature can be regarded as similar to it, and can naturally be considered as the same class.
1) General form via perturbing the embedding representation: We sample a set of points by adding perturbations following a specific distribution to a given feature.Then, a perturbed feature f ptb of the input is represented as: where E represents the perturbation and δ > 0 is the amplitude of it.To avoid misleading the final classification, the perturbation amplitude cannot be too large, thus δ should be a small number.This perturbed feature is the input of the classifier.Then, the corresponding perturbed logit z ptb j of class j is calculated by: where z ptb j is the original logit z j augmented by a perturbing a perturbing item δ(w T j E). 2) Normalized Euclidean form: It should be noted that the perturbing item has different degrees of influence on the final predicted results based on different predicted logits.The impact on z ptb j is relatively minor when the original logit z j is large.Conversely, it becomes more pronounced for z ptb j when z j is small.Consequently, it is imperative to normalize the effects induced by varying predicted logits while preserving the consistency of the perturbing item's influence.We achieve this by employing cosine distance through the normalization of the perturbed logits.Here, s e and s a represent the norms of the embedding and the class anchor, respectively, that is s e = ∥f ∥ and s a = ∥w j ∥.The normalized perturbed logit zptb j is expressed as: where s = s a • s e .∥f + δE∥ is approximate to ∥f ∥ because δ is a small number.For the second term, we use I j to represent the identity vector that has the same direction as w T j , namely . Eq. ( 3) is simplified as: where θ j is the angle between f and w j .Inspired by [49], the predictions can be made solely based on the angle between the feature and the class anchor.Therefore, following [2], [50], we can utilize a fixed norm of individual class anchor to substitute s a .Without loss of generality, we employ s a = 1.Additionally, following [51], [52], [49], the norm of the embedding feature can also be replaced with a constant s, that is, set s e = s.Consequently, the logit is calculated using features distributed on a hypersphere of radius s.As for the perturbation, we set it to Gaussian distribution, i.e.E ∼ N (M, Σ) where M ∈ R D and Σ ∈ R D×D .The rationale behind this choice lies in the widespread adoption of additive Gaussian noise in machine learning [53] attributed to the simplicity and universality [54], [55] of Gaussian distribution.Moreover, we specifically set Σ = σI where I ∈ R D×D is the identity matrix.Then I j E is the projection of the perturbation on the direction of the anchor vector of class j.We directly use ε j to represent this value, which can be interpreted as the amplitude of the projection.By substituting the aforementioned norms and perturbation into Eq.( 4) and uniformly shifting the classrelated variable to the pre-defined perturbation amplitude δ for simplicity, we derive a more concise expression for zptb j : ( Since ε is also distributed in Gaussian form, it has a cloudlike shape.δ j is the class-based perturbation amplitude that depends on label frequencies.We name δ j cloud size because it controls the amplitude of ε.To broaden the embedding space for the tail classes, the cloud size for tail classes is required to be larger than that of the head classes.Therefore, δ j is negatively correlated with n j .In addition, given that cos θ j ∈ [−1, 1], the consistency of the influence of the perturbing item can be maintained.As ε makes the logit has a cloud-like shape, we name the perturbed logit as Gaussian clouded logit (GCL).We delve into Eq.( 5).If ε > 0, zptb j corresponds to the points that are closer to the anchor vector of class j.The correct classification of proximal points does not guarantee the accurate classification of distant points within the same class.Therefore, ε > 0 will not be helpful for classification.On the contrary, a reduced logit corresponds to the points that are relatively far from the class anchor.If the relatively distant points can be predicted correctly, the closer one will definitely be able to assign the right label.The points in the same class that are relatively far from the class anchor should be focused on.ε therefore should always be negative.We name this logit as GCL in normalized Euclidean form (GCL-E for short) because it is derived from normalized Euclidean distance metric.We modify the perturbed logit and use zGCL−E j to represent it, which is expressed as: where δ E j is the cloud size for GCL-E.3) Angular form: The final logit of GCL in normalized Euclidean form is equivalent to adding a class-based perturbation on cosine logit.From another perspective, namely metric learning, Eq. ( 6) corresponds to adding a Gaussian form margin with class-based variance to the cosine logit (Section IV-B provides a detailed analysis).Inspired by Deng et al. [49], this Gaussian form margin can also be introduced into the angular distance metric.For the sake of distinguishing from GCL-E, this version of GCL is named GCL in Angular form (GCL-A for short).Using zGCL−A j to represent.These two forms can be unified into a single expression: where ν A ∈ {0, 1} and ν E ∈ {0, 1} are the switch parameters.
• When ν A = 1 and ν E = 0, we obtain the Angular form, expressed as follows: • When ν A = 0 and ν E = 1, we obtain the normalized Euclidean form, denoted as zGCL−E j , as expressed in Eq. (6).By taking the Gaussian clouded logit into the original softmax, we obtain the final loss function of GCL: where ). L GCL−E is utilized to represent the loss function of GCL-E and L GCL−A denotes that of GCL-A.

C. Classifier Re-balance
Although both GCL-E and GCL-A calibrate the distorted embedding space well, the problem of classifier bias still remains to be addressed.
In the following, we analyze the reasons for the biased classifier.Eq. ( 13) implies that the sample of the target class y punishes the classifier weights w j of non-target class j, j ̸ = y w.r.t.p j .In general, the number of training instances in head classes is enormously greater than in tail classes.Therefore, the classifier weights of tail classes receive much more penalty than positive signals during training.Consequently, the classifier will be biased towards the head classes and the predicted logits of the tail classes will be seriously suppressed, resulting in low classification accuracy of the tail classes [43], [56], [57].We call this problem of the cross-entropy loss function in long-tailed learning negative gradient over-suppression.A straightforward approach to cope with it is to make the sample numbers of each class equal [58] to balance the negative gradients.To achieve this goal, we can make the tail classes over-sampling and then re-train the classifier.The sampling rate of each class is 1  C .Then, the class-balanced sampling rate q cb j of each sample x from class j is calculated by: This strategy is called classifier re-training (cRT) [12].It can also be combined with the effective number [18].We can replace the actual sample number n j of class j with the socalled effective number n en j , the effective sampling rate q en j of each sample from class j is given by: Calculate the logit cloud size δ j by Eq. ( 16) (or Eq. ( 17)): Calculate sampling rate by Eq. ( 10) (or Eq. ( 11)): q j ← n j / n j (or q j ← n en j / n en j ); 10 Sample a batch of data B ′ with the sampling rate q j and the batch size b; 11 Calculate the loss using Eq. ( 9): Update the classifier parameters ω cls while keeping the representation parameters frozen: where n en j is calculated by: with hyper-parameter β ∈ [0, 1).Algorithm 1 summarizes the overall training procedure of the proposed method.

IV. RATIONALE ANALYSIS
This section provides a detailed rationale analysis of how Eq. ( 7) and Eq. ( 8) balance the embedding space from two perspectives, considering both model optimization and metric learning perspectives, following with a time-complexity analysis.

A. The Perspective of Model Optimization
In backward propagation, the gradients on logit z i are calculated by: where . We take the binary case to illustrate without loss of generality.Suppose the input image is from class 1.The gradient on z 1 is calculated by: It indicates that the gradient of the target class rapidly approaches zero with the increase of the target logit.This phenomenon is called softmax saturation [59], [60].This inopportune early gradient vanishing weakens the validity of training samples and impedes model training.Therefore, softmax can only slightly separate various classes, and lacks the impetus to evenly distribute each class in the embedded space.We can also observe that there are many overlapping areas among each class in Fig. 1.Especially under the circumstances of long-tailed classification, the tail class features are insufficient to cover the ground truth distribution in embedding space.The early gradient vanish caused by soft saturation exacerbates the squeezing of the embedding distribution in tail class.Different from the original softmax loss function, the logit difference (∆ y−j ) obtained by GCL of Eq. ( 6) between the target and non-target classes is calculated by: In case the target class is a tail class, δ y − δ j > 0, which decreases the softmax saturation and thereby helps increase the validity of tail class samples.Eq. ( 8) has the same effect.Thus, Eq. ( 6) and Eq. ( 8) can automatically balance the sample validity of different classes and provide incentives for the model to make each class more separable.They achieve the aim of calibrating the distorted embedding space.

B. The Perspective of Metric Learning
Compared with the prior work that enlarges the inter-class separability via the "hard margin", e.g.see [49], [13], [60], Eq. ( 6) and Eq. ( 8) are equivalent to adding a "soft" margin.That is, the farther away from the class anchor, the lower the probability that the point belongs to this class.Fig. 3 schematically shows the comparison of the prior hard margin and the proposed soft margin.Hard margins will cause the samples to shrink toward the class anchor if the margin is too large.In addition to this, hard margins can lead to overfitting because they prohibit outliers, which can impair the robustness ability of the model.The proposed soft margin provides a smooth transition area, allowing the outliers to appear near the target class with a lower probability.This is both intuitively and theoretically more reasonable.
The cloud size δ * j may also take different expression forms, where the superscript * indicates the adopted specific form.Cao et al. [13] obtained the optimal trade-off of the hard margin (m i ) and the class size via Rademacher complexity.They have proved that m i ∝ n −1/4 i .The exponent should be −1/3 derived from Wei et al. [61].Inspired by these works, we can set the cloud size in power function form: where n max is the sample number of the most frequent class.k can be 1/3 or 1/4.Menon et al. [45] used the Fisher consistency with respect to the balanced error and obtained that m i ∝ log(1/n j ).Therefore, we can also set the cloud size in logarithmic form: We also experimentally demonstrate the effectiveness of the cloud size in different expression forms in Section V-D.
In short, GCL in the form of either normalized Euclidean distance or angular distance can achieve the following three advantages: 1) reduce the softmax saturation and thereby increase the sample validity of tail classes; 2) avoid overfitting and improve robustness through randomly sampling the values in Gaussian distribution; 3) enlarge the margin of class boundary for tail classes and thus calibrate the distortion of the embedding space.The slight disparity between the two forms lies in the procedural approach: GCL-E incorporates class-based perturbance onto features prior to logit calculation, whereas GCL-A is equivalent to sampling disturbed feature points subsequent to determining their distance from the class anchor.In addition, we systematically illustrate two versions of GCL and their distinctions from previous methods, exemplified by CE and LDAM [13], as depicted in Fig. 4.

C. Time-Complexity Analysis
The softmax has a time complexity of O(C), which is linear with the dimension of logit.It is the same as cross-entropy loss L CE and L GCL in both forms.The main difference in time complexity comes from the calculation of logit.For the original normalized logit (which is denoted as zj = s • cos θ j ), its main computational cost is vector multiplication.It contains D • C multiplications and (D − 1) • C additions.Thus, the time complexity of computing zj is O(DC).Eq. ( 6) shows that GCL-E only adds C scalar additions to zj .As a result, computing zGCL−E j has O(DC) time-complexity.For GCL-A, we first expand Eq. ( 8) to zGCL−A j = s•(cos θ j cos δ j ∥ε∥− sin θ j sin δ j ∥ε∥).The sine value can be obtained from the

B. Basic Setting
The parameters that need to be pre-set are the Gaussian distribution parameters (µ, σ 2 ).For GCL-E, the maximum cloud size cannot exceed 1 because cos θ i ∈ [−1 , 1].Gaussian distribution has a probability of 99.7% falling in [µ − 3σ, µ + 3σ], we therefore set µ = 0 and σ = 1 3 .We further clamp ε to [−1, 1] to prevent the cloud size from exceeding 1.For GCL-A, we first constrain the range of ε to [−1, 1] in the same way as the cosine form GCL.Then, we multiply ε with a constant π 2 to limit the cloud size in angular form to [− π 2 , π 2 ] based on the lemma 3 proposed by Ranjan et al. [51].Moreover, we normalize δ i by δ i ≜ δ i / max(δ i ), i = 1, 2, ..., C to ensure that maximum value of δ i does not exceed 1.For data augmentation techniques, we follow Zhong et al. [28], except for basic augmentation such as image flip, rotation, and random crop, only mixup [67] are adopted in all experiments to ensure fair comparisons.
PyTorch [68] is utilized to implement the backbone network training.We adopt the SGD optimizer with a momentum of 0.9, coupled with a multi-step learning rate schedule.All models are trained from scratch, except for ResNet-152, which is pre-trained on the original balanced version of ImageNet-1K.For the first stage, we select ResNet-32 as the backbone network and follow the experimental settings in Cao et al. [13] for CIFAR-10/100-LT.For the experiments conducted on large-scale datasets, namely, ImageNet-LT, iNatralist 2018, and Places-LT, we mainly follow Kang et al.'s settings [12] except for the learning rate schedule.For the second stage, i.e., re-balancing the classifier, we follow Kang et al.'s setting [12] for all datasets.

1) Competing Methods:
The competing methods can be categorized into the following two groups.Baseline Methods: Vanilla training with cross-entropy (CE) loss serves as one of our baseline methods.Previous studies in visual recognition [13], [73], [74], [75] have demonstrated the effectiveness of cosine similarity in mitigating the impact of imbalanced feature bias within imbalanced data distributions.Therefore, we also include CosFace [50] and ArcFace [49] as additional baseline methods.
2) Comparison Results: Extensive comparative experiments are conducted to illustrate the efficacy of our proposed GCL in two forms (GCL-E and GCL-A).The evaluation metric for assessing performance is top-1 accuracy on the test/validation sets.For comparison methods that have not released official code or relevant hyper-parameters, we quote the results directly from the original papers Results on CIFAR-10/100-LT: The proposed GCL-E and GCL-A both outperform the previous methods by notable margins with all imbalanced ratios.Especially for the largest r, i.e., 200, the proposed approach has obvious improvement.For example, GCL-E gets 79.03% and 44.84% in top-1 classification accuracy for CIFAR-10-LT and CIFAR-100-LT with r = 200, which surpasses the second-best method, i.e., FBL [72] (on CIFAR-10-LT) and MisLAS [28] (on CIFAR-100-LT) by a significant margin of 0.93% and 2.51%, respectively.GCL-A further improves the performance compared to cosine form except on CIFAR-10-LT with r = 100 (82.72% top-1 accuracy, which is still higher than the existing methods).For example, it increases the top-1 accuracy from 44.84% to 46.53% for CIFAR-100-LT with r = 200 compared to the cosine form.The margin is more than 3% compared to MisLAS.Interestingly, we can observe that CosFace [50] and ArcFace [49] perform well compared to CE loss, illustrating the efficacy of angular distance metric in long-tail learning.In comparison to LDAM-DRW [13] that is also based on angular distance metric, our proposed solution is still the clear winner.The performance gain is obtained by the smooth margin that can avoid overfitting and improve robustness.The clear performance gain compared to decoupling [12] demonstrates that calibrating the feature space via GCL is beneficial to the subsequent classifier learning.The results on CIFAR-10/100-LT datasets are summarized in Table I. Results on Large-scale Datasets: The results on large-scale long-tailed datasets including ImageNet-LT, iNaturalist 2018, and Places-LT are reported in Tab.II.We observe that GCL-E is superior to the prior arts on all datasets.On ImageNet-LT, it achieves 54.84% top-1 accuracy, surpassing DisAlign [15] by a notable margin of 1.97% and MisLAS [28] by 2.77%.For iNaturalist 2018, the proposed GCL-E achieves a top-1 accuracy of 72.01%, outperforming the second-best method by 0.44%.On Place-LT, our proposed method achieves 40.62% top-1 classification accuracy.Although the performance gain compared with MisLAS on iNaturalist 2018 and Place-LT is not as high as other datasets, our method does not require hyper-parameters searching for different datasets and thus is relatively easy to implement.GCL-A largely improves the performance on ImageNet-LT from 54.84% to 55.12%, but it slightly decreases the accuracy on iNaturalist 2018 and Places-LT.GCL-A achieves 71.14% top-1 classification accuracy on iNaturalist 2018, which is lower than MisLAS but still outperforms the other baseline methods by notable margins, showing the effectiveness of angular perturbation to balance the embedding space distribution.On Places-LT, it has a lower accuracy than MisLAS and DisAlign.Note:img-LT, iNat and Pla-LT short for ImageNet-LT, iNaturalist 2018 and Places-LT, respectively.Others are the same as Table I.IV shows the results.The re-sampling strategy (sampler) includes instance-balanced sampler (IBS) [12], square-root sampler (SRS) [76], effective number sampler (ENS) [18] and class balanced-sampler (CBS) [12].

E. Further Analysis
We conduct a series of experiments to further analyze the proposed method.Effectiveness on MoE model: We select RIDE [35] as a representative of MoE Models.The reproduction of RIDE in our experiment follows the original settings, which utilize LDAM loss and DRW strategy.We employed three experts in our MoE model and adopted the mixup technique to ensure a fair comparison.MoE models have been shown to outperform single models, albeit at the expense of increasing model size.For instance, RIDE with GCL-E achieved an accuracy of 81.32% on CIFAR-10-LT with an imbalance ratio of 200, which is an obvious improvement from the 79.03% achieved by a single ResNet-32 model with GCL-E.However, the model size of RIDE is 5.38 Mb, whereas the single model had a size of only 1.84 Mb.Tables V and VI demonstrate the improvement in performance achieved by GCL on RIDE.Both versions of GCL can be observed to improve RIDE's performance significantly on all datasets.The improvement of GCL-A ranges from 0.90% to 2.62%, while that of GCL-E ranges from 0.82% to 2.64%.GCL-E vs. GCL-A: Combining Tables I and II, it can be observed that GCL-A does not always have inferior performance compared to GCL-E, and vice versa.The reason is that iNaturalist 2018 and Places-LT have much large imbalance ratios (r = 500 and 996, respectively) than the other datasets (ImageNet-LT has the largest r which is 256 among these datasets ).We draw the logit curve of different forms of GCL, which is shown in Fig. 5.In our setting, the large class has a small δ.The smaller the class size, the larger its corresponding δ.As the distance θ increases, the logit of GCL-A decreases faster than GCL-E.It is more noticeable for the larger δ, as shown in Fig. 5b.A small distance will have a more obvious logit difference for GCL-A compared with GCL-E.Therefore, in the case of high imbalance ratio, GCL-E can make the separability of minority classes stronger so that the logit difference is more significant.
Another rationale arises from the discrepancy in logits restrictions caused by varying imbalance ratios.Excessively strict logit constraints may lead the model astray.Without loss of generality, we use the most frequent class (denoted by subscript 'head') and the least frequent class (denoted by subscript 'tail') to analyze.For an input image that is tail class, GCL-A necessitates:  5.The logit curve of GCL in different forms.For ease of visualization, the scale parameter s is omitted 5 .
Considering δ = 0.5 as an example, when θ head < π 2 , θ tail being negative satisfies the requirements of the loss function, which could mislead the model training.The requirement that the angle between non-target classes and the target weight be greater than π 2 is overly stringent.For highly imbalanced datasets, namely iNaturalist 2018 and Places-LT, the discrepancies in perturbations between tail and head classes are more pronounced, which contributes to this phenomenon.In datasets with a smaller imbalance ratio, the disparities in perturbations are comparatively smaller, making this restriction relatively weaker.The majority of classes can adhere to their respective soft margin restrictions.However, opting for a smaller δ might result in the added perturbation being less conspicuous, thereby leading to less differentiation between classes.For GCL-E, an input image belonging to the tail class should satisfy the following inequality: When δ = 0.5, θ head > π 3 will cause θ tail to be negative.In contrast, the constraints imposed by GCL-E are more lenient, resulting in a slight decrease in performance on datasets characterized by a low imbalance ratio compared to GCL-A.Nonetheless, this relaxation does not predispose the model to erroneous interpretations stemming from excessively stringent restrictions.
Moreover, from another perspective, the selection of the perturbation magnitude δ holds a pivotal role for GCL-A.Additionally, cloud size selection should extend beyond mere class size considerations, with each variant of GCL potentially requiring its optimal strategy for cloud size selection.It is conceivable that the logarithmic form of cloud size utilized for GCL-A does not constitute the optimal choice.We leave these as our future study.The Effect of Gaussian Cloud: To obtain additional insight, we visualize the embedding distribution using t-SNE projection.Since CE loss is selected as the loss function for several methods [11], [12], [65], especially MisLAS performs the second-best in most cases, we visualize the embedding distribution obtained by CE loss for comparison.LDAM [13] is an angular distance metric based method but utilizes the hard margin, we also show its embedding distribution.The embeddings are calculated from the samples in CIFAR-10-LT with r = 100.Fig. 6 shows the results.From Fig. 6a,     larger than that of other approaches.LDAM and GCL in both forms are all angular distance metric based methods, thus their embeddings are basically radial.Fig. 6b shows that the LDAM embedding of each class is more slender.This is caused by the hard margin that strictly restricts the class region, resulting in overfitting the training set.Thus, LDAM does not generalize well on the test set compared with our proposed GCL.In Fig. 6c and Fig. 6d, on training set, the embeddings for each class obtained via GCL in both forms have more obvious margins compared to CE and also are more scattered compared to LDAM.The results of the test set verify the efficacy of our proposed approach.GCL-E and GCL-A have better generalization performance, and it can be found that the misclassified classes are mainly in the edge regions of each class.For better illustration, we additionally compare the embedding distribution of the most (class 0) and least (class 9) frequent classes, along with their respective decision boundaries derived from various loss functions in Fig. 7. Concerning the acquired features, within the training set, the overlap between the features of the head and tail classes by LDAM and GCL is reduced compared to those obtained by CE loss, with a pronounced disparity observed in GCL-A.In addition, it presents more clearly that compared to our proposed GCL, the LDAM embeddings appear to perform better on the training set, but cannot be well generalized to the unseen test samples.In Fig. 7b, there are more points of class 9 appearing inside the class 0 area on the test set.By contrast, as shown in Fig. 7c and Fig. 7d, the misclassified points of class 9 are mainly in the edge area of class 0 on test set.Regarding the decision boundary, CE loss exhibits a tendency to predominantly ensure accurate classification of head classes while often disregarding tail classes.In contrast, due to the presence of margins or perturbations beneficial to the tail class, both LDAM and GCL adopt a holistic approach to class performance.However, this approach comes at the expense of head class performance to some extent.The decision boundary delineates specific head class samples into the tail class.Performance on Classes with Different Scale: To investigate the impact of GCL, we report the accuracy of various scale classes on ImageNet-LT.The results are presented in Table VII.The classification accuracy of baseline methods drops a lot in the middle and tail classes.LDAM-DRW increases the accuracy of middle and tail classes but decreases that of head classes a lot.GCL-E outperforms the other state-of-theart methods on middle and tail classes with large margins.Meanwhile, the accuracy of the head class decreases the least.By contrast, GCL-A has more improvement in middle and tail classes, but the damage to head classes is slightly higher than GCL-E and decoupling.In general, GCL-E performs well in all class scales.GCL-A has the highest overall classification accuracy.Significantly improving the accuracy of tail classes while preventing that of the head classes from diminishing illustrates the superiority of our approach.

VI. CONCLUSION
In this paper, we have proposed to use Gaussian form perturbance to augment the features for long-tailed classification.Eventually, we have derived two GCL forms, which are simple but effective.Both of these two forms make tail classes have larger perturbance amplitudes on their corresponding class anchors, which can expand the spatial distribution of tail class embeddings.Furthermore, we have analyzed the rationale of the proposed method from different perspectives, which provides insights into how to obtain a representative and balanced-distributed embedding.After obtaining a balanced distributed embedding space, the classifier bias can be effectively addressed by simply retraining it with classbalanced sampling.Comprehensive experiments on various benchmark datasets have demonstrated that the proposed Gaussian clouded logit in both forms achieves significant performance gains compared to the state-of-the-art methods.In addition, we have also validated the properties of the proposed GCL by t-SNE visualization and the performance on different scales of classes.

Fig. 1 .
Fig. 1.T-SNE visualization of the distorted embedding space2 .The embedding distributions of head and tail classes are shown in shaded areas.We can see that there are many overlapping regions between each class.

Algorithm 1 :
GCL with cRT Input: Training dataset T ; Output: Predicted labels; 1 Initialize the model parameters ω of the backbone network ϕ((x, y); ω) randomly ; 2 for iteration = 1 to I 0 do 3 Sample a batch of data B from the original long-tailed dataset T with a batch size of b;4

Fig. 3 .Fig. 4 .
Fig. 3. Schematic comparison of hard margin and soft margin.The blue dots and pink triangles represent the head and tail classes, respectively.(Color for the best view.)(a) The hard margin strictly restricts samples from appearing in the corresponding region.(b) The soft margin allows outliers to appear in the region with a lower probability, which increases generalization.
corresponding cosine value.Compared to zj , GCL-A adds an additional 2C multiplications and C subtractions.Computing zGCL−A j also has O(DC) time-complexity.It is obvious that GCL in both forms imposes a negligible additional burden on the training process.V. EXPERIMENTS This section first introduces five long-tailed datasets used in our experiments in Section V-A.Then, the detailed implementation settings of the experiments are presented in Section V-B.To demonstrate the effectiveness of GCL, we compare the proposed two forms of GCL with state-of-the-art methods based on a single model structure.The classification accuracy is compared in Section V-C.Moreover, Section V-E validates that GCL can also enhance the performance of MoE model.Finally, the model validation experiments and ablation studies are conducted to show the properties of our proposed method in Section V-E. A. Benchmark Datasets We use five benchmarks: CIFAR-10-LT and CIFAR-100-LT, ImageNet-LT, iNaturalist 2018, and Places-LT.CIFAR-10/100-LT: The original versions of CIFAR-10 and CIFAR-100 [62] are uniformly distributed datasets, which consist of 10 and 100 classes, respectively.They both contain 60K images with a size of 32 × 32.The training set contains 50K samples and the test set has 10K samples.Following the experimental settings in [18], [13], we down-sampling training images per class with the exponential function n i = n o i × λ i , where i is the class index (0-indexed), n o i is the label frequency in the original balanced version and λ ∈ (0, 1).The test sets are kept unchanged.The imbalance ratio r is defined as the ratio of the maximum and minimum label frequencies, i.e., r = max (n i )/ min (n i ), i = 1, 2, ..., C. In the comparative experiments, we employ the three most widely used imbalance ratios, namely r = 50, 100, and 200.ImageNet-LT and Places-LT: The original versions of Ima-geNet [63] and Places [64] are artificially balanced, large-scale real-world datasets for classification and localization.Following Liu et al.'s [65], we construct long-tailed versions of these datasets by truncating a subset using the Pareto distribution with a power value α = 6 from the balanced versions.The original validation sets are employed for testing.In summary, ImageNet-LT comprises 115.8K training images from 1K categories with r = 1, 280/5.Places-LT consists of 62.5K training images spanning 365 categories with r = 4, 980/5.iNaturalist 2018: iNaturalist 2018 [66] is a real-world finegrained dataset for classification and detection, exhibiting a naturally long-tailed distribution.It contains different species of plants and animals collected from the real world in a wide variety of situations.This dataset contains over 437.5K training samples and more than 24.4K validation images from 8,142 categories.The official validation set is utilized for testing in the experiments.The imbalance ratio of iNaturalist 2018 is r = 1, 000/2.
Fig.5.The logit curve of GCL in different forms.For ease of visualization, the scale parameter s is omitted5 .

Fig. 6 .
Fig. 6.Visualization of the embedding distribution obtained by different methods.t-SNE projection is utilized.The dataset is CIFAR-10-LT with r = 100.ResNet-32 is used as the backbone.(Color for the best view.)

Fig. 7 .
Fig. 7. T-SNE visualization of decision boundary (dashed line) between head (class 0) and tail (class 9) classes.The dataset is CIFAR-10-LT with r = 100 and the backbone network is ResNet-32.(Color for the best view.)

TABLE I COMPARISON
RESULTS ON CIFAR-10/100-LT W.R.T. TOP-1 ACCURACY (%).⋆ denotes that the results are quoted from the corresponding papers.Other results are obtained by re-implementing with the official codes.The best and the second-best results are shown in underline bold and bold, respectively.

TABLE II COMPARISON
RESULTS ON IMAGENET-LT, INATURALIST 2018 AND PLACES-LT W.R.T. TOP-1 ACCURACY (%) We explore several different cloud size adjustment strategies, including power form with different exponents (1/3 and 1/4), and logarithmic form.For a fair comparison, we use GCL-E, and the sampler and retraining strategy are selected as class-balanced sampling and cRT, respectively.The results are presented in TableIII.The

TABLE IV ABLATION
EXPERIMENT OF DIFFERENT RE-SAMPLING AND RE-TRAINING STRATEGIES ON CIFAR-10-LT WITH r = 100.
The form of GCL is GCL-E and the re-training techniques for all samplers are cRT.IBS decreases the performance slightly (from 80.55% to 80.52%), which indicates that training the classifier with IBS leads to classifier overfitting.CRT improves the model performance because it increases the sampling probability of tail classes.ENS and CBS have better performance because they can address the problem of negative gradient over suppression by balancing the amount of data in each class.We use CBS in the comparison experiments because it achieves the best results among these samplers.For the selection of RT technique, we first train the backbone without any RT technology using GCL-E.Then we froze the representation and re-balance the classifier with learnable weight scaling (LWS), τ -normalized classifier (τ -NC), and cRT, respectively.We can observe that even without any RT technique, our approach (the top-1 classification accuracy is 80.55%) can still beat most state-of-the-art including two-stage methods (for example, LDAM-DRW and BBN achieve 77.03% and 79.82%, respectively).All RT techniques significantly improve model performance, which demonstrates that good representation can improve classification accuracy by simply re-balancing the classifier.cRT outperforms best among the classifier re-training techniques, which improves the accuracy by 2.18% compared with no RT.Thus, we use cRT in the comparison experiments.

TABLE V VALIDATION
OF THE EFFECT ON MOE MODEL ON CIFAR-10/100-LT.

TABLE VI VALIDATION
OF THE EFFECT ON MOE MODEL ON LARGE-SCALE DATASET.