Adjusting Decision Boundary for Class Imbalanced Learning

The training of deep neural networks heavily depends on the data distribution. In particular, these networks easily suffer from class imbalance. The trained networks would recognize the frequent classes better than the infrequent classes. To resolve this problem, existing approaches typically propose novel loss functions to obtain better feature embedding. In this paper, we argue that drawing a better decision boundary is as important as learning better features. Based on our observations, we investigate how the class imbalance affects the decision boundary and deteriorates the performance. We also investigate the feature-distributional discrepancy between training and test time. Accordingly, we propose a novel, yet simple method for class imbalanced learning. Despite its simplicity, our method exhibits outstanding performance. Specifically, the experimental results show that we can significantly improve a network by scaling the weight vectors, even without additional training processes.


Introduction
Data is imbalanced in nature.We frequently encounter everyday information, while we rarely face singular information.Despite this imbalance, humans do not have any trouble in learning and recognizing things.Also, we often learn from rare, but intense experiences.However, when it comes to the domain of machine learning, the imbalance of data becomes a critical issue.It deteriorates the performance of the trained machines.Especially in deep neural networks, imbalanced data distribution is highly critical since they learn directly from the data distribution.To this end, many of the widely used public datasets provide wellbalanced class distribution.In other words, their collectors have discarded a large amount of data from frequent classes to adjust the class balance.It is clearly wasteful and redundant in terms of both information and human efforts.
The optimal scenario is certainly to train a machine using every data that we have access to.However, the dispar-Code available: https://github.com/feidfoe/AdjustBnd4Imbalanceity between samples often induces a disparity between the accuracy of classes.Features are often biased toward frequently appearing classes, so the less frequent classes have poor feature representation.Consequently, a trained machine recognizes frequent classes, whereas it shows poor performance with infrequent classes.Understanding how such a phenomenon is developed in class imbalanced learning could provide a novel viewpoint to mitigate the problem of the imbalanced performance.In this work, we provide an in-depth analysis based on observation and propose a simple but powerful method for class imbalanced learning.
One important observation is that minimizing the empirical loss with a conventional training framework results in decision boundaries that allocate a larger volume of the feature space to more frequent classes.This suggests that the decision boundary is biased toward less frequent classes.Furthermore, we show that the bias in the decision boundary is closely related to the norm of each weight vector.Low sample frequency reduces the norm of the weight vector and leads to a disadvantageous decision boundary.Therefore, we propose the Weight Vector Normalization (WVN) method to draw the decision boundary at the middle of the weight vectors.
Another motivational observation is regarding how the features of each class are distributed.If a network is trained for recognition, the features of each class form a cluster in the feature space.In the image space, the size of each cluster follows the sample frequency; more samples literally form a larger cluster.However, we have observed that the more frequent classes rather form smaller clusters in feature space; more samples induce a higher density.This size reversal of clusters is due to the disparity of generalization.A trained neural network is more generalized to frequent classes, whereas it is over-fitted to infrequent classes.This suggests that a larger margin is required for less frequent classes.To resolve this problem, we propose a weight rescaling method (RS).Once the network is done training, we adjust the decision boundary depending on the sample frequency.Despite the simplicity of the proposed method, it shows outstanding performance.Interestingly, we achieved better performance than the existing methods without us-ing any additional training process.This suggests that we can obtain a feature extractor of fine quality by minimizing empirical loss and that the problems with class imbalanced learning mainly lie in how to draw the appropriate decision boundary.
Our main contributions can be summarized as follows: Firstly, we present an in-depth analysis on class imbalanced learning, in terms of the norm of the weight vector.Our analysis shows that there is obvious correlation between the norm and sample frequency.Secondly, we show that we can adjust the decision boundary by controlling the norm of the weight vector.With concordant observations, we propose a novel method, which outperforms the previous methods.Lastly, we experimentally show that the features from our baseline network are already of fine quality; hence we can achieve better performance than the existing methods with a delicately drawn decision boundary.

Related Works
The great majority of existing algorithms used to resolve the data imbalance problem can be categorized as either re-sampling or re-weighting.The data re-sampling approach is intuitively straight forward and relatively simple: "Since we have imbalanced number of data for each class, duplicate or discard what we already have."Properly over-sampled [5,12,3,1,22,13], or under-sampled [16,11,2,19] data makes a model perform better.However, both over-sampling and under-sampling approach have notable weaknesses.The over-sampling method causes a model to become over-fitted to the duplicated samples.To minimize the over-fitting problem, SMOTE [5] and its variants [12,3,1,22] have been proposed to generate samples of infrequent classes.The recently proposed generative adversarial networks [10,29,8] can also resolve this problem.However, it is difficult to overcome the fundamental deficiency in data samples.
On the other hand, the under-sampling approach easily deteriorates the overall performance, struggling with severer data deficiency.As Sun et al. described in [24], the performance of neural networks logarithmically increases based on the volume of training data.This implies that discarding samples is critical in terms of overall performance.In [19], the authors pointed out that the natural distribution is also a valuable information, so we need to fully exploit the data.For this reason, the over-sampling strategy is preferred to under-sampling.
The re-weighting approach is also considered as a costsensitive approach.The underlying concept of a costsensitive approach for class imbalanced learning is to treat different predictions differently.In [9], the authors researched weighting methods for the binary classification task.Similarly, a cost-sensitive SVM for highly imbalanced datasets was proposed in [25,28,20].To obtain a better per-forming model, the ensemble method was adopted to both cost-sensitive [30], and sampling approaches [27,26].
Following the explosive development of CNN-based models, deep learning based algorithms that resolve the class imbalance problem have been proposed.Under excessive class imbalance, re-weighting the classification loss due to the inverse of the sample frequency can make a network diverge during training.To this end, Cui et al. proposed the concept of effective number of samples to rebalance the classification loss [6].In advance of [6], Lin et al. proposed focal loss [21], which weights the classification loss depending on the prediction results.Focal loss helps the network to focus on poorly predicted samples and not become over-fitted to the well-predicted samples.Cao et al. proposed label-distribution-aware margin loss [4] which aims to generalize the minority class better, by considering the label distribution in the loop.Khan et al. proposed a novel loss function by estimating the uncertainty of each class [17].In [15,7], the authors also proposed another form of loss to train the neural networks by sampling neighbors.
As enumerated above, deep learning based methods mainly focus on studying a novel loss function for class imbalanced learning.Unlike these researches, our work involves neither re-sampling nor re-weighting.Our proposed method regulates the neural network and adjusts the decision boundary based on the sample frequency of each class.Similar to [4], we analyze the class imbalance problem in terms of generalization.We compare the generalization for each class and use the analysis as a prior information to adjust the decision boundary.The details of the method are presented in the following section with a thorough justification.

Method
Before describing the method, we define the notations and overall framework.If it is not specifically mentioned, the notations that appear in this paper refer to the following.

Empirical Loss Minimization
Suppose that we have training dataset D = {(x i , y i )} N i=1 of N image-label pairs, where the label space is {1, ..., K}; it is a classification problem with K classes.Since our target task is class imbalanced learning, we further segment the dataset as D = K j=1 D j , where D j is a subset of the whole dataset, which consists of samples from class j.Then, we define n j as the number of samples in D j .Without loss of generality, we can set n 1 ≥ ... ≥ n K .Following prior research [6], we define the imbalance ratio of the dataset as n 1 /n K .
For training, we employ a general framework.We first feed an input image x into a feature extraction network f (•).It outputs a feature vector, f (x) ∈ R d .Then, a classifier, which consists of single fully connected layer, outputs a logit vector, l(x) ∈ R K , by calculating the inner-product between f (x) and the learnable parameter, W ∈ R d×K .We can write W in a vector form as W = [w 1 , ..., w K ], where w j ∈ R d is a weight vector for class j.The operation of the classifier can be written as follows: For brevity, we drop the additive bias term.Note that we are considering a linear classifier.Then, we apply softmax operation to convert l(x) into a vector of probabilities, p(x).
Each element of p(x) represents the probability of input x belonging the corresponding class.We compute the crossentropy loss between the one-hot encoded ground truth label and p(x), so that we can calculate the gradients for the learnable parameters.
The described framework trains the neural network by minimizing the empirical loss.Given a dataset, D, the empirical loss can be formulated as follows: where |D| denotes the size of the dataset, and (•, •) denotes the cross entropy loss between the label and p(x).Considering D to be a union of D j , we can rewrite the empirical loss as the weighted summation of the class-wise empirical loss as follows: From Eq.( 3), it can be seen that minimizing L(D) is highly likely to result in L(D 1 ) ≤ L(D 2 ) ≤ ... ≤ L(D K ), if the number of samples for each class is highly imbalanced.The asymmetrically optimized class-wise empirical loss is likely to result in a decision boundary that is biased toward less frequent classes [17].We consent to the analysis of [17]; however, we focus more on the norm of each weight vector, unlike the authors who focused on the directions.

Norm and Decision Boundary
We start with an observation on the tendency of the norm of each weight vector.Figure 1 shows how the norm of each weight vector changes over the training process.Early in the training, the norms do not show a clear correlation with the sample frequency, since it is suffering even for the training data.Notably, the norms of every weight vector are increasing.During the later stage of the training, the graph of the norms become disentangled, presenting an apparent correlation.Figure 2 illustrates the relative norm of the weight vectors.The norms are relatively uniform, if the training data is well-balanced (Imb 1).Since ||w k || 2 can be interpreted as a multiplicative bias, the fluctuation presents a natural variation of the bias.However, when the training data where p k (x) denotes the k-th element of p(x), and θ x k de- notes the angle between f (x) and w k .Eq.( 4) shows that the sign of ∂ (j, x)/∂||w k || 2 is dependent on θ x k , since the other terms always have a fixed sign.Once the network is sufficiently trained, so that the empirical loss has been sufficiently minimized, cos(θ x k ) is highly likely to have a positive value if k = j for all x ∈ D j .This suggests that ∂L(D j )/∂||w j || 2 has a negative value, so ||w j || 2 should be increased by minimizing L(D j ).On the other hand, if k = j, ∂L(D j )/∂||w k || 2 can be either positive or negative depending on the correlation between classes j and k.Assuming a highly imbalanced sample frequency, Eq.(3) and Eq.( 4) imply that ||w 1 || is likely to have the largest value among the weight vectors.
The norm of the weight vector and the decision boundary are closely related.In the feature space, the decision boundary between class i and j, is a set of points that satisfy w T i f (x) = w T j f (x); we can rewrite this hyperplane as follows: This implies that the weight vector of larger norm would form wider angle with the decision boundary.Figure 3 illustrates how the norm of each weight vector affects the decision boundary.Although the direction of each weight vector is fixed, the boundary changes depending on the norm.If we train the network without any regularization, the weight vectors are formed as shown in Figure 3 (a); the weight vector for more frequent class has a larger norm, so the decision boundary is biased toward the less frequent class.As a result, a smaller volume of the feature space is allocated to the less frequent class.On the other hand, Figure 3 (b) shows that the decision boundary is drawn at the middle of the two weight vectors since they have similar norms.As illustrated in Figure 2, well-balanced sample frequency brings about well-balanced norm.As a result, comparable volumes of feature space are allocated to each class.To sum up, an imbalanced sample frequency causes an imbalance in the norm of each weight vector, and it indicates that there is a discrepancy in the volume of feature space allocated to each class.
The volume discrepancy is in accord with the empirical distribution.Since a small number of samples are provided from the K-th class, the network is trained to allocate a small volume.Conceptually, this phenomenon is against our desire, since it implies that the network considers more frequent classes as more important classes.We want to train the network to treat all classes as equally important.To this end, we propose Weight Vector Normalization (WVN), which normalizes the weight vectors at the end of each training iteration.Then, the stochastic gradient descent optimizer becomes projective stochastic gradient descent optimizer.From the perspective of the prior distribution, WVN is used to force the class conditional distribution to have the same variance independent of the sample frequency.

Generalization
Another important observation of ours is about generalization and the size of the feature cluster.Higher sample frequency implies a bigger cluster in the image space.Even if we consider the effective number of samples [6], the size of the cluster monotonically increases with the number of provided samples.On the contrary, we have observed that the size of the cluster is not monotonic in the feature space, since the feature extraction network is trained to project all the samples from each class to a corresponding point.Moreover, owing to the gap of generalization for each class, the size of the feature cluster monotonically decreases in the test time with the number of samples.
Consider a neural network trained to minimize empirical loss.There is a unanimous agreement that more training data implies better generalization.Intuitively, more training data represents a high sampling rate, which is associated with less uncertainty [17].The same analysis is applicable to each class.If the empirical distribution of the training dataset is imbalanced, the network would provide better generalization for more frequent classes.It is intuitively straight-forward since the network had seen more diverse data points from classes with a high sample frequency.Consequently, for frequent classes, the features of the training and test time form clusters close to each other.On the contrary, if only a few samples are provided, the over-fitting problem arises.A feature extraction network projects training samples to the feature space close to each other, while projecting the test samples far apart from the training samples.The most representative method for resolving the over-fitting problem is to reduce the model capacity.Unfortunately, it is not practicable for class imbalanced scenario, since the reduction of model capacity will deteriorate the performance of other frequent classes.As a result, the network is trained to provide poor generalization for less frequent classes.
Figure 4 presents the cluster size and generalization for each class.σ-Train and σ-Test denote the size of the feature cluster for each class during the training and test times, respectively.To measure the size of each cluster, we project all the features to unit ball of the feature space.Then, we calculate the angular standard deviation of each cluster.Although it is not precisely monotonic, σ-Train increases with the sample frequency.Moreover, the size of each cluster becomes saturated if the class has sufficient training samples.It concurs with the concept of effective number proposed in [6].However, when it comes to the test time, the size of the cluster shows the opposite tendency; σ-test suggests that the features from less frequent classes are more broadly distributed, forming larger clusters with lower density.This shows that the network is well generalized for frequent classes, whereas it is over-fitted for infrequent classes.
The disparity in generalization is more evident when we measure the training/test difference of clusters.In Figure 4, ∠µ of each class denotes the angular gap between the centers of training and test clusters.We consider this as a measure of generalization.It shows how far apart the cluster centers are placed during training and test times.The gap represents the distinctive correlation with the sample frequency.In the case of the least frequent class, C10, the training and test clusters are nearly 40 • apart.Since the last layer of the feature extraction networks is ReLU activation, the maximum angular distance between two feature vectors is 90 • .Considering this, we can roughly conceive the significance of the angular gap between training and test clusters.
Figure 5 visually describes the disparity of generalization.It is a t-sne plot of the most and the least frequent classes from Long-Tailed CIFAR-10 with an imbalance ratio of 100.For brevity, the same number of samples are plotted from both classes.We can visually verify the disparity in generalization.For the most frequent class, the test features are distributed similar to the training features, while the test features from the least frequent class are more broadly distributed covering the training feature distribution.Naturally, the center of the test cluster is far apart from the center of the training cluster.
These observations suggest that the decision boundary should rather be leaned toward classes with a high sample frequency, thereby allocating a smaller volume.This is the opposite tendency of what the minimization of empirical loss induces.A similar analysis appears in [4,17], where the authors suggest that we should encourage a bigger margin for minority classes and propose a novel loss function based on their suggestions.Since Figure 3 shows that we can adjust the decision boundary by controlling the norm of where γ is a hyper-parameter.To sum up, our overall training algorithm can be written as follows: Compute gradient and update: Note that if γ = 0, all the weight vectors remain the same, ablating the re-scaling method.With a larger value of γ, we allocate more volume of feature space to infrequent classes, admitting that our network is poorly generalized for those classes.

Experiments
In this section, we present the experimental results and analysis.We evaluate our proposed methods on the object classification task with modified CIFAR [18] and Tiny Im-ageNet [23] datasets.The proposed weight vector normalization is denoted as WVN, and the re-scaling method is denoted as RS.

Visual recognition on CIFAR
The CIFAR dataset originally contains 50,000 training images and 10,000 test images.Since the dataset provides a well-balanced empirical distribution, we need to artificially implant the imbalance.To verify our algorithm and compare with the result of previous research, we applied the long tailed imbalance implanting protocol proposed in [6].The number of training samples decreases according to an exponential function, while the whole test samples were used as it is.This suggests that a network should be trained to recognize every class regardless of their sample frequency.Moreover, we used the imbalanced CIFAR dataset for further analysis; the characteristic of the decreasing number of samples allows us to analyze whether a tendency is dependent on the sample frequency or not.We used this dataset for the figures in prior sections as well.For the network architecture, we used ResNet-32 [14] for all the experiments on CIFAR.We trained the network over 180 epochs with an initial learning rate of 0.1.The learning rate was decayed by a factor of 0.1 at the 80 th and 150 th epochs.
Table 1 summarizes the classification error rates for the long tailed CIFAR dataset.As a baseline algorithm, a net- Predicted Label work is trained by minimizing the empirical cross-entropy loss without any regularization.The under-sampling strategy severely degrades the performance when the dataset is highly imbalanced.Moreover, the re-weighting approach was neither effective with high imbalance ratio.The results show that our proposed method outperforms the other methods when the classes are imbalanced.If the classes are well balanced, normalizing the norm of each weight vector is the same as not using a multiplicative bias in the classifier network.It reduces the total degree of freedom and affects the performance.However, irrespective of whether the performance was improved or degraded, the variation was marginal.Figure 6 presents the confusion matrices of our baseline and WVN+RS model on Long-Tailed CIFAR-10 with an imbalance ratio of 100.In Figure 6 (a), the color of the diagonal elements is fading.This shows that the accuracy increases with the number of samples, suggesting that the different sample frequency induces the disparity of the ac-curacy.On the other hand, the high values in the bottom left corner represent the low precision of frequent classes and the low recall of the infrequent classes.It implies that the decision boundary is leaned toward minority classes, while the feature points are biased toward majority classes.Figure 6 (b) shows that the WVN+RS method alleviates the disparity.Compared to the baseline, the model trained using our method provides more balanced accuracy.The performance on infrequent classes is improved while preserving the performance on frequent classes.Furthermore, the most striking result is that of Base-line+RS, which is the off-the-shelf proxy of our proposed method.Algorithm 1 shows that each weight vector needs to be normalized at the end of every training iteration.Instead, we only apply Eq.( 6) after the network is trained.The parameters of the classifier are re-scaled only once after the whole training is done.In other words, we have ablated the weight vector normalization.Therefore, all the parameters except those in the classifier have an identical value with that of baseline model; it uses identical features with that of the baseline model.The direction of each weight vector is also preserved from the baseline as well.Notably, it shows better performance than the other methods.This suggests that the features extracted by baseline models are of satisfactory quality.

Visual recognition on Tiny ImageNet
We also evaluated the proposed method with Tiny Ima-geNet [23].The Tiny ImageNet dataset has 200 classes, and each class has 500 training samples and 50 test samples.To implant the data imbalance, we applied the same protocol used for CIFAR dataset.In addition, a step imbalance [2] was implanted to verify whether our proposed method can resolve various types of imbalance.In step imbalance, all the majority classes have the same number of samples.All the minority classes also have the same number of samples but are fewer.Half of the classes were selected as the minority classes.We used ResNet-18 architecture, and γ was fixed as 0.1 for all the experiments.Table 2 presents a summary of the validation errors on the Tiny ImageNet dataset.Similar to that in the results of the imbalanced CIFAR dataset, WVN+RS method showed the best performance in every experiment except for the case of step imbalance with a ratio of 100.Even in that case, Baseline+RS model performed the best.The Baseline+RS models showed remarkable performance on the Tiny Ima-geNet as well.Considering that the extracted features are completely identical with those of the baseline model, the superior results of Baseline+RS shows the importance of the decision boundary.To this end, we further analyze the off-the-shelf proxy of our proposed method.The last row denotes the performance gap between the proposed method and Oracle.Although the Oracle performance with the baseline feature extractor is superior to that of the WVN model, the results suggests that the feature extractor of the WVN model can achieve better performance when we apply the RS method

Discussion
The overall experimental results indicate that (1) adjustment on the norm of each weight vector can effectively regulate the network to learn from imbalanced data, and (2) the features from the baseline network are already of fine quality.In particular, by observing the results of Baseline+RS model, we conclude that drawing an appropriate decision boundary is as important as extracting features of superior quality.The results shown in the previous sections imply that the softmax cross entropy loss is advantageous in training the feature extractor, whereas the resulting classifier provides a biased decision boundary.To quantify the quality of the extracted features, we have fine-tuned the classifier with test samples, while the feature extraction network is fixed.We denote these performances as Oracle.Table 3 summarizes the validation error of the proposed methods and their Oracle performance depending on the feature ex-  3, RS denotes the performance after the weight vectors are re-scaled.Interestingly, the Oracle of the Baseline feature extractor performs better than the Oracle of WVN in both cases.This suggests that the feature extractor is rather degraded by the weight vector normalization in terms of the potential performance.Nevertheless, the WVN+RS model performs better than the Baseline+RS model.The last row of Table 3 shows the performance gap between the proposed method and their Oracle.Note that the improved performance by adding the RS method can approach much closer to the Oracle performance if the vector is normalized.This shows that the features from the WVN model are aligned more appropriately, so that we can draw better decision boundary.More important benefit of WVN is γ sensitivity.Figure 7 shows that the Baseline+RS model is more sensitive in the selection of γ.Note that if γ is zero, Baseline+RS is the same as Baseline.A larger γ denotes a stronger adjustment on the decision boundary.Therefore, the robustness with regard to γ implies that the features are clustered with a large margin in the test time.The robustness is also important in terms of selecting the hyper-parameter.Since the proposed weight re-scaling is a post-processing of the training procedure, adjusting γ is relatively handy compared with other cost-sensitive methods.Nevertheless, it is clearly advantageous that we can determine the hyper-parameter effortlessly.
From another viewpoint, the models are more sensitive on Long Tailed CIFAR-100 than CIFAR-10.In the experiments on Long Tailed CIFAR-10, the proposed method consistently showed better performance than the baseline.Moreover, the variation of performance is marginal along the value of γ.However, on CIFAR-100, the robustness against γ was comparatively degraded.A small interval of γ improves the performance, and a larger value degrades the performance.This is understandable since the classes in CIFAR-100 are more fine-grained than the classes in CIFAR-10.Fine-grained classes are highly likely to have a smaller margin than coarse-grained classes.Since the proposed re-scaling method is effective when the decision boundary is adjusted in between the margin, a small margin causes high γ-sensitivity.

Conclusion
In this paper, we proposed two methods for class imbalanced learning: weight vector normalization (WVN) and weight re-scaling (RS).Our methods showed outstanding performance despite their simplicity.The key idea of our methods is a causal relationship between the imbalanced sample frequency, norm of the weight vector, and decision boundary.We showed that the disparity in the norm is a consequence of imbalanced class and described how the disparity affects the decision boundary.Moreover, we experimentally showed that we could successfully adjust the decision boundary.
Most of the deep learning based methods for class imbalanced learning follow a cost-sensitive approach, seeking a better loss function.The underlying concept of costsensitive methods is to train a better feature extractor.Although the models trained with those methods perform better than the baseline, this work showed that a simple adjustment on the baseline further improves the performance.In particular, the results of Baseline+RS model show that the baseline features were already of fine quality.This suggests that drawing a better decision boundary is as important as training a better feature extractor.We hope that this work provides a novel viewpoint and inspiration for class imbalanced learning.

Figure 1 .Figure 2 .
Figure 1.How the norm changes during the training process.Note that class 1 is the most frequent class, while class 9 is the least frequent class in the figure.Early in training, the norms do not show clear correlation with the sample frequency.However, in the later stage, the norm of each class is aligned with the sample frequency

Figure 3 .
Figure 3. Correlation between the decision boundary and the weight vectors.(a) If two weight vectors have different norms, the decision boundary is drawn leaning toward the weight vector with the smaller norm.(b) If they have identical norms, the decision boundary is drawn at the middle.This figure also shows that we can adjust the decision boundary by adjusting the norm

Figure 4 .Figure 5 .
Figure 4.The disparity in generalization depending on the sample frequency.σ-Train and σ-Test denote the size of the feature cluster for each class during the training and test times, respectively.Although the training features are well-clustered, σ-test suggests that the test features from less frequent classes are more broadly distributed.∠µ denotes the angular gap between the centers of the training and test clusters.This suggests that the decision boundary should be leaned toward more frequent classes

Figure 6 .
Figure 6.Confusion matrix of the (a)baseline and (b)proposed method on Long Tailed CIFAR-10 with an imbalance ratio of 100.The fading color of diagonal elements in (a) implies the disparity of the accuracy.With our proposed algorithm, the performance on infrequent classes is improved while preserving the performance on frequent classes

Figure 7 .
Figure 7. Validation errors of the proposed methods depending on γ.This figure illustrates that the WVN model is more robust to γ than the baseline model tractor.Since the classifier of the Oracle model is trained and validated with the same test data, their validation error can be interpreted as a lower bound of the corresponding feature extractor.In Table3, RS denotes the performance after the weight vectors are re-scaled.Interestingly, the Oracle of the Baseline feature extractor performs better than the Oracle of WVN in both cases.This suggests that the feature extractor is rather degraded by the weight vector normalization in terms of the potential performance.Nevertheless, the WVN+RS model performs better than the Baseline+RS model.The last row of Table3shows the performance gap between the proposed method and their Oracle.Note that the improved performance by adding the RS method can approach much closer to the Oracle performance if the vector is normalized.This shows that the features from the WVN model are aligned more appropriately, so that we can draw better decision boundary.More important benefit of WVN is γ sensitivity.Figure7shows that the Baseline+RS model is more sensitive in the selection of γ.Note that if γ is zero, Baseline+RS is the same as Baseline.A larger γ denotes a stronger adjustment on the decision boundary.Therefore, the robustness with regard to γ implies that the features are clustered with a large margin in the test time.The robustness is also important in terms of selecting the hyper-parameter.Since the proposed weight re-scaling is a post-processing of the training procedure, adjusting γ is relatively handy compared with other cost-sensitive methods.Nevertheless, it is clearly advantageous that we can determine the hyper-parameter effortlessly.From another viewpoint, the models are more sensitive on Long Tailed CIFAR-100 than CIFAR-10.In the experiments on Long Tailed CIFAR-10, the proposed method consistently showed better performance than the baseline.Moreover, the variation of performance is marginal along the value of γ.However, on CIFAR-100, the robustness

Table 2 .
Validation errors of ResNet-18 on Tiny ImageNet datasets.The proposed method shows notable improvements.

Table 3 .
Evaluation error of the Oracle and proposed method.