Class Balanced Loss for Image Classification

In the study of image classification, neural network learning relies heavily on datasets. Due to variability in the difficulty of collecting images in reality, datasets tend to have class imbalance problems, which undoubtedly increases the difficulty of classification. During the training of a neural network, classes with a large number of images are naturally trained more often than classes with a small number of images. Because of imbalanced training, the classification ability of neural networks on test and validation sets differs greatly in different categories. The test results of more training classes are better, and the test results of classes with less training are poor. In this paper, we propose two kinds of balanced loss functions, namely, CEFL loss and CEFL2 loss, by rebalancing the cross-entropy loss function and focal loss function. The experimental results show that the proposed loss functions are significantly able to improve classification accuracy on class-imbalanced datasets.


I. INTRODUCTION
In the past decade, deep learning methods have been widely used in image classification, target detection, voice recognition and other related fields. Many well-known network structures, such as AlexNet [1], VGG [2], ResNet [3], Inception [4], and Res2Net [5], have been widely applied and studied. Furthermore, breakthroughs are still being made in those fields [6], [7]. Despite these advances, network performance tends to decline significantly in the face of some class-imbalanced data problems [8]. Therefore, research addressing the network performance degradation caused by class-imbalanced data has great importance [9].
Supervised learning requires a large quantity of accurately labelled data. In image classification, every data sample belongs to a known category. However, real-world datasets often face a class imbalance problem [10], [11]. As illustrated in Fig. 1, the Kaggle Mushroom Classification dataset demonstrates an imbalanced distribution of different categories of data. In this dataset, the data samples of some classes occupy a large proportion of the total samples, while those of other classes take up a small proportion. The training process of the neural network will shift towards those classes The associate editor coordinating the review of this manuscript and approving it for publication was Xiaochun Cheng.
with a large number of samples, which is conducive to the learning of those classes with a large number of samples. Correspondingly, classes with a small number of samples are studied less. As a result, classes with a few samples are more likely to be misclassified. In extreme cases, the designed neural network may ignore the minority class [10]. Models trained with class-imbalanced datasets have poor generalization ability, are prone to overfitting and tend to perform well in the training set but poorly in the validation set or test set [9], [12].
When neural networks are used to study the problem of image classification, the loss function describes the relationship between the predicted output of the networkŷ and the true label y, reflects the fitting degree of the model to the data, and is an important index to evaluate network performance. Common loss functions, including the cross-entropy loss function, perform well when training datasets with a balanced distribution in some datasets, such as CIFAR-10/100 [13] and MNIST [14]. However, when the training datasets exhibit class imbalance phenomena, the cross-entropy loss function often does not classify well [15].
Samples from minor classes tend to have higher loss than those from major classes, as the features learned in minor classes are usually poorer [11], [15]. Focal loss [16], proposed by Kaiming He, can solve the class imbalance problem in  classification to a certain extent [17]. Fig. 2 shows the performances of both the focal loss and cross-entropy loss functions when dealing with the class imbalance problem. It can be seen from the figure that focal loss has a certain degree of improvement when compared with the cross-entropy loss function.
Focal loss has the following form: where p is the neural network output of ground truth and γ is a constant hyperparameter. When γ = 0, the focal loss is formally equivalent to the cross-entropy loss. Focal loss distinguishes the weights of different samples through (1 − p) γ . With an increase in network training frequency, classes with a large number of samples in the training set will have a higher ratio of training frequency, and neural networks will naturally learn more from these classes. Therefore, the classification of classes with a large number of samples will have greater accuracy, and samples in such classes will be easier to classify (p > 0.5, where p is the predicted probability of network output for the ground truth) [16]. Specifically, focal loss assigns a smaller loss to these classes. In contrast, the accuracy of classes with fewer samples is relatively lower, which makes it more likely that they will become poorly classified (p ≤ 0.5) [16] classes. Focal loss assigns a relatively larger loss to them. Therefore, focal loss focus more on poorly classified samples from minor classes. The focal loss graph is shown in Fig. 3. Fig. 3 shows an interesting phenomenon: when using the focal loss function in network training (e.g., γ = 2), the loss of easy samples (p > 0.5) is relatively small; instead, the loss of hard samples (p ≤ 0.5) becomes relatively large, but compared to that of the cross-entropy loss function, it becomes smaller with the same probability p. Naturally, a bold idea arises: based on the characteristics of cross-entropy loss with regard to higher loss on well-classified samples, we can add a cross-entropy loss item to focal loss so that the neural network can better distinguish between both easy samples and hard samples. Aiming at solving this problem, we introduce two kinds of loss functions, namely CEFL loss and CEFL2 loss, based on cross-entropy loss and focal loss to further increase the loss for easy samples and hard samples. Renderings of the CEFL loss and CEFL2 loss functions are shown in the following Fig 4, which will be displayed in section III in more detail. The figure shows the proposed CEFL loss function and CEFL2 loss function. Assume p is the predicted probability of the network for the ground truth. When p ≤ 0.5 (the samples are poorly classified), CEFL loss and CEFL2 loss are better approaches to cross-entropy loss, while when p > 0.5 (wellclassified samples), CEFL loss and CEFL2 loss approach the focal loss.
Our main contributions can be summarized as follows: (1) Based on the existing cross-entropy loss function and focal loss function, we propose two new balanced loss functions, CEFL loss and CEFL2 loss, which help to improve the image classification accuracy of class-imbalanced datasets.
(2) We verify the superiority of CEFL loss and CEFL2 loss on three datasets with imbalanced data, and the experimental results shows that the two loss functions performed similarly on the Imbalanced CIFAR-10/CIFAR-100 dataset, while CEFL loss function was slightly better on Kaggle Mushroom Classification dataset and the performance of CEFL2 loss function was slightly better on AI Challenger 2018 Crop Disease dataset. We believe that, like focal loss, CEFL loss VOLUME 8, 2020 and CEFL2 loss can be extended to the field of target detection to address extreme imbalances between foreground and background classes during training [16], [18]- [20].

II. RELATED WORK
Recent research methods have been proposed to address class-imbalanced datasets [21]- [25] and can be broadly divided into two categories. One approach resamples the training dataset, and the other reweights the training loss for imbalanced classes.

A. RESAMPLING
Resampling is the process of over-sampling (adding duplicate samples to) classes with minority samples, under-sampling (deleting samples from) classes with majority samples, and sometimes both. Resampling can rebalance the distribution of training data to some extent, but over-resampling may increase the number of duplicate samples of classes with few samples, which may lead to overfitting of the model. In addition, for some datasets, the numbers of classes with minority samples are extremely limited, perhaps only a few images. In this case, it is very inappropriate to delete images in classes with large samples to balance the dataset. To solve this problem, adjacent samples can be interpolated [26] or synthesized [27], [28] for classes with minority samples. However, these new samples will inevitably include noise, which could reduce the generation of models.

B. DATA AUGMENTATION
Data augmentation [29] is one of the most widely used methods before data are put into the network. There are many ways to augment data, most of which are available in current deep learning frameworks (such as PyTorch, Caffe and TensorFlow), which users can easily call as needed. To some extent, data augmentation can expand classes with small sample sizes and reduce overfitting slightly. However, the data augmentation method is much less effective for classes with very few data, so it is often used in conjunction with other methods.

C. HARD EXAMPLE MINING (HEM)
Features of samples learned in minor classes are usually poorer, which may cause samples from minor classes to tend to be hard samples. HEM is a common method to deal with hard samples. The core idea of this method is to select hard samples according to the loss of input samples and put them back into network training. Online hard example mining (OHEM) [30] is a typical hard sample mining method. It aims to screen out the first few samples with greater loss and then rejoin the network for retraining. Like focal loss, OHEM places more emphasis on misclassified examples, but unlike focal loss, OHEM completely discards easy samples.

D. FOCAL LOSS
In 2017, He et al proposed focal loss. The paper makes it clear that data imbalance raises two problems: (1) training is inefficient, as most locations are easy negatives that contribute no useful learning signal; and (2) en masse, easy negatives can overwhelm training and lead to degenerate models [16]. In general, when training class-imbalanced datasets with neural networks, classes with a large number of samples will receive more training and become easier to classify, while classes with a small number of samples will receive less training and become more difficult to classify. Focal loss is proposed for the express purpose of making a large contribution to the total training loss when there is a small number of difficult samples and a small contribution to the total loss when there is a large number of simple samples. In other words, focal loss causes the network to focus more on hard samples whose training set only includes a few data points in the whole dataset. However, when the predicted probability p of the ground truth is small, for example, p ≤ 0.5, focal loss decreases the loss value compared to that of cross-entropy loss, which may not contribute to increasing the loss difference between easy examples and hard examples.

III. CEFL/CEFL2 LOSS
Suppose that the output of a j-classification model for all classes is a set of tensors s = {s 1 , s 2 , . . . , s j } where j is the total number of classes. After softmax calculation, s can be mapped to the probability distribution of different classes. Given the model output s of a set of samples, the probability output after softmax is: where i is the label of the ground truth and p i is the model's estimated probability of the ground truth.

A. CROSS-ENTROPY LOSS
The cross-entropy loss function is a loss function often used in multi-classification problems, and its form is as follows: where p is the model's estimated probability for all classes and y is the set of labels of all classes. In a multi-classification problem, the label is given in a one-hot coding form. Therefore, y should be 1 when only considering the ground truth [16]. Thus, the cross-entropy loss can be reduced to: For a batch with an input of N images, the cross entropy loss is:

B. FOCAL LOSS
Focal loss adds a term of (1 − p) γ based on the cross-entropy loss function to reduce the relative loss of well-classified samples and increase the relative loss of poorly classified samples. The formula is shown as follows: where p is the model's estimated probability for the ground truth and γ (γ > 0) is a tuneable focusing parameter. The focal loss function is drawn according to different gamma values shown in Fig. 3. It is worth mentioning that when γ = 0, focal loss is formally the same as the cross-entropy loss function. It can be seen from Fig. 3 that with increasing γ , the value of focal loss generally declines compared to that of cross-entropy loss. In [16], it is mentioned that the author introduces the α-balanced variable to improve accuracy. In this case, the α-balanced focal loss is denoted as: For a batch with input of N images, the focal loss should be: C. CEFL/ZZZZZ/WWWWW/CEFL2 LOSS Samples from minor classes tend to have higher losses than those from major classes as the features learned in minor classes are usually poorer. Consequently, samples from major classses will be well-classified and the samples from minor classes will be poorly classified [15], [16]. Aiming at addressing the problem, cost-sensitive learning method [31] assigns loss function weight for each class according to a given data distrubution. One of interesting problems is how to assign weight distribution for major and minor classes automatically. Motivated by this idea of weight distribution, we proposed two loss functions called CEFL loss and CEFL2 loss to address the problem of training from imbalanced date samples by rebalancing cross-entropy loss and focal loss to expand the loss value of well-classified and poorly classified examples. Mathematically, we distributed the weights of (1 − p) and p to the cross-entropy loss function and focal loss function respectively. A new loss function,called CEFL loss, is chosen as following form: where γ (γ > 0) is a tuneable focusing parameter, as it is in focal loss. Note that when γ = 0, CEFL loss is formally cross-entropy loss. When a sample is poorly classified (p ≤ 0.5), p is relatively smaller and (1−p) is relatively larger, which makes the weight of the cross-entropy loss function relatively larger while the weight of the focal loss function is relatively smaller; thus, CEFL loss is closer to cross-entropy loss. Conversely, when a sample is well classified, p is relatively larger and (1 − p) is relatively smaller, which gives a smaller weight to the cross entropy loss function and a larger weight to the focal loss function; thus, CEFL loss is closer to focal loss. The polarized trend of well-classified and poorly classified samples is conducive to make minor classes in smaller losses and major classes in larger losses as is shown in Fig. 4. Therefore, the network will focus on poorly classified classes.
For a batch with an input of N images, the CEFL loss should be: CEFL loss has two properties. (1) For well-classified samples from major classes, CEFL loss assigns them relatively smaller loss values by approaching focal loss.
(2) For poorly classified samples from minor classes, CEFL loss assigns them relatively larger loss values by aprroaching cross-entropy loss. In order further to enhance the two properties, CEFL2 loss is introduced. CEFL2 loss is similar to CEFL loss in form, except that the weight distribution of the cross-entropy loss term and focal loss term in CEFL loss are different, which expands the effective loss of easy samples and hard samples to some extent, as shown in Fig. 4. The CEFL2 loss function is as follows: For a batch with an input of N images, the CEFL2 loss should be: The proposed CEFL loss ans CEFL2 loss are able to distinguish between well-classified and poorly classified samples according to different predicted class probability p. Fig. 5 demonstrates the polarized degree of cross-entropy loss and focal loss with the change of probability for ground truth class. VOLUME 8, 2020

IV. EXPERIMENTS
To verify the effectiveness of the two loss functions, CEFL loss and CEFL2 loss, proposed in this paper, three different datasets were used, including the manually processed imbalanced CIFAR-10/CIFAR-100 (which we call the imbalanced CIFAR-10/CIFAR-100 dataset), the AI Challenger 2018 Crop Disease Detection dataset and the Kaggle Mushroom Classification dataset. In addition, all three datasets have different degrees of class imbalance problems; thus, they can be used to measure the loss function that we propose. Furthermore, all three datasets are available on the Internet.
ResNet is one of the most widely applied network architectures. The ResNet network introduced a shortcut structure and solved the problem of gradient disappearance to some extent with a deepening network layer. Its generalization ability in ImageNet, CIFAR, PASCAL and other datasets exhibited good performance [3]. Therefore, we use deep residual networks (ResNet), both pretrained and non-pretrained, with different depths as our network framework.
When analyzing the experimental results, we used several different metrics, including global accuracy, precision, recall, and F1_score. Accuracy measures the classifier performance across all classes, while others are specific to each single class. They are written as: A. DATASET INTRODUCTION 1) IMBALANCED CIFAR-10/CIFAR-100 DATASET The original CIFAR-10/CIFAR-100 contains a total of 60,000 images (50,000 training images and 10,000 test images). According to different subdivisions, the CIFAR datasets are divided into two categories, CIFAR-10 (the same number of sets in each category, with 5000 images) and CIFAR-100 (the same number of sets in each category, with 500 images). We conducted a certain amount of random sampling on the original CIFAR-10/CIFAR-100 training set for each class to obtain a class-imbalanced training set. To measure the imbalance of the new imbalanced CIFAR-10/CIFAR-100 training set, we adopt the imbalance ratio [9] of the numbers of images between the class with the most images and the class with the fewest images. The formula is expressed as where 81146 VOLUME 8, 2020 where i(i ∈ {0, 1, .., 9}) and j(j ∈ {0, 1, .., 99}) are the indexes of the class labels in CIFAR-10 and CIFAR-100, respectively. Therefore, i, j ∈ N * and i ∈ [0, 9], j ∈ [0, 99]. N i and N j are the numbers of images for each class of the newly generated imbalanced CIFAR-10/CIFAR-100. For example, sampling from CIFAR-10 with the imbalance factor ρ = 10, the image numbers of each class in imbalanced CIFAR-10 should be N = [500, 645, 834, 1077, 1391, 1796, 2320, 2997, 3871, 5000]. Moreover, the test set is the same as the original one. We conducted experiments on imbalanced CIFAR-10/CIFAR-100. By sampling the original CIFAR-10/CIFAR-100 according to different imbalance factors, we created several different datasets, which are shown in Fig. 6.  sets. The following figure can be obtained by conducting sample statistics for each class of the training set and validation set. As seen from Fig. 7, the total number of samples in categories with a large quantity of data can reach approximately 2500, while categories with a small quantity of data have only a few images.

3) KAGGLE MUSHROOM CLASSIFICATION DATASET
The Kaggle Mushroom Classification dataset is a publicly available dataset on Kaggle's official website. The dataset contains the most common images of the northern European mushroom species. There are 5614 images divided into 9 folders in the dataset. Each folder contains 300 to 1500 or more images of the mushroom genus, which makes it a typical class-imbalanced dataset. We randomly selected 300 images from each category as the test set and the rest as the training set. Therefore, the data distribution of the training set and test set is shown in Fig. 8.

B. IMPLEMENTATION
All experimental models are based on PyTorch. We used a stochastic gradient descent (SGD) optimizer with a   Table 1. None of the experiments uses a warm-up strategy [10] in the training.
For the imbalanced CIFAR-10/CIFAR-100 dataset, we used a ResNet-18 network without pretraining, and the size of the input image was 32 × 32 pixels. A total batch size of 128 on a single GPU for 100 epochs was used, and the initial learning rate was 0.05. Then, when training to 60 epochs and 80 epochs, the learning rate was multiplied by 0.1. A number of ablation experiments were conducted on the imbalanced CIFAR-10/CIFAR-100.
For the AI Challenger 2018 Crop Disease Detection dataset, we used ResNet-18, ResNet-34, ResNet-50 and ResNet-101 for the pretraining model, and the image input network size was 224 × 224 pixels. For each pretraining network, we used a batch size of 64 on a single GPU for 60 epochs, initialized the learning rate to 0.01, and then multiplied the learning rate by 0.1 when training to 30 epochs and 50 epochs.
For the Kaggle Mushroom Classification dataset, we used a pretraining model of ResNet-18, ResNet-34 ResNet-50 and ResNet-101 with an input image size of 224 × 224 pixels. For each pretraining network, a batch size of 64 on a single GPU for 100 epochs was used, and the initial learning rate was 0.01. Then, when training to 60 epochs and 80 epochs, the learning rate was multiplied by 0.1.

C. EXPERIMENTAL RESULTS
To demonstrate that our proposed loss functions can be used in real-world imbalanced datasets, we conducted experiments on the above mentioned three datasets.

1) EXPERIMENTS ON IMBALANCED CIFAR-10/CIFAR-100 DATASET
We conducted a series of ablation experiments on the imbalanced CIFAR-10/CIFAR-100 datasets based on ResNet-18, and the results show that the proposed CEFL loss and CEFL2 loss can improve the accuracy. We show the results in Tables 2 and 3. And the average accuracy averages the results of ρ = 10, 20, 50, 100, 200, 500. Obviously, the best accuracy results in each dataset with different ρ values and average accuracy are for the proposed CEFL and CEFL2 loss. Fig. 9 shows the average accuracy of the imbalanced CIFAR-10/CIFAR-100 dataset containing various imbalanced ratios ρ (except ρ = 1) listed in the last column of Tables 2 and 3. In Fig. 9, the proposed CEFL loss (green cylinder) and CEFL2 loss (yellow cylinder) significantly help the network improve classification accuracy compared to that  with cross-entropy loss (blue cylinder) and focal loss (red cylinder).

2) EXPERIMENTS ON THE IMBALANCED AI CHALLENGER 2018 CROP DISEASE DETECTION DATASET
For the AI Challenger 2018 Crop Disease Detection dataset, we conducted a set of ablation experiments, and the results are shown in Table 4 and Fig. 10. Table 4 shows that CEFL2 loss function is slightly better than CEFL loss funtion on AI Challenger 2018 Crop Disease Detection dataset. And Fig. 10 shows the experimental results with different loss functions and different depths of network. It can be seen that the proposed CEFL loss and CEFL2 loss obtain better performance than cross-entropy loss and focal loss on this imbalanced dataset.

3) EXPERIMENTS ON THE KAGGLE MUSHROOM CLASSIFICATION DATASET
For the Kaggle Mushroom Classification dataset, we trained images on ResNet-18, ResNet-34, ResNet-50 and VOLUME 8, 2020     ResNet-101. Fig. 11 shows the confusion matrices for the 9 class of Kaggle Mushroom test set when training with ResNet-101. Table 4 shows F1_score and accuracy results of different network depths. It can be seen that CEFL loss function is slightly better than CEFL2 loss funtion on Kaggle Mushroom Classification dataset. Table 5 presents the classification results with precision, recall and F1_score for each class. It can be judged that the proposed loss functions generally improve the classification performance by enhancing classification ability of minor classes. Fig. 12 shows the accuracy curves both in training and test process. It can be concluded that the proposed CEFL loss and CEFL2 loss can improve accuracy on test set considering different network depths.

V. CONCLUSION
In this paper, we present two theoretical loss functions to address the problem of class imbalance for training neural networks. The idea of the proposed loss functions comes from considering the cross-entropy loss function and focal loss function. Some important considerations are as follows: (1) When the training samples are well classified (p > 0.5), the loss approaches the focal loss. (2) When the training samples are poorly classified (p ≤ 0.5), the loss is close to the cross-entropy loss. If the proposed loss function satisfies both considerations, it will cause the network to learn more from classes with fewer samples. Thus, the trained network will improve the total accuracy of classification. We experiment with cross-entropy loss, focal loss, and the proposed CEFL loss and CEFL2 loss on three different datasets with class imbalance problems, and the results show that the proposed CEFL loss and CEFL2 loss have better classification accuracy than the other two kinds of loss functions.
In addition, the proposed CEFL loss and CEFL2 loss do not add any new hyperparameters and can be regarded as an automatically balanced loss function between the cross-entropy loss function and focal loss function. This strategy allows our loss function to be broadly applied to models of image classification. We believe that our approach can also be extended to target detection to address the imbalance problem of foreground and background in a similar way to that of focal loss, which we will research further in the future.