Rectified Softmax Loss With All-Sided Cost Sensitivity for Age Estimation

In Convolutional Neural Network (ConvNet) based age estimation algorithms, softmax loss is usually chosen as the loss function directly, and the problems of Cost Sensitivity (CS), such as class imbalance and misclassification cost difference between different classes, are not considered. Focus on these problems, this paper constructs a rectified softmax loss function with all-sided CS, and proposes a novel cost-sensitive ConvNet based age estimation algorithm. Firstly, a loss function is established for each age category to solve the imbalance of the number of training samples. Then, a cost matrix is defined to reflect the cost difference caused by misclassification between different classes, thus constructing a new cost-sensitive error function. Finally, the above methods are merged to construct a rectified softmax loss function for ConvNet model, and a corresponding Back Propagation (BP) training scheme is designed to enable ConvNet network to learn robust face representation for age estimation during the training phase. Simultaneously, the rectified softmax loss is theoretically proved that it satisfies the general conditions of the loss function used for classification. The effectiveness of the proposed method is verified by experiments on face image datasets of different races.


I. INTRODUCTION
Age estimation means an accurate age (group) should be determined considering a given face image. Because this technology has important application value in the fields of human-computer interaction, intelligent marketing, intelligent monitoring and criminal reconnaissance, it has become a research hotspot in computer vision in recent years. However, due to the influence of growth environment, lifestyle and ethnic genetic diversity, it is still very challenging to accurately estimate the facial age.
In the machine learning methods, age estimation is usually divided into two parts: facial age representation and age determination. Age representation includes shallow characterization of AAM [1]- [3], LBP [4], BIF [5], [6], and the like. After extracting the appropriate age representation for the face image, some age determination methods are used to predict the corresponding age, such as KNN [6], quadratic function [6], [7], SVM [8], [9], SVR [8], [9], and The associate editor coordinating the review of this manuscript and approving it for publication was Wenming Cao . the like. However, there are some drawbacks: 1. The design of manual age representation is quite cumbersome. 2. The traditional generative model and discriminative model cannot fully obtain the benefits brought by the big data era, namely with the increase in the amount of data, the performance improvement is not obvious. 3. Because not an end-to-end design, it is quite time-consuming to estimate the age for an unacquainted facial image.
Inspired by the excellent performance of deep learning in image classification, many researchers have begun to apply it to facial age estimation. According to different algorithm principles, current methods are mainly divided into regression and multi-classification. In the regression method, the researchers regard the age label as a continuous value. As a result, the output layer of Convolutional Neural Networks (ConvNets) contains only one neuron, so the output is the specific age. In the multi-classification method, the researchers regard the age label as a discrete category. As a result, the number of neurons in the output layer of the ConvNet is consistent with the number of age groups in the experimental dataset, so the output is the specific age category. In this paper, age estimation will be studied based on multiclassification ConvNet. When using the classification ConvNet for age estimation, the previous method usually uses the plain ConvNet model or the ConvNet model originally used for other image classification tasks (such as: VGG16 [11]), and the improvement in performance is mainly due to fine-tuning on the age image dataset after pre-training in other types of image sets. Because the shallow representation of most images and even all face images in ConvNet can be shared, transfer learning reduces the researchers' unnecessary pre-training work for ConvNet and often enjoys the performance benefits of increased sample size. However, the obtained model does not satisfy Ockham's Razor. At this time, it is necessary to internally analyze the age estimation problem and design the ConvNet model in a targeted manner. Since age estimation is a typical costsensitive computer vision problem, this method came into being.
Using cost-sensitive theory for age estimation, there are two main problems: 1. Class imbalance problem, as shown in Fig. 1, it is the statistical distribution of the sample size of each age group in the two largest face dataset AFAD [10] and MORPH [12], showing that the sample size is extremely unbalanced. If the classifier is trained directly using class-unbalanced data, the classifier will always tend to misclassify categories that contain fewer samples into categories that contain more samples. In real life, the probability of occurrence of some categories is quite different. For example, the probability of young people appearing in shopping plazas is always higher than that of the elderly and children, but we should still try to make each class have the same effect. In this paper, we name this type of problem as Exterior Cost Sensitivity (ECS), which means that it is interpreted as the cost sensitive problem caused by people collecting samples. 2. Uneven misclassification cost problem. That is, when the age estimation is regarded as a multi-classification problem, the problem of the misclassification between different classes should be considered. For example, the cost of misclassifying a 22-year-old to 23 and misclassifying it as 50 is not the same, as shown in Fig. 3. Similarly, we name it Interior Cost Sensitivity (ICS), which means that it is inherent when age estimation is regarded as a multi-classification problem.
In view of ECS, ICS and the shortcomings of the previous methods in age estimation, this paper embeds cost sensitivity into the deep learning framework during the training phase, and proposes a rectified softmax loss with all-sided cost sensitivity. The core idea of the algorithm is firstly to establish a loss function for each age category in order to reduce the influence of the unbalanced number of samples of different categories on the performance of the whole classification algorithm. In addition, a cost vector is constructed to reflect the cost difference of misclassification among different classes, and a cost sensitive error function is proposed. Finally, the proposed two loss functions are integrated to construct a rectified softmax loss for the CNN model to comprehensively solve the cost sensitivity problem in age estimation. The whole algorithm flow is shown in Fig. 2. To sum up, the contribution of this paper is summarized as follows: 1) For ECS, we built loss function for each category to eliminate maximally the impact of class-imbalance on the total loss.
2) For ICS, we constructed a novel cost matrix to reflect the difference in misclassification cost based on the Desired Class Maximum Principle (see section III.C). In addition, we also verify its advancement with some examples.
3) For ECS and ICS, we integrated a considerate cost sensitivity loss function for age estimation and prove its practicability with relevant theories. To the best of my knowledge, this is the first time that the cost sensitivity problem has been considered so comprehensive in the age estimation related methods to date.
4) The corresponding Back Propagation (BP) algorithm is ameliorated to analysis the influence of the change of loss function toward the training process of network. 5) Experiments were carried out on the two interracial datasets to reveal the generalization of the proposed method.

II. RELATED WORKS
Before describing our method systematically, we first briefly introduce age estimation and cost sensitivity learning methods in the previous literature.

A. AGE ESTIMATION
Even though age estimation has not received as much attention as face recognition, it has been studied for more than 20 years. The mainstream methods could be divided into the following aspects.

1) POINT DISTRIBUTION MODEL
Active Appearance Models (AAM) was described by Cootes et al. in 1998 [1], which is a statistical model of the shape and grey-level appearance. Because the appearance can reflect the growth process of an individual intuitively, many literatures are developed based on AAM. Lanitis et al. [2] used it as the parameter description of facial images, and then tested many classifiers to estimate the age (i.e., Quadratic Function, Shortest Distance Classifier, Multilayer Perceptron and Self Organizing Map). Besides, for people who look similar they designed appearance-specific classifier, and similarly, age-specific classifier. AGing pattErn Subspace (AGES) was introduced by Geng at el [3], and it adopts AAM to extract facial feature likewise. In the training stage, Principal Component Analysis (PCA) is used to construct the global aging pattern subspace, which is defined as the sequence of all training images sorted chronologically. For a previously unseen facial image, its aging pattern is selected by the projection in the subspace with the minimum reconstruction error criterion, and the position of image obtained iteratively in the proper aging pattern will indicate its age.

2) MANIFOLD LEARNING METHOD
''Manifold'' is a structure which is locally homomorphic with Euclidean space. The main idea of manifold learning is to project the data located in the high-dimensional space into the low-dimensional space so that the low-dimensional data could reflect some essential and structural characteristics of the original high-dimensional data. There is an important premise before it could be applied, which is that the high-dimensional data actually is a low-dimensional manifold embedded in high-dimensional space. Naturally, the use of manifold learning for age estimation also acquiesced to this premise, when the so-called high-dimensional data is the face image (the pixel matrix). By regarding the aging process as the low dimensional distribution in a chronological order, Fu et al. [13] employs the linear manifold learning method, i.e., PCA, Neighborhood Preserving Projections (NPP), Locality Preserving Projections (LPP) and Orthogonal Locality Preserving Projections (OLPP), to find a low-dimensional embedding space, and takes the projection of the image in this space as the aging representation, and finally constructs a quadratic function for regression. Two years later, the same age-manifold method was mentioned again by Fu et al. [7] However, the tool of finding effective low-dimensional embedding space was replaced by Conformal Embedding Analysis (CEA), which followed by the improvement on accuracy.

3) THE SHALLOWER BIO-INSPIRED NETWORK
Riesenhuber et al. [14] described a 4-layers network (usually called BIF) following the organization of visual cortex, which alternates between a simple (S) layer and a complex (C) layer. The S layer was designed based on the multi-orientation and multi-scale Gabor filters, so as to extract the discriminative low-level features. The C layer shows some tolerance to shift and size by max-pooling operation. Guo et al. [6] used the two layers of BIF to estimate facial age. Specifically, they adopted a standard deviation operation (STD) combined with the max-pooling in layer C to better reveal the local variation of textures on faces. Soon afterwards, he [15] combined BIF with a variety of supervised manifold learning methods (i.e., marginal fisher analysis and locality sensitive VOLUME 8, 2020 discriminant analysis), and then estimated age according to gender and age group.

4) THE DEEPER BIO-INSPIRED NETWORK (i.e., ConvNet)
Levi and Hassner [16] implemented age classification by mean of the classic Alexnet-like network (5 layers). To my knowledge, it is the first successful application of deep learning in age estimation. In addition, the literature [9], [10], [17], [18], [19], [38], [39], [40] (see section IV.E) used ConvNet for age estimation. However, they are all based on one or more baseline ConvNet, and do not consider the problem of class-imbalance and the uneven misclassification costs. In contrast, the proposed network structure modification method can finally learn the robust and discriminative parameters about cost sensitivity.

B. COST SENSITIVITY LEARNING
Cost sensitivity learning located on the algorithm-level is similar to the data-resampling method located on the data-level, both often aimed at solving the class-imbalance problem. Specifically, it distributes different misclassification costs to different classes through cost matrix and/or cost vector. Ting [20] introduced cost sensitivity into the decision tree, and the resulting cost sensitive tree is simpler and more efficient. Cai et al. [21] proposed cost sensitive SVM for text classification. In addition, cost sensitivity learning is particularly widely used in the boosting method [22], [23]. Although the theory of cost sensitivity learning is being adopted by many literatures, its research in deep learning is still quite rare. As far as I know, only [24], [25] are available.
Several cost-sensitive age estimation algorithms had been previously considered: Chang et al. [26] proposed a cost-sensitive ranking algorithm, which embedded cost sensitivity in the ranking model. However, the disadvantage was that the cost sensitivity was only considered in the age prediction stage, so they did not learn enough discriminant and robust parameters; Lu et al. [4] proposed a cost-sensitive LBP algorithm, but because LBP only considers local texture information, the method has limited representation ability. In addition, they only consider the misclassification cost problem and ignore the imbalance of the training sample quantity, so the cost-sensitive age estimation model obtained is incomplete. For the above considerations, this paper proposes a modified softmax loss that works in the training phase for age estimation.
The upcoming part of our proposed work is organized as follows. The proposed approach is illustrated in the next section, where we give a detailed description of our scheme. Section IV is our experimental part. In Section V we discuss the transportability of our method and the future research direction. The final conclusion is located in Section VI.

III. THE PROPOSED APPROACH
In this paper, the cost-sensitive problem in age estimation is solved in the training phase, and our main work focuses on the improvement and innovation of the loss function.

A. TRADITIONAL LOSS FUNCTION
When the classification ConvNet model is used for age estimation, its loss function is usually defined as: where L is the size of training samples, Loss(·) is a defined error function, such as Mean Square Error or Cross Entropy, is expected output for the network (i.e. label), in which n is the number of categories and d i k is usually a Boolean type. y i (w, b) represents the prediction probability vector of ConvNet model for the ith sample, where the scalar value is usually of floating-point type, w and b represent the weight matrix and bias matrix of the ConvNet model respectively. In the training process, when the value of (1) is large, it means that the ConvNet model cannot fit the sample distribution of the training set well. Then, iterative learning should be continued to obtain the optimal parameter values (w * , b * ), so as to minimize the loss value of (1): Let o i k (w, b) represents the prediction output value of the kth neuron to the ith image, and use the following standard softmax regression function to calculate the posterior probability of the ith image belongs to the kth class: And then the Cross Entropy error function can be formulated as: If ECS and ICS are not considered, the traditional age estimation loss function based on deep learning [9]- [11], [16]- [19] can be obtained: is the error about the ith image. In this way, here is the optimization way of the whole network: Which minimizes the uncertainty of the network in a loopy manner on the overall training set, making the output probability distribution of the logit layer more and more close to the expected distribution. Moreover, the normal BP algorithm is used to alter the weight and bias distribution. Finally, the optimized model can be used for age prediction (overfitting is excluded). However, the resulting model would obviously not be robust, since it would be a 'consideration-poor' loss function for the multi-classification problem for age estimation.

B. CONSTRUCT ECS-SOFTMAX LOSS FUNCTION
In the loss function defined in (5), if the loss of each type of sample is considered separately, the formula can be rewritten as: where j E(w, b) denotes the total loss of samples of category j, and L j denotes the number of samples of category j. It could be intuitively seen from (6) that, if the dataset is class-imbalance (as shown in Fig. 1), the class containing more samples will have a decisive impact on the total loss, while the class containing fewer samples will have a negligible impact on the loss. As a result, classifier will learn skewed weights and biases in favor of the class containing more samples.
In this paper, in order to solve the ECS problem in age estimation, the loss of each training sample was calculated separately, and ECS-softmax loss function was proposed as follows: Compared with the original softmax loss (see (5) and (6)), ECS-softmax loss is different in that it adopts a divide and conquer strategy for the loss on each category, that is, it uses a divide-and-conquer strategy to construct a loss function for each category to calculate the loss value separately. In this way, regardless of the sample size of each category, all categories have the same impact on the loss of the entire dataset. Further, the impact of the number toward even the performance of classifier could be ignored.

C. CONSTRUCT ICS-SOFTMAX LOSS FUNCTION 1) CONSTRUCT COST MATRIX
In the standard softmax regression function shown in (3), when calculating the posterior probability of each category, the misclassification cost sensitive problem (that is, ICS) was not considered. At this time, a cost matrix should be designed to reflect the misclassification cost difference between different categories according to the age estimation itself, which is a cost-sensitive computer vision problem.
For ICS, the traditional cost matrix follows the reasonableness condition [25]: 1. The cost of correct classification should always be kept at the minimum; 2. The cost of two classes with relatively far distance from each other in misclassification should bigger than that of two classes with relatively close distance from each other in misclassification. In addition, another condition needs to be satisfied when cost matrix is applied to softmax regression function: when misclassification cost is equal, the cost-sensitive softmax function should be the same as the original softmax function. To satisfy these properties, we propose a preliminary cost matrix: where c ρ,k represents the cost value of class ρ mistaken for class k. Fig.3 illustrate this cost value when the real age is 22-year-old. That is, for instance, the cost of misclassifying someone belongs to age-22 for age-23 is |22 − 23| + 1 = 2 while the cost of misclassifying someone belongs to age-22 for age-50 is |22−50|+1 = 29. This result does make logical sense.
Except that a situation the cost value is not zero located in diagonal even if the classification result is correct, the property of the preliminary cost matrix is similar to the usual absolute cost matrix. (In fact, the minimum value we can allow is 1, and which is not wrong, because on the one hand, when the misclassification costs of all categories are even, ICS-softmax loss changes back to the original softmax loss, and on the other hand, it makes sense for the proposed final cost matrix).
The reason why we do not directly use the preliminary cost matrix as the weight of the logit value is that it does make the parameter in ConvNet embedded with the factor of misclassification cost in the training process, but ultimately it cannot guarantee that the probability value of the desired class will get better gradually. Here, we propose a general rule that we and even all literatures about cost matrix should follow, called Desired Class Maximum Principle (DCMP), which means that all of the cost matrixes have only one purpose, namely, to make the logit value of the correct classification larger and larger (reflected by the increasing o ρ value), so as to make the probability value of the correct classification larger and larger (reflected by the increasing y ρ value). This is not only the core meaning of the proposed cost matrix, but also the difference from the previous cost matrix. Concretely, we classified logit values of both desired and non-desired (i.e., predicted) classes into positive and negative categories, thus derived a new cost matrix c, and whose specific value can be seen in the following (9).
Different from [25] on automatic learning cost matrix, the cost matrix we proposed is a fixed formula, but it is actually verified as the following (See section III.E.A).

2) ICS-SOFTMAX LOSS
In order to completely solve the ICS problem, we combined (8) and (9), so as to embed cost matrix into the softmax regression function and modify the output value of the last layer of ConvNet (as shown in Fig. 4), and calculate the error value of the ith image that originally belonged to class ρ and VOLUME 8, 2020 was mistakenly classified as class k: The overall loss function is as follows:

D. RECTIFIED SOFTMAX LOSS WITH ALL-SIDED COST SENSITIVITY 1) CS-SOFTMAX LOSS FUNCTION
In order to deal with ECS and ICS problems in age estimation at the same time, (7) and (10) are combined and expanded in this paper to construct a rectified softmax loss function for the ConvNet model (see Fig. 5), namely: 2) CS-BP ALGORITHM BP algorithm is the most effective multi-layer neural network learning algorithm and its main characteristic is that the signal is transmitted forward while the error is propagated backward. By constantly adjusting the weight and bias value of the network, the final output of the network is as close as possible to the expected value, and finally the training purpose has been achieved. Therefore, once the loss function is modified, the BP algorithm is bound to be adjusted accordingly.

a: Predefined symbols
A multi-layer neural network is assumed to be composed of L layers and there are the following predefined symbols: is the connection weight between the ith neuron in the l-1th layer and the jth neuron in the lth layer, and where sl is the number of neurons in the lth layer.
2) net is the input of the kth neuron in the lth layer. 3 is the neuron output of the lth hidden layer. 4) f (·) represents the non-linear activation function of the hidden layer, and this paper adopts Rectified Linear Units (RELU) [33].

b: The derivation of BP algorithm
The iterative optimization algorithm that updates the weight and bias matrix adopted here is mini-batch gradient descent, then the key is to calculate the partial derivatives of the error of each training image with respect to the weight and the bias respectively. It is worth noting that we neglect the partial derivatives of the error with respect to the bias, because it always missing the output of a neuron at the previous layer compared with the partial derivatives of the error with respect to the weight.
Suppose a facial image that was originally in class ρ is classified as class k, and a conclusion was drawn (see Supplement Material for a detailed derivation):

3) CONVNET MODEL AND ALGORITHM STEPS
Before the formal experiment, we carried out some necessary preprocessing steps for the raw facial image, namely face detection, facial landmarks localization and face alignment. Specifically, we adopted the classical cascaded VJ object detector [29] for face detection, and AAM [1] is used for facial landmarks localization. Then, face alignment is conducted based on the location of nasal tip. Finally, the image is resized to the 224 * 224 for age estimation. In addition, we adopted ResNet-50 [27] as the backbone architecture and integrated the Squeeze-and-Excitation module [28] with it. Finally, in order to accelerate convergence and avoid over-fitting caused by excessive parameters, the fullyconnected layer was removed. After integrating the rectified softmax loss function into ConvNet, we named network as deep cost sensitivity Con-vNet. The optimization process of the network is shown as Algorithm 1. At this time, the image is classified into the first category even if it is not. When the proposed ICS-softmax is used for classification, it will adjust the probability of each category according to the actual situation:

Algorithm 1 Iterative Optimization About ConvNet
Looking up (9), and The corresponding probability distribution as the following: Similarly, and The corresponding probability distribution as the following: Similarly, and The corresponding probability distribution as the following:   From the above, ICS-softmax has been increasing the probability of the desired class, although by a small margin. In this way, it could not only accelerate the convergence of the network, but also improve the accuracy of ConvNet greatly after repeated iterations.

F. THEORETICAL PRACTICABILITY ANALYSIS
After proposing the method, we need to verify the theoretical practicability of it to analyze whether it satisfies the general nature of the classification loss function. To this end, we need to consider two properties here, namely classification-calibration [25] and guess-aversion [22]. Note that the classification direction of all-sided CS-softmax loss function is determined by ICS-softmax loss, so we only need to consider whether the ICS-softmax loss function for a single sample meets the above properties (for simplicity, we ignored i in the upper right).

Classification-calibration.
Classification-calibration is essentially a pointwise form of Fisher consistency for classification. Since the use of surrogate loss function of 0-1 loss introduces additional risk, it can guarantee the consistency of the decision of proposed loss function with the Bayesian decision * (·) in terms of statistical property.
Argument 1: ICS-softmax error function conforms to classification-calibration. Proof: If there is a sample X , assuming it belongs to the ρth class and the classification result is correct, and then its ICS-softmax error is At this point, the expected risk is We let 1,n]) to find the ideal set of ConvNet predicted output o, and then: Integrating (14), (15) and (16), therefore: Since the diagonal elements in our proposed cost matrix are all 1, we have: Finishing the above equation, we have: According to the above equation, optimal ConvNet output probability y t is inversely proportional to Bayesian risk of class t.
∵ All the costs in c are positive. ∴ o t is positively correlated with y t . ∴ Optimal ConvNet predicted output o t and Bayesian risk of class t have an inverse relationship.
∴ Argument 1 is satisfied. Guess-aversion. Guess-aversion is used to verify whether the classification result of our loss function always tends to desired class rather than arbitrary guess class.
Proof: Now we prove that the sufficient condition of argument 2 is true. In which ρ is the desired class, and is the set of arbitrary guess points. From [22], reasonably, original softmax loss (3) conforms to guess-aversion, that is: For ConvNet, is an all-zero set, so: Since the proposed cost matrix is based on DCMP (see also the example in the section III.C.A), there is: ⇒ E(ρ, o, c) < E(d, , c) .
Argument 2 is satisfied.

IV. EXPERIMENT
In this section, we first introduced dataset, evaluation criteria and the details of experiment, and finally carried out two groups of experiments: ablation experiment and contrast experiment.

A. DATASETS
We first chose MORPH Album2 [12], which is basically composed of the white race and the black race, as an option. In addition, for the sake of universality of our algorithm, we also adopt Asian Face Age Dataset (AFAD) [10], a dataset composed of the yellow race.

1) MORPH
The full MORPH Album2 contains over 94 000 facial images, but since it is partially public, we only use and discuss a subset of it, i.e., MORPH Public release (for short, MORPH). This dataset contains 55 608 unique images of more than 13 000 individuals and the ages range from 16 to 77. For a detailed distribution and some examples of MORPH at each age, see Fig. 1(a) and Fig. 6, respectively. Since MORPH does not provide an official split standard for the training set and the testing set, in order to compare with other age estimation methods, the split standard in [10], [11] [16]- [19] was followed, namely, randomly divide MORPH into 80% for the training set and remaining 20% for the testing set.

2) AFAD
AFAD consists of 164 432 facial selfie images collected by a social networking site, and the ages from 15 to 40. It is not VOLUME 8, 2020 only the largest public dataset for the age estimation so far, but also a powerful dataset for researching facial age in the wild. For a detailed distribution some examples of AFAD at each age, see Fig. 1(b) and Fig. 7, respectively. The split standard of it is analogous to MORPH.

B. EVALUATION CRITERIA
In age estimation, the commonly used evaluation criteria are Mean Absolute Error (MAE) and the Cumulative Scores (CS). MAE.
where L t is the amount of the testing set, and ρ is the desired class while k is the predicted class. MAE is used to calculate the mean age error of the all facial images in the testing set.
For an example, the MAE of an age estimation algorithm is 5, that is, given a facial image, the estimated age will be 5 years younger or 5 years older on average than the actual age. CS.
where bool is the Boolean function and m represents the tolerable age error. CS could be understood as the accuracy rate under the allowable error.

C. THE DETAILS OF EXPERIMENTS
Our experiments were based on the caffe open source framework [30] (solver mode: GPU) in the Ubuntu18.04 system. Deep cost sensitivity learning adopted the strategy of transfer learning, and caffe model derived from [31]. On this weight basis, we made fine-tuning on MORPH and AFAD. the hyper-parameters in the solver prototxt were set as follows: basic learning rate is 0.0015, and decaying exponentially; weight decay coefficient is 0.0005; Stochastic gradient descent is on, and mini-batch size is 64; momentum is acquiescent 0.9.

D. ABLATION EXPERIMENT
In this section, ablation experiment with control variables method was conducted to compare the effects of different loss functions on age estimation performance, so as to further highlight the superior performance of our idea. We use one of the four loss function include our proposed ECS-softmax loss, ICS-softmax loss, CS-softmax loss and original softmax loss, as a loss function during the training phase while the rest of the network stays the same. As a result, four ConvNets with parameters attached to different meanings to complete age estimation during the testing phase. The final results are shown in Table 1, and the following Q& A are unfolded: Q1: Why is it that age estimation performance of AFAD worse than MORPH (3.806>3.377), even though the same original softmax loss is used? A1: Because MORPH is collected in a restricted environment, the image resolution and lighting conditions are easy to control. On the contrary, the most facial images in AFAD are collected in a natural way, so many images are blurred (e.g. bad frontfacing camera), noisy (e.g. weird hairstyle that cover even the whole face) and so on. To sum up, the gap is quite reasonable.
A2: This indicates that compared with MORPH, the ECS problem of AFAD is more serious, that is, its number distribution at each age is slanting, which can be seen from the A3: This shows that the ICS is a crucial factor in the age estimation, and the proposed cost matrix is effective.
Q4: Why is CS-softmax loss the best improvement in performance?
A4: First of all, it combines ICS and ECS formally, and secondly, the significance of CS for age estimation is evident.

E. CONTRAST EXPERIMENT
To illustrate the advantage of deep cost sensitivity learning, we compared the result with the state-of-the-art age estimation methods. Both MAE and CS are shown in the TABLE 2, Fig. 8 and Fig. 9, respectively. The following is a brief introduction to the methods involved in the table and figure. Deeply learn feature [9] uses the feature map of the convolution layer and the pooling layer in a 4-layers ConvNet as the representations, and finally feeds it to SVR for age regression. MR-CNN [10] and C-CNN [11] regard age estimation as regression and classification problems, respectively.   OH-ranker [26], Ranking-CNN [19] and OR-CNN [10] consider the relative order of age labels [37] and then perform a series of two classifications. The difference between the three is the selection of the classifier, the first one uses SVM and the last two use ConvNets. The difference between OR-CNN and Ranking-CNN is that the former model is more concise because it shares the middle characterization of ConvNet. For the improvement of ConvNet's fitting ability, the technique of the multi-scale analysis and the local aligned patches have been added to Mult-scale [18]. DEX [17] proposes a softmax expected value refinement after the softmax loss layer during the testing phase to estimate apparent age. CNN-ELM [38] Uses CNN to extract features from the input face image, and then use Extreme Learning Machine (ELM) to classify the intermediate results. D2C [39] proposes a novel cumulative hidden layer which is supervised by a point-wise cumulative signal. In this cumulative hidden layer, their model is learned indirectly through faces of neighboring ages. GA-DFL [40] splits ordinal ages into a set of discrete groups and learn deep feature transformations across age groups to project each face pair into the new feature space, where the intra-group variances of positive face pairs from the training set are minimized and the inter-group variances of negative face pairs are maximized, simultaneously. In addition, in order to illustrate the portability of proposed rectified softmax loss function, we have embedded it to Ranking-CNN and OR-CNN respectively, forming Ranking-CSCNN and OR-CSCNN.
As can be seen from TABLE 2, our method has achieved the first place in the white race and black race (MORPH) and the yellow race (AFAD), which shows that our method is simple, but perfectly in line with the idea of Occam's razor: Entities should not be multiplied unnecessarily. In addition, some conclusions can be drawn: 1. There are three methods based on rank theory [37], and they all achieve good results, which indicates that the relative order among age labels is a very important factor in age estimation.
2. Except for OH-ranker, the rest of the most advanced methods are based on deep learning, but unfortunately, they mainly rely on migration learning to improve performance.
3. The only method that can be compared to our approach is Ranking-CNN, but its downside is that it uses multiple ConvNets, which is a waste of computing resources and space.
4. Two methods with the rectified softmax loss function performed very well, and even the result of Ranking-CSCNN is better than our method, which shows that our method is well adapted to the method of facial age estimation based on CNN.
As can be seen from Fig. 8 and Fig.9, when the tolerable age error is 0-15 years, our method is ahead of other methods most of the time. This adds further arguments to our thinking: ConvNet structure modification method that considers specific factor for specific task is more reliable than blind migration learning. It is worth mentioning that the specific task of this paper is age estimation, and the specific factor is cost-sensitive learning.
In Fig. 10 and Fig. 11, we further compare the accuracy of the original loss function and the rectified loss function  at each age. Again, our method achieves a consistent lead in almost all cases. Note that the accuracy of the original softmax loss function is extremely skewed, and its performance is mediocre. Our CS-softmax loss function is more uniform because it solves the class imbalance problem. In addition, due to the consideration of the misclassification cost, it almost completely surpasses the original softmax loss function in performance. This result also demonstrates in detail the superiority of using the rectified softmax loss function to estimate age.

V. DISCUSSION
As SE-ResNet is one of the most advanced ConvNet at present, the experiments based on it are not convincing to some extent. Therefore, we retested the performance of CS-softmax loss on a relatively shallow network, which is composed by three convolutional layers, two max-pooling layers and a fully connected layer. We first pre-trained the network based on VGGFace2 [31] and then ended up with a good performance improvement on MORPH, i.e., 3.723-3.151=0.572, which is even slightly better than the deeper ConvNet. This proves that the performance of our method is independent with the depth of ConvNet.
The proposed CS-softmax loss can be used as a loss function for age classification in later works, just as Dropout [32], RELU [33], and Batch Normalization (BN) [34] have been deeply embedded in the current ConvNet. The reason is that both ECS and ICS are deeply rooted in age classification.
Although deep cost sensitivity learning we discuss is aimed at age estimation, it is undeniable that when there is a multiclassification problem or a binary classification problem, the classification accuracy can be improved by only modifying the preliminary cost matrix c towards the specific problems. (e.g., if the misclassification costs are all equal, let c be a uniform matrix, i.e., c == [1] n×n ). After all, even if the misclassification costs in pattern recognition tasks are likely to differ, the class-imbalance problem is almost everywhere in all benchmark datasets.
In addition, in the future work, some other factors inherent in age estimation can be considered to further improve its performance.

VI. CONCLUSION
We came up with a rectified softmax loss with all-sided cost sensitivity for age estimation called CS-softmax loss. Specifically, it covers the most common problems in age classification, which consider the uneven distribution problem of facial datasets and the cost of age misclassification. For the latter, we further propose a novel cost matrix that maximizes the desired class probability while considering the misclassification cost. Finally, the effectiveness of the method is demonstrated by theoretical analysis and different experiments based on the interracial benchmark datasets.