Combined Deep Learning With Directed Acyclic Graph SVM for Local Adjustment of Age Estimation

In order to further improve the accuracy of age estimation, a locally adjusted age estimation algorithm based on deep learning and directed acyclic graph SVM is proposed. In the training phase, SE-ResNet-50 network pre-trained by the VGGFace2 dataset is first fine-tuned. Once the network converges, and the vector consisting of the parameters of the last fully connected layer is used as a representation and train multiple One-Versus-One SVMs. In the test phase, we first sent the face image to be estimated into SE-ResNet-50 to obtain a rough age estimation value, then set the specific neighborhood, and finally combined the trained SVM into a directed acyclic graph SVM and set specific neighborhood with the global estimate as the center for accurate age estimate. In order to show the universality of the proposed coarse-to-fine or/and global-to-local method, experiments were carried out on MORPH and AFAD images of different races, and the results verified the effectiveness of the algorithm.


I. INTRODUCTION
Age estimation aims to identify the age value or age group of the input face image. Although automatic age estimation based on face images is an important technology involved in many practical applications such as multimedia applications and human-computer interaction, estimating age from face images is still a challenging problem. In other words, because different people age in different ways, the process of aging depends not only on human genes, but also on many external factors, such as physical condition, lifestyle, place of residence, and weather conditions. In addition, due to the different levels of use of cosmetics and accessories, the age of men and women may also be different. How to extract the general discriminative characteristics of aging while reducing the negative effects of individual differences is still a problem to be solved.
In the age estimation method based on classic machine learning, it usually includes two steps of feature extraction and age discrimination. Among them, feature extraction usually uses active appearance model [1], local binary pattern [2], The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . manifold learning [3], bionic features [4] and other shallow representation methods, after which machine learning methods such as K-nearest neighbor method [4], quadratic regression function [5] or support vector regression [6] are used for the final age discrimination.
In recent years, when studying age estimation, deep learning methods are often used. Zhang et al. [6] proposed a novel method based on long-term short-term memory networks (LSTM), which is a fine-grained age estimation inspired by the visual attention mechanism. This method combines the residual network with the LSTM unit to construct the LSTM-ResNet network to extract the local features of the age-sensitive area, thereby effectively improving the accuracy of age estimation. Xie and Pun et al. [7] adopted the decomposition idea and proposed to use two sets of classifications for depth and serial number combination learning. Specifically, they first establish an ensemble based on convolutional neural network (CNN) technology, and the serial number relationship is implicitly constructed by their basic learners. Each basic learner classifies the target face into one of two specific age groups. After realizing the probability predictions of different age groups, their aggregate them by converting them to calculate the value distribution of the entire age group, and let them get the final age estimate from their votes. Lee et al. [8] proposed a deep residual learning model for age and gender estimation. Their method detects faces in the input image, and then estimates the age and gender of each face. It is worth mentioning that the estimation method is composed of three deep neural networks, and the residual learning method is adopted. Li and Xing [9] proposed a label-sensitive depth metric learning method for facial age estimation. Inspired by the fact that human age labels are related in chronological order, the proposed algorithm aims to seek a series of hierarchical nonlinear transformations through a deep residual network to project face samples into the potential public space. The similarity with that age is isotonic to keep the ranking difference. Singhal and Majumdar [10] solved the problem of estimating age and gender based on positive photos. In this work, they describe it as a regression problem. This is the natural way to deal with gender and age. Gender can be expressed as a single variable that may take a binary value (male or female), while age can be expressed as a single variable that uses a non-negative real value. They formulate regression on the newly proposed deep dictionary learning framework. Previous work on this topic was in unsupervised representation learning. In this work, they built regression into the deep dictionary learning framework to supervise the formulation process.
In the above methods based on traditional machine learning or deep learning, usually only a specific generative model, discriminant model, classification CNN or regression CNN is used for age estimation. For traditional machine learning, the disadvantage is that the performance is usually unsatisfactory. For deep learning methods, once the hyperparameter settings such as sample size and number of iterations are unreasonable or the parameters are not fully converged, there is no fault tolerance rate, even it has a decisive influence on the final age estimation accuracy.
Aiming at this shortcoming and in order to further improve the accuracy of age estimation, the classic machine learning method is combined with the deep learning method to propose a locally adjusted age estimation method (LAAE) from coarse to fine, global to local. The flow chart can be seen in Figure 1. Specifically, in the training phase, the SE-ResNet-50 [11] pre-trained on the VGGFace2 [12] data set is first fine-tuned. When it converges, the fully connected layer is extracted, and the vector formed by its end-to-end connection is used as a representation and multiple One-Versus-One SVM. In the test phase, first send the face image to be estimated into SE-ResNet-50 to get a rough age estimate, then set the specific neighborhood and combine the trained SVM into a directed acyclic graph SVM for accurate age estimation.

II. RELATED WORKS A. FACE AGE ESTIMATION
Face age estimation is to extract age-related facial features from face images, use age estimation algorithms, build age estimation models through computer technology, and then estimate the specific age or age range of the input image to be tested [13]. The current age representation models for face images mainly include: anthropometric models [14], active appearance models (AAM) [1], aging pattern subspace (AGES) [15], age stream Shape (age manifold) [16], based on bio-inspired features (BIF) [5] etc.

1) ANTHROPOMETRIC MODELS
The anthropometric model measures and compares the distance and ratio between the feature points of the face, and mainly reflects the changes in facial bone developed with age. This method has less error in the estimation of the younger age, while the adult face age estimation is not applicable [17]. In other words, in order to distinguish between young adults and the elderly, it is necessary to incorporate the age changes VOLUME 9, 2021 of facial soft tissue and skin into the study to analyze texture features [14]. The anthropometric model for age estimation is proposed by Mlinar [7] and became the main method of age estimation research in the 1990s. This model manually marks the facial feature points of the two-dimensional image, and measures the changes in the distance and proportion of the feature points to estimate the age [18]. Since then, Farkas [19] conducted a facial metrology study and defined 57 feature points that change with age. Pitanguy et al. [20] measured the size of the face organs and bones, and selected features that can characterize the face with age. The changing parameters indicate that there is a non-linear relationship between age and face parameters. Takimoto et al. [21] summarized the change rules of facial detail features in three different age stages of human face from the perspective of face image. Wang et al. [22] proposed a facial image feature representation method, which combined the facial geometric proportion feature extracted from the craniofacial growth model with the facial local texture feature extracted from the fractional differential theory, and achieved good results in age estimation. In a word, anthropometry model is mainly applicable to teenagers. Considering only geometric features but ignoring texture features, and manually marking feature points on face images, it cannot better reflect the features of faces changing with age.

2) AAM
AAM was first proposed by Cootes et al. [1], which is a rapid extraction method of image features. By comprehensively considering the global shape and texture information for statistical analysis, a face blending model is established. In 2004, Lanitis et al. [23] first applied it to age estimation of human faces and established the relationship between age and facial image features through functional expressions. Suo et al. [24] used AAM to enhance the localization of face details from the aspects of face shape feature and texture feature, and then used the artificial immune recognition system method to achieve the purpose of age estimation on face images. Luu et al. [25] used AAM to divide face images into children and adults and combined with support vector machine (SVM) to estimate age. 2014, In 2014, Du et al. [26] extracted feature points by using AAM, constructed proportion vectors and relative displacement vectors of different facial expressions as the key input features of face recognition and facial expression analysis. They used key features to pre-classify and analyze the facial images in the face database, which effectively improved the recognition accuracy and efficiency. Compared with anthropometry model, AAM is applicable to face images of any age because it considers both shape and texture features of face at the same time [14]. However, this model requires accurate automatic positioning of facial feature points at the beginning of the study, otherwise the positioning error is easy to be amplified in subsequent processing [21].

3) AGES
Based on the idea of age estimation proposed by Fu et al. [18], Singhal and Majumdar [10] proposed AGES in 2007, that is, to continuously collect facial images of the same individual and order them according to age changes, so as to establish a representative facial age growth subspace, and based on this, age estimation of facial images was carried out. Wang et al. [22] similarly combined age weights with shape model parameters to form a model space and established a strict age model method based on statistics in this space. The AGES model uses the morphological changes of the same individual face, which is more in line with the objective reality. However, in the process of sample collection, each research object is required to have face images at all AGES, which is difficult to realize. Meanwhile, the vector of this model represents a higher dimension and requires a large amount of computation, which may bring dimension disaster [21].
Age Manifold. Age Manifold [11] is a versatile lowdimensional face age growth pattern for different individuals' face images of different ages based on manifold embedding technology [23]. Currently, common methods for age geometry learning include locality Preserving projection (LPP) [24], orthogonal locality preserving projection (OLPP) [25] and conformal embedding analysis, CEA) [26]. Hu et al. [11] applied manifold learning method to find an effective embedding space, and used linear regression function to establish low-dimensional manifold data. Finally, manifold data points were modeled as quadratic regression function.
Compared with AGES, age manifold trains a common aging model for face images of different AGES of different bodies, without the need for a specific age growth pattern of an individual, but this model requires sufficient training data [21].

4) BIF
In 2009, Cao et al. [12] proposed a BIF model for face age research for the object recognition framework based on feature combination [27]. Currently, Luu et al. [28] and Lu and Shi [29] are widely used in this field to automatically complete face age estimation based on BIF mimicking the information processing mechanism of mammalian visual cortex through computer.
Compared with other models, the accuracy of face age estimation based on BIF is higher, and its effect on age estimation is very excellent [30].
The upcoming part of our proposed work is organized as follows. Our backbone network, namely SE-ResNet is illustrated in the next section. Section IV is proposed method where we give a detailed description of our scheme. Section V is our experimental part. In Section VI we discuss the transportability of our method. The final conclusion includes the future research direction is located in Section VII.

B. SE-RESNET-50 1) RESNET
In VGGNet [31], CNN reached 19 layers, and in GoogleNet [32], the number of layers of the network reached an unprecedented 22. However, in deep learning, the increase of network layers is usually accompanied by several problems: consumption of computing resources, model overfitting and gradient disappearance and gradient explosion. For enterprises or universities with sufficient research funds, the shortage of computing resources can be solved only through GPU cluster. The overfitting can also be solved by collecting a large number of valid sample data and cooperating with regularization methods such as Dropout [33]. The gradient problem can also be solved by batch normalization. It seems that as long as the number of layers of the neural network is continuously increased, the benefits can be obtained, but the experimental data cannot effectively support this view [34]. If the network depth is increased, the training error will increase. When the network degrades, the shallow network can achieve better training effect than the deep network. At this time, if the characteristics of the lower layer are transmitted to the higher layer, the effect should be no worse than that of the shallow network. From the perspective of information theory, due to the existence of data processing inequalities, in the process of forward transmission, with the deepening of the number of layers, the original image information contained in the feature map will be reduced layer by layer, while the addition of identity mapping ensures that the latter layer of the network must contain more image information than the former layer. Based on the idea of fast mapping, the residual neural network is developed.
The residual network is formed by adding a series of residual modules to the original neural network, as shown in the figure 2 below. Figure 2 can be expressed as: X l+1 = H(X l ) + F(X l , W l ), in which H(X l ) = X l is the identity mapping on the left-hand side of the graph, F(X l , W l ) is the residual on the right side of the curve where W l is the weight and bias of the l layer. When the number dimensions of feature maps of the current layer and the latter layer are different, 1 * 1 convolution operation is required to reduce or raise the dimension. At this time, H(X l ) = W l X l , in which W l is 1 * 1 convolution operation.

2) SQUEEZE-AND-EXCITATION MODULE
In the convolutional layer of a convolutional neural network, the set of a series of convolution kernels can be regarded as the neighborhood spatial connection mode on the input channel, which fuses the spatial dimension information and channel information in the local receptive field [11]. The convolutional neural network generates robust representations by stacking a series of convolutional layers, nonlinear activation functions and pooling operations to capture hierarchical patterns and obtain theoretical global receptive fields. A lot of research work has been done to improve the performance of the network from the spatial information level. For example, Inception structure has embedded multi-scale information to successively aggregate the characteristics of multiple sensory fields. Inside-outside considers the neighborhood information of space. The Squeezing-and-Excitation Module (SE) improves network performance by considering the relationships between the feature channels. The approach is to learn the importance of each feature channel automatically. The importance is located there to enhance the features and suppress features that are not located for the current mission. The operation instructions of each part of the extruding-excitation module are as follows: (1) F tr : Generally, it is convolution operation.
(2) F sq : Operation of Squeeze. We carry out feature compression along the spatial dimension to make the output dimension match the input feature channel number.
In addition, each two-dimensional characteristic channel is transformed into a scalar, which has global receptive field to some extent. It represents the global distribution of the response on the characteristic channel, and makes the global receptive field available at the layer close to the input.
(3) F ex : Excitation operation. It is a mechanism similar to gates in recurrent neural networks, which generates corresponding weights for each feature channel by learning to explicitly model the correlation parameters between feature channels.
(4) F scale : Reassignment (i.e. Scale) operation. The weight of excitation output is regarded as the importance of each Feature channel after Feature selection, and then multiplied on the previous features by channels to complete the Feature Recalibration of original features on the channel dimension.
The SE module can be integrated into a network such as Inception or a residual network. This paper uses SE-Resnet-50 as the backbone network, as shown in Figure 3. After a residual module first, and then use global average pooling operation by extruding, followed by two full connection layer to the correlation between explicit modeling channel: first of all, will feature dimension will for the original 1/r (r generally take 16), and then pass through a fully connected up back to the original dimensions of the bottleneck operation module is more strongly nonlinear and greatly reduced the number of arguments and the computational complexity and then through the Sigmoid will feature weights to a value between 0 and 1, at last, through Scale operation to weighted on the channel characteristics.

III. LAAE A. LOCAL ADJUSTMENT
LAAE's idea is to get the age value estimated by CNN as close as possible to the real age in the local neighborhood, as shown in Figure 4.
Assume that for input data y, the corresponding CNN output is f (y), that is, the small black circle in Figure 4. Perhaps f (y) is still some distance from the actual age value L of the red small circle in the figure, so the idea of the age estimation of local adjustment is to slide the estimated value f (y) to the right and left (i.e., increase or decrease) within 2d of the domain scope to make it closer to the actual age value L, which can be expressed as L ∈ [f (y) − d, f (y) + d] by the formula.
In this way, the age estimation of local adjustment can be divided into two steps: 1) Age classification of all training data using CNN network. This step can be considered a rough estimate or a global estimate. 2) Focus on the results of the first step and make local adjustments in a small area. Correspondingly, this step can be considered as fine-tuning or local estimation.
At this time, the key problem is how to verify different age values within a certain range for local adjustment. Our goal is to approximate the original estimated age as closely as possible to the real age through global regression. We treat each age label as a class and use the method of classification to adjust or verify the different age values locally. Because only a small number of age tags are used for each local adjustment, the regression method does not work properly. For local tuning based on the classification method, there are many options in the classifier method, but here we use linear SVM for local tuning. The main reason is that SVM is robust in the case of fewer training samples. This has been demonstrated in previous small sample case studies, such as face recognition [35], [36], image retrieval [37], audio classification and retrieval [38] and face expression recognition [39].

B. LINEAR SVM
Given the training vector (y 1 , z 1 ), . . . , (y n , z n ) belonging to the two classes, where y i ∈ R d , z i ∈ {−1, +1}. linear SVM can learn an optimal classification hyperplane wy + b = 0 to maximize the margin between the two classes [40] [41]. The learning essence of SVM is to find the saddle points of the following Lagrange functional: (1) where S is the Lagrange multiplier. Its optimization objectives can be translated into the following dual problems: At this point, the optimal hyperplane can be expressed as a dual solution: The value of b can be substituted into the original equation wy + b = 0 to solve.
When testing, for any data point y, the classification results can be given by the following functions: If the training data is not separable, the relaxation variable ξ i can be introduced. A detailed introduction of this part can be referred to in reference [40].

C. DIRECTED ACYCLIC GRAPH SVM
Classical SVM was originally designed to solve the dichotomy problem. When it was extended to the multi-classification problem, there were the following methods: 1) One-versus-one: learn a classifier for every two classes; 2) One-versus-many: train more than one SVM for each class and the rest; 3) Many-versus-many: for all the classes at the same time training SVM is obviously not suitable for the last two methods algorithm, because in the local adjust part only a small amount of sample included if using two methods behind the SVM will at every time of local adjust dynamically to training, this training will no doubt increase the complexity of the first kind of method is feasible in the mission, the reason is that it does not need to train SVM online, namely all pairs will be offline training SVM classifier.
In the process of combining multiple one-versus-one binary classifiers, the idea of directed acyclic graph in graph theory can be introduced to combine multiple binary classifiers into multi-class classifiers [42]. For an n-classification problem, directed acyclic graph SVM requires the construction of C 2 n = n(n − 1)/2 classifiers corresponding to n(n − 1)/2 nodes distributed in the n-layers structure. Taking n = 4 as an example, the topology of SVM in a directed acyclic graph is shown in Figure 5.
As can be seen from Figure 5, the top layer of a directed acyclic graph contains only one node, namely the root node, the second layer contains two nodes, and so on, the i-th layer contains i nodes, until the bottom layer has completed the classification of n class. If a sample is input, the directed acyclic graph starts from the root node, and the decision value of the symbolic function sign(w · y + b) of each node is calculated (see Formula 4). If -1, it enters the left child node, and if 1, it turns into the right child node. In turn, the output of the leaf node in the last layer can represent the category of samples. From this point of view, the directed acyclic graph is equivalent to a table operation: when the initial form contains all the classes, then each node operation to form the fore and aft of the two kinds of comparison, excluded the most impossible to belong to the category of the samples, and delete a class, in a table at the end of the form is the only remaining category as samples belong to categories.
In general, for an n-classification problem, only n-1 comparisons are needed during the test phase. Here, the number of pairwise comparisons is limited to m-1, because only the m class is involved in local adjustments (m < n).

D. THE DESIGN OF NEIGHBORHOOD
In theory, it is difficult to design the neighborhood U (f (y), d) = {x|f (y) − d < x < f (y) + d} for local adjustments because it is determined by many factors, such as the size of the sample size and the performance of the coarse estimator. There can, however, be broad directions: the wider the search, the greater the chance of including real age within that range. If the search area is too small to reach the actual age tag, an arbitrary age tag may be found in the case of a local search. On the other hand, if the range of local search is too wide, it also increases the possibility of adjusting the age away from the real age, because local classification is only a local optimal search.
In order to locally adjust the age estimate and satisfy the special topology of the directed acyclic graph SVM, we tried different local search ranges of powers of 2, i.e., 2(d = 1), 4(d = 2), 8(d = 4), and 16(d = 8).
Theoretically, we could extend the search scope to the same sample size of the data set, but this would not satisfy the ''local adjustment'' strategy, so we set the search scope to 16 at most.
In the experiments located at the next section, we specify different scopes and demonstrate the impact of different local search scopes on the results. The main purpose was to show that local tuning can indeed improve the age estimation performance of a single machine learning classifier or deep learning network.

IV. EXPERIMENT A. IMAGE SETS
In order to verify the effectiveness and universality of proposed method, AFAD image set [43] composed of yellow and MORPH [44] image set composed of white and black were selected for ablation experiment and comparison experiment.

1) AFAD IMAGE SET
AFAD includes approximately 160,000 images from social media, ranging in age from 16 to 40 years. Not only is it the largest open source data set available for age estimation, but it is also very useful for studying the facial age in unconstrained environments. Since there is no official criterion for dividing the training set and the testing set in AFAD, AFAD was randomly divided into 80% training set and 20% test set in order to compare with other age estimation methods. Some examples of the AFAD image set are shown in Figure 6.

2) MORPH IMAGE SET
MORPH consists of more than 55,000 face images of about 13,000 people, ranging in age from 17 to 77 years.  Some examples of the MORPH dataset are shown in Figure 7, and the training protocol is similar to AFAD.

B. PRETREATMENT, EXPERIMENTAL SETUP AND EVALUATION INDEX
Before age estimation, the following preprocessing is performed on the original face image: the cascaded VJ detector [45] is used for face detection, and then AAM [1] is used to locate the face reference points, and finally the face image is scaled to 224 * 224 for experiment. This experiment was carried out under the GPU open source framework of caffe [46], and pretrained SE-ResNet-50 model used was from the literature [12].
The performance of age estimation is evaluated by means of two measures: Mean Absolute Error (MAE) and Cumulative Score (CS).
MAE is defined as the average absolute error between the predicted age value and the actual age value: MAE = N k=1 |l k − l k |/N , where l k is the actual age value of the test sample k,l k is the estimated age value and N is the sample size of the test set.
The formula of CS is defined as CS(j) = N e≤j /N × 100%, where N e≤j is the total number of images whose absolute value error is not less than j (i.e. tolerance age error)in the test set.

C. ABLATION EXPERIMENT
To prove the validity of LAAE, we specify different neighborhoods and demonstrate the influence of different local search scopes on the results. As a comparison, ablation experiments using only SE-ResNet and only DAG SVM (in this case, image three-channel pixels and linear dimension reduction, namely principal component analysis (PCA), were used for feature extraction) were also added, and the results were shown in TABLE 1. The following conclusions can be drawn from TABLE 1: 1. Performance in MORPH is always better than AFAD. The reason is that the images in MORPH are taken officially, the lighting conditions and camera performance are quite benign, while the images in AFAD are crawled from social networks and therefore vary in resolution, which makes a difference in performance.
2. The performance of a single deep learning method in the two data sets is better than a single classic machine learning method (i.e., DAG-SVM in this section), which further demonstrates the superiority of deep learning.
3. The effect of local adjustment is always better than pure machine learning method or pure deep learning method, but the performance produced by different neighborhoods is quite different, and the best neighborhood settings on the two data sets are not the same. The reason lies in the sample size difference between MORPH and AFAD, that is, the number of categories in MORPH is more, so the larger the search range, the better the performance, but this is the opposite in AFAD, because its best performance is in d = 4, after which the effect is worse with a larger neighborhood.
We only take the maximum neighborhood as d = 8 here. Except for the reason mentioned in the previous section that the larger the neighborhood is, the more it will not meet the prior conditions of local adjustment, there is another important reason, that is, if d = 16 is taken as the larger neighborhood, the scope of local adjustment will be expanded to 32, while the category in AFAD is 40 − 16 + 1 = 35, which is equivalent to the second estimate of age.

D. CONTRAST EXPERIMENT
To further verify the validity of the method, the results are compared with other age estimation methods based on deep learning, and the results are shown in TABLE 2 and Figure 8 and 9.
In TABLE 2, the description of Zhang et al. [6], Xie and Pun [7], Lee et al. [8], Li and xing [9], Singhal and Majumdar [10] can be seen in the introduction. The remaining methods that not accounted are described below. Deep embedding method [47] proposes an end-to-end deep   embedding neural network for robust age estimation. Specifically, they used a combination of categorization loss and triples-based sorting loss to train a deeply embedded network that maps input facial images into an embedded metric space, where features of the same age are compact and features of different ages are pushed into another space. Therefore, deep embedding network can learn more discriminative features and improve the performance of age estimation. TF [48] uses a deep neural network with pre-trained weights to perform image-based gender recognition and age estimation. Specifically, VGG19 and VGGFace pre-training models are adopted to discuss transfer learning by testing the influence of different design schemes and training parameter changes, so as to improve the prediction accuracy. Finally, in the test phase, subjects were first classified by sex, and then age was predicted using separate male and female age models. To allow for multiple labels per image, apparent methods [49] did not use the average age of the labeled face images as a class tag. Instead, they grouped face images within a specific age range. In Cluster-CNN [50], a new deep neural network clustering convolutional neural network (i.e., Cluster-CNN) is proposed to estimate age from face images. It is based on the clustering rich CNN features, which can help the network effectively deal with the nonlinear of this task. In particular, for a given face image, they first roughly normalize the face to a standard size based on the distance between the two eyes, and then input the normalized face into a Cluster-CNN for prediction. The proposed cluster module can capture multi-modal transformations and is differentiable, so that it can be optimized in a unified back propagation method.
In addition, the average absolute error of our method in MORPH and AFAD reached 3.04 and 3.17 respectively, which obviously exceeded the performance of the previous method compared with the best method of the comparison algorithm Cluster-CNN, and the performance of our method improved by about 6% in the average case.
In comparison with other methods about the cumulative score index, we respectively selected the best LAAE (d = 8) on MORPH and the best LAAE (d = 4) on AFAD. the contrast experiment results can see figure 8 and figure 9, when the tolerance age error is more than 4, our methods is ahead of the other comparison methods. In figure 9, our method is always superior to the Cluster-CNN.
In Fig. 10 and Fig. 11, we further compare the accuracy of pure SE-ResNet-50 and LAAE(d = 8 and d = 4) at each age. Again, our method achieves a consistent lead in almost all cases. Note that the accuracy of the SE-ResNet-50 is extremely skewed, and its performance is mediocre. Our approach is more uniform, probably because it implicitly solves the class imbalance problem. This result also demonstrates in detail the superiority of using the tips of local adjustment to estimate age.

V. DISCUSSION
As SE-ResNet is one of the most advanced CNN at present, the experiments based on it are not convincing to some extent. Therefore, we retested the performance of CS-softmax loss on a relatively shallow network, which is composed by four convolutional layers, two max-pooling layers and a fully connected layer. We first pre-trained the network based on VGGFace2 [12] and then ended up with a good performance improvement on MORPH, i.e., 3.873-3.415 = 0.458, which is even slightly better than the deeper CNN. This proves that the performance of our method is independent with the depth of CNN. Most importantly, this result shows that the essence of LAAE is local adjustment based on neighborhood rather than the specific classifier.

VI. CONCLUSION
This paper presents a locally adjusted age estimation method named LAAE. Specifically, deep learning is first used for global rough estimation of age, and then local fine estimation is performed on a DAG SVM by setting the neighborhood. It can be seen from the experimental results that the performance of proposed coarse-to-fine or/and global-to-local approach is better than that of pure deep learning and pure machine learning methods, and the comparison with other methods can further illustrate the effectiveness of LAAE. LAAE is also theoretically feasible for other pattern recognition problems. In addition, its future research direction can be based on data-driven or self-adapting search scope rather than artificial and mechanical setting.