2M BeautyNet: Facial Beauty Prediction Based on Multi-Task Transfer Learning

Facial beauty prediction (FBP) has become an emerging area in the field of artificial intelligence. However, the lacks of data and accurate face representation hinder the development of FBP. Multi-task transfer learning can effectively avoid over-fitting, and utilize auxiliary information of related tasks to optimize the main task. In this paper, we present a network named Multi-input Multi-task Beauty Network (2M BeautyNet) and use transfer learning to predict facial beauty. In the experiment, beauty prediction is the main task, and gender recognition is the auxiliary. For multi-task training, we employ multi-task loss weights automatic learning strategy to improve the performance of FBP. Finally, we replace the softmax classifier with a random forest. We conduct experiments on the Large Scale Facial Beauty Database (LSFBD) and SCUT-FBP5500 database. Results show that our method has achieved good results on LSFBD, the accuracy of FBP is up to 68.23%. Our 2M BeautyNet structure is suitable for multiple inputs of different databases.


I. INTRODUCTION
At present, the basic research of facial beauty prediction (FBP) has promoted the rapid development of plastic surgery and cosmetics industry, like cosmetic recommendation [1], esthetic surgery planning [2], face-based pose analysis [3] and facial beautification [4]. It has become an essential role in our daily life, while it also suffers from factors like appearance, social status, personal feelings and others. Consequently, FBP has become an emerging area in the field of artificial intelligence.
In recent years, most studies of FBP are based on deep learning [5]- [7]. Although these methods have achieved good results, there are still some challenges. One of the main difficulties is the lack of training data. Currently, the study of FBP is based on a small database with only a few thousand The associate editor coordinating the review of this manuscript and approving it for publication was Fan-Hsun Tseng .
images. SCUT-FBP5500 [6] is a facial beauty database of 5500 images, constructed by South China University of Technology. Our group has built a Large Scale Facial Beauty Database (LSFBD) [7], including 20,000 labeled images (10000 male images and 10000 female images) and 80000 unlabeled images. However, training a deep convolutional neural network by a small database is prone to overfitting. Hence our group constructed a multi-scale network structure with transfer learning to solve the problem of insufficient data, and the accuracy on LSFBD is 67.4% [8].
Currently, most researchers consider single-task learning (STL) in FBP, ignoring the correlation between tasks. Unlike the single task, multi-task learning (MTL) utilizes the additional useful information from the auxiliary, improving the generalization performance and learning efficiency [9], [10]. Besides, the multi-task network has other merits. For example, it contains shared layers that avoid recalculating the characteristics of each task, thus the memory usage is VOLUME 8, 2020 This reduced [11], and the computation speed is improved [12]. Gao et al. [13] proposed a novel multi-task network consisting of facial beauty prediction and landmark detection. The best correlation score is up to 0.92 on the SCUT-FBP benchmark. However, the related task of FBP in their framework is landmark detection, ignoring the influence of other facial attributes, such as gender, emotion, and age, etc. It is worth noting that some studies show smiling faces are more attractive [14], [15]. Therefore, we consider using other face attributes as auxiliary tasks of FBP for MTL. More recently, multi-task network structures are mostly single input and multiple outputs. It is suitable for inputting a dataset with multiple labels, which is called multi-label learning. Unfortunately, LSFBD has a facial beauty label simply but does not include other facial attribute labels. Hence, our MTL researching on FBP and other facial attribute tasks cannot use multi-label learning, but we can perform multi-task researching on different facial attribute tasks from different databases. For this reason, combined related tasks of FBP with transfer learning, we propose a multi-input and multioutput network to improve the classification accuracy. As is displayed in Fig.1, it manifests the whole framework of our proposed method. Our major contributions are as follows: (i). We propose a novel network for MTL. 2M BeautyNet was designed for FBP and its auxiliary tasks. We research FBP and gender recognition based on multi-task learning. Meanwhile, transfer learning is combined to extract the depth sharing features of faces. Thereby, the shallow features of main task are enriched, and the classification accuracy is improved.
(ii). We utilize a multi-task loss weights automatic learning strategy. Combined with multi-task loss weights automatic learning strategy, 2M BeautyNet avoids the phenomenon, in which the model has one task that dominates the entire loss, while the other tasks cannot influence the learning process of the shared layer.
(iii). We combine the traditional method with deep learning method. After multi-task training, the softmax classifier is replaced by a random forest for classification. Finally, the result is improved by about 1%.
The remainder of this paper here is as follows: section 2, reviews some related work of FBP and multi-task transfer learning. Section 3 describes 2M BeautyNet architecture, and multi-task loss weights automatic learning strategy. Then, we present the experimental platform and implemental details for our paper and analyze experimental results in Section 4. Section 5 will conclude our works.

II. RELATED WORKS A. FACIAL BEAUTY PREDICTION
The research on FBP has gone through three stages, facial beauty attractive hypothesis verification, handcrafted feature classification, and depth apparent feature classification. At first, people's understanding of facial beauty mainly focused on a series of facial beauty attractive hypotheses and aesthetic research elements proposed by psychology and biology. The facial beauty attractive hypothesis includes averageness hypothesis, evolutionary hypothesis, symmetry hypothesis, sexual dimorphism, etc. Aesthetic research elements include three courts and five eyes, golden ratio, etc [16]- [18]. Later, people began to extract handcrafted features of face images for FBP, such as geometric features and texture features. In recent years, using machine learning and computer vision technology to analyze facial beauty will empower machines to judge beauty, which has become another emerging research topic in the field of artificial intelligence [19], [20].
In general, FBP usually was regarded as a classification or regression problem. This process is usually divided into feature extraction and feature classification/regression. In early studies, researchers pay great efforts to use different machine learning algorithms and design various features. Eisenthal et al. [21] measured 92 frontal face images for geometric feature point distance and ratio, using SVM and KNN algorithm for facial beauty evaluation. Gunes et al. [22] used a 15-dimensional distance vector and 13-dimensional distance ratio vector for geometric features and training the classifier with C4.5 decision tree. Gray et al. [23] were the first to use texture features to analyze facial beauty. The face image is filtered by 48 filters, then different face image features are extracted by changing the image resolution. Although these methods can achieve certain results, handcrafted features lack universality and are low-level features that contain deeper levels of facial beauty information.
With the spread of up-to-data deep learning methods, higher-level features learned from CNNs have been applied for facial beauty computation tasks. Our group utilized a deep self-learning and Convolution Restricted Boltzmann Machines (CRBM) to predict facial beauty [24]. A set of works used the pre-trained VGG-16 to extract the depth features of facial beauty prediction [25], [26]. However, simply using the trained network as the feature extractor cannot achieve the superior effect. Xu et al. [27] used a new deep cascaded fine-tuning scheme, which is using various face image inputs to fine-tune the network respectively. These works indicate that the hidden layers of the deep model can learn useful facial features. Before long, a psychologically inspired convolutional neural network (PI-CNN) for FBP was proposed [28]. To solve the problem of lack of labeled data and distinguishing features in FBP, Liu et al. [29] fused depth features and geometric features for the first time, then an end-to-end label distribution learning (LDL) framework are constructed. Shi et al. [30] used a co-attention learning mechanism to improve the importance of different regions and different facial components, and pixel-wise labeling masks were seen as the meta information of the face. Whether based on features or other means, these works are single-task learning, ignoring the correlation between tasks.

B. MULTI-TASK TRANSFER LEARNING
Multi-task learning can be seen as a type of transfer learning, which was first proposed by [31]. It makes use of shared information between complementary tasks to enhance the generalization ability and the learning and recognition ability. In deep multi-task learning (DMTL), MTL typically shares parameters between hidden layers. The sharing mechanism is divided into hard sharing and soft sharing. Hard sharing is the most common mechanism in deep multi-task learning. All tasks share the structure and parameters, but the output layer is respective. Baxter et al. [32] proved that hard sharing mechanism can effectively reduce the risk of over-fitting. However, in the soft sharing mechanism, each task has its structure and parameters, which guarantees the similarity of parameters through regularizing the distance of parameters between models [33], [34]. In traditional multi-task learning, soft sharing is largely influenced by regularization technology. Recently, there is a growing wave of research in multi-task learning, by which diverse computer vision tasks are solved. Lu et al. [35] proposed a hierarchical multi-task network (HMTNet), which can achieve the effect of simultaneously identifying a person's gender, race, and facial attractiveness from a given portrait image. A DMTL model for keypoint detection, face detection, posture estimation was proposed by [36]. In MTL, the challenge of each task is different, and the loss weight of each task in these methods is a fixed value set according to experience. Guo et al. [37] proposed a dynamic task priority method, utilizing the indicators of performance to determine the difficulty level of each task. If the loss value of task was accumulated directly, a task will converge but the other will be screw up. Therefore, Alex et al. [10] used uncertainly to measure the loss in multitask learning. To a certain extent, these efforts can avoid the phenomenon of one task dominating the whole loss.
The purpose of transfer learning is to improve the learning efficiency of the target domain. It transfers the source domain knowledge to the target domain [38]. The easiest way for transfer learning is that target domain uses the weight of source domain. Because the pre-trained model already contains a lot of basic information, it enriches the low-level features of the target task and improves the model learning performance. For the problem of heterogeneous unsupervised domain adaptation, a shared fuzzy equivalence relation (SFER) method was proposed [39]. Based on distribution adaptation, Jiang et al. [40] proposed a multi-label metric transfer learning (MLMTL). Recently, Tan et al. [41] studied a novel transfer learning problem termed Distant Domain Transfer Learning (DDTL). The differences can exist between target domain and source domain. Xin et al. [42] used transfer learning for smile detection and fine-tuned the face recognition model with different inputs, which effectively improved the performance of smile detection. Also, Amira et al. [43] applied transfer learning to medical research and proposed a segmentation recommender for skin lesion extraction. Lv et al. [44] proposed an unsupervised incremental learning algorithm, which is realized by transferring the pedestrian's spatio-temporal pattern in the target domain.
Multi-task transfer learning method was used in our works. By the way, not only the low-level information from in the source domain but also the shared information of auxiliary tasks can be used. A host of experiments demonstrated that multi-task transfer learning can improve the classification VOLUME 8, 2020 accuracy of FBP, and alleviate the over-fitting phenomenon to some extent.

A. 2M BEAUTYNET ARCHITECTURE
Most of the existing MTL network structures used multiple labels of one image to jointly train the network with a single input and multiple outputs. At this stage, the basic multitask network is usually a single input. What's different is that we combine the related tasks with FBP to construct a network structure with multiple inputs. Here we will give two reasons for designing this network structure. On the one hand, we study facial beauty prediction on LSFBD with only a single beautiful degree label. The commonly used singleinput and multi-output networks are not suitable for our multitask research. On the other hand, few facial beauty sample training depth networks are prone to over-fitting. If the network structure with multi-input multi-task is used, the input data of related tasks can make up for the defect of insufficient training samples to some extent. Fig.2 illustrates the architecture of 2M BeautyNet, and the parameter setting is shown in Table.1. Our model consists of multi-input layers, shared layers and specific layers. Among them, multi-input layers include the first block of VGG16 (2 convolution layers) and Max Feature Map (MFM) layer. It can transfer existing CNN parameters and learn tasks from different databases. MFM reduces the dimensions of each branch, and finally recombines the feature maps to retraining them. The shared layer is composed of the other block of VGG16, and the specific layers consist of Global Average Pooling (GAP) layers and two classifiers. The input image of the network is RGB images from SCUT-FBP5500 and LSFBD. Firstly, the images of Input_0 and Input_1 are input to the multi-input layers for training respectively. Secondly, the respective feature images are input into the shared layer for training at the same time. Finally, the two classifiers are used to classify FBP and gender recognition. Among them, Input_0 and Input_1 represent facial beauty prediction images and gender images, respectively. Task_beauty is the facial beauty prediction classifier, while Task_gender is the gender recognition classifier.

B. MULTI-INPUT MULTI-OUTPUT NETWORK
In this section, we will reveal two basic multi-task networks that are based on hard sharing and soft sharing. Fig.3 (a) is a simple parameter hard shared multi-task network, and parameter soft shared network is shown in Fig.3 (b). Based on parameter hard shared multi-task network, we designed multi-input multi-task network for tasks from different databases to ensure transferred parameters during multiple tasks training. Fig.4 shows multi-input multi-task network. All tasks of parameter hard shared network share a bottom layer. The shared part learns the shared representation of multiple tasks, which have strong abstraction capabilities.
It is adaptable to several different but related target tasks, often enabling the better generalization of the main tasks in MTL. However, as shown in Fig.3 (b), each task of parameter soft shared model has its own underlying. All parameters or partial parameters can be shared by designing a shared structure between the underlying layers. Based on parameter hard shared model, this paper designs 2M BeautyNet in combination with related tasks from different databases.
Researches have shown that transfer learning can effectively solve the problem of insufficient training samples [9], [24]. For multi-task transfer learning, we transferred the trained parameters of VGG16 from the ImageNet database to different layers of our multi-input multi-task network for retraining. The details are described in Section 3.3. Next, we will illustrate why we design multi-input multi-task network.
Since the output size of the transfer network convolution layer is fixed, parameter transfer can only be performed if the input data dimension is the same as transfer layer. However, in the case of multiple inputs, the dimensions of the input data are usually not guaranteed to be the same. To solve this problem, a multi-input multi-task network that contains a multiple inputs layer is designed. We use VGG16 as the basic transfer network, where input data shape is (128, 128, 3). If the two input data of beauty and gender are directly concatenated, the input shape of the first convolution layer will be changed to (128, 128, 6), and it will not be able to transfer the trained parameters. Therefore, we have designed a multi-input layer with two branches. The output feature map type of the two branches after passing through the convolutional layer is (128, 128, 64). If you want to continue to transfer the parameters of the layers behind VGG16, you must ensure that the data type, which is input to the next convolutional layer, is (128, 128, 64). But the data type of the two branches directly concatenated is (128, 128, 128). Therefore, we reduce the dimension through MFM activation function instead of 1 × 1 convolution kernel. The first reason is that Relu function will cause the collapse of lowdimensional data. The probability of a low-dimensional feature distribution on Relu function activation band is small, and the information is damaged severely after the feature map passes through Relu layer. Therefore, we use MFM to process low-dimensional feature maps. MFM activation function suppresses the low-activation neurons in each layer through competitive relationships and has advantages of compact features and reduced parameters. If convolution kernel is used for convolution, not only the amount of model parameters is increased, but also subsequent parameter transfer is hindered. Another reason is that MFM adopts a split-polymerization method, which can achieve compact feature representation and variable selection, dimensionality reduction and sparse gradient function. We assume that the convolutional layer of the output is C ∈ R h×w×2n , dividing the convolutional layer into two. MFM 1/2 activation function is expressed as where the channel of the input convolution layer is 2n, 1<i < h, 1<j < w. The gradient of (1) takes the following form, Among them, 1 ≤ k ≤ 2n, and Usually, the activation amount of MFM 1/2 is 50%. Similar network structures can be used if the network input comes from three different databases. We just change the activation function to MFM 2/3. Some studies have proved that the activation performance of MFM 2/3 is better than MFM 1/2 [10], [45].

C. AUTOMATIC LEARN WEIGHTS OF MULTI-TASK LOSS
In MTL, joint optimization between multiple tasks may make a simple single task converge due to data imbalances and diverse difficulty in tasks. If we add multiple losses directly, one task tends to be better and the other tasks are bad. To solve this problem, Alex et al. [10] propose that the weight of multiple losses can be obtained by way of uncertainty. The effectiveness of the method was verified on multiple regression tasks, but it was not verified on the classification task. We verify this method in two classification tasks.
We assume that multi-task prediction error obeys Gaussian distribution, that is, the mean is 0 and the variance is σ 2 . Based on the maximizing Gaussian likelihood function with homoscedastic uncertainty, the multi-task loss function is adjusted. Let f W (x)be the output of a neural network with weights W on input x. For the classification task, we usually pass the model output through a Softmax classification function. And the final output is expressed as where y represents the output of the model and σ is a positive scalar. This scalar is either fixed or can be learned, and the Softmax input is scaled by σ 2 . The log-likelihood for this output can be written as In the case of multiple outputs, we assume that each task is independently and equally distributed. The maximum likelihood estimation of multiple tasks can be expressed as where y 1 , . . . , y j means that the model has j outputs, which means that there are j tasks. In this paper, FBP is the main task, and the auxiliary is gender recognition. We take the log for maximum likelihood estimation of the two tasks and get the minimization objective function of our multi-output model, represented by where . The weight of each task is 1 σ 2 1 and 1 σ 2 2 . If σ is larger, the contribution of the corresponding task in the training process is greater.

D. RANDOM FOREST CLASSIFIER
Leo Breiman proposed a classifier called random forest [46]. Its principle is to use an army of decision trees to train and predict samples. It has the advantage of giving an importance score for each variable and evaluating the role of each variable in the classification when classifying the data. It's worth noting that bagging is an important idea in random forests, and the main principle of implementation is bootstrap sampling. That is to say, each decision tree in random forests is a classifier, and if there are n trees there will be n classification results. The final classification results are obtained by considering all the classifiers. By voting on all the results of the classification in random forests, the output with the most votes is selected. The implementation process of random forest can be summarized as follows: Step 1: Suppose the number of original training sets is S. Then, we can randomly select k new bootstrap sample sets from S, and construct k classification trees. Finally, each time the sample that has not been extracted constitutes k out-of-bag data.
Step 2: Assuming that the feature dimension of each sample is M , specify a constant m, and satisfy the condition m M . Then, the feature subset of m-dimension is randomly selected from the M -dimension feature. Finally, each time the tree is split, the best feature is selected from m feature subsets.
Step 3: Every tree grows to its fullest extent and there is no pruning process. Step 4: The generated multiple classification trees form a random forest, and we can use this random forest classifier to recognize and classify the new.

IV. EXPERIMENTS ANS ANALYSIS
We implement our method with Keras on an Ubuntu server with GTX1080 GPU, i3-7350k CPU, and 48G memory. In this paper, we use different classifiers in the skict_learn toolkit to achieve the best classification by adjusting some of the parameters in the function. Because 2M BeautyNet structure is improved based on VGG16. Then all the singletask experiments are based on VGG16, and the multi-task experiments are based on 2M BeautyNet.

2) SCUT-FBP5500
SCUT-FBP5500 database contains 5,500 high-resolution frontal face images with different races, gender and age. Each image is labeled with a beauty score ranging from 1 to 5, and the larger the score, the more attractive it is. These scores were assessed by 60 volunteers. They named facial images of male and female, Asians and Caucasians in different ways. So we can analyze the naming of each image to get a gender label, with ''0'' for males and ''1'' for females. The face image quality of this database is excellent, which is beneficial VOLUME 8, 2020  to multi-task shared feature learning, so we take this database as the input data of gender recognition. Large-scale Celeb Faces Attributes (CelebA) Dataset [47] is a large scale face attributes dataset with 40 attribute annotations. We use skin labels in the large scale face attribute, with ''0'' for white and ''1'' for black. Among them, class ''0'' contains 400 male and female images, class ''1'' contains 7017 images. We calculated the distribution of gender and skin labels, and the distribution between categories is shown in Fig. 7.

B. DATA PREPROCESSING AND AUGMENTATION
For our training data, we use different data augmentation methods to enlarge the sample size of the training set. We used three main methods to extend the two databases, such as randomly cropping, randomly flipping images from left to right and randomly rotating images from −45 • to 25 • . Since the SCUT-FBP5500 database has fewer samples than LSFBD, we use more types of data enhancement methods. To perform multi-task training normally, we require the training and verification samples of two databases with the same order of magnitude. In our experiment, we increased the number of samples of two training sets to 36,000. The verification set does not perform data enhancement, and the number of samples is 996. Our network only works with face, so we use Multi-task Cascaded Convolutional Networks (MTCNN) [48] to detect faces in each image. Some examples are shown in Fig. 8.

C. IMPLEMENTATION DETAIL
Our experiment combines two tasks: facial beauty prediction and gender recognition, gender recognition and skin classification. We have LSFBD as facial beauty prediction, SCUT-FBP5500 as gender recognition and CelebA as skin classification. Our 2M BeautyNet is implemented via Keras. During the training process, the initial weights of the network structure are transferred from the pre-trained weights of VGG16 on the Imagenet database. The Block1 weights of two branches in the single-task layer are the Block1 weights of VGG16. Then, after the MFM activation function, concatenate layer, and convolution operations, the feature map for each task is jointly trained. The convolutional layer weights of Block2-Block5 have corresponded to that of VGG16. In all of our tables, TL represents transfer learning and CA represents classification accuracy.
In the experiments, we use Adam optimizer to train the network for 50 epochs with batch size of 64, the initialized learning rate of 0.001, and momentum of (0.9, 0.999). Besides, this experiment uses the monitor in Keras to monitor the verification loss in the FBP task. When the verification loss does not decrease within two epochs, the current learning rate is changed to 0.1 times. Finally, we use the skict_learn toolkit to verify the performance of different classifiers.

D. COMPARISON BETWEEN SINGLE TASK AND MULTIPLE TASKS
Many studies have shown that MTL is better than STL. On the one hand, multi-task network structure contains multiple tasks shared layers. By learning the shared representation of multiple tasks, the network has strong abstraction ability. The main task can achieve better generalization capabilities when it is trained by combining several different but related target tasks. On the other hand, a specific independent layer structure is designed for each different task to learn how to use a shared representation to improve the performance of each specific task. In this paper, gender recognition is regarded as an auxiliary task for FBP. And Table. 2 compares the performance of single-task and multi-task with multi-task loss weights automatic learning strategy and softmax classifier. Meanwhile, we also compare the classification accuracy of using transfer learning. Furthermore, in order to illustrate the effectiveness of the proposed method, we conducted another experiment, which is joint gender recognition and skin classification for multi-task training. The classification accuracy of FBP, gender recognition, and skin classification tasks during training alone are showed in Table. 2. In addition, we also showed the classification accuracy of multi-task training in two experiments: FBP and gender recognition, gender recognition and skin classification.
In experiments, the network framework of STL uses VGG16, while the network framework of MTL adopts our proposed 2M BeautyNet. Whether for single-task or multitask training, the trained VGG16 parameters on ImageNet of each layer are transferred to the corresponding layers of our network. From Table. 2, we can see that the classification accuracy of STL is slightly worse than MTL, and transferring ImageNet weights is better than direct training. To evaluate the generalization performance of our 2M BeatyNet,  we change the multiple tasks of FBP and gender recognition to skin detection and gender recognition. Experiments show that our method is better than a single task, the classification accuracy of facial beauty prediction is up to 66.82%.

E. THE IMPACT OF MULTI-TASK LOSS AUTOMATIC LEARNING STRATEGY
In MTL, joint optimization between multiple tasks may make simple single task converge due to data imbalances and diverse difficulty in tasks. If we add multiple task losses directly, one task tends to be better and the others are bad. We also compared the effects of different loss on multi-task performance, especially the manually setting fixed value and automatically learning value. Multi-task loss weight automatic learning can derive a reasonable multi-task loss function that can learn to balance various losses. The impact of multi-task loss automatic learning strategy on task performance is shown in Table. 3.
If the experiment does not adopt the method of multitask loss weights automatic learning strategy, the default task weight ratio is 1:1. Through experiments, if the losses of multiple tasks are directly added and optimized, the model will converge to tasks with fewer classification categories. For example, if five-class FBP and two-class gender recognition are jointly optimized, the model converges on gender recognition. Five-class FBP and seven-class emotion recognition are jointly optimized, the model converges on FBP. However, if multi-task loss automatic learning strategy is adopted, both tasks will slowly converge. It can be seen from Table. 3 that FBP classification accuracy rate with multi-task loss automatic learning strategy is increased by 2%-3% compared with the fixed ratio 1:1. For FBP, the best classification accuracy is 68.23%.

F. THE IMPACT OF RANDOM FOREST CLASSIFIER
The existing FBP methods usually regard a deep neural network as a feature extractor to extract features from the final output layer, and then use softmax for classification. However, random forests are a supervised integrated learning model that aggregates multiple machine learning models to make overall performance better. It has the advantage of being parallelizable and reducing over-fitting. Thus, we also compare the influence of different classifiers on the classification accuracy of FBP. The FBP classification accuracy of the softmax classifier, random forest classifier, are shown in Table. 4.
In the experiment, we first use 2M BeautyNet and softmax classifier to train multiple tasks, the feature expression with higher accuracy can be extracted, and finally softmax is replaced by random forests for classification. As shown in Table. 4, we found that the classification accuracy of replacing the Softmax classifier by random forest is increased by about 1% to 2%.

G. COMPARISON OF DIFFERENT METHODS
In order to verify the validity of the 2M BeautyNet, we compare its performance with other existing algorithms on LSFBD, such as one traditional method with four basic CNN models. The detailed comparison is shown in Table. 6. The first part shows the classification accuracy of LSFBD by the traditional method of FBP [9]. The second part compares the classification accuracy of different networks designed by deep learning, including NIN [49], GoogleNet [50], VGG16 [51] and BeautyNet [10]. As we can see, the classification accuracy of deep learning methods is higher than traditional methods. At the same time, we also compare the performances of these four networks after the transfer, in which the accuracy of facial beauty classification is generally increased by 2%-5%. The above experiments are all about FBP single task, and the highest accuracy was 67.48%. However, the last part compares the accuracy of 2M BeautyNet combined with different strategies, including transfer learning, multi-task loss automatic learning strategy, and random forest. In summary, the classification accuracy of our method is better than others, and the highest classification accuracy is 68.23%. In addition, we verified our proposed network on SCUT-FBP5500. As shown in Table 5, our network has achieved similar results. However, 2M BeautyNet is based on VGG16 with relatively few parameters.

V. CONCLUSION
In this work, we have proposed an effective multi-input multi-task 2M BeautyNet to jointly learn two tasks including facial beauty prediction and gender recognition. This network can properly transfer pre-trained network parameters when inputting from different databases simultaneously. Different from the single-task network structure in FBP, 2M BeautyNet combines task-assisted information from the other databases related to FBP to improve the performance of the main task. Besides, transfer learning is combined to extract the depth sharing features of faces. Multi-task transfer learning can enrich the low-level features of the main task and improve classification accuracy. Then we utilize multi-task loss automatic learning strategy to avoid the phenomenon, in which the model has one task that dominates the entire loss, while the other tasks cannot influence the learning process of multitask shared layer. Finally, we combine the traditional method with deep learning method. After multi-task training, the softmax classifier is replaced by a random forest for classification. The extensive experiments on LSFBD and SCUT-FBP5000 show that our method achieves superior accuracy over the other methods in facial beauty prediction task, which is up to 68.23%.
In the future, we will find the tasks more relevant to FBP for multi-task training by measuring the correlation of the tasks. Considering how to design a more versatile and effective multi-input multi-task network, and combining the local information and the other factors that influence facial beauty in multi-task learning, we believe that this area will advance the progress of aesthetic surgery planning, cosmetic recommendation, and facial beautification, etc. He founded a start-up company, Sensure srl, in the area of intelligent systems for industrial applications (leading it from 2007 to 2010). He was active in industrial research projects with several companies. His main research and industrial application interests are intelligent systems, computational intelligence, pattern analysis and recognition, machine learning, signal and image processing, biometrics, intelligent measurement systems, industrial applications, distributed processing systems, the Internet of Things, cloud computing, fault tolerance, application-specific digital processing architectures, and arithmetic architectures. He is an ACM Fellow.