A New Deep Neural Architecture Search Pipeline for Face Recognition

,


I. INTRODUCTION
Face recognition is extensively utilized in many fields, such as video surveillance, public security, face payment, and smart home. The critical problem in face recognition is how to acquire facial features accurately. According to characteristics, face recognition can be divided into the following two categories: based on shallow features such as SIFT [1], LBP [2], HOG [3] and based on deep convolution neural network(DCNN [4]). The main advantage of deep learning algorithm is that it can be used with a large number of data samples for training and learning the robust face feature representation in different face recognition datasets. This method does not need to design specific features for different types of intra-class differences (such as illuminations, postures, facial expressiosn and ages), but can be directly extracted by the deep convolution features. In the training The associate editor coordinating the review of this manuscript and approving it for publication was Wei Zhang . process, bottleneck features representing faces are extracted by DCNN. Then other techniques (such as taking advantage of joint Bayesian or exploiting different loss functions) are used to fine-tune the CNN models. CNN architectures for face recognition have been inspired by those conventional architectures in the ImageNet Large-scale Visual Recognition Challenge (ILSVRC) [5]. Resnet [6] has become the most preferred choice for many target recognition tasks, including face recognition. Many terminals or customers have favored MobileNet's [7] performance on lightweight devices with limited computing capability. Therefore, the selection of loss function to train CNN methods for these two network frameworks has become the most active research field in face recognition recently. The previous work in this field mainly focused on two directions: converting loss function to improve recognition accuracy in traditional deep convolution neural networks (such as Resnet, Densenet). Another way is to combine the latest loss function with the lightweight system (such as MobileNet and ProxylessNet) to reduce the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ size of the network at the minimal expense of accuracy. The development of loss function in advanced face recognition algorithms mainly includes two ideas: • Metric Learning: Contrastive Loss, Triplet Loss [8] and related sampling methods; • Margin Based Classification: Softmax with Center Loss [9], Sphere Face [10], Soft-Margin Loss [11], AM-Softmax (CosFace) [12] and ArcFace [13].
NAS [14] has made remarkable progresses in the past few years. The computationally intensive NAS algorithms based on reinforcement learning or evolutionary algorithms have proved their ability to produce models that transcend traditional network designed by human beings [15]. Other methods, such as DARTS [16], SNAS [17], Proxyless NAS [18], MileNAS [19], and SGAS [20] have also been significantly developed. Although these methods accelerate the search process by reducing search space and changing search strategies, the accuracy is relatively lower. Most NAS methods have trained the image classification tasks, for instance, CIFAR-10 and ImageNet. On these benchmarks, the NAS algorithms have achieved prominent performance, despite the exorbitant computational cost [21].
In this paper, we propose a novel method that applying NAS technology into face recognition, and the neural network structure of the recognition field is customized through enhanced learning algorithm. Considering the influence of network structure size, we combine the latency with the reward of reinforcement learning in the process of neural network search. The goal of this work is to examine whether the NAS method can automatically design better network structures in face recognition.
To prove the effectiveness of this method, we have carried out many experiments. Through the neural network architecture searched by NAS, we achieve 98.77% accuracy in the extensive MS-Celeb-1M [22] dataset and 99.89% accuracy in classical paired LFW [23] dataset. These all meet the industry's best standards. In addition to achieving outstanding efficiency, the algorithm also has extraordinary performance in terms of network size. The maximum one is only 19.1 M among all the searched networks.
In Section II, we introduce the evolution of loss function in face recognition and the overview of neural architecture search. We describe the detail of NAS algorithm in Section III which is based on reinforcement learning. In Section IV, we actualize various experiments to prove that NAS is superior to traditional face recognition algorithms. Finally, we summarize this paper in Section V and provide some guidance for future research.

II. RELATED WORKS
The loss functions play an essential role in the gradient updating of neural networks. A proper loss function can significantly improve the efficiency of the neural network and achieve better results. At present, the loss function in face recognition has been substantially developed.
NAS can be divided into many types from two dimensions: search space and search strategy. At the same time, the advantages and disadvantages of NAS are evaluated in terms of acceleration strategy, calculation consumption, parameter numbers, and inference latency. We provide a comprehensive classification of NAS in the following paragraphs.

A. EVOLUTION OF LOSS FUNCTION
For face recognition tasks, we present five most representative loss functions, namely, Cross-Entropy Loss, Angular-Softmax Loss, Additive Margin Softmax Loss, and ArcFace Loss.

1) CROSS-ENTROPY LOSS
The cross-entropy loss is one of the most widely used loss functions in classification scenarios [4]. In face recognition tasks, the cross-entropy loss is an effective method [25] to eliminate outliers, which is expressed as: where x i is the i th training sample, N is the number of samples, W j and W y i are the j th and y th i column of W , respectively.

2) ANGULAR-SOFTMAX LOSS
The SoftMax loss is proposed based on softmax. It enables CNN to learn angular resolution features, which is one of the many modifications to softmax function for introducing margin-based learning. It is described as: and m ≥ 1 is an integer that controls the size of the angular margin.

3) ADDITIVE MARGIN SOFTMAX LOSS
On account of Angular-Softmax Loss, a general function is added to softmax loss for large margin property. The loss function is defined in the following: where cosθ − m a general function for large margin property as an additive margin for softmax loss, and the hyperparameter s is used to scale up the cosine values.

4) ArcFace LOSS
Given the above loss functions, a new margin cos(θ + m) is proposed, which represents the best geometrical interpretation. The ArcFace Loss function using angular margin is formulated as: where s is the radius of the hypersphere, m is the additive angular margin penalty between x i and W y i , and cos(θ + m) is the margin, which makes the class-separations more stringent.

B. NEURAL ARCHITECTURE SEARCH
The NAS search space defines the network architecture to ensure the rationality and quality of the sampling model in the search process. For some specific tasks, introducing prior knowledge can reduce the search space, but to a certain extent, it also restricts the generation of the network beyond previous experience. The backbone structure of the NAS network includes chain architecture space, multi-branch architecture space, and block-based search space. Its operator space contains convolution, pooling, residual connection, and other topological structures.
The search strategy of NAS defines what algorithm can quickly and accurately find the optimal configurations of network architecture. According to different search strategies, NAS can be divided into the following categories.
• Other search algorithms: SMASH [31], PNAS [32], Auto-Keras [33], Graph hypernet [34], Proxyless NAS, Efficient Multi-Scale Architectures [35] Since NAS requires a lot of training time and computing capability, the commonly used acceleration strategies are parameter sharing, network morphism, and network pruning. In these NAS algorithms, different tasks and scenarios require various performance evaluation indicators. The indicators used to evaluate the performance of NAS are test accuracy, number of parameters, inference latency, and memory utilization

III. PROPOSED METHOD
We search the MS-Celeb-1M dataset to find network structures and rank them according to the accuracy of network validation. Based on the hypothesis that the accuracy ranking of subnetworks is consistent with that of fixed networks, the top three networks are taken out and trained from scratch. Finally, we test the accuracy of the system on the LFW dataset.

A. FACE DATASET 1) TRAINING DATASETS
Among the multitudinous face data, the most representative training data is CASIA-WebFace [36] and MS-Celeb-1M, while the most commonly used test data is LFW (Labeled Faces in the Wild). The CASIA-WebFace contains 4,94,414 face images belonging to 10,575 different individuals. The official MS-Celeb-1M dataset consists of 10M pictures with 100k face identities. Since the original dataset contains much noise, to be fair, we use a high-quality subset of MS-Celeb-1M improved by Arcface authors, which is optimized according to the distance from the image to the center of the class. The dataset consists of 5.8 million images, including 8.5k categories, with dozens to hundreds of faces in each category. Compared with CASIA-WebFace, the data quantities of each label in MS-Celeb-1M are more abundant. So we employ the most generalized and challenging dataset in face recognition (MS-Celeb-1M) as a training set for the NAS.

2) TESTING DATASET
LFW dataset is established to study face recognition in free environments. This collection contains more than 13,000 face images (all from the Internet, not the laboratory environment). Every face is standardized by different names, among which about 1680 people corresponding with more than two faces. According to the standard LFW evaluation scheme, we measure the verification accuracy of 6000 faces. This dataset is widely used to evaluate the performance of the face verification algorithm. Therefore, we use LFW to validate the performance of the network architecture retrieved by NAS.

3) DATA PREPROCESSING
According to the preprocessing technology of face, we use MTCNN [37] to detect five facial markers (eye center, nose tip, and mouth corner) for similarity transformation to align facial images. Every pixel in these images is normalized by subtracting the average of all channels and dividing their standard deviation. At the same time, all the pictures are randomly flipped, padded to 120 × 120 and randomly cut to 112 × 112. The order of all data is shuffled. All of the above preprocessing is executed on both MS-Celeb-1M and LFW datasets. Also, we regularize the training set of MS-Celeb-1M with a 16-size cutout to improve the robustness of the network. We divide the MS-Celeb-1M dataset into training, verification, and test dataset according to the proportion of 8:1:1.

B. TRAINING DETAILS
The NAS search space we constructed consists of all possible directed acyclic graphs (DAGs) on L nodes. Fig.1 briefly illustrates the framework of our algorithm In the stage of the searching network, we use reinforcement learning algorithm (policy gradient) to guide the controller network to train the optimal child architecture. The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly, which rely upon optimizing parametrized policies with respect to the expected return by gradient descent [38]. The controller network and child network are alternately trained. To improve the efficiency of the searching, we share the parameters of the child network. The controller network is consists of a LSTM network which has 100 hidden units. The main trainable parameters are LSTM's weights and shared parameters of the child network.
The training process of NAS includes two interweaving stages. When training the child network, the generation strategy of the controller network is steadfast. At the same time, when training the controller network, the weight parameters of the child network are fixed.

1) TRAINING CHILD NETWORK
For each mini-batch of data, LSTM adjusts the policy to sample a network architecture. Momentum is served to optimize the parameters of the child network, by decaying the cross-entropy loss, which is the most commonly used in classification problems. To reduce the size of the search space and the cost of training, thus we draw on the known robust model ENAS and we change the search space. Each node in the child network contains four operations, choosing from 3 × 3 or 5 × 5 separable convolution, average pooling, and max pooling. To save computational space and reduce the latency of inference, we replace all regular convolutions with depthwise separable convolutions in search space. The convolution blocks definitely as Fig.2(a) are connected in series by a 1 × 1 convolution, an N × N convolution, batch normalization, and RELU, where NAS searches the N and description. At the same time, the pooling block just like Fig.2(b) is relatively straightforward with a 1 × 1 convolution added to max or average pooling to limit the channels of operations. Since the parameters are shared, we fix the input and output shapes of each search block. This process goes through Q iterations until all the input pictures are traversed, which represents the completion of an epoch for the child network.

2) TRAINING CONTROLLER NETWORK
After training the epoch of the child network, then we modify its weights and combine different operations and connections to find the optimal architecture. The reward signal updates the gradient of the controller network. Reinforcement learning plays the role of an optimizer. We evaluate child network on each minibatch of the validation set to measure the accuracy, which is substituted into the reward function of reinforcement learning. As shown in the Eq5, we use coefficients to control the trade-off between network accuracy and complexity. Every time the child network infers on the verification set, it obtains the efficiency and the inference latency L(m) of the net. As expected, the inference latency raises with the increasing complexity of network. So we added target latency to the reward function. At the same time, adding arcsin to the multi-objective reward function can provide more positive feedback for the final small accuracy improvement. where q is the weight factor. This incentive mechanism has greatly reduced the size of the network. The average verification accuracy and model complexity are balanced in the design of the reward function. To prevent the over-fitting of the model, the controller generates different architectures every step. Finally, we summarize the whole NAS training procedure in Algorithm 1.

3) TRAINING FIXED NETWORK
Data preprocessing includes clipping, flipping, and cutout, with the optimizer function of momentum. According to the accuracy of the validation dataset, we sort the child networks and select the first three structures. All networks are trained from scratch, and the accuracy is verified in the test dataset. The algorithm is exhibited in Algorithm 2

A. NETWORK SETTINGS
At the beginning of the network, we use a 3 × 3 convolution to extract features, followed by the batch normalization. As shown in Fig.3(a), the searched skip operations are processed by continuous RELU function, 1 × 1 convolution, and batch normalization. These operations not only increase the nonlinearity of the neural network but also control the number of channels in the transmission process. We add a feature Algorithm 1 Training NAS With Cross-Entry Loss 1: Input:The training data{x i , y i } n i=1 , the optimize function, the node numbers L, the controller train steps S, the quantity of sampled architecture Q, the target latency T 2: for each epoch do 3: child network update weights W with ''[selected arc]'' 6: end for 7: calculate accuary on the whole test dataset 8: end for 9: Output:accuracy of test dataset reduction module between the first three search blocks in the network. In these modules, we halve the height and width, but double the number of channels. We divide each module into two paths according to the number of channels. The first path passes through an average pooling with a stride of 2 and then follows a 1 × 1 convolution to fix the number of output channels. In particular, the second path is pre-processed: after a circle of padding around the image, the same size area of the original image in the lower right corner is exercised as a new image. Next up is the same path operation: an average pooling and a 1×1 convolution. We concatenate the two paths together as a new feature reduction operator. The detailed schematic diagram is exhibited in Fig.3(b). Considering the characteristics of parameter sharing, all skip operations need not only non-linear processing but also feature reduction at similar parallel positions. After the final search blocks, a drop block [39] and a full connection layer are tailgated.In the network architecture, each skip operation is accompanied by the nonlinear module.

B. HYPERPARAMETER SELECTIONS
We train NAS on an 8*V100 machine with a batch size of 128 per GPU. In the processing of training child networks, we adopt the most classical cross-entropy loss function in face recognition. We set the number of search blocks on the child network to 5. To reduce the difference between independent training, we decrease the learning rate by cosine anneal. In addition, we add a function of training epochs [40] to the original cosine annealing, as shown in Eq6, which can converge to the optimal solution faster. To make the neural network better adapt to the tremendous learning rate, it warmups from 0 to 0.1 in the first 20 epochs and then reduces to 0.0001 according to the previous cosine annealing method. We use a weight decay of 10 −4 , and the optimizing function of the child network is momentum, with its β value being fixed to 0.9.
where lr i min and lr i max are the minimum and maximum learning rates; T i is the total number of epochs; T cur is the current epoch; and i is the index into a list of these parameters for a sequence of warm restarts in which lr i max typically decays. In contrast, the learning rate of the controller network is the strategy that piece-wise decreases from 0.1 to 0.0001 per 20 epochs. At the same time, the controller network uses the stochastic gradient descent (SGD) optimization algorithm. The child and controller networks alternately train 100 epochs, each of which traverses all training and validation data. After the alternative training of the child and controller networks, we rank the searched network architectures according to the verification accuracy. The top three structures are extracted and resampled respectively by the controller network. Besides increasing the total epochs from 100 to 150, all the hyperparameters of the fixed network are consistent with that of the child network, including updating mechanism and optimizer of the learning rate. All the data traversed by the former child and controller networks are regarded as the training dataset. At the end of each epoch, we examine the accuracy of the last test set. Fig.4 shows the schematic diagram of the network structure searched by NAS. But the operators in search blocks and the connection of network architectures are different. After the fixed network training is completed, we test the accuracy of the network on the last 35W dataset, which is previously segmented in MS-Celeb-1M. We compared the test results of three networks with other previous studies in this dataset. As shown in Table.1, all the NAS structures exceed previous results. Table.2 shows the test accuracy of different methods on the MS-Celeb-1M dataset. Among them, NAS-C achieves state-of-art accuracy with 98.77%.  To further verify the performance of our proposed method, we test it on the LFW dataset, which is the most commonly used in face recognition. We fix the three architectures and their parameters and test them directly on LFW. As expected, NAS-B and NAS-C have achieved remarkable performance except for the slightly worse test results of NAS-A networks. Similar to the regular pattern of fixed network testing, NAS-C surpasses the existing face recognition technology with 99.89% accuracy and achieves the best result.

C. TRAINING RESULTS
Although we consider the inference latency in Eq5, we set a relatively loose constraint range to improve the network test accuracy as much as possible. When the network is small, the speed of precision reduction is much slower than that of parameter reduction. However, the complexity of the network structure is not strongly related to the test accuracy. Among the three network structures previously searched, NAS-B has the largest size, reaching 19.1M, but the highest efficiency is NAS-C, which has only 16M. Despite this, it is still better than traditional Resnet. As the structure of the final neural network has been fixed, the training time mainly depends on the size of the network. Thanks to the addition of latency in the reward of reinforcement learning, the network structure found by NAS is not so large. Therefore, compared with Resnet and Mobilenet, the training time of our fixed architecture is on the same order of magnitude. This experimental result helps select the appropriate architecture, which can obtain relatively small networks under the premise of excellent performance.

V. CONCLUSIONS AND FUTURE WORK
In the solution of the convolutional neural network, the efficiency of the model is mainly determined by the architecture of backbone networks. In this paper, we change the traditional method by using the classical network as a backbone and modifying loss in face recognition with NAS. The neural network we searched surpasses all previous results on the MS-Celeb-1M dataset with 98.77% accuracy. At the same time, we apply the search results directly to the LFW dataset, and the test accuracy also reaches 99.89% accuracy. The performance on these two famous face datasets proves the universality of our method. On the premise of guaranteeing the accuracy, we reduce the size of the searched network structure as much as possible.
In the future, we would like to improve the efficiency of the NAS search and design a more general search space. What's more, we will test the model with several face recognition datasets, considering low and high resolutions, face positions, and light conditions in the future work. We believe that the combination of NAS strategy and various loss functions is also a viable research direction.