Exploiting the Tail Data for Long-Tailed Face Recognition

Long-tailed distribution generally exists in large-scale face datasets, which poses challenges for learning discriminative feature in face recognition. Although a few works conduct preliminary research on this problem, the value of the tail data is still underestimated. This paper addresses the long-tailed problem from the perspective of maximally exploiting the tail data. We propose a Joint Alternating Training (JAT) framework to learn discriminative feature from both the long-tailed data and the tail data by using alternating training strategy. JAT consists of two branches: 1) the long-tailed data branch is adopted to learn the universal discrimination information from the whole long-tailed data with instance-balanced sampling. 2) the tail data branch is designed to exploit the discriminative information in the tail data with class-balanced sampling. To compensate the insufficient samples and lack of intra-class variations, we apply data augmentation (DA) to the tail data. We further propose margin-based mixup (MarginMix) for data augmentation, which can deal with the nonlinearity of margin-based softmax loss and stabilize the training process in mixup. Furthermore, we obtain the best combination of strategies (i.e., JAT+DA+ MarginMix) for long-tailed face recognition, which can maximally exploit the discriminative information in the tail data while retaining the universal discrimination learned from the long-tailed data. Extensive experiments on 8 face datasets demonstrate that our proposed methods and combination of strategies can effectively address the long-tailed problem in face recognition.


19
Deep face recognition has made significant development in provided [1], [3]. Therefore, much effort has been put into 27 building large-scale face datasets recently [1], [2], [3]. 28 Most large-scale face datasets in real-world exhibit a long- 29 tailed distribution, in which a small number of identities 30 The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . account for most of the samples (the head data), while many 31 other identities only have relatively few face images (the tail 32 data). A common problem in training on long-tailed dataset 33 is that the head identities are properly trained, but the tail 34 identities are under-represented. Consequently, this will bring 35 difficulty in learning feature with good representation and 36 generalization ability for face recognition. According to [5], 37 the model trained on a part of the long-tailed dataset (remove 38 20% or 50% of the tail data) obtains higher accuracy than that 39 learned on the whole dataset. On the other hand, if too much 40 tail data (70% or more) is discarded, the model performance 41 will drop. This preliminary research reveals the fact that 42 the tail data is a double-edged sword, i.e., if used properly, 43 it can boost the performance of the trained model; otherwise, 44 it will bring negative effect in learning discriminative feature. 45 VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Framework of the proposed JAT framework. It consists of two data branches: 1) the long-tailed data branch is used to learn the universal information from the whole long-tailed dataset with instance-balanced sampling. 2) the tail data branch is designed to maximally exploit the discriminative information in the tail data by using class-balanced sampling. The alternating training strategy is adopted to learn more discriminative feature from the two data branches jointly.
Therefore, how to make the best of the tail data is a key issue

62
We address the long-tailed face recognition problem from 63 the perspective of maximally exploiting the tail data in this 64 paper. A Joint Alternating Training framework is proposed to 65 learn more discriminative feature from both the long-tailed 66 data and the tail data by using alternating training strategy.

67
As shown in Fig. 1, there are two data branches in our JAT 68 framework. The long-tailed data branch is responsible to learn 69 the universal information from the whole long-tailed dataset.

70
The tail data branch is designed to exploit the discriminative 71 information in the tail data, which cannot be discovered probability, is used for the tail data branch. By using the 80 alternating training strategy, JAT can effectively exploit the 81 discriminative information in the tail data while retaining 82 the universal discrimination learned from the whole long-83 tailed dataset.

84
Due to the insufficient number of samples, the intra-85 class variance in the tail data is usually limited, which 86 leads to unreasonable squeeze of the tail classes in feature 87 space. To alleviate this problem, we propose MarginMix for 88 data augmentation, which can deal with the nonlinearity of 89 margin-based softmax loss and stabilize the training process 90 in mixup. MarginMix applies different loss functions on dif-91 ferent parts of the linearly mixed label for the mixed sample. 92 More specifically, MarginMix adopts margin-based softmax 93 loss for the label with larger mixing coefficient and traditional 94 softmax for the label with smaller coefficient. To further 95 enlarge the intra-class variance in the tail data, a larger mixing 96 coefficient is assigned to the sample with label in minority 97 class. Furthermore, the MarginMix is combined with other 98 data augmentation techniques to generate face images with 99 more intra-class variations.

100
The major contributions of this work are the followings: 101 1) We address the long-tailed problem in face recognition 102 from the perspective of maximally exploiting the tail data. 103 We are the first to explore how to maximally exploit the 104 long-tailed data and the tail data jointly in the deep face 105 recognition literature. 2) We propose a JAT framework to learn discriminative 107 feature from both the long-tailed data and the tail data jointly. 108 JAT aims to exploit the discriminative information in the 109 tail data while retaining the universal discrimination learned 110 from the whole long-tailed dataset, which is achieved by 111 alternating training on the long-tailed data branch and the tail 112 data branch.   Although these methods can obtain performance improve-146 ment for long-tailed face recognition, they often suffer from 147 high complexity or sensitivity to hyper-parameters in the 148 training process. Furthermore, the discriminative information 149 in the tail data is not fully exploited in the existing methods. , etc., their feasibilities in 170 face recognition have not been well studied. In this paper, 171 we make preliminary attempt to apply simple tricks (e.g., data 172 re-sampling, mixup), which are commonly used and hyper-173 parameters insensitive, to long-tailed face recognition. 175 We first describe our proposed JAT framework in detail, and 176 then we introduce MarginMix for data augmentation.

177
A. JOINT ALTERNATING TRAINING 178 We propose JAT framework to learn discriminative feature 179 from both the long-tailed data and the tail data jointly. 180 When training model on long-tailed dataset, the head classes 181 are properly trained, but the tail classes are inadequately 182 trained due to the limited number of samples. Therefore, 183 we design an additional branch to exploit the hidden dis-184 criminative information in the tail data in our framework. 185 JAT follows the training paradigm of multi-task learning, 186 where the long-tailed data branch and the tail data branch 187 are designed to learn feature representation from the long-188 tailed data and the tail data with alternating training strategy. 189 As shown in Fig. 1, two branches share the same base model 190 and weights for deep feature learning. The base model is 191 followed by a fully connected layer (i.e., classifier) in each 192 branch, which maps the deep feature into respective label 193 space. The classification loss for each branch is calculated 194 respectively in the training process. We describe the two data 195 branches and the alternating training process in detail. This branch is designed to discover and exploit the hidden 209 discriminative information in tail data. The tail data is defined 210 as the samples in the tail identities, which have limited num-211 ber of samples but account for a significant portion in the 212 whole dataset. After removing the head data, the distribution 213 of the tail data becomes much more balanced than that of 214 the long-tailed data, so class-balanced sampling strategy is 215 used in this branch. To compensate the insufficient training 216 samples and lack of intra-class variance of the tail data, 217 face data augmentation is further applied. Equipped with 218 class-balanced sampling and data augmentation, the discrimi-219 native information in the tail data can be maximally exploited. 220 Algorithm 1 Joint alternating training algorithm Input: the long-tailed data D l , the tail data D t Output: the parameters of base model θ, the classifiers' weights of the long-tailed data branch W l and the tail data branch W t Require:instance-balanced sampling IBS(), class-balanced sampling CBS(·), the step ratio between the long-tailed data branch and the tail data branch S 1 : S 2 , the maximum training epochs E max , the classification loss function CL(·) Alternating Training: for epoch=1 to E max : for  Since the scale of the long-tailed data is larger than that of 236 the tail data, we set a step ratio to balance the training process 237 (i.e., update frequency of the parameters) of the two branches, As mentioned above, insufficient training samples will lead 250 to small intra-class variance in the tail data. To alleviate this 251 problem, data augmentation techniques, including mixup and 252 generic data augmentation methods are adopted to generate 253 samples with more variations for the tail data.

254
Let (x, y) denote a sample and its label in the training 255 dataset. Based on the assumption that linear interpolations of 256 samples should be labelled by the linear interpolations of their 257 associated labels, a mixed sample (x,ỹ) is generated by where λ ∈ [0, 1] is the mixing coefficient. The pair of sam-261 ples (x i , y i ) and x j , y j are drawn from the training dataset. 262 Accordingly, the loss of the mixed sample is calculated by the 263 linear weighted summation of two losses on label y i and y j , 264 i.e., L y i and L y j However, this linear assumption does not hold good in 267 the nonlinear margin-based softmax loss function. In margin-268 based softmax (e.g., ArcFace [10]), margin penalty is added 269 on the target label to learn more discriminative feature by 270 enforcing the intra-class compactness and the inter-class 271 discrepancy. Therefore, the linearly generated labelỹ does 272 not accurately describe the probability of classes that the 273 mixed imagex belongs to. Furthermore, margins are added 274 on both the label of y i and y j in mixup training, and thus 275 the marginal softmax loss for the mixed sample will be 276 much larger, especially for the label with smaller mix-277 ing coefficient. This will cause an unstable training of the 278 neural network and bring difficulty in the convergence of 279 model. 280 We propose MarginMix to deal with the nonlinearity of 281 margin-based softmax loss and stabilize the training process. 282 The key idea in MarginMix is to apply different loss functions 283 on different parts of the linearly mixed label. Specifically, the 284 margin-based softmax loss is adopt for the label with larger 285 mixing coefficient and the traditional softmax is used for the 286 label with smaller coefficient. Take ArcFace for example, the 287 loss in MarginMix can be formulated as: log e s(cos(θ y i +m)) e s(cos(θ y i +m)) + n k=1,k =y i e scosθ k where λ ≥ 0.5 is the mixing coefficient. The coefficient λ 292 is usually sampled from a beta distribution in practice. In the 293 TABLE 1. The detailed information of the training datasets. The imbalance ratio is the ratio of sample number between the largest class and the smallest class. The percentages of identities and images in tail data are given in parentheses.
case that λ is smaller than 0.5, we simply use 1−λ to replace λ 294 to ensure λ ≥ 0.5, which is formulated as: As introduced previously, several generic data augmenta-355 tion techniques are adopted for face image augmentation, 356 i.e., color jittering, occlusion, blur, horizontally flip, and 357 grey level transformation. These augmentation techniques 358 are selected with a probability of 0.25 successively, so a 359 combination of different techniques may be applied on a 360 single sample. The max number of augmentations applied on 361 a single image is limited to 3 in our experiments to avoid 362 the augmented image drifting far from the original image. 363 MarginMix is applied with a probability of 0.5, and the 364 mixing coefficient λ is drawn from a beta distribution with 365 α = β= 0.2 in all our experiment. We further use (7) to 366 ensure λ≥ 0.5. All of these data augmentation methods are 367 only applied in the tail data. We adopt ResNet50 and ResNet100 as our backbone network 370 and use the ArcFace [10] as our loss function. We set the 371 angular margin m at 0.5 and the feature scale s at 64 for 372 ArcFace. The Stochastic Gradient Descent optimizer is 373 employed, and the learning rate starts from 0.1 with fixed 374 momentum of 0.9 and weight decay of 5e-4. On ID30K, 375 we divide the learning rate by 10 at 10, 16, 22 epochs and 376 finish the training process at 25 epochs. On Glint360K, the 377 learning rate is divided at 8, 12, 16, 20 epochs and the training 378 process is finished at 22 epochs. 379 VOLUME 10, 2022    Mix. We firstly train a model on the original long-tailed 389 ID30K dataset as a baseline, which is trained without using 390 any strategy. In DA strategy, we only apply generic data aug-391 mentation on the tail data with instance-balanced sampling.

392
The face verification accuracies of these strategies are shown 393 in Table 2.  Table 3.

415
From the results, we can see that all the three combina-416 tions outperform the baseline and the model using single 417 JAT strategy. The combination of JAT+DA+ MarginMix 418 (JDM) achieves the highest accuracy on all the benchmarks, 419 which demonstrates its effectiveness in dealing with the long-420 tailed problem in face recognition. More importantly, the 421 accuracy improvement in JDM is not trivial anymore. For 422 instance, the verification accuracy increases by about 1% on 423 CFP-FP and CALFW, and the identification and verification 424 accuracy on MegaFace increases by about 1.5%. When JAT, 425 DA and MarginMix is incrementally applied, the accuracies 426 are steadily improved on all the test sets, i.e., JAT+DA is 427 better than JAT and JAT+DA+ MarginMix is even better, 428 which demonstrates that there is negligible conflict between 429 the three strategies.

430
Another informative observation is that data augmentation 431 does steadily improve model performance when combined 432 with the JAT framework, and this observation holds for both 433 generic DA and MarginMix. This is because the tail data is 434 particularly enhanced in both sample number and intra-class 435 variance in the framework of JAT, which helps to exploit more 436 discriminative information. 437 We randomly select a tail identity in ID30K for illustra-438 tion, which only contains two face images. We show these 439 two images with their augmented variants in both image 440 space and feature space in Fig. 2. In Fig. 2(a), we can see 441 that data augmentation can effectively enlarge the intra-class 442 variance for the tail class. We visualize the image features 443 by projecting them onto 2D space using t-SNE [29] in 444 Figure 2  right are extracted from our JAT+DA+ MarginMix model.

447
The distance between the red circle and blue circle becomess-448 maller in the right of Fig 2(  Following the testing protocol in [10], we adopt the feature 473 norm and the face detection score to reweight the face within 474 each template. We show the TAR@FAR=1e-5 and 1e-4 of 475 different methods in Table 5. We can find that our method 476 outperforms the other methods on both IJB-B and IJB-C 477 benchmarks. Compared with the baseline trained with Arc-478 face, the combination of JDM gains accuracy improvement 479 of 1.05%, 0.27% on IJB-B and 0.58%, 0.25% on IJB-C at 480 TAR@FAR=1e-5, 1e-4 respectively.

482
In this paper, we address the long-tailed face recognition 483 problem from the perspective of maximally exploiting the tail 484 data in long-tailed dataset. We propose a JAT framework to 485 learn more discriminative feature by alternating training on 486 the long-tailed data and the tail data with different sampling 487 strategies. We further propose MarginMix to deal with the 488 nonlinearity of margin-based softmax loss in mixup train-489 ing, which is further combined with other DA techniques to 490 generate face images with more variations for the tail data. 491 Furthermore, we obtain the best combination of strategies, 492 i.e., JAT+DA+ MarginMix, for long-tailed face recognition, 493 which can maximally exploit the discriminative information 494 in the tail data while retaining the universal discrimination 495 learned from the long-tailed dataset. Extensive experiments 496 demonstrate that our proposed methods and combination of 497 strategies can learn more discriminative deep feature on long-498 tailed face datasets.

662
From 2007 to 2020, she was worked as a Researcher in the university and 663 an information-communication industry with a research interest specifically 664 in image analysis, computer vision, and artificial intelligence. Since 2020, 665 she has been a Researcher at Fujitsu Laboratories Ltd. Her current research 666 interests include biometric field and its related technology.