Angular Margin-Mining Softmax Loss for Face Recognition

Face recognition methods have been significantly improved in recent years owing to the advances made in loss functions. Typically, loss functions are designed to enhance the separability power by concentrating on hard samples in mining-based approaches or by increasing the feature margin between different classes in margin-based approaches. However, margin-based methods lack the utilization of informative hard sample, and mining-based methods also fail to learn the latent correlations between classes. Moreover, there are no methods that simultaneously consider the effects of hard samples and feature margin through the same shape of feature angular margin. Therefore, this paper introduces the Angular Margin-Mining Softmax (AMM-Softmax) loss function, which adaptively emphasizes hard samples while also increasing the decision margins. The proposed AMM-Softmax loss function introduces a linear angular margin for hard samples, enabling the direct optimization of the geodesic distance margin and maximization of class separability. Furthermore, the proposed AMM-Softmax loss function is computationally efficient and can be easily converged by rapidly switching from the hard samples to easy samples. The results of the extensive experimental analyses conducted on popular benchmarks demonstrate the superiority of the proposed AMM-Softmax loss function over the existing state-of-the-art methods.


I. INTRODUCTION
Existing deep face recognition methods are primarily focused on designing effective loss functions that enhance the separability of feature embedding. Adapting a triplet loss [14] for embedding learning is one of the pioneering methods in this field. This method first selects the representative triplet samples and then enforces a pair of samples from the same class, which are closer in distance than those from the different classes. Center loss [24] also improves the discriminability by learning centers for the features of each identity and using the centers to enforce the intra-class compactness. However, the triplet loss faces the drawback of a combinatorial explosion in the number of triplets, which results in a time-consuming learning process for large-scale datasets. Additionally, careful mining strategies are required to select the representative triplets within the informative mini-batch. For the center loss, updating the centers is difficult as the number of classes increases and embedding method for center loss produces The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . a sub-optimal solution lacking in optimization of the interclass margin. Several methods to improve the efficiency and capability of face recognition have been developed, which attempt to implement the margin-based softmax loss function [1], [10], [20], [22].
The margin-based softmax methods focus on increasing the feature margin between different classes to improve the power of feature discrimination by proposing their own margin function. SphereFace [10] introduces a multiplicative angular margin (A-Softmax) function between the features of different classes to impose a larger inter-class distance. Cos-Face [22] adds a cosine margin to the cosine similarity (logit) space and solves the instability problem of the training of SphereFace. ArcFace [1] proposes an additive angular margin to obtain the separability of features. Margin-based softmax methods present several advantages. However, they do not consider the significance of informative samples such as hard samples. Recently proposed hard sample mining has been acknowledged as a critical method to further improve the performance. MV-Softmax [23] attempts to integrate a margin-based softmax loss function with the feature mining technique. In this method, hard samples are defined as VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ misclassified samples, and are emphasized by increasing their weights to loss. CurricularFace [7] also improves the performance by adaptively adjusting the relative importance of the samples during different training stages. However, MV-Softmax and CurricularFace are vulnerable to the stability of the training due to their inconsistent decision margins of easy and hard samples. Their decision condition changes based on the hardness for each sample from the perspective of linearity for the decision boundary. The MV-Softmax presents a non-linear margin for hard samples, while its decision margin for easy samples becomes linear or non-linear depending on the type of the margin function. The CurricularFace presents both linear and nonlinear decision boundaries for easy and hard samples, respectively, throughout all the stages. Their inconsistency in decision boundaries interrupts the process of changing from the hard sample to the easy sample because the models update their weight by loss for hard samples while mining the hard samples follows the loss for easy samples. They also result in the models being stuck at the local optimum with a considerable amount of samples, which fail to escape from hard samples even at the final stage of training.
This paper proposes an Angular Margin Mining Softmax loss (AMM-Softmax loss) function, which provides a margin and mines hard samples using only consistent linear angular margins for both easy and hard samples. The proposed AMM-Softmax loss corresponds perfectly to the geodesic distance on the hypersphere because of the linear margins, which results in the improvement of the discriminability of the network for face recognition by enabling the network to directly optimize the geodesic distance. The margin consistency presented in this paper for both easy and hard samples also accelerates the relocation from hard to easy samples, thus increasing the speed of convergence during the training procedure when compared to the existing state-ofthe-art (SOTA) models.
The mechanism of the proposed AMM-Softmax loss function for hard samples is compared to that of the marginmining softmax function, as shown in Figure 1. Our loss function presents consistent linear decision boundaries for both easy and hard samples, unlike earlier margin-mining softmax [7], [23]. Consider four samples, x 1 , x 2 , x 3 , and x 4 , which all belong to the same class. At first, all the samples are hard samples because they are all misclassified according to the decision boundary for the easy samples (represented by the green dashed line). The hard samples (sky-blue dot) then become easy samples (blue dot) except for one sample, i.e., x 2 of margin-mining softmax (red dot), as each method learns toward the decision boundary for hard sample (red dashed line). This is attributed to the fact that margin-mining softmax has a non-linear decision boundary for hard samples, which is different from those for easy samples. The consistency for decision boundaries presented by the proposed loss functions leads to the acceleration of the convergence by rapidly switching from hard samples to easy samples. The main contributions of this paper can be summarized as follows: • We propose a novel AMM-Softmax loss function to obtain more feature separability for face recognition. The proposed loss function presents a constant linear angular margin for hard samples, enabling the direct optimization of the geodesic distance margin, along with the maximization of the class separability. This is the first attempt to introduce a linear angular margin for hard samples to the best of our knowledge.
• The proposed AMM-Softmax loss function improves the speed and stability of training by rapidly switching from hard to easy samples, using consistent linear decision margins for both the easy and hard samples. Additionally, the AMM-Softmax loss function is computationally efficient as it provides a margin to the hard samples using the same parameter m, without using or calculating the weighting parameter t in previous loss functions.
• Extensive experiments are conducted on LFW, CALFW, CPLFW, AgeDB, CFP, and IJB-C, and the superiority of the proposed AMM-Softmax loss function over the existing state-of-the-art methods is demonstrated. The rest of the paper is organized as follows: Section II discusses the relevant works in the field of face recognition. Section III explains the underlying concept on loss functions and the proposed AMM-Softmax loss function. Section IV presents a detailed description of the experimental results and lastly, Section V concludes the paper.

II. RELATED WORKS A. METRIC LEARNING
Metric learning is a study which aims to automatically construct a task-specific distance metric. It provides a mapping function from the sample to a latent vector space. Therefore, face recognition methods implement this metric to discriminate the identity of a person in an open-set environment. Some studies [14], [17]- [19], [24] have attempted to implement metric learning for the embedding features which are extracted from the Deep Convolutional Neural Network (DCNN). The triplet loss [14] selects a triplet consisting of the anchor, positive, and negative features to separate the negative features from the anchor and positive features. Contrastive loss [17]- [19] utilizes the feature pairs to place the positive pair together and separates the negative pair. The center loss [24] learns center vectors to accurately represent the position of each class and reduces the distance between the sample features and the corresponding center vector. However, the aforementioned metric learning approaches face the drawbacks of high computational cost and sensitivity to the sample selecting strategy, which significantly affects the performance. NormFace [21] utilizes the normalized softmax loss function while eliminating the costs involved in selecting the samples. It was able to solve the two metric learning problems described earlier.

B. MARGIN SOFTMAX FUNCTIONS
Several studies [1], [10], [20], [22] have added a margin on the softmax loss function, based on NormFace [21], to enhance the feature separability power. They have added the margin between the embedded feature and the positive class weight vector, creating a decision boundary gap between the classes. AMSoftmax [20] and CosFace [22] consider the margin in an additive manner for the cosine similarities, while the SphereFace [10] sets it in a multiplicative manner. The ArcFace [1] proposes an additive margin for the geodesic distance directly and unifies the previous margin softmax loss functions by formulating the loss function with three different margins: m 1 , m 2 , and m 3 . The margin softmax losses improve the separability power of the face recognition models. However, they are vulnerable to hard samples such as a large-pose variant face. Furthermore, they lack the logit utilization for a non-ground truth class and ignore the significance of mining informative samples for training the separability of a feature extractor.

C. MINING SOFTMAX FUNCTIONS
The mining softmax approaches [8], [9], [16] are applied to solve the large-pose variant face problem and to improve the model performance for hard samples. They select the hard samples in the training stage and give them more weight to enhance the model discrimination power. Focal loss [8] focuses on the hard samples by reducing the weights for the easy samples, demonstrating a high inference probability. Shrivastava [16] proposed the online hard sample mining method which selects samples with high loss values from non-uniform and non-stationary distribution. Adaptive-Face [9] weights hard samples, by adaptively sampling the training data based on a sampling probability which depends on the model accuracy to each sample. However, they are vulnerable to over-fitting because of the repeated training of the same mini-batch only consisting of outliers. They also fail to learn the latent correlations between the classes since they use mining at the sample level.

D. MARGIN-MINING SOFTMAX FUNCTIONS
The major limitation in combining the features of the methods proposed in the previous two studies is that the margin softmax method adds the angular margin in the logit level while the mining softmax approaches select the margin on a sample level. The margin-mining softmax functions [7], [23] present the advantages of both the mining and margin softmax functions by weighting hard samples and adding margins on the same logit level. The MV-Softmax [23] combines two methods by proposing an indicator function on the previous margin softmax equation which selects misclassified logits for hard samples. They emphasize the importance of misclassified logits by weighting them with a constant parameter. CurricularFace [7] re-designs a weighting function which adopts curriculum learning and adaptively emphasizes the easy samples at earlier stages and the hard samples at the later stages during model training process The previously developed margin-mining approaches [7], [23] successfully overcome the limitations of the overfitting problem of mining softmax and success to learn a difficulty correlation between classes. However, they present a non-linear geometric margin for the hard samples which results in an instability and a low convergence rate in training. Conversely, to increase the convergence rate and improve the model stability, the proposed AMM-Softmax uses a linear weighting function for misclassified logits.

A. ANGULAR MARGIN-MINING SOFTMAX LOSS FUNCTION
This section explains the novel AMM-Softmax loss function which adds a linear angular margin for misclassified cosine similarities for hard samples. Figure 2 illustrates the overall pipeline of our the proposed AMM-Softmax function.
We first consider the naive softmax loss function, which is represented as follows: where x ∈ R d represents an output feature vector for a sample extracted from DCNNs and y denotes a ground truth label corresponding to x, which is included in the interval of k ∈ 1, 2, . . . , n, where n denotes the number of classes. w k ∈ R d represents a weight vector of the k-th class. d and w y denote the embedding feature dimension and a weight vector of the ground truth class, respectively. b y and b k represent the biases. The feature vector x and class weight vector w k are normalized to x ||x|| and w k ||w k || , respectively, using an l 2 normalization and the biases b k ∈ R are set to zero similar to [10], [21]. The cosine similarities are obtained in the form of cos θ k = w T k x ||w k || ||x|| by multiplying the two normalized values as shown in Figure 2, where θ k indicates the angle between the feature vector x and the k-th class weight vector w k . The cosine similarities are rescaled to s for modulating the logit magnitudes. Subsequently, the normalized and scaled softmax loss function L norm can be represented as follows: The cosine similarities (logits) are classified into four different groups: positive (blue), negative (red), misclassified (purple), and well-classified(green) logits, as shown in Figure 2. The positive logit cos(θ y ) represents a cosine similarity between feature vector x and center vector for ground-truth class w y . The negative logit cos(θ k ) represents a cosine similarity for the non-ground truth class, k ∈ {1, 2, . . . , n} and k = y, which can be separated into the well-classified logit and misclassified logit. The red box in Figure 2 represents the method used to extract the misclassified logit from the negative logit. The dashed line represents a decision boundary formulated in cos(θ y + m) < cos(θ k ) where m denotes the parameter for the margin between the class weight vectors earlier. If the given sample x satisfies the condition, then the weight vector w k for class k is classified into the misclassified logit. Otherwise, the weight vector w k is classified into the well-classified logit.
The sample x is categorized as a hard sample if it was able to identify at least one misclassified logit calculating the AMM-Softmax loss function. Correspondingly, the sample is categorized as an easy sample if no misclassified logit is identified while calculating the AMM-Softmax loss function. The face recognition task primarily aims to minimize the number of misclassified samples and logits.
An angular decision margin m is added in the positive logit following the previous approaches using angular margin [1], [7], [23] to enhance the embedding feature discrimination (blue logit in Figure 2). Subsequently, f (m, θ y ) is used instead of cos θ y in Eq. 2, where f (m, θ y ) denotes a margin function for the positive cosine similarity. Correspondingly, the decision boundary condition for the positive logit is set to f (m, θ y ) = cos(θ k ) as shown in Table 1. The normalized softmax loss function with an angular margin is represented as: In the general form, margin function is formulated as, f (m, θ y ) = cos(m 1 θ y + m 3 ) − m 2 . In our method, m 1 , m 2 , and m 3 are set to 1, 0, and m. An additional linear margin reweighting function is introduced for mis-classified logits, v(θ k , m, I ). The proposed linear margin reweighting function is represented as follows: I k denotes a binary mining indicator to categorize the misclassified logits among the negative logits, where f (m, θ y ) = cos(θ y + m) and m > 0. The misclassified logits categorized by the binary mining indicator I are weighted by our linear margin function to increase the discrimination between the class weight vectors. The overall AMM-Softmax loss function with the mining indicator and corresponding reweighting function is represented as follows: L amm = −log e scos(θ y +m) e scos(θ y +m) + n k=1,k =y e su(θ k ,m,I ) , The proposed AMM-Softmax has linear angular margins for both the positive and misclassified logits. The linear angular  margin for misclassified logits enables the maximization of the class separability and optimizes the geodesic distance margin. Additionally, the AMM-Softmax loss function simplifies the training procedure and reduce the time required to relocate the hard samples to the easy samples.

B. COMPARISONS WITH DIFFERENT LOSS FUNCTIONS
This section presents a comparison between the proposed AMM-Softmax function and the margin, mining and marginmining softmax functions, as shown in Figure 3 and Table 1.
The normalized naive softmax function without the angular margin is set as our baseline. The first subplot of Figure 3 and the first row of the Table 1 represents the boundary condition cosθ y = cosθ k and following decision boundary plot of the baseline.

1) MARGIN SOFTMAX FUNCTIONS
The margin softmax functions (CosFace [22] and Arc-Face [1]) consider an angular margin based only on the positive logits. They are in the form of Eq. 3 and have different boundary conditions for positive logits as shown in Table 1. CosFace [22] introduced the method to add margins in logits to increase the inter-class compactness, and proposed the decision boundary condition as, cos(θ y ) − m = cos(θ k ). ArcFace [1] addressed the non-linearity problem of decision margins and defined the linear decision margin by adding a direct margin to θ as cos(θ y + m) = cos(θ k ).
The proposed AMM-Softmax function adopted the boundary condition for positive logit following ArcFace [1]. In addition to the margin softmax functions, the proposed AMM-Softmax introduces a mining indicator for the misclassified logits and exploits misclassified logits, by weighting them with a linear margin.

2) MINING SOFTMAX FUNCTIONS
The mining softmax approaches [8], [16] designed a mining function to select the informative samples and utilized them during training to emphasize the importance of difficult samples. The mining softmax function can be summarized as: where g(x) represents a mining function. In OHEM [16], the samples which have a higher loss are classified as hard samples with g(x) = 1, or g(x) = 0 for easy samples which have a lower loss. In focal loss [8], they designed the mining such that g(x) = (1 − p y ) γ , where a probability p y = e scosθy n j=1 e scosθ k , and γ is a constant modulating factor. However, previously proposed mining softmax functions utilize mining only to select informative samples, while the proposed AMM-Softmax utilizes mining at the logit level. Logit level mining presents the potential for a more consistent strategy to train a discriminative feature extractor, since the model can determine the difficulty of the samples class by class. Furthermore, there is an opacity to utilize loss [16] and probability [8] as mining strategies for informative samples to the training model, whereas the proposed AMM-Softmax function clearly defines the informative samples as misclassified samples.

3) MARGIN-MINING SOFTMAX FUNCTIONS
The margin-mining softmax functions extract the misclassified logits from the negative logits. They are added to insert the margin on positive logits similar to the margin softmax functions. The AMM-Softmax function is compared with two previous margin-mining softmax functions, MV-Softmax [23] and CurricularFace [7] in this subsection.  The general form of margin-mining softmax functions is defined as, where h(t, θ k , I k ) represents a weighting function, and t denotes a weighting parameter for misclassified logit. MV-Softmax [23] defined the weighting function as, h(t, θ k , I k ) = e st(cos(θ k )+1)I k = e s(tcos(θ k )+t)I k . The decision boundary condition of MV-Softmax to the misclassified logits is in a non-linear form, cos(θ y + m) = (t + 1)cosθ k + t, as shown in Figure 3. CurricularFace [7] proposed the reweighting function, h(t, θ k , I k ) = e s(t+cosθ k )cosθ k I k , and t as an Exponential Moving Average (EMA) of positive logits, cos(θ y ). It adaptively adjusts the leverage for misclassified logits and presents the varying boundary condition, cos(θ y + m) = (t + cosθ k )cosθ k . The margin-mining softmax functions [7], [23] adopt the margin softmax to increase the decision margin of a positive logit with [1], implying that they use the same mining condition for the misclassified samples cos(θ y + m) = cosθ k . They [7], [23] proposed a non-linear weighting function for misclassified logits, while a linear decision margin for positive logit and linear mining indicator are used. This inconsistency in the loss function results in an inconsistent decision condition for both easy and hard samples, as shown in Figure 3. Conversely, the AMM-Softmax loss function presents linear and simpler boundary condition, cos(θ y + m) = cos(θ k − m) for hard samples which has consistency with boundary condition for easy sample, cos(θ y + m) = cosθ k .

IV. EXPERIMENT
A. EXPERIMENTAL SETTING 1) DATASETS Table 2 presents the identity and the number of images of the datasets used in our experiment. All the models are trained on the same public datasets, MS1MV2, for a fair comparison. MS1MV2 presents an improvement on the previous MS-Celeb-1M [4] dataset for a semi-automatic refinement version. The models are tested for two tasks, a verification task and an identification task. The verification performance of the AMM-Softmax model is compared with other recent face recognition models [1], [7], [10], [22], [23] on the widely used LFW [6] and CFP-FF [15] datasets. Furthermore, we also measure the model verification-robustness in largeage and large-pose variance using the more challenging benchmarks, CFP-FP [15], AgeDB-30 [12], CPLFW [25], and CALFW [26]. The IJB-C [11] dataset was adopted to test verification and identification performances of the models in large-scale benchmarks using the 1:1 and 1:N protocol of IJB-C. The IJB-C dataset is an extension of IJB-B.

2) PREPROCESSING AND CNN SETUP
The input of the network is preprocessed using the following two steps: face detection and face alignment. Face detection is conducted with RetinaFace [2] to find the face in all the images along with the corresponding five facial landmarks (2 eyes, 1 nose, 2 tips of mouth) for each person. After finding the landmarks, affine warping is used for face alignment, based on landmarks. It stretches and rotates the faces and produces a preprocessed image with a size of 112 × 112 size in width and height. We employed SE-ResNet50-IR as our backbone network, which consists of a ResNet50 backbone network with a Squeeze-and-Excitation (SE) [5] module, and improved residual module (IR) [3] which is customized for the image recognition task. The latent vector with a 512 channel dimension is extracted from an 112 × 112 image by the backbone network.

3) TRAINING
We trained the model with Stochastic Gradient Descent (SGD) with momentum of 0.9 and weight decay of 0.0005 in all experiments. We used 0.05 initial learning rate and divided it by 0.1 at 7-th, 14-th, 21-th epoch. The total epoch is 25. The default scale size s is set to 32, margin size m is set to 1.35, 0.35, 0.5 for SphereFace [10], CosFace [22], and ArcFace [1], respectively. For marginmining softmax, we follow m setting for their margin function.(i.e., 0.35 for MV-AM-Softmax, 0.5 for MV-Arc-Softmax, CurricularFace, and AMM-Softmax). A reweight parameter for misclassified logit t is set to 0.2 for MV-AM and 0.3 for MV-Arc following [23]. All experiment is implemented by PyTorch [13] framework and we used four NVIDIA RTX3090 in parallel with batch size 512. We train the backbone network from scratch by initializing fully connected (FC) layer at the back to calculate loss and to flow gradient to the backbone network.

4) TESTING
Preprocessed images are used as an input to test the models and we utilize only the trained backbone network without the FC layer to test the performance of the models. All the input images are converted to the feature vectors using the trained backbone network, and cosine similarities of the vectors are calculated for a 1:1 verification and 1:N identification. The verification and identification protocols utilize their own templates or lists to select the samples and quantify the performances of the models.  Figure 4 depicts the training loss curves of the margin-mining softmax models during training. The AMM-Softmax loss function reduces its training loss faster than any other marginmining methods even though it starts with the second highest training loss. CurricularFace presents a superior training property and convergence of models since it is based on curriculum learning. It demonstrates a smaller dispersion and faster convergence when compared to that of the MV-Softmax loss function. However, the AMM-Softmax loss function presents faster and stable convergence performance than other methods after the early stage of training. It is difficult to mathematically demonstrate the superiority of training. However, considering the number of misclassified logits of margin-mining models shown in Figure 5, the convergence behavior directly corresponds to the speed with which the model exploits misclassified logits. This is crucial for the training of the separable feature extractor because it demonstrates the speed at which the model changes samples with high loss to samples with small loss. In this case, the proposed AMM-Softmax loss function, which uses a consistent margin method for the easy samples (linear) and hard samples (linear) achieves faster convergence behavior when compared to the model that has an inconsistent decision margin. Furthermore, the number of misclassified logits remaining at the last epoch of MV-AM is 200K, which is quantitatively large while AMM-Softmax remains in a few hundred, completely utilizing the misclassified logits during training process.

2) DISTRIBUTION OF y
To demonstrate the superiority in inter-class separability and class compactness of training with our AMM-Softmax loss function, we present the distribution of θ y . Figure 6 represents the histogram of angle between ground-truth class weight vector and input feature that is extracted from different  four losses (Softmax, Arc-Softmax [1], MV-Softmax [23] and our AMM-Softmax). The angles are measured three times at the start (1st epoch), middle (2nd epoch), and end (25th epoch) phase of training, and each mean value µ, and variance σ 2 of the angle is calculated at the end phase for a detailed comparison. At the start of the training phase, all the class weight vectors and weights of the backbone network are randomly initialized and lack any similarity between the feature vector and positive class weight vector, which are faces each other at an angle of 90•. As the training proceeds, the angle distribution decreases and becomes sharper because the model is trained to maximize the intra-class compactness and inter-class separability. Since the training a face recognition model aims to train a discriminative classifier, the distribution with the lower µ and lower σ 2 at the end phase presents the better performance of the models. In Figure 6, the Softmax function shows the lowest mean, µ = 30.7•, with high variance, σ 2 = 38, indicating the low intra-class compactness, while the MV-AM Softmax function represents the lowest variance, σ 2 = 21 with the low inter-class separability and high mean, µ = 45.7•. However, the AMM-Softmax loss function ensures a similar high level of inter-class separability when compared to that of the softmax function and a similar low level of intra-class dispersion when compared to that of the MV-AM Softmax function. Specifically, compared with MV-AM Softmax loss function, which represents the SOTA model for face recognition, the AMM-Softmax reduces the µ from 45.7• to 33.0• while the σ 2 of the θ y distribution is same 21 with that of MV-AM Softmax loss function demonstrating the best discrimination performance.

3) RESULTS ON VERIFICATION DATASETS
The proposed AMM-Softmax loss function is compared to the recent face recognition models (SphereFace [10], Cos-Face [22], ArcFace [1], CurricularFace [7], and MV-AM [23]) on famous public face verification benchmarks, LFW and CFP, as shown in Table 3. Furthermore, the performance of the AMM-Softmax loss function is demonstrated on relatively difficult datasets that consist of high variant samples in pose and age (CALFW, CPLFW, AgeDB). The model trained using AMM-Softmax loss function leads the baseline by a large margin, and surpasses the other models [1], [7], [10], [22], [23] exhibiting the highest performance for each benchmark, as shown in Table 3. The proposed AMM-Softmax loss function presents an improvement of 92.15%, which is 0.42% higher than that of the MV-AM and 4.81% higher than that of the baseline on CPLFW, which is the most difficult dataset among the considered benchmarks. This result demonstrates that the proposed AMM-Softmax method can considerably enhance the discriminability of the embedding network using consistent linear angular margin by simply changing the softmax loss function. Table 4 presents the verification and identification results of the model learned by the AMM-Softmax loss function  when compared to the existing face recognition models. This study employs the 1:1 protocol of IJB-C as a verification benchmark and 1:N protocol as an identification benchmark to compare the performance of the models. To evaluate the robustness and reliability of the models in the face recognition task, the true positive rate (TPR) of the model at a very low false positive rate (FPR) is important. Therefore, a 1:1 verification TPR is reported when the FPR = 1e-6, and 1:N identification TPR when FPR = 1e-3 to estimate the AMM-Softmax performance. The proposed AMM-Softmax loss function outperforms the other state-of-the-art models [1], [7], [23] by a margin of 0.3% for verification and 3.6% for identification, as shown in Table 4. Figure 7 presents the receiver operating characteristic (ROC) curve of the proposed AMM-Softmax and other SOTA models on IJB-C 1:1 protocol. The results of the SOTA methods on the test dataset present significantly high performance over 0.8 TPR@FPR = 1e-6. However, the proposed AMM-Softmax loss function reached the highest TPR@FPR = 1e-6 value and highest area under the ROC curve (AUC). The results demonstrate that the proposed AMM-Softmax loss function presents reliable performance through various FPR situations, and presents the best verification performance when compared to other SOTA methods. Table 5 depicts the two ablation studies for different values of margin m and scale size s. An ablation study for the margin m is conducted in the scale of [0.2, 0.35] increasing by 0.05, while s is fixed to 32 as shown in the first four rows in Table 5. In another ablation study for s as shown in the last four rows, m is set to a constant value of 0.3, and s is increased from 16 to 64. The results demonstrate that small s and small m provide insufficient discriminability to the model. However, high margin and scale size do not necessarily ensure a high performance either. Instead, it was observed that the proposed AMM-Softmax loss function achieves the best result with the margin of 0.3 and 32 for the scale size. This result demonstrates that margin and scale are essential parameters for face recognition and the optimal parameters of AMM-Softmax are m = 0.3 and s = 32.

V. CONCLUSION
This paper proposed the AMM-Softmax loss function, which provides a consistent margin for both easy and hard samples by applying linear and angular margins to the geodesic distance. The proposed AMM-Softmax loss function improves the separability power of the feature extractor in the face recognition task, and its consistency in the decision margin also provides stability and acceleration in the training. Extensive experimental analysis was conducted to demonstrate the superiority of the proposed AMM-Softmax loss function in both verification and identification tasks when compared to other state-of-the-art face recognition methods.