SphereFace Revived: Unifying Hyperspherical Face Recognition

This paper addresses the deep face recognition problem under an open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. To this end, hyperspherical face recognition, as a promising line of research, has attracted increasing attention and gradually become a major focus in face recognition research. As one of the earliest works in hyperspherical face recognition, SphereFace explicitly proposed to learn face embeddings with large inter-class angular margin. However, SphereFace still suffers from severe training instability which limits its application in practice. In order to address this problem, we introduce a unified framework to understand large angular margin in hyperspherical face recognition. Under this framework, we extend the study of SphereFace and propose an improved variant with substantially better training stability -- SphereFace-R. Specifically, we propose two novel ways to implement the multiplicative margin, and study SphereFace-R under three different feature normalization schemes (no feature normalization, hard feature normalization and soft feature normalization). We also propose an implementation strategy --"characteristic gradient detachment"-- to stabilize training. Extensive experiments on SphereFace-R show that it is consistently better than or competitive with state-of-the-art methods.


INTRODUCTION
R ECENT years have witnessed the tremendous success of deep face recognition (FR). Owing to the rapid development in discriminative loss functions [1], [2], [3], [4], [5] that promote large inter-class feature margin, the performance of deep FR has dramatically improved. These loss functions share a common goal to project deeply learned face embeddings onto a hypersphere and incorporate large geodesic inter-class margins. We call this series of deep FR methods hyperspherical face recognition.
Previous deep FR methods [6], [7] typically train neural networks by classifying identities in a training set. Such a training target largely deviates from the open-set testing (i.e., to determine whether two face images belong to the same person) in two aspects: (i) similarity measure differs in training and testing; (ii) open-set testing must solve a metric learning problem [2] where the goal is to learn large-margin features, while training aims to solve a closed-set classification problem where the goal is to learn separable features. Motivated by the mismatch between training and testing in deep FR, hyperspherical FR aims to bridge the gap by (i) constraining the face embeddings on the hypersphere (i.e., using cosine similarity for both training and testing), and (ii) incorporating large geodesic margin on the hypersphere. Another motivation for hyperspherical FR comes from the observation that deep features are intrinsically discrimina-• W. Liu  tive on a hypersphere [8]. Hyperspherical FR essentially focuses on answering the question: How to effectively and stably incorporate large angular margin to face embeddings?
Large-margin softmax [1] is one of the first methods to incorporate large angular margin to deeply learned features. The core idea is to use a monotonically decreasing lower bound function ψ(θ (x,Wy) ) to replace the target angular activation cos(θ (x,Wy) ) in the softmax-based loss function, where θ (x,Wy) denotes the angle between deep feature x and the classifier of the target class W y (y is the label of x). The intuition is that the function ψ(θ (x,Wy) ) will make the angle θ (x,Wy) smaller in order to achieve the same value of cos(θ (x,Wy) ). Such a design will encourage the deep features to have large inter-class margins on the unit hypersphere. Most popular hyperspherical FR methods [2], [3], [4], [5] adopt this design principle.
Built upon [1], SphereFace [2] takes one step further by explicitly constraining decision boundaries on the hypersphere and simultaneously incorporating angular margins. Inspired by SphereFace, there is a series of work [3], [4], [5] that design alternative lower bound target function ψ(θ (x,Wy) ) to achieve angular margin. Based on how ψ(θ (x,Wy) ) is constructed, loss functions in hyperspherical FR can be divided into additive margin [3], [4], [5] and multiplicative margin [1], [2]. As a representative multiplicative margin method, SphereFace renders promising geometric insights. However, in contrast to additive margin, SphereFace is known to be highly non-trivial to train, typically requiring a number of bells and whistles to stabilize its training, which limits its potential application.
In order to address this shortcoming, we take a detour by first identifying an intrinsic connection that bridges different margin designs [1], [2], [3], [4], [5], [9] in hyperspherical FR. We formulate this connection with a unified large-margin framework for hyperspherical FR. In this framework, we arXiv:2109.05565v3 [cs.CV] 16 Mar 2022 summarize a general principle for any loss function to achieve large angular margin. Following this principle, most existing hyperspherical FR methods can be viewed as special instantiations. This framework helps us gain a deeper understanding of hyperspherical FR, and serves as a portal to design new loss functions.
Under this unified framework, we extend our previous study of SphereFace [2] by proposing alternative yet effective ways to implement the multiplicative margin with improved training stability and better empirical performance. Specifically, the original realization of multiplicative margin in SphereFace is exact only when the angle between the feature and the target classifier is sufficiently small. When this angle is large, the multiplicative margin becomes approximate and the original intuition no longer holds. Motivated by this, we propose two novel variants that can exactly implement the intuition of multiplicative margin for all possible angles. Along with the new multiplicative margins, we also propose a novel implementation strategy which we call characteristic gradient detachment (CGD) that helps to stabilize training and improve generalization. We term our improved approach SphereFace-R.
Another significant difference between SphereFace and other hyperspherical FR methods [3], [4], [5], [9] is whether feature normalization (FN) is performed. Based on the empirical observation in [2], [8], we notice that feature magnitude still contains some information such as image quality. However, whether the information encoded in feature magnitude is useful for FR remains an open question. To address this, we consider three schemes here: no feature normalization (NFN), hard feature normalization (HFN) and soft feature normalization (SFN). HFN is identical to the popular feature normalization used in [3], [4], [5], [9]. In contrast, SFN formulates the feature normalization objective into a regularization term and optimizes it jointly with the neural network. Unlike HFN, SFN will take feature magnitude into account when training the neural network. This shares a similar spirit with [10]. While both FN-free learning and HFN can be viewed as limiting cases of SFN, SFN effectively unifies both approaches and serves as an interpolation between them. We conduct a systematic study to evaluate the effectiveness of all three FN strategies.
Our contributions can be summarized as follows: • We present a unified framework to understand large angular margin in hyperspherical FR. This framework effectively explains how and why angular margin can be incorporated in SphereFace and further summarizes a general principle for loss functions to introduce large angular margin. Moreover, most of the current hyperspherical FR methods can be viewed as special instantiations of this framework.
• Under the unified framework, we substantially extend our previous work on SphereFace [2] by addressing training instability and improving empirical performance. Compared to the original SphereFace, SphereFace-R uses a more intuitive way to incorporate the multiplicative margin and yields more stable training, more clear geometric interpretation and superior generalization. • We propose CGD, a generic implementation method for hyperspherical FR methods to improve the training stability and generalizability.
• To evaluate the usefulness of feature magnitude, we comprehensively study SphereFace-R under three different FN schemes: NFN, HFN and SFN.
• Our paper comes with an easy-to-use codebase to facilitate future research. 1 It serves as a platform to evaluate hyperspherical FR methods fairly.
In contrast to hyperspherical FR methods that use an additive margin, SphereFace-R is built upon our previous work [1], [2] and adopts a multiplicative margin approach.

A UNIFIED LARGE-MARGIN LEARNING FRAME-WORK FOR HYPERSPHERICAL FACE RECOGNITION
To gain deeper insights towards large angular margin, we present a unified framework for hyperspherical FR. To start with, we consider the standard softmax cross-entropy loss: where x ∈ R d denotes the deep feature (the input of the classifier layer), y is its ground truth label, K is the total number of classes, W i ∈ R d is the weights of the i-th classifier and b i is the bias for the i-th class. Note that here we consider the case of a single input sample for simplicity; we only need to average the loss objectives if we consider a mini-batch of input samples. Since the class-dependent bias term is not informative in open-set evaluation, we follow the common practice to remove it [2]. Then we normalize the classifier weights to one (i.e., W i = 1, ∀i) and rewrite the objective function as follows: where θ i denotes the angle between deep feature x and the i-th classifier W i . By considering a generic angular activation rather than the cosine function, we have the following generalized objective function: where ψ(θ y ) is the angular activation function for the target class (i.e., ground truth label) and η(θ i ), i = y denotes the angular activation function for the i-th non-target class (the labels excluding the ground truth one). Similar to the cosine function, both ψ(θ) and η(θ) are generally required to be monotonically decreasing for θ ∈ [0, π]. After looking into different hyperspherical FR methods, we summarize a simple yet generic principle for any softmax loss in order to learn embeddings with large angular margin.
To achieve large angular margin, the generic principle is to make ψ(θ) always smaller than η(θ) in (0, π], namely where we define ∆(θ) as the characteristic function for large angular margin. ∆(θ) determines most of the properties about the angular margin, such as the size of the margin, its learning stability, etc.
As long as we guarantee that ∆(θ) is larger than zero, then the objective function in Eq. (3) will define a task that can inherently introduce large angular margin. To see how ∆(θ) interacts with the loss function, we can rewrite Eq. (3) in the following mathematically equivalent form: which essentially aims to minimize η(θ i ) − η(θ y ) + ∆(θ y ). The term η(θ i ) − η(θ y ) represents the difference of classification confidence, and the characteristic function ∆(θ y ) controls the angular margin. When ∆(θ y ) = 0 and η(·) is the cosine function, Eq. (5) reduces to the standard softmax loss with weight normalization. ∆(θ y ) = 0 indicates that no angular margin has been introduced. When ∆(θ y ) > 0, this leads to large angular margin because it makes the classification more stringent (i.e., the neural network will learn to make θ y smaller in order to reach the same loss value as the case of ∆(θ y ) = 0). It is also worth mentioning that when ∆(θ y ) < 0, Eq. (5) defines an easier task than the standard classification problem and is potentially useful for robust learning against noisy images or labels. Our paper focuses on the case of ∆(θ y ) > 0.
We note that there exist scenarios where large angular margin can still be achieved even if ∆(θ y ) is smaller than zero in some range of θ y ∈ [0, θ]. For example, ∆(θ y ) for Ar-cFace can be smaller than zero when θ y is close to π. ArcFace can still introduce angular margin because the case where θ y is close to π hardly happens with real data distribution, as verified by [5]. Nonetheless, the characteristic function for ArcFace still approximately satisfies our principle, since it is larger than zero with most θ y ∈ [0, θ]. Therefore, as long as the characteristic function ∆(θ y ) is larger than zero for the angles where θ y is densely distributed in practice (i.e., E θy ∆(θ y ) > 0), it will typically suffice to produce effective angular margin. Our principle in fact serves as a sufficient condition to introduce angular margin. It is generally better to use our principle as the guideline for designing new angular margin losses, because the empirical distribution of the target angle θ y could vary under difference circumstances (e.g., network architectures, datasets, optimizers).
We now discuss in depth why ∆(θ y ) > 0 is able to introduce large angular margin. For ease of illustration, we consider the binary case where the first class is the target class. In this case, we only need to discuss η(θ 2 ) − η(θ 1 ) + ∆(θ 1 ). If ∆(θ 1 ) = 0, the decision boundary  for the first class is η(θ 2 ) − η(θ 1 ) = 0 which is equivalent to θ 1 = θ 2 . When θ 1 < θ 2 , the sample x will be classified to the first class. If ∆(θ 1 ) > 0 and η(·) is monotonically decreasing, then the decision boundary for the first class becomes η(θ 2 ) − η(θ 1 ) + ∆(θ 1 ) = 0 which is equivalent to θ 1 + m(θ 1 ) = θ 2 where m(·) denotes some positive function (the specific form of m(·) is determined by η(·) and ∆(·), but it stays positive as long as ∆(·) is always positive). Therefore, now we need to make θ 1 + m(θ 1 ) < θ 2 in order to classify x to the first class, and the decision boundary for the first class becomes more stringent than the previous case. The neural network has to learn smaller θ 1 in order to correctly classify x and smaller θ 1 implies a more compact representation for the first class. The same reasoning also applies to the case where x belongs to the second class (i.e., the second class is the target class). As a result, if we can successfully train a neural network to correctly classify training samples with these more stringent classification criteria, ∆(θ 1 ) > 0 can effectively produce large angular margin for the learned deep features. Importantly, current popular hyperspherical FR methods can be viewed as special cases under this unified framework, as shown in Table 1 (first four rows). To intuitively understand different variants of angular margin, we also compare their characteristic functions ∆(θ) in Fig. 1(a). One can observe that different hyperspherical FR methods yield different large-margin characteristic functions. Each characteristic function determines how a hyperspherical FR method performs and therefore it is of great significance to design a suitable characteristic function. Specifically, the characteristic function ∆(θ) clearly reveals the induced angular margin for samples with different recognition hardness (larger θ y typically implies a harder sample). Instead of a static characteristic function, designing a dynamic characteristic function could be beneficial [37], [40]. It is also possible to learn the characteristic function in a data-driven and automatic fashion, as explored in [70], [71].
Besides the characteristic function ∆(·), the feature magnitude x in Eq. (5) also plays a non-negligible role in learning large angular margin. The original SphereFace approach preserves the feature magnitude in training, since the feature magnitude does not affect the angular decision boundary. [3], [4], [5], [9], [29] show that normalizing the feature magnitude to a constant s (e.g., making x ← s x x in Eq. (5)) can stabilize training and also improve hyperspher-   . For this figure, we consider the binary case where θ 1 = π/3 (y = 1) and θ 2 = π/2. Both η(·) and ψ(·) are cosine function. (b) How the loss curve of Eq. (6) varies under different feature magnitude s. For this figure, we consider the binary case where y = 1 and θ 2 = π/2. ical discriminativeness. By normalizing the feature magnitude to a prescribed positive constant s, Eq. (5) becomes where s is a universal value instead of the original instancedependent x . There are two advantages of feature normalization. First, it can effectively avoid potential bad local minima. Second, it can help the loss function to better balance easy and hard training samples. For the first aspect, we consider Eq. (6) in a simple binary classification scenario (class 1 is the ground truth label for the deep feature x, i.e., y = 1). The loss value can easily go to zero once the deep feature x lies in the correct decision region, as demonstrated in the following equation: where η(θ 1 ) − η(θ 2 ) > ∆(θ 1 ) means that x can be correctly classified, η(θ 1 )−η(θ 2 ) = ∆(θ 1 ) means that x exactly lies on the decision boundary and η(θ 1 )−η(θ 2 ) < ∆(θ 1 ) means that x can not be correctly classified. The results imply that when x can be correctly classified, a trivial solution to reduce loss to zero is to simply increase s. However, increasing s does not help the neural network learn angularly discriminative face embeddings and results in bad local minima. Because x can be viewed as an instance-dependent learnable s, the neural network without feature normalization is likely to simply increase x after θ 1 passes the decision boundary. Therefore, using a constant s can prevent this trivial way of reducing loss value and eliminate these bad local minima. We also plot how the loss value changes as s increases in Fig. 2(a). The same argument can easily generalize to the multi-class scenario, as shown in where s has a large influence on the loss value. For the multi-class scenario, the neural network can trivially in-crease s to minimize the loss once η(θ y ) − η(θ i ) > ∆(θ y ) for all i = y. Interestingly, this also explains why the standard softmax loss cannot learn deep features with large angular margin. Empirically the standard softmax loss tends to increase s instead of minimizing the target angle once the deep feature x falls into the correct decision boundary, leading to separable features rather than large-margin features. Largemargin losses take advantage of this phenomenon and make the decision boundary asymmetric for different classes (i.e., η(θ) = ψ(θ)). Then the classification of x (i.e., forcing ψ(θ y ) > η(θ i ), ∀i = y) naturally becomes equivalent to learning large-margin deep features.
For the second aspect, we use an example to demonstrate how the feature magnitude s can balance the easy and hard samples. We compare the loss function under different s in Fig. 2(b). By adjusting s, the loss function in Eq. (6) has different sensitivity for samples with different target angle θ y . Intuitively, samples with large target angle are considered to be hard, while samples with small target angle are viewed as easy. Therefore, feature magnitude s can also balance the loss value for easy and hard samples, which serves a role similar to hard sample mining [21] in deep metric learning. From Fig. 2(b), one can observe that larger s puts more focus on the hard samples, since the loss ratio between hard and easy samples increases. Finding a good s essentially can be viewed as searching for a suitable balance between easy and hard samples in hyperspherical FR methods, e.g., [3], [4], [5], [9].
To summarize, Eq. (6) essentially throws away the information encoded in the feature magnitude x. Despite the two major advantages that ease the training of hyperspherical FR methods, it remains an open problem whether it is beneficial to combine feature magnitude to training. Feature magnitude is closely related to image quality and semantic ambiguity [8], [72], and such information intuitively seems useful to distinguish different faces. However, training hyperspherical FR methods without feature normalization generally yields inferior training stability and generalization performance in practice. In order to explore whether feature magnitude is indeed helpful or not, we consider to constrain the feature magnitude via a soft regularization in Section 4.2. This serves as an interpolation between no feature normalization and hard feature normalization, and can take the feature magnitude into account during training.

SPHEREFACE-R: BETTER AND MORE STABLE
In this section, we elaborate the design of SphereFace-R and introduce two novel variants that perform well in practice. Sharing the same geometric interpretation as SphereFace, SphereFace-R yields improved training stability and superior open-set generalizability. We start by revisiting the design of the original SphereFace and then propose alternative ways to implement the multiplicative margin in Section 4.1. In Section 4.2, we discuss different feature normalization schemes. Finally, we list a few important open problems for hyperspherical FR in Section 4.3.
The original SphereFace requires m to be an integer, which is in fact unnecessary. m can be any positive value larger than 1. In order to improve training stability, our original SphereFace minimizes its objective function (with m = 4) jointly with a standard softmax loss, which approximately yields an effective m as 1.4. Therefore, the target angular function in the original SphereFace can be simplified to where we usually use m ∈ [1,2]. One can easily verify that the characteristic function ∆(θ) is always larger than zero with θ ∈ (0, π], so it satisfies the general principle to introduce large angular margin. Fig. 3 intuitively compares no angular margin, additive angular margin and multiplicative angular margin. The intuition of multiplicative margin can be understood from a simple binary classification example (with two classifiers W 1 and W 2 ). We consider Eq. (3) with η(θ) = cos(θ) and ψ(θ) = cos(θ). For a sample x, we need to require cos(θ 1 ) > cos(θ 2 ) to correctly classify x. But what if we instead require cos(mθ 1 ) > cos(θ 2 ) where m > 1 in order to correctly classify x? It is essentially making the decision more stringent than previous, because we require a lower bound 2 of cos(θ 1 ) to be larger than cos(θ 2 ). The decision boundary for class 1 is cos(mθ 1 ) = cos(θ 2 ). Similarly, if we require cos(mθ 2 ) > cos(θ 1 ) to correctly classify samples from class 2, the decision boundary for class 2 is cos(mθ 2 ) = cos(θ 1 ). Suppose all training samples are correctly classified, such asymmetric decision boundaries will naturally produce an angular margin of size m−1 m+1 θ 12 where θ 12 denotes the angle between W 1 and W 2 . From angular perspective, correctly classifying x from identity 1 requires θ 1 < θ2 m , while correctly classifying x from identity 2 requires θ 2 < θ1 m . If m > 1, both decision criteria are more difficult to achieve than the vanilla case without any angular margin (i.e., θ 1 < θ 2 and θ 2 < θ 1 ).
We can observe that Eq. (10) can exactly match the intuition of multiplicative margin only when θ ∈ [0, π m ]. When θ ∈ ( π m , π], the same argument however will no longer hold. Although such a heuristic design can still empirically achieve large angular margin and work reasonably well, it may inevitably be less interpretable and also contribute to the training instability. The key to multiplicative angular margin is to guarantee that the equation ψ(θ) = η(mθ) (m > 1) always holds for θ ∈ [0, π]. In order to better implement the intuition of multiplicative margin, we propose two different approaches, i.e., designing either a new target angular function ψ(θ) or a new non-target function η(θ). In Section 4.1.1, we first follow the original idea of SphereFace to re-design a target angular function ψ(θ) which can better reflect the intuition of multiplicative margin. In Section 4.1.2, we take a different approach by designing a new η(θ) which has a much simpler form yet can exactly match the intuition of multiplicative margin for θ ∈ [0, π]. Section 4.1.3 proposes a useful implementation method to further stabilize training. Section 4.1.4 gives implications and discussions.

SphereFace-R v1: On Designing ψ(θ)
Following the conventional way to design an angular margin loss [2], [3], [4], [5], we first focus on constructing a target angular function ψ(θ) based on the intuition of multiplicative margin. For θ ∈ [0, π m ], we can simply use ψ(θ) = cos(mθ) which is a monotonically decreasing function in [0, π m ] and exactly implements the multiplicative angular margin. When θ > π m , SphereFace constructs a surrogate monotonically decreasing function to replace cos(mθ), as specified in Eq. (10). However, this design of ψ(θ) in [ π m , π] does not follow the original intuition of multiplicative margin and may be sub-optimal. In order to better implement multiplicative margin in the entire domain of [0, π], we propose the following target angular function: where m is usually a prescribed positive constant. Eq. (11) remains a monotonic function in [0, π] and can be viewed as incorporating large angular margin with a dynamic mul- exactly the same as SphereFace and perfectly implements the multiplicative margin. For θ ∈ [ π m , π], we consider a new multiplicative margin parameter m and the target angular function becomes ψ(θ) = cos(m θ). In order to (i) make m as large as possible and (ii) make ψ(θ) a monotonic decreasing function where m θ does not exceed π, we propose an adaptive decreasing strategy for m : m = π θ . Combining pieces, we end up with a dynamic multiplicative margin parameter m = min{m, π θ }. The non-target angular function is the same as SphereFace, i.e., η(θ) = cos(θ). Therefore, the multiplicative margin is implemented through ψ(θ) = η(m θ). The curve of the corresponding characteristic function ∆(·) is given in Fig. 1(b). More interestingly, we can observe that SphereFace-R v1 incorporates less angular margin to samples that are too easy or too hard (i.e., the target angle is around 0 or π) and combines the largest angular margin to samples with medium hardness. We also compare SphereFace-R v1 with the other hyperspherical FR methods in Table 1.
Despite the well implemented multiplicative margin in Eq. (11), there is still a constraint on the effective multiplicative margin parameter m , i.e., m ≤ π θ . Moreover, we have no consistent m in Eq. (11) for samples with different target angle. It indicates that for an arbitrary sample whose target angle is within [0, π], SphereFace-R v1 can not guarantee the same m . To address this limitation, we propose to implement the multiplicative margin from the perspective of the non-target angular function rather than the target angular function, leading to SphereFace-R v2.

SphereFace-R v2: On Designing η(θ)
We consider how to design the non-target angular function η(θ) based on the intuition of multiplicative margin. To achieve ψ(θ) = η(mθ) without changing the target angular function ψ(θ), we can naturally arrive at the following desired non-target angular function η(θ): where m is a prescribed positive constant and ψ(θ) = cos(θ). Compared to Eq. (11) in SphereFace-R v1, Eq. (12) is much simpler and more importantly, satisfies the property of ψ(θ) = η(mθ) for θ ∈ [0, π] with a static m. While being extremely simple and conceptually appealing, SphereFace-R v2 can also exactly incorporate a static multiplicative angular margin. In contrast, SphereFace-R v1 is unable to induce a static multiplicative margin with m > 1 and can only incorporate a dynamic multiplicative margin where the effective margin parameter has to be close to 1 if θ is near π. More importantly, unlike SphereFace-R v1, there is no constraint for the size of the induced angular margin in SphereFace-R v2 and we can use any desirable m ≥ 1.
From the corresponding characteristic function given in Fig. 1(b), we can see that SphereFace-R v2 incorporates the smallest angular margin to easiest samples and the largest angular margin to samples with medium hardness. Unlike SphereFace-R v1 that introduces very small angular margin to hard samples, SphereFace-R v2 combines much larger angular margin to these samples. Therefore, SphereFace-R v1 and SphereFace-R v2 put different efforts on optimizing hard samples and may yield different generalizability.
To the best of our knowledge, SphereFace-R v2 is the very first method that introduces large angular margin through non-target angular functions. SphereFace-R v2 easily addresses the difficult problem of incorporating a static multiplicative margin by simply switching the design focus from target function to non-target function. We believe that this method provides an important and novel perspective on designing large angular margin losses.

Characteristic Gradient Detachment
In order to further stabilize training and improve performance, we introduce a simple and generic methodcharacteristic gradient detachment for implementing our multiplicative margin. In general, the shape of the characteristic function ∆(θ) determines the training stability and the convergence property. Empirically, we find that a characteristic function with simpler backward gradient computation typically leads to better training stability. For example, CosFace [3], [4] yields strong empirical training stability and its characteristic function is simply a positive constant with backward gradient as 0. Inspired by such an observation, we aim to simplify the backward gradient computation for the characteristic function. To gain more intuitions, we first use Taylor expansion to decompose the characteristic function in the target angular function at an arbitrary angle θ 0 ∈ (0, π) with a small angle deviation δθ: where ∆ (n) (θ) denotes the n-th order derivative of ∆(θ) and R n (δθ) denotes the higher order infinitesimal of (δθ) n . When the characteristic function is more complex, then its Taylor expansion needs to have more terms to accurately represent it. This leads to more complex backward gradient computation. Motivated by the observation that simpler gradient computation often leads to better training stability, we propose to make an approximation to the characteristic function by removing some higher order terms in its Taylor expansion. Generally, we can remove any higher order Taylor expansion terms and it yields different backward gradients. Particularly, we draw inspirations from the constant characteristic function adopted in CosFace, and use the zeroorder approximation for ∆(θ 0 + δθ) in Eq. (13): which is much simpler and robust to compute and gives the following approximate gradient for ψ(θ) at θ 0 : which naturally leads to the proposed CGD where we can simply apply gradient detachment to the characteristic function ∆(θ). Specifically, we stop the gradient of the characteristic function with a detachment operator: where Detach(·) denotes the detachment operator that allows forward computation but stops the backward gradient propagation. This essentially means that we only need to compute the characteristic function in the forward pass and completely ignore it in the backward propagation. In order to avoid computing the gradient of the characteristic function, we substitute Eq. (16) into the first line of Eq. (5) and finally obtain the following loss function: which can be generally used for the cases where the target angular function is modified, such as SphereFace and SphereFace-R v1. In the backward pass, CGD approximates the characteristic function of the multiplicative margin with a piece-wise function consisting of many constant functions, as illustrated in Fig. 4. Equivalently, CGD can also be viewed as a step function approximation to the characteristic function in the backward pass. We note that the approximation in CGD only exists in the backward direction and the forward computation is always identical to the CGDfree scenario. From a different perspective, CGD can be understood as interpreting the multiplicative margin with a dynamic CosFace-style additive margin (i.e., the effective margin parameter for the additive margin is dynamically dependent on the input target angle in CGD rather than being static in CosFace). The discussion above applies to both SphereFace and SphereFace-R v1, since they are modifying the target angular function. As a concrete example, applying CGD to SphereFace-R v1 yields the target angular function: ψ(θ) = cos(θ) − Detach(cos(θ) − cos(min{m, π θ } · θ)) whose gradient is identical to CosFace.
For SphereFace-R v2 that modifies the non-target angular function, the derivation is similar except that we focus on approximating the gradient of the non-target function η(θ) instead of the target function ψ(θ). Therefore, we can similarly apply gradient detachment to the characteristic function in the non-target angular function: After putting Eq. (17) into the first line of Eq. (5), we end up with the following general loss function for the cases that modify the non-target function: from which we can see that the key is to detach the gradients of the characteristic function. Therefore, applying CGD to SphereFace-R v2 yields the non-target function: η(θ) = cos(θ) + Detach(cos( θ m ) − cos(θ)). The essence of CGD is to avoid computing the gradient of the characteristic function in the backward pass. As a simple generalization, we can consider higher-order Taylor approximation to the characteristic function instead of the zero-order approximation. Since CGD already yields satisfactory training stability and empirical performance, we will stick to it for simplicity.
In fact, CGD serves as a generally useful tool for implementing new types of angular margin and is not limited to SphereFace and SphereFace-R. For the backward propagation, CGD can approximate the characteristic function induced by any angular margin with a dynamic additive margin, and effectively stabilize the training.

Implications and Discussions
Comparison between additive and multiplicative margin.
While Table 1 provides a detailed comparison between additive and multiplicative margin, the fundamental difference between them is on a conceptual level. Additive margin is introduced by adding or subtracting a parameter to the target function so that the characteristic function ∆(θ) can be larger than zero in most cases. Specifically, this parameter can be either inside [5] or outside [3], [4] the cosine function. In contrast, multiplicative margin is achieved by multiplying a parameter to the target or non-target function so that ∆(θ) is larger than zero. It is also possible that a multiplicative margin loss and an additive margin loss lead to the same characteristic function, and they may be technically the same loss. Therefore, their difference is determined by the specific intuition that guides the loss design. Generality of multiplicative margin. SphereFace-R v1 and v2 demonstrate two different strategies to incorporate multiplicative angular margin, showing the existence of many feasible designs to achieve multiplicative margin. In fact, the exact form of the loss function is not crucial and the core of multiplicative margin lies in the spirit of multiplying a factor to ensure the characteristic function to be larger than zero. Following such a spirit, there are likely many potential loss designs that can work as well as ours. Comparison between SphereFace and SphereFace-R. It is easy to see that SphereFace employs a surrogate characteristic function to implement the multiplicative margin and does not follow the intuition of multiplicative margin for θ ∈ [ π m , π] where m > 1. In contrast, both SphereFace-R v1 and v2 exactly follow the intuition of multiplicative margin in the entire domain of [0, π]. SphereFace-R v1 implements a dynamic multiplicative margin (i.e., the effective margin parameter varies depending on the training sample), while SphereFace-R v2 implements a static one (i.e., the effective margin parameter stays the same for all training samples). Moreover, SphereFace and SphereFace-R use different effective margin parameters for samples with different hardness. Both SphereFace and SphereFace-R strictly satisfy the general principle in Eq. (4) to achieve large angular margin, validating the effectiveness of our proposed principle. Jointly designing target and non-target functions. Because SphereFace-R v1 focuses on the target angular function and SphereFace-R v2 focuses on the non-target angular function, it is natural to consider to simultaneously design the target and non-target angular functions. For example, Eq. (11) and Eq. (12) can be easily used together and the resulting characteristic function is simply the combination of both. More interestingly, it is not necessary for both target and non-target functions to use the cosine-based design. We can simply use a linear function as the target and non-target functions, as proposed in [31]. It can effectively alleviate some design constraints caused by the periodicity of cosine function. Jointly designing the target and non-target functions can greatly enlarge the search space of the characteristic function and may lead to a better multiplicative margin loss. Beyond additive and multiplicative margin. There are many more alternative types of angular margin other than the additive and multiplicative ones. For example, we can also use the exponential function to achieve ∆(θ) > 0. Specifically, we use η(θ) = cos(2θ) and ψ(θ) = cos m (2θ), where m > 1 is the margin parameter and larger m gives larger angular margin. Alternatively, we can also combine additive and multiplicative margin as η(θ) = cos(θ), ψ(θ) = cos(m 1 θ + m 2 ) − m 3 . It remains an open problem to design a simple yet well-performing angular margin.

Feature Magnitude
SphereFace [2] originally does not use feature normalization, because the feature magnitude does not affect the angular decision boundary. [3], [4], [5], [9] show that feature normalization can ease the difficulty of minimizing angular margin losses and greatly improve the training stability. Despite being effective to stabilize training, feature normalization inevitably loses useful information about individual samples (e.g., image quality). Existing hyperspherical FR methods either preserve the feature magnitude in the loss function [1], [2], [32] or normalize the feature magnitude to constant s [9], [30]. To explore whether feature magnitude can be beneficial to generalization, we systematically study SphereFace and SphereFace-R under NFN and HFN. Moreover, we consider a soft feature normalization method which effectively unifies NFN and HFN and serves as an interpolation between both. Hard feature normalization. HFN becomes a default component in current hyperspherical FR methods [3], [4], [5], [26], [40]. By normalizing the feature x to a constant s, the objective function value will merely depend on the angles between x and the classifiers W i , ∀i. In order to perform such a hard normalization on the feature x, we parameterize the original x in Eq. (5) with s x x and arrive at Eq. (6). Since s is a prescribed constant, it is equivalent to normalizing all the features to a hypersphere with radius s. Soft feature normalization as an interpolation. We consider the soft feature normalization that interpolates between FNfree learning and HFN. Specifically, besides the original loss, we combine an additional regularization term to constrain the feature magnitude: where t is a hyperparameter that controls the regularization strength and s is a prescribed feature magnitude that serves a similar role to HFN. When t = 0, SFN reduces to FN-free learning. When t = +∞, SFN reduces to HFN. Therefore, SFN can be viewed as an interpolation between FN-free learning and HFN. SFN has also been studied in [10].
SFN can make use of the instance-level information encoded in feature magnitude during training while still encouraging a feature normalization effects. Moreover, the difference between SFN and HFN can be viewed as using different optimization techniques to constrain the feature norm to a prescribed constant. HFN has the flavor of projected gradient descent where the solution will be projected to the feasible region to satisfy some constraint. In contrast, SFN is essentially a Lagrangian relaxation of the original problem where the feature norm is constrained. Therefore, their empirical performance could be quite different in practice, even if they share the same optimization target. Dynamic feature magnitude. In contrast to HFN that uses a static feature magnitude, both FN-free learning and SFN can be viewed as a dynamic (data-dependent) way to control the feature magnitude. Moreover, there exist many other strategies that can dynamically control the feature magnitude to improve the empirical performance, such as [36].

Discussions and Open Problems
Optimal design of characteristic function. It is clear that the characteristic function is the key to large angular margin, but is there an optimal characteristic function? The answer to this question remains open. We argue that the optimal design of characteristic function should be dynamic and depends on the specific dataset, the network architecture, the optimizer, the stage of training (i.e., the weights of the network), etc. Current studies on hyperspherical FR still focuses on a static characteristic function. [70], [71] explore an automatic way to learn a characteristic function from data, but those learned characteristic functions are still static ones and do not lead to a significant performance gain. [45] combines the sample quality to hyperspherical FR through a customized characteristic function that is dependent on the feature magnitude. How to design or learn a better characteristic function that is dynamically dependent on the data and also easy to optimize remains a huge challenge. Moreover, the underlying mechanism that determines the performance of a characteristic function stays a mystery and needs to be understood both empirically and theoretically. Making better use of feature magnitude. In this paper, we have not considered to incorporate feature magnitude to testing and still stick to the cosine similarity for comparing pairs. However, it remains an interesting open problem whether it will be more beneficial to combine feature magnitude back to the similarity score (especially FN-free learning or SFN is used). We consider a generalized form of similarity score as S(x 1 , x 2 ) = g( x 1 , x 2 ) · cos(θ 1,2 ) where x 1 , x 2 are deep features of two input samples, θ 1,2 is the angle between x 1 and x 2 , and g( x 1 , x 2 ) denotes a function with the norm of x 1 and x 2 as input. We may require the function g( x 1 , x 2 ) to have a few properties: (i) permutation invariance: g( x 1 , x 2 ) = g( x 2 , x 1 ) and (ii) adjustable magnitude augmentation. As a concrete example, we could use g( t adjusts the augmentation strength of feature magnitude. g( x 1 , x 2 ) = 1 reduces to the cosine similarity score. In general, how to design a good g is not clear and remains to be explored in future endeavours.

A UNIFIED CHARACTERIZATION OF LOSS FUNC-TIONS IN HYPERSPHERICAL FACE RECOGNITION
In this section, we take a closer look at what characterizes hyperspherical face recognition. As our unified framework in Section 3 discusses, a feature normalization strategy, a (non-)target angular function and a characteristic function can fully determine the loss function of a hyperspherical FR method. Particularly, the characteristic function ∆(·) controls the property of the induced angular margin (e.g., size, training stability). It does not consider the feature magnitude and only focuses on the difference between target and non-target function. Here we take a step further by showing a unified way to characterize the loss function as a whole. Specifically, we have the following general form of the loss function for hyperspherical FR: :=Q(θy,θi,s,m) where we define Q(θ y , θ i , s, m) as the loss characteristics that fully determine how the loss function behaves. We compare the loss characteristics among normalized softmax [9], SphereFace and two variants of SphereFace-R in Fig. 5.
Although we show that the loss characteristics can fully determine the loss function, the underlying mechanisms of how the loss characteristics can affect the performance are largely unclear and remain to be investigated. Typically, s and m jointly specify the loss characteristics, and their roles could be partially coupled, which is also empirically observed in our ablation study. From a back-propagation perspective, we have the gradient of L s (w.r.t. either x or W i , ∀i) as where Q (θ y , θ i , s, m) denotes the gradient of loss characteristics. Quite interestingly, if we apply CGD to Eq. (19), then Q (θ y , θ i , s, m) for all hyperspherical FR methods will immediately become identical to that of the normalized softmax loss [9]. The only critical difference lies in the weighting factor ρ i = exp(Q(θy,θi,s,m)) 1+ i =y exp(Q(θy,θi,s,m)) in Eq. (20), where Q(θ y , θ i , s, m) varies for different loss functions in the forward pass. This finding suggests that once CGD is applied, the gradient of every loss function in hyperspherical FR can be viewed as a particular weighting strategy to combine Q (θ y , θ i , s, m) of different i = y. In other words, only the weighting factors ρ i , ∀i in the gradient L s = i =y ρ i · Q (θ y , θ i , s, m) will differ for different loss functions. Therefore, the design space for loss functions can be switched from finding Q(θ y , θ i , s, m) to finding a weighting strategy for combining the gradients Q (θ y , θ i , s, m), ∀i = y. Such a gradient weighting perspective reveals that searching for suitable m and m is equivalent to designing a good gradient weighting strategy. This may open a brand new gate to gain deeper understandings towards hyperspherical FR. Moreover, we believe that our novel loss characterization reformulation in Eq. (19) and Eq. (20) may inspire more effective designs for loss functions in hyperspherical FR.

EXPERIMENTS AND RESULTS
In this section, we present comprehensive experiments to explore the properties of the SphereFace family. Experimental setup is introduced in Section 6.1. We perform ablation studies in Section 6.2 and Section 6.3 to investigate different variants of SphereFace and their hyperparameters. In Section 6.4, we evaluate our methods on the large-scale benchmarks and compare to ohter state-of-art methods.

Implementation Details
Preprocessing. Each face image is cropped based on the five face landmarks (i.e., left eye, right eye, nasal tip, left mouth corner, and right mouth corner) detected by MTCNN [73] and RetinaFace [74] using similarity transformation. The size of the cropped image is set to 112 × 112, and each RGB pixel ([0, 255]) is normalized to [−1, 1]. CNNs. The SphereFace Networks (SFNets) that are initially proposed in [2] are used as the backbone in our experiments. Slightly different from [2], we equip SFNets with batch normalization (BN) [75] to facilitate the model optimization. For better comparison to existing methods, we also evaluate our models with IResNet-100 [5] which is a 100-layer modified ResNet. The affine parameters in the last BN layer  are enabled when NFN and SFN are used. We use SFNet-20 and SFNet-64 in the ablation and exploration, while SFNet-64 and IResNet-100 are adopted in large-scale benchmarks to achieve state-of-the-art performance.
Training. The training images are horizontally flipped for data augmentation. We train all the models on two popular training dataset: VGGFace2 [76] (3.1M images from 8.6K IDs) and MS-Celeb-1M (5.8M images from 86K IDs, also called MS1M-V2, the cleaned version of MS-Celeb-1M used in [5]). Detailed statistics of these datasets are given in Table 2. In our experiments, all the models are optimized using stochastic gradient descent with momentum 0.9. For VGGFace2, we train on 2 GPUs for 80k iterations, with a learning rate of 0.1 which is decreased by 10 at the 40k and 60k iteration. For MS-Celeb-1M, we train on 4 GPUs for 240k iterations. The learning rate is initialized as 0.1 and decreased by 10 at the iteration of 100k, 180k, and 220k. Testing. We strictly follow the specific protocol provided in each dataset for evaluation. Table 2 shows the statistics of the testing sets. Given a face image, we extract two 512dimensional embeddings from the original image and its horizontally flipped version, respectively. The final embedding is obtained by averaging the two. The scoring method is cosine similarity. The nearest neighbor classifier and thresholding are used for face identification and verification, respectively. To reduce the randomness, 5 models from the last 10k iterations will be used in testing and their averaged results are reported. Specifically, we evaluate the models at the iteration of 72k, 74k, 76k, 78k and 80k for VGGFace2, and 232k, 234k, 236k, 238k and 240k for MS-Celeb-1M.

Ablation and Exploration on VGGFace2
The validation set is a combination of multiple datasets, including LFW, AgeDB-30, CALFW, CPLFW, CFP-FP, CFP-   FF, and VGG2-FP. The statistics of these datasets are summarized in Table 2. In total, there are 43,000 testing pairs (21,500 positive pairs, and 21,500 negative pairs). The performance of a model is measured by the area under the ROC curve (AUC). Since it is important for a face recognition system to avoid false positives, we use AUC-x [87] as the metric, which integrates up to false positive rate of x (x ∈ [0, 1]). We find that x = 0.0005 achieves the best trade-off between stability and effectiveness. Since VGGFace2 is a relatively small training set (8.6K subjects [76]), we use the SFNet-20 model as the backbone in this section.

No Feature Normalization
We start by exploring SphereFace and SphereFace-R without feature normalization. Since there is only one effective hyperparameter (i.e., the margin m), it is easy to find the optimal setting. We compare SphereFace, SphereFace-R v1 and SphereFace-R v2, and they represent different types of margins as shown in Table 1. Note that SphereFace with NFN is the same as the original SphereFace [2] (with additional CGD). Our methods are equivalent to the standard softmax cross-entropy loss when m = 1.0. We vary m from 1.1 to 1.5 and report the corresponding AUC-0.0005 in Table 3. It can be observed that small margin (e.g., 1.1) results in inferior performance, because the learned features are not sufficiently discriminative. On the other hand, incorporating a margin that is too large can not produce good results as well, due to the increased difficulty in optimization. In our experiments, m = 1.2 achieves the best trade-off between feature discriminativeness and optimization difficulty, leading to the best performance for all types of margins.

Hard Feature Normalization
HFN introduces an additional hyperparameter s to the loss function, which controls the norm of the deep features. To show how m and s affect the performance, we perform a grid search by varying these two hyperparameters for all three types of margins including SphereFace, SphereFace-R v1 and SphereFace-R v2. The results are given in Table 4.
We have several observations. First, there usually exists an optimal margin hyperparameter m that leads to the best performance for each scale s. If m is smaller than the optimal value, the performance is usually improved as m increases. If m is greater than the optimal value, the performance is usually decreased as m increases. Second, for larger s, the corresponding optimal m will also tend to  Table 4(b), the optimal margin m is monotonously increased from 1.1 to 1.5 for the scale s ranging from 20 to 60. The performance is affected by both s and m in a coupled manner. The same pattern also appears in Table 4(a) and (c). Third, a wide range of s could lead to a satisfying performance, as long as m is properly tuned. For example, in  [2], [5]. Second, we find the optimal margin parameter m for each s with a uniform search. It has been shown in Table 4 that the range of m depends on the specific s we use. For larger s, its optimal m is typically larger as well. As a useful practice, we should gradually search larger m, as s increases. Finally, we choose the scale s and the margin m that lead to the best performance on the validation set. We find that such a simple hyperparameter searching is generally useful and can be applied to tuning different kinds of hyperspherical FR methods in practice.

Soft Feature Normalization
SFN is a soft regularization to constrain the feature norm and can be considered as a trade-off between NFN and HFN. SFN has a weighting hyperparameter t controlling the contribution of feature norm regularization term. Since there are three hyper-parameters, i.e., m, s, and t, it is timeconsuming and infeasible to enumerate all possible combinations of them. To efficiently evaluate SFN, we adopt the combination of m and s that leads to the best performance in Section 6.2.2 and fix them throughout the SFN experiment so that we can focus on how t will affect the performance. We give the AUC-0.0005 in Table 5. SphereFace achieves 62.4% with SFN and 61.37% with HFN. SphereFace-R v1 achieves 61.37% with SFN and 60.45% with HFN. SphereFace-R v2 achieves 61.21% with SFN and 62.72% with HFN. Our experiments show that that SFN generally yields comparable or even better results than HFN with a properly chosen t when our models are trained on VGGFace2.

Characteristic Gradient Detachment
In Table 6, we compare the models trained with or without CGD. Except the usage of CGD, the experiments are performed with exactly the same experimental settings (e.g., dataset, architecture, training setup etc). We have evaluated CGD on all three FN strategies (i.e., NFN, HFN and SFN). The results show that CGD significantly improves the results in all scenarios, which validates the importance of simplifying the gradient of the characteristic function. CGD works particularly well for SphereFace-R v1, since it enables the gradient propagation for large margin (m > π θ in Eq. 11). Here we use the best-performing hyperparameters from our previous experiments. In fact, the consistent improvements can also be obtained from CGD with the other hyperparameters. We observe that CGD can effectively improve SphereFace and SphereFace-R under all FN strategies. Because of the consistent effectiveness of CGD, we use CGD for both SphereFace and SphereFace-R by default and the results in the other sections are also obtained with CGD.

Ablation and Exploration on MS-Celeb-1M
In this section, we conduct ablation and exploration with a deeper network that is trained on a large-scale dataset. Specifically, we use SFNet-64 as the backbone network architecture and MS-Celeb-1M as the training set. We report AUC-0.0005 on the same validation set as Section 6.2.

No Feature Normalization
We evaluate SphereFace, SphereFace-R v1 and SphereFace-R v2 without feature normalization. Note that SphereFace with NFN is equivalent to the original SphereFace [2] with CGD. The results are given in Table 7. Similar to the experiments on VGGFace2, NFN generally works well with small margins, achieving 79.89%, 84.66%, and 84.78% with m being 1.1, 1.2, and 1.2, respectively. The increased number of training identities (from 8K to 86K) leads to severe optimization difficulty during training. This can be empirically observed from the smaller optimal m from SphereFace and the dramatically decreased performance from both SphereFace and SphereFace-R with larger margins.

Hard Feature Normalization
Following our hyperparameter tuning strategy given in Section 6.2.2, we search the optimal combination of s and m. Since the performance is stable across a wide range of s, here we use a relatively large gap (i.e., 32). As mentioned in Section 6.2.2, the optimal m tends to increase with larger s. We search two feature scale hyperparameters for SphereFace (i.e., s = 32, 64), SphereFace-R v1 (i.e., s = 32, 64) and SphereFace-R v2 (i.e., s = 64, 128). Based on our observation on VGGFace2, the optimal m for larger s also tends to be larger. Therefore for smaller s, we search m from 1.4 to 1.8, and for larger s, we search m from 1.6 to 2.0. The results in Table 8 well match our expectation that there is only one optimal m for each s and the optimal m will increase with larger s. Under the same s, we discover that as m deviates from its optimal value, the performance will also become worse. The distribution of the performance for different m exhibits a strong unimodality, which can largely benefit the hyperparameter tuning. The best models of SphereFace, SphereFace-R v1 and SphereFace-R v2 achieve 91.03%, 90.78% and 91.58% with s = 32, 32, and 64, respectively. The performance on MS-Celeb-1M is also much better than that on VGGFace2, which shows that our methods can easily enjoy the accuracy boost from larger training set and deeper network. Moreover, the performance of both SphereFace and SphereFace-R is not very sensitive to m and remains stable for a wide range of m.

Soft Feature Normalization
Similar to Section 6.2.3, we adopt the best-performing settings (m and s) from HFN and vary the hyperparameter t in order to evaluate the performance of SFN. The results are reported in Table 9. With a properly tuned t, SFN achieves 87.70%, 87.19% and 86.10% AUC-0.0005 for SphereFace, SphereFace-R v1 and SphereFace-R v2, respectively. We observe that the performance of SFN on MS-Celeb-1M are not as good as HFN, which contradicts the observation on VGGFace2. This implies that SFN may be sensitive to the distribution of the training data. We hypothesize that SFN is more sensitive to noisy samples (since MS-Celeb-1M has much more low-quality and noisy images than VGGFace2).

Experiments with SFNet-20 and SFNet-64
In this section, we evaluate the performance of both SphereFace and SphereFace-R with three FN strategies on popular large-scale benchmarks (i.e., IJB-B, IJB-C and MegaFace). We also provide fair comparisons to state-ofthe-art methods (with the same training set and backbone network). We use the same models from Section 6.2 and Section 6.3 with the best-performing hyperparameters on the validation set. Specifically, we train SFNet-20 on VGGFace2 and SFNet-64 on MS-Celeb-1M. We compare our models to current state-of-the-art methods, i.e., NormFace [9], Cos-Face [3], [4], ArcFace [5], circle loss [26] and Curricular-Face [40]. The hyperparameters of these methods are tuned to achieve to the best validation performance.
We make several useful observations from Table 10 and  Table 11. First, the results on large-scale testing sets are consistent with those on the validation set, especially the metrics at low false acceptance rate (FAR) or false positive identification rates (FPIR). The consistent performance demonstrates the effectiveness of the selected validation set and metric. This indicates that the models that achieve higher performance on the validation set usually show better results on MegaFace and IJB as well.
Second, different types of margins tend to achieve similar performance, while different types of FN strategies have significantly different results. In Fig. 6, we first show the training losses for different FN strategies. Fig. 6(a) shows that all the models equipped with NFN, HFN or SFN converge well on VGGFace2, a relatively small, clean and high-quality training set. From Table 10, NFN and HFN show comparable generalization ability, while SFN achieves the best performance among all FN strategies. This implies that making good use of the magnitude information during training can effectively improve the results. Fig. 6(b) shows that NFN is unable to converge to a sufficiently small training loss on MS-Celeb-1M, a large, noisy and low-quality dataset. From Table 11, we can observe that NFN indeed converges to a bad local minima that generalizes poorly. By introducing a magnitude regularization term to the objective function, SFN can effectively help the models escaping from the bad local minima. In contrast to the results on VGGFace2, we find that SFN performs worse than HFN on MS-Celeb-1M, implying that SFN may be more sensitive to noisy samples in the training set and HFN may be more robust to different training sets than NFN and SFN.
Finally, with our proposed modifications, all variants in our SphereFace family shows competitive results compared to the state-of-the-art methods. Both SphereFace and SphereFace-R perform particularly well under the low FAR, such as 1:1 verification TAR at 1e-6 and 1:N identification TPIR at 1e-1 FPIR. These metrics are very important in designing a robust face recognition system in practice.

Experiments with IResNet-100
In order to have a comprehensive comparison with the published results, we conduct the experiments to train our methods on MS-Celeb-1M [77] with IResNet-100 [5]. Comparing the results in Table 12 and Table 11, we observe that IResNet100 achieves better performance than those using SFNet-64 (with the same set of hyperparameters), establishing a higher baseline for SphereFace. For different FN strategies, IResNet100 performs similarly to SFNet64 in the sense that HFN is slightly better than SFN and they are both better than NFN. In general, both SphereFace and SphereFace-R are generally comparable to CosFace and ArcFace. SphereFace with HFN achieves the best performance on MegaFace. More interestingly, we find that both SphereFace and SphereFace-R v1 with HFN achieve significantly better 1:1 verification performance at low FAR than all the compared methods on IJB.

CONCLUDING REMARKS
Our paper proposes a novel framework that unifies hyperspherical face recognition. This framework provides a general principle for a loss function to incorporate large angular margins. Under this framework, we substantially extend and improve our previous work on SphereFace [2] by addressing training instability and significantly improving empirical performance. Specifically, we propose two new types of multiplicative margins that effectively implement the original intuition of SphereFace. Moreover, we also come up with a novel implementation technique called characteristic gradient detachment to further improve training stability and generalization. Extensive experiments on a number of popular benchmarks are conducted to validate the superiority of our SphereFace family.
Based on the unified framework, our paper demonstrates strong flexibility and many unique advantages of hyperspherical FR. There still exist a number of exciting yet under-explored open problems in hyperspherical FR, such as how to design better angular margin, how to effectively incorporate feature magnitude into training and testing, how to learn the loss function directly from data, etc. We also present a few useful characterizations for the loss function in hyperspherical FR, leading to multiple equivalent loss design spaces. Current popular loss functions only represent a very small and limited subset in the huge design space of hyperspherical FR. We expect that more work can be devoted to this promising line of research in the future. Rita Singh received the B.Sc.(Hons.) degree in physics and the M.Sc. degree in exploration geophysics, both from the Banaras Hindu University, India. She received the Ph.D degree in geophysics in 1996 from the National Geophysical Research Institute of the Council of Scientific and Industrial Research, India. She is currently a Member of the Research Faculty at the School of Computer Science, Carnegie Mellon University (CMU), Pittsburgh, PA, USA. From March 1996 to November 1997, she was a Postdoctoral Fellow with the Tata Institute of Fundamental Research, India, where she worked with the Condensed Matter Physics and Computer Systems and Communications Groups. During this period, she worked on nonlinear dynamical systems and signal processing as an extension of her doctoral work on nonlinear geodynamics and chaos. Since November 1997, she has been affiliated with the Robust Speech Recognition and SPHINX Groups at CMU. She currently works on core algorithmic aspects of computer voice recognition, and artificial intelligence applied to voice forensics. Her focus is on the development of technology for the automated discovery, measurement, representation and learning of the information encoded in voice signal for optimal voice intelligence.
Adrian Weller received his undergraduate degree from the University of Cambridge, and his PhD from Columbia University in New York. He is Programme Director for AI at The Alan Turing Institute, the UK national institute for data science and AI, where he is also a Turing Fellow leading work on safe and ethical AI. He is a Principal Research Fellow in Machine Learning at Cambridge, and at the Leverhulme Centre for the Future of Intelligence where he is Programme Director for Trust and Society. His interests span AI, its commercial applications and helping to ensure beneficial outcomes for society. He is Co-Director of the European Laboratory for Learning and Intelligent Systems (ELLIS) programme on Human-centric Machine Learning. Previously, he held senior roles in finance.