Gender Privacy Angular Constraints for Face Recognition

Deep learning-based face recognition systems produce templates that encode sensitive information next to identity, such as gender and ethnicity. This poses legal and ethical problems as the collection of biometric data should be minimized and only specific to a designated task. We propose two privacy constraints to hide the gender attribute that can be added to a recognition loss. The first constraint relies on the minimization of the angle between gender-centroid embeddings. The second constraint relies on the minimization of the angle between gender specific embeddings and their opposing gender-centroid weight vectors. Both constraints enforce the overlapping of the gender specific distributions of the embeddings. Furthermore, they have a direct interpretation in the embedding space and do not require a large number of trainable parameters as two fully connected layers are sufficient to achieve satisfactory results. We also provide extensive evaluation results across several datasets and face recognition networks, and we compare our method to three state-of-the-art methods. Our method is capable of maintaining high verification performances while significantly improving privacy in a cross-database setting, without increasing the computational load for template comparison. We also show that different training data can result in varying levels of effectiveness of privacy-enhancing methods that implement data minimization.


Gender Privacy Angular Constraints for Face Recognition
Zohra Rezgui , Nicola Strisciuglio , Member, IEEE, and Raymond Veldhuis , Senior Member, IEEE Abstract-Deep learning-based face recognition systems produce templates that encode sensitive information next to identity, such as gender and ethnicity.This poses legal and ethical problems as the collection of biometric data should be minimized and only specific to a designated task.We propose two privacy constraints to hide the gender attribute that can be added to a recognition loss.The first constraint relies on the minimization of the angle between gender-centroid embeddings.The second constraint relies on the minimization of the angle between gender specific embeddings and their opposing gender-centroid weight vectors.Both constraints enforce the overlapping of the gender specific distributions of the embeddings.Furthermore, they have a direct interpretation in the embedding space and do not require a large number of trainable parameters as two fully connected layers are sufficient to achieve satisfactory results.We also provide extensive evaluation results across several datasets and face recognition networks, and we compare our method to three state-of-the-art methods.Our method is capable of maintaining high verification performances while significantly improving privacy in a cross-database setting, without increasing the computational load for template comparison.We also show that different training data can result in varying levels of effectiveness of privacy-enhancing methods that implement data minimization.
Index Terms-Privacy-enhancing techniques, soft-biometric privacy, gender classification, face recognition.

I. INTRODUCTION
D EEP learning has been revolutionary for face recogni- tion.CNNs in particular, have enabled the training and convergence of algorithms with large complexity allowing the learning of features that are highly discriminative for the face recognition task.This breakthrough resulted in face recognition systems that were progressively more effective at recognizing faces even in challenging scenarios, including changes in lighting, pose, and expression [1], [2], [3], [4].Next to containing information that is highly useful for the face verification and identification tasks, the features learned by deep-learning based face recognition systems entangle a variety of other information auxiliary to identity.Previous studies have shown that features extracted from the last layers of face recognition networks can be transferable to other tasks, such as gender, age or ethnicity classification [5], [6] as well as the classification of other fine-grained attributes such as hairstyle or the shape of eyebrows [7].This entanglement between identity and auxiliary soft biometric attributes present in the facial templates poses privacy risks.For instance, in the event that such templates or their source model are exposed, an adversary can train classifiers that would undertake profiling of subjects based on demographic or other highly sensitive information.This can be problematic as the subject may have not consented to the processing of their biometric information for profiling tasks.We illustrate this risk in Figure 1.From a legal perspective, the disclosure of such information stemming from a face recognition model poses a potential issue to both the model developer and the party responsible for storing the templates.This is due to the fact that such information leakage runs counter to the data minimization principle outlined in the General Data Protection Regulation (GDPR). 1  The correlation between soft biometrics and identity can also contribute to the demographic unfairness of biometric systems, which is a topical research problem [8], [9].Face recognition systems are biased toward demographic categories, meaning they produce more errors for certain categories than others.This is usually due to skewed distributions of different categories in the training data [8] which cause the neural networks to overfit on the dominating category.The entanglement between demographic attributes and identity can further exacerbate this issue.In fact, if the embeddings for face verification are easily separable by a demographic attribute, they can potentially form significantly different distributions for different categories of the attribute.This makes the verification step prone to generating different error rates by category.
To remediate the aforementioned issues, a few works emerged that apply privacy-enhancing techniques at the template level.While some require a training procedure [10], [11], [12], others are training-free and rely on the shuffling of information in the templates rather than removing the sensitive information [13].However, finding an optimal trade-off between privacy and face verification   performance remains a difficult challenge.As an additional observation, previous studies often lack sufficient assessment of the generalizability of their approaches as they do not conduct evaluations on diverse, independent datasets from the training set.In some cases, the evaluation data for privacypreserving approaches comes from the same source as the training data [10], [11] which does not allow for a real-life scenario evaluation.We present a more detailed overview of such works in Section II.
In this paper, we focus on protecting the gender2 attribute as it is easily learned from facial templates [6], [14].Our proposed method takes advantage of the hyperspherical nature of the feature space used in many face recognition systems [4], [15], [16], [17].We fine-tune a face recognition model by passing the feature vectors through a shallow network that projects them onto an unbiased feature space.We introduce angular constraints that when added to a recognition loss, consist in overlapping the distributions of the gender categories while maintaining the verification performance.
The advantages of our method are that it is geometrically interpretable, easy-to implement, effectively transforms gender-discriminative features to gender-neutral representations while upholding an acceptable verification performance with the same dimensionality.A fundamental point of distinction from prior methods involving finetuning [11], is that we do not depend on a given gender classifier during training which lowers the computational burden of training of our method and makes it independent of a specific decision boundary.We make the source code of our implementation publicly available. 3e summarize our contributions as the following: 1) We propose light-weight and geometrically interpretable constraints to enhance the privacy of face recognition systems.2) Our method is designed to align gender distributions, thus fortifying the templates against any potential exploitation to train a robust gender classifier.It is specifically tailored to impede the capacity of unforeseen classifiers to learn any gender-related features.3) Our method does not require a gender classifier during training and focuses instead on the training of a shallow network that imparts minimal additional computational load to the initial face recognition network.4) We provide comprehensive cross-database results using various face recognition models and evaluate the effect of the training data on the performance of the method to an extent that was rarely observed in previous work.

II. RELATED WORK
Most of the privacy-enhancing techniques that focus on the face modality are image-based.The earliest methods rely on fusing specific parts of a face or morphing faces from different categories of the soft biometric attribute [18], [19], [20].In [18], they determine the most gender-discriminant face components and use image fusing to choose the closest facial components from the opposing gender for a particular subject.Likewise, [19] perform a transformation of the gender expression of the facial image as a privacy method.They use a spectrum of morphing parameters to generate numerous versions of the input image with varying gender confidence levels.
Other methods rely on adversarial perturbations to fool gender classifiers into making wrong predictions without fooling face recognition systems [20], [21].For instance, [21] show that gradient-based adversarial attacks on a gender classifier are in some cases not transferable to face recognition systems.Therefore, the images are perturbed to result in a false gender prediction of a gender classifier with imperceptible distortion of the images and negligible decrease of verification performance of face recognition systems.Other works such as [22], [23], [24], [25], [26] use GANs to alter the appearance of the facial image making it imitate characteristics of a different category of the soft biometric attribute.However, while these methods perform well on facial images, they do not necessarily work on the template level as shown in [11].
Therefore, a few works emerged that focus on enhancing soft biometric privacy on the template level [11], [12], [13], [14], [27].The authors in [11] finetune a face recognition model by training a 3-layer network with a modified triplet loss that incorporates constraints aiming to fool an adversary ethnicity or gender classification layer that is also trained in parallel to the face recognition model.This makes the representation learning dependent on the convergence state of the adversary classification network in different stages of the training.
More recently, [13] proposed an approach based on shuffling blocks of the information encoded in the templates.At the moment of template comparison, shuffled references and probes are realigned based on the Hungarian algorithm.While this method has a more general approach to privacy, it does not tackle the learned bias in the face recognition system that generates the templates and therefore is considered a data protection approach instead of a data minimization approach.
The methods in [14] and [12] implement data minimization based on the identification then suppression of sensitive information.They allow control on which type of attribute to protect and the amount of information that can be suppressed from the templates.Both methods are easily reproducible, are based on an intuitive approach and provide competitive results on the privacy aspect.
While both data minimization and data protection approaches aim to hinder the retrieval of soft biometric information, the data protection approach does not aim to eliminate such information from the templates.Instead, it performs operations on the templates to block the access to such information.On the other hand, the data minimization approach aims to effectively remove the sensitive information from the stored templates [12], [28].
In this paper, we compare our method to [12], [14], both data minimization methods like our solution.Additionally, we compare to [13], a data protection method.We describe these methods in detail in Sections II-A-II-C respectively.

A. IVE: Incremental Variable Elimination
The incremental variable elimination (IVE) algorithm introduced in [10] is based on estimating feature importance for the classification of the targeted attribute.Feature importance is estimated with the decrease in node impurity measures for tree-based classifiers.Following that, the most important features for the classification of the targeted attribute are iteratively eliminated.
The authors claim to suppress features that were discriminative for gender as well as age.While the method is intuitive, its main drawback is the significant loss of information due to reducing the dimension of the templates.This impacts negatively on the utility of the templates for their intended verification task.Indeed, in order to achieve an acceptable level of sensitive attribute suppression, they report that 400 to 500 features had to be eliminated out of 512 features resulting in an equal error rate (EER) 4 times higher on the training data.
In this paper, we executed the IVE algorithm on different datasets to suppress gender and we report cross-database results for comparison to our method.

B. Multi-IVE
The authors in [12] propose an improvement on the IVE algorithm.Instead of eliminating the features from the original feature space of the face recognition model, they first perform a transformation of the features by projecting them on the domains generated from a principal component analysis (PCA) or an independent component analysis (ICA).This modification allows maintaining the same dimension of the original embeddings.The feature suppression based on feature importance estimation is then performed on this new domain.Finally, the feature vectors are projected back onto the original domain with inverting the PCA or ICA.This way, the dimension of the feature space remains the same.They employed two settings when applying this method.The first setting does not exclude any principal component in the PCA/ICA from the elimination process.The second setting tries to indirectly minimize loss in verification performance by locking the first k principal components k = (3, 5).Furthermore, they adapt the IVE algorithm to suppress three soft biometric attributes (gender, age and ethnicity) simultaneously.However, to compare with our results, we run Multi-IVE solely to suppress gender and we do not consider the other attributes as it can bias the performance of the algorithm on the gender suppression.
While both methods are intuitive approaches, they are not explicitly trained to maintain a high verification performance.We instead train the embeddings with a recognition loss and a privacy loss simultaneously.We also note that we train and evaluate both the IVE and Multi-IVE methods rigorously across multiple datasets and report the results on datasets that are completely independent from the training dataset unlike in [10].

C. PE-MIU
In [13], the authors introduce the privacy-enhancing minimum information units method (PE-MIU).This method does not require training.It is based on partitioning a feature vector into several blocks that are then randomly shuffled.The sensitive information is not removed from the templates but rendered inaccessible due to the random order of the blocks.The block size parameter controls the trade-off between privacy and verification performance.The authors report that using a block size of 16 features provides nearly perfect privacy for gender while having minimal effect on verification performance.During comparison, the Hungarian algorithm is used to assign an order to the blocks of the probe that is similar to the blocks of the reference.For mated comparisons, the block assignment is often successful and results in a high similarity score.For non-mated comparisons, the blocks are not assigned correctly but this often results in a low similarity score which is suitable.Due to the step of block assignment, the time needed for comparisons is considerably longer than on unprotected templates.We run this method across multiple datasets using a block size of 16 features which results in 32 blocks for every template.Furthermore, we include runtime evaluations for all methods in Section V-C to assess their suitability for real-world applications.

III. PROPOSED METHOD
The templates that are generated by state-of-the-art face recognition systems suffer from a strong entanglement between identity-relevant information and gender information.In the first row of Figure 2, a t-distributed stochastic neighbor embedding (T-SNE) plot of the features obtained from an ArcFace model [4] reveals a high gender separability across Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.several datasets.While in many ways the presence of gender information can facilitate the identity verification, it is prone to resulting in different distributions by gender causing unfair decisions for one of the categories.Furthermore, the presence of gender information can be exploited by a privacy adversary to infer the gender attribute on a large scale.

A. Adversary's Knowledge
As illustrated in Figure 1, we consider the scenario where the privacy adversary gains access to the feature extractor that generated the templates and has access to the templates.We also suppose that the adversary has access to a large dataset of facial images annotated with gender labels.The adversary would then use this dataset to train a set of gender classifiers that we describe in Section IV-C.Finally, they would feed the templates to these classifiers in order to retrieve the gender predictions.The goal of gender privacy-enhancing methods on the template level is to reduce the separability of the gender.The adversary with the ability to create labeled templates, should not be able to train a reliable gender classifier that can be used to infer the gender from the stored templates.Gender information in the templates is either removed through data minimization or made inaccessible through data protection after privacy-enhancing methods are applied.In this case, the templates become inadequate for accurate classification or for training an effective gender classifier.Successful implementation of the privacy-enhancing method should result in the balanced accuracy of the gender classifier approaching 50%, equivalent to the performance of a random binary classifier.We propose a training method that is based on a constrained recognition loss.In the following sections, we first introduce the trainable parameters then the composition of the proposed loss.

B. Architecture and Training Parameters of the Gender Privacy-Preserving Layers
For training our method, we feed the embeddings extracted from a pre-trained face recognition network into a shallow network consisting of two fully connected layers as presented in Figure 3 in order to obtain the private features.We use a Leaky ReLU as an activation function between these two layers to allow for a non-linear projection of the embeddings.Afterwards, the embeddings are passed into a last fully connected layer to estimate the class identity weights necessary to calculate the logits for the recognition loss.This layer does not have any bias parameters accordingly to the losses described in Section III-C.

C. Composition of the Training Loss
The proposed loss has three components.The first component is a normalized softmax loss that has an objective of maintaining the recognition performance.The two other components are weighted privacy constraints minimizing angles between gender-centroid features as well as angles between gender-centroid features and their opposed gender-centroid identity class weights.We dissect the formulation of these three loss components in the following subsections.
1) Normalized Softmax Loss: This component takes the role of ensuring that the recognition performance stays high despite the privacy constraints.Following the modifications to the softmax function outlined in [29], the feature vectors obtained from the embedding layer of a face recognition network preceding the calculation of the logits, lay on a hypersphere.The aforementioned procedure is accomplished by imposing that bias b j = 0 in that layer and l2−normalizing both the learnable class weights ||W j || = 1 as well as the feature vectors ||x i || = 1.This step guarantees that the features lay on the unit hypersphere of a given dimension d which in its turn, allow a straightforward calculation of cosine similarity between the feature vectors and their corresponding identity class weights via their inner product.
In [30], the authors propose the following improvement.The feature vectors are scaled to a fixed number s after normalization.This loss is equivalent to ArcFace [4] with margin m = 0.The following equation describes the crossentropy loss to be minimized with this addition.
2) Our Privacy Constraints: Building on the improvements in [4], [30] for the recognition loss, we formulate two angular constraints to ensure that the gender distributions overlap.The ensemble of the angles we minimize during training is illustrated in Figure 4.
In order to confuse a gender classifier, the feature distributions of each category have to be as similar as possible.The first constraint L p 1 involves solely the embeddings.The distance between the distributions is approximated by the angle between the average feature vectors in each category.Eventually, this angular distance is minimized during training.For every batch, we select separately the feature vectors for each gender category.After l 2 − normalization, we calculate the average feature vectors for the masculine category x m and for the feminine category x f and then perform l2− normalization.As a final step, we calculate the angle θ p1 between these two vectors via the arccos function on their inner product.In order to improve estimation of the mean vectors per gender, the batches are gender-balanced.The pseudo-code to calculate the following L p 1 is given in Algorithm 1.
The second constraint L p 2 involves both the embeddings and the weights of the identity classes.As the normalized softmax loss L r is minimized during training, the identity class weights and the embeddings are updated such that each embedding forms the smallest angle possible with the identity class weight vector corresponding to its ground truth identity.The constraint L p 2 is added to guide the updates of the identity class weights and the feature vectors simultaneously Algorithm 1 Calculate Privacy Constraint L p 1 Input: i m , i f : male and feminine indices, n m , n f : sizes of male and feminine samples, X: feature matrix of shape (n, d) with n: batch size and d: feature space dimension Output: L p 1 = θ p 1 , the angle between the average feminine feature vector and the average male feature vector.
by enforcing that the average masculine identity vector gets as close as possible to the average feminine feature vector and vice versa.
For every batch, we select separately the feature vectors for each gender category and we also select separately the l 2 − normalized weights of the identity classes associated with each gender category.We calculate the average identity weight vector for each gender category and then l 2 −normalize it to bound it to the surface of the unit hypersphere.Finally, we compute the angles between every feature vector and the average identity vector associated with its opposite gender category.The averages of these angles θ p 2 and θ p 3 are then minimized during training.We provide the pseudo-code for calculating the following constraint in Algorithm 2.
The final loss to be minimized during training is given in the equation below with α and β as hyperparameters for increasing or decreasing the magnitudes of the privacy constraints:

A. Datasets
In [10] and [11], the authors use one dataset for training and evaluating their privacy-preserving approach.In [12], the authors use separate datasets to train and evaluate their method however, they do not evaluate the verification task and the privacy task simultaneously on the evaluation dataset.Instead, they pick one dataset to evaluate the gender suppression and another to evaluate the verification performance.To have a large overview of the generalization ability of our method and the methods from [10], [12], [13] we reproduce, we use four facial datasets.The following datasets are alternated for training and evaluation and all evaluations are performed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.between the feature vectors and the centroid identity weight vector of the opposing gender category.
simultaneously for the verification and the privacy tasks: The Labeled Faces in the Wild (LFW) dataset [31] consists of 13,233 images in unconstrained conditions of 5,749 identities.AgeDB [32] contains 16,516 images of 570 identities in uncontrolled conditions with a large variation in age.ColorFeret [33] contains 11,338 images of 994 identities collected under controlled conditions.We also randomly select a gender-balanced sample of 15,000 images from the VGGFace2 train set [34] of 5,000 identities.
The samples from LFW, AgeDB, and VGGFace2 encompass a wide array of images taken in real-world conditions, exhibiting substantial variability in terms of pose, lighting, and demographic characteristics, including age, ethnicity, and gender expression.
The faces of these images are detected and aligned with an MTCNN face detection algorithm [35] and finally they are resized to 112×112 pixels.We give in Table I the total number of images and gender distribution for these datasets.
We alternate using these datasets to train the privacy-enhancing methods and evaluate the performance on the remaining ones.We note that in contrary to what is reported in [10] and [11], whenever we use one of these datasets for training a method, we do not include in the evaluation set to avoid bias that might be resulting from overfitting.

B. Pre-Trained Face Recognition Models
We run our experiments on three state-of-the art face recognition models, namely ArcFace4 [4], SphereFace5 [15], [17] and ElasticFace6 [16] that are trained using distinct angular losses.The ArcFace model used is trained on the VGGFace2 dataset [34] with an IResNet50 architecture while the SphereFace and ElasticFace models used are both trained on the MS-Celeb-1M dataset [36] with an IResNet100 architecture.All of the models take images of 112×112 pixels and output 512-dimensional embeddings.

C. Gender Classifiers
To evaluate the gender separability of the embeddings before and after applying privacy-enhancing methods, we use a 3-fold cross-validation setting.For each dataset used for evaluation, we form 3 folds where in each fold, the train set and the test set do not have overlapping identities.For every fold, an ensemble of gender classifiers is trained on the train set and evaluated on the test set.This ensemble of gender classifiers consist of two linear classifiers, namely a linear SVM and a logistic regression, and one non-linear classifier, an SVM with an RBF kernel.The composition of the folds in terms of number of images and number of identities is given in Table I.

D. Evaluation Metrics and Model Selection
We use the average balanced accuracy that we refer to as ACC G of the gender classifiers across the 3-folds to evaluate the gender classification performance.The balanced accuracy is defined as: where the numbers TP, TN, FP, FN refer to true positive, true negative, false positive and false negative predictions respectively.We report the details regarding the partition of the folds in Table I.
In order to have results that describe reliably the impact of privacy enhancing techniques on verification, all verification evaluations are performed following the standard protocol 1 for benchmark on the LFW dataset in [37] where 6,000 pairs (3,000 mated and 3,000 non-mated) are compared using the Euclidian distance.The same procedure was used in [32] to create age-invariant verification protocols.The most challenging protocol is the one where the pairs have 30 years of age difference (AgeDB-30).This protocol is the most widely used to report verification performance on the AgeDB dataset [4] therefore we use it to guarantee comparability.Similarly, we use the same protocol to generate the verification pairs (6,000 pairs) for the other datasets (ColorFeret and the sampled VGGFace2).We choose the equal error rate (EER) to report verification performance, that we refer to as EER V .
To select the models with the best trade-off between the two tasks we use the privacy gain (PG) -identity loss (IL) criterion (PIC) introduced in [38]: with with the couples (ACC * G , EER * V ) and (ACC G , EER V ) designating the gender classification and verification performances on the embeddings before and after the privacy-enhancing method respectively.
The higher PIC gets, the better privacy-utility trade-off we obtain.In the case where the identity loss is greater than the privacy gain, this metric yields negative values.We note however that this metric calculates relative improvements in privacy and face verification performances with regards to the original embeddings.Therefore, if the original embeddings are not highly discriminative for gender or obtain near perfect verification performance, the metric is likely to yield negative values even if the obtained privacy and verification performances are satisfactory.

V. EXPERIMENTS
For IVE, we ran the method with various total number of eliminations from the feature vectors and with different training datasets.The total number of eliminations ranges from only 20 eliminated features to 500.For each training set, we selected the resulting elimination algorithm that provided the highest PIC value.Across the training sets used, the highest PIC values correspond to a total elimination of 500 features from the ArcFace templates compared to an elimination of 400 to 500 features from ElasticFace templates and 300 to 400 features from SphereFace templates depending on the training set used.
Similarly for Multi-IVE, we varied the type of intermediate transformation domain (PCA or ICA), the total number of eliminations as well as the number of locked principal components in the transformation domain (k = 0, 3, 5).For each training set used, we select parameters that correspond to the highest PIC value.In all cases, the highest PIC value was associated to a total of 120 eliminations in the PCA domain with k = (3, 5) for the ArcFace templates.For the ElasticFace templates, the optimal number of eliminations ranges from 276 to 432 eliminations in the PCA domain with k = (3, 5).As for the SphereFace templates, 81 to 354 eliminations are optimal in the PCA domain with k = (0, 3, 5) depending on the training set.For all models, more eliminations come with an even higher expense on the verification performance.
As for PE-MIU, we run it and report all the results using a block size of 16 features resulting in templates of 32 blocks.When it comes to our proposed losses, we minimize them by training the privacy finetuning layers for 100 epochs with a learning rate of 0.01.The scale factor for the recognition loss L r is set to s = 64.A batch size of 128 images is used with a roughly balanced number of images per gender.When it comes to the privacy weight factors α and β, we set α = 20 as it gives a magnitude to L p 1 that is comparable to that of L r and set β ∈ {0, 1}.We note that higher values of β resulted often in convergence problems.We also varied the training sets, each of the datasets is used as a training set and we also formed pair combinations of datasets LFW, ColorFeret and AgeDB.When it comes to model selection, we evaluate the performance of all the saved parameters every 10 epochs and we select the model that is associated with the highest PIC value.
Before applying any privacy-preserving technique, we investigated the original embeddings when they are extracted from different datasets.We see from Figure 5 that the features obtained from images in the VGGFace2 dataset are not linearly separable regardless of the feature extractor used as ACC G does not exceed 71% for linear SVM and logistic regression classifiers but are highly separable with a non-linear classifier reaching an ACC G of 98.20% for SphereFace features.
AgeDB and ColorFeret are associated with more linearly separable features, in particular ColorFeret with an ACC G exceeding 80% using both linear classifiers.The most linearly separable datasets are ColorFeret, followed by AgeDB, then LFW and finally VGGFace2.However, all of the datasets are easily separable using an RBF kernel SVM classifier that achieves a performance ACC G exceeding 80% in all cases.We can speculate that the differences in separability among the datasets are caused by the different levels with which these dataset distributions vary from the training data distributions of the face recognition systems.In addition to disparities at the dataset level, we also notice that ElasticFace seems to result in less gender separable features across all datasets compared to ArcFace and SphereFace.
These observations indicate that the gender separability of the embeddings vary from one dataset to another as well as from one face recognition model to another and depends on the type of classifier that is used.

A. Cross-Database Evaluation
We illustrate in Figure 6, the gender classification and verification performances of all the methods with various training sets on the evaluation sets.We exclude the results Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 5. Verification performance and gender classification performance using three classifiers (Linear SVM, Logistic Regression and an SVM with an RBF kernel) on the original embeddings from the pre-trained face recognition models.Zooming may be necessary for the best viewing of the plots.
where the training set of the privacy-preserving method is the same as the evaluation set.
For features extracted using ArcFace, we notice that our methods achieve a near ideal trade-off on LFW and VGGFace2 datasets next to the PE-MIU shuffling approach.On AgeDB and ColorFeret, PE-MIU achieves the best trade-off with ACC G lower than 55% and an EER V lower than 20%.On these datasets, our methods have a much lower privacy with our best ACC G of 69.15% on ColorFeret and of 79.67% on AgeDB while maintaining in all cases a EER V less than 17%.However, on all four datasets, Multi-IVE and IVE result in either a worse privacy level with a comparable verification performance to our methods or a better privacy with a significantly deteriorated verification performance.For instance, on the AgeDB dataset, IVE achieves an ACC G of 60.84% but with an EER V of 31.45% which is better than ours in terms of privacy but hinders extremely the verification task.
For ElasticFace features, we notice a similar ideal trade-off on the LFW dataset for PE-MIU and our methods where the best of our methods achieves an ACC G of 50.41% with an EER V of 0.5%.PE-MIU achieves a similar level of privacy but with an EER V of 0.46%.For the remaining datasets, our methods are more effective than Multi-IVE in terms of privacy and less effective than PE-MIU and IVE but tend to achieve higher verification performance.For instance, on ColorFeret, PE-MIU has a near total privacy with an ACC G of 53.01% and an EER V of 16.65% which is 3.47 times higher than the original EER V of 4.8% while our best method achieves an ACC G of 63.41% with an EER V of 10.29%.However, we note that despite not being consistently the best at enhancing privacy all our best methods achieve consistently an ACC G less than 65% across all datasets while maintaining an EER V equal or lower than 10.29%.
As for the SphereFace features, similarly to ArcFace and ElasticFace features, our methods are as successful in achieving an ideal trade-off on LFW as the data protection approach PE-MIU.Both sets of methods achieve nearly an ACC G of 50% with only a negligible deterioration of verification performance.On the other hand, IVE and Multi-IVE best privacy results achieve an ACC G of 69.38% and 63.28% respectively.In the remaining datasets, our methods best results supersede IVE and Multi-IVE while maintaining an EER V equal or lower than 10% but they are superseded by PE-MIU which in the case of ColorFeret, achieves an ACC G of 55.61% but results in an EER V of 14.12%.Nevertheless, our methods best results consistently achieve an ACC G lower than 68% across all datasets.
The results on ColorFeret and AgeDB show that for these two datasets, identity and gender tend to be highly entangled, in particular for ColorFeret.It is more difficult in the case of these datasets to remove gender information without severely impacting the verification performance.In contrary to the other data minimization approaches, our method includes a recognition loss term L r in the training loss that explicitly forces the network to retain the verification performance from decreasing significantly.Multi-IVE implicitly attempts to maintain the identity-relevant information in the embeddings by excluding a number of principal components from elimination in the transformed PCA or ICA domain.IVE only executes the privacy-enhancement by suppressing gender-related features.The recognition loss that is included in our method could be the reason why the gender classification performance does not always decrease as drastically as with the other methods, due to the high entanglement between gender and identity for certain datasets such as ColorFeret.
We also note that overall, the choice of the training data has an impact on the performance of the methods.Using our methods, training on ColorFeret and AgeDB has the tendency to produce better results on each other while VGGFace2 seems to be the least suitable training data for our methods in terms of privacy.Combinations of datasets are in some cases beneficial as the combination of LFW and AgeDB when evaluating on the ColorFeret dataset or the combination of AgeDB and ColorFeret when evaluating on VGGFace2 dataset for the ArcFace features.
In Figure 7, we plotted the average verification performance and gender classification on the evaluation sets.We excluded the combination of training sets used in our methods to guarantee that all methods appearing in the figure share the same training set and evaluation sets.For our methods, the gender classification performance varies from an averageACC G of 55.31% to 72.99% with an average EER V consistently below 10% ranging between 3.47% and 8.21% across all feature extractors.When it comes to IVE, the performances vary depending on the training sets with a severe impact on verification performance.It achieves an average ACC G ranging Fig. 6.The above plots summarize the effect of the training set with different methods on the privacy-utility trade-off evaluated on different datasets.The training set appears in bold in the legend preceded by the loss or method used.Each column corresponds to a distinct face recognition model.Zooming may be necessary for the best viewing of the plots.from 52.81% to 80.39% with an average EER V from 3.7% to 22.47%.Multi-IVE achieves worse results from a privacy point of view but tends to have a higher verification performance when using most training sets.Its achieves an average ACC G ranging from 67.72% to 84.47% and an average EER V ranging from 2.91% to 9.57%.Finally, for PE-MIU, it achieves near optimal privacy results with an average ACC G consistently below 55% and its average EER V ranges from 7.02% to 8.67%.
We retain that our methods are able to minimize the impact on the verification performance while obtaining a significant privacy gain compared to methods Multi-IVE and IVE.IVE only tackles the privacy aspect and therefore, causes a substantial loss in verification performance while Multi-IVE results in limited privacy due to limiting the elimination of a number of principal components in the PCA domain.While PE-MIU supersedes our methods based on the reported performances, it is crucial to note that PE-MIU has a higher risk of being compromised due to the fact that the sensitive information is not removed as is the case in IVE, Multi-IVE and our methods.Instead, the sensitive information is only shuffled.
To qualitatively assess the performance of our method, we show in Figure 2 the t-distributed stochastic neighbor Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Sensitivity Analysis of the Privacy Factor α
To see the impact of the α parameter on the privacy gain, we performed a sensitivity analysis with α ∈ {11, 14, 17, 23, 26, 29}.We did a Wilcoxon signed test to compare the performance of the gender classifiers when α = 20 and when α takes values from the set.
The statistical test compares the predictions of the gender classifiers on the AgeDB dataset with ArcFace features.We chose to perform this experiment on AgeDB as it is the most challenging dataset in terms of privacy gain for ArcFace features.All the models generating the features are trained on the same dataset ColorFeret.We report the p-values of the Wilcoxon signed per classifier and value of α as the ACC G averaged across all classifiers in Table II.We notice that the differences are significant for α = 17 and α = 26 with p-values < 0.05.The ACC G is the highest for α = 17 and is the lowest for α = 26.This shows that α is a sensitive parameter despite not consistently resulting in a significant privacy gain when it increases.This could be explained by the fact that the composed loss has two main components; the recognition component and the privacy component that are two tasks competing against each other.

C. Computational Time Analysis
We assess the suitability of the aforementioned methods in a real-world situation by quantifying the computational time required for generating privacy-enhanced templates and executing comparisons.The template generation step includes pre-processing for the facial image, the running the original feature extractor and applying the privacy enhancement method to obtain the final template.The comparison step refers only to the computation of the Euclidian distance between two templates.
These runtime measures are performed on the ArcFace features, using consistently the best model for each privacyenhancing method.We note that for IVE, the privacy-enhanced template has only 12 features.We report such measures in Table III where we can see that the average time to generate the template is approximatively the same for all methods, except for Multi-IVE where it is roughly 2.3 times slower than the other methods.This is due to the complicated steps in Multi-IVE that require at least three different steps next to the generation of the pre-privacy template, namely a projection onto the transformed domain, elimination of sensitive features (120 eliminations) in the transformed domain then a reverse projection onto the original domain.
We note that despite the additional layers that we train on top of the initial templates, our method does not add a significant computational burden in order to obtain the privacyenhanced templates.When it comes to the time needed for computing one-to-one comparisons using Euclidian distance, we see that PE-MIU stands out as a substantially heavy method.Comparison between IVE templates is slightly faster than comparison between original templates due to the reduced size of the templates after IVE.
While the remaining methods have similar computational time for template comparison to that of the original templates, PE-MIU is 1444 times slower.This is largely due to the assignment of the blocks between the reference and the probe, which is a crucial part to calculate reliable comparison scores, especially for mated pairs.However, due to its significant computational load during comparison, it is unlikely that PE-MIU can be implemented in a practical application.This is especially true in situations where one-to-many comparisons are necessary for the purpose of identification.

VI. CONCLUSION
In this paper, we finetune a face recognition system with the aim to enhance gender privacy in facial templates.We propose two constraints that act on both the gender-specific feature vectors and the learnable identity class weights.These constraints are intuitive and take advantage of the hyperspherical nature of the feature space in state-of-the-art face recognition systems and are effective for training a shallow network on top of the embedding layer of a pre-trained face recognition model.Our findings demonstrate that the inclusion of said constraints significantly improves privacy while preserving the face verification performance with no additional computational burden unlike other methods.We also highlight the impact of the choice of the training data for privacy-enhancing techniques.Additionally, we provide an extensive evaluation protocol that emphasizes the importance of performing the evaluations on several datasets that were not used for training the privacy-enhancing method.Future work is required to assess further the impact of the separability of the training data on the effectiveness of privacy-enhancing techniques.

Fig. 1 .
Fig. 1.Threat illustration: Face recognition features contain discriminative information on gender and can be used to train a gender classifier.

Fig. 2 .
Fig.2.T-SNE visualizations on the original embeddings of ArcFace (first row) and after applying our method with the loss 20L p 1 + 1L p 2 trained on the AgeDB dataset (second row).Every column corresponds to the source dataset of the embeddings.

Fig. 4 .
Fig. 4. 2D Illustration of the different angles on the hypersphere that we minimize during training.xi refers to the feature vector of a sample i, W i refers to the weight vector corresponding to its identity class, W m and W f correspond to the average weight vectors for the masculine and feminine identity classes respectively and x m and x f correspond to the average feature vectors for each gender category respectively.

Fig. 7 .
Fig. 7. Figure showing the average performance per training set and method on the remaining evaluation sets.Zooming may be necessary for the best viewing of the plots.

TABLE I OVERVIEW
OF THE NUMBER OF IMAGES N AND IDENTITIES N ids IN TOTAL AND IN THE FOLDS SETTING USED FOR THE EVALUATION OF GENDER CLASSIFICATION PERFORMANCE L p 2 : Privacy constraint based on angles θ p 2 and θ p 3 m : Numbers of feminine and masculine identities, X: Feature matrix of shape (n, d) with n: batch size and d: feature space dimension, W: Identity class weight matrix of shape (m, d) with m: number of identity classes and d: feature space dimension Output:

TABLE III AVERAGE
COMPUTATIONAL TIME IN MILLISECONDS (MS) OF PRIVACY-ENHANCING METHODS USING ARCFACE AS THE PRE-PRIVACY FEATURE EXTRACTOR.RUNTIME IS ESTIMATED WITH THE timeit PYTHON MODULE AND IS AVERAGED OVER 1000 ITERATIONS.THESE MEASURES ARE RUN ON AN INTEL