Fairness in Biometrics: a figure of merit to assess biometric verification systems

Machine learning-based (ML) systems are being largely deployed since the last decade in a myriad of scenarios impacting several instances in our daily lives. With this vast sort of applications, aspects of fairness start to rise in the spotlight due to the social impact that this can get in minorities. In this work aspects of fairness in biometrics are addressed. First, we introduce the first figure of merit that is able to evaluate and compare fairness aspects between multiple biometric verification systems, the so-called Fairness Discrepancy Rate (FDR). A use case with two synthetic biometric systems is introduced and demonstrates the potential of this figure of merit in extreme cases of fair and unfair behavior. Second, a use case using face biometrics is presented where several systems are evaluated compared with this new figure of merit using three public datasets exploring gender and race demographics.


INTRODUCTION
T He pipeline from research to deployment of an ML- based system can assume several shapes with different steps.In abstract terms (and allow us to do such simplification), such pipeline is composed of i-) Data Collection: where the "state of the world" is reduced to a set of rows and columns of data (e.g.face images, bank transactions, medical data, etc...); ii-) Modelling: where the "model" is supposed to summarize the patterns of the data and be able to make generalizations (via supervised/unsupervised learning, etc..); iii-) Benchmarking: where the model is evaluated with respect to some figure of merit (e.g.accuracy, f1score, etc..); iv-): Feedback where it is decided if the model is "good" for deployment or not; if not, steps (i) and/or (ii) needs to be redone; v-) Deployment: ML-System goes to production 1 .During the benchmarking stage, it is common to use reference databases.Such reference databases are supposed to represent somehow operational conditions and it is hypothesized that ML-based systems that presents high accuracy, high f1 score, low false-positive rate, low falsenegative rate, etc in such benchmarks is a proxy to have the same figures of merit in operational conditions.Once this is achieved (by any criteria ML engineers decide), ML is "safe" to be deployed.
Fairness issues arise from the analysis of these figures of merit in specific demographics groups (e.g, gender, ethnicity, race, revenue levels, or any covariate in general) and the observation that operational conditions originally estimated can't be reproduced in those.The large scale deployment of such systems in so many different scenarios raises the debate about its fairness and its impact on our lives.For instance, the book Weapons of Math Destruction [1] presents several cases where unfair decision-making tools based on ML impacted the life of city populations in a negative way if, among other things, aspects of fairness are not taken into 1.Usually, feedbacks are also done after deployment, but let's keep this simplification as is because it is enough for our purposes.account.
Decision-making tools based on biometrics, as part of this Machine Learning wave, have been largely deployed in the recent decade.For instance, it is present in our daily lives for data protection (e.g to unlock mobile phones and/or computers), law enforcement, airport e-gates, and among other applications.This work addresses fairness aspects in biometric systems and its contributions are twofold.First, it is discussed the factors to consider a biometric verification system as fair and we introduce the first figure of merit in this field, the Fairness Discrepancy Rate.Second, a case of study of this figure of merit is presented using face recognition as a biometric trait.We aim to make this reproducible: all the source code, trained models, and scores are made publicly available.Details on how to reproduce this work can be found on the provided link 2 .

RELATED WORK
In this section, we present the related work by first discussing the efforts made by the Machine Learning community to suppress demographic biases and then we move to efforts made by the biometrics community.

Machine Learning Background
Many criteria to assess and address fairness in pattern recognition problems have been proposed over the years, each one phrasing the problem in different ways.The recent work from [2] hypothesizes that most of these criteria described in the machine learning literature boils down into three major categories of conditional independence and they are: Independence, Separation, and Sufficiency.
To illustrate these criteria, let's consider X ∈ R n a random variable denoting the input data, D = {d 1 , d 2 ...d n } a random variable denoting a set of sensitive attributes (e.g.gender, demographics, etc), Y ∈ {0, 1} (for simplicity) a random variable denoting the target variable (representing a binary classifier) and F : f (X, D) the trained predictor (that can be possibly thresholded).The first non-discrimination criteria, and the most simplistic one, is independence which simply requires that the classifier F must be independent of the sensitive attributes D, or F ⊥ D. This is also addressed as demographic parity or statistical parity.For our the binary classification case, this can be rewritten as: This criteria is largely used in ML in general to mitigate biases either via regularization criterias [3], representation learning criterias [4], [5], [6], or post-processing mechanisms [7].Assuming independence has some issues in addressing fairness and this is largely discussed in [8] and more recently in [2].
The second criteria is separation where it explicitly acknowledges that the target variable Y might be correlated with D. This might be desirable in some scenarios.For instance, a medical doctor might argue that a particular disease is more probable to be developed in one demographic group than other and a "disease" prediction function F must take this into account.This is summarized by the following condition independence: F ⊥ D|Y .For our the binary classification case, this is equivalent to these two requirements: and What separation requires is that all demographic groups should experience the same true/false positive rates and the 2. https://gitlab.idiap.ch/bob/bob.paper.fdrsame true/false negative rates [2] in order to be fair.This is addressed at training time for some classification tasks in [8], [9].
The third criteria is sufficiency, which basically formalizes that the value of F includes the sensitive attribute D for prediction.Hence, F is sufficient for D if Y ⊥ D|F , which basically means that F doesn't need to explicitly see D to predict Y .In this case: Those three basic fairness criteria supports most of the what was published in the Machine Learning literature either explicitly or implicitly.

Fairness in Biometrics
In the biometrics literature, aspects of fairness are being recently addressed for some biometric traits.For instance, the Face Recognition Vendor Test (FRVT) has a special report addressing demographic effects in face recognition 3 where several analysis observing, mostly, the effect of race and gender are made using more then 100 COTS (Commercial of The Shelf) systems.
This recent work from [10] describes some underlying factors that biases COTS face recognition systems with respect to race.For instance, it was observed that the "Other Race Effect", well known in humans, [11] can also be observed in FR algorithms; FR systems developed in Asia are more accurate with Asians than with Caucasians, and vice-versa.Furthermore, it was observed that racial biases are more frequently observed in low quality samples.Such observation about image quality was also raised by the FRVT report.Studying race, the work from [12] observed consistently higher False Match Rates (FMR) with African American cohorts compared with Caucasians using two COTS systems.Furthermore, this work extended its analysis with ICAO face checker 4 .It was observed that ICAO SDKs work better for Caucasians than with African Americans.The work from [13] made an extensive study analysing several age cohorts using one COTS system.Among several observations made, the most impacting one was the high FMR and high False Non Match Rates (FNMR) in pairs of images where age is lower than four years old.
Still on face biometrics, the work from [3] introduces the Racial Faces in the Wild dataset.Such dataset is a subset of the MSCeleb-1M [14] whose identities are organized in four different races (Caucasians, Black, Indian and Chinese).Using such data, and using the independence criteria, the authors, at training time, regularized different deep neural networks by minimizing the Mutual Information between the face classifier and the demographic attributes.
Biases towards gender were also observed in the periocular region of the face.For instance, the work from [15] demonstrates that several periocular recognition systems performs better with male subjects than with female ones.2: Example of a canonical fair biometric verification system with three demographics (0, 1, 2) and six operational thresholds (depicted with the dashed lines).Performance measures in terms of FMR(τ ) and FNMR(τ ) can be found in TABLE 1.
The NIST SRE 5 is the most relevant benchmark for speaker recognition and along last editions it consistently evaluates error rates looking at gender cohorts.
To the best of our knowledge the works from biometric literature that addresses somehow fairness, by either analyzing COTS systems or by proposing a strategy to mitigate it, does so using different criteria.However, the trend seems to achieve somehow the statistical parity (or independence), even if this detail is not explicitly mentioned.Even if this is the trend, a figure of merit to directly address it is nonexistent.For instance, the work from [16] uses the Area Under the ROC curve to assess the fairness of a biometric verification system under different demographic groups.ROC curves measures the True and False Positive Rates (TPR and FPR respectively) trade-offs.Although this seems sensible to assess demographic discrepancies, it has a serious flaw; it assumes that the verification decision threshold (let's call it τ ; we'll formally define this further) is demographicspecific.Hence, TPR(τ ) and FPR(τ ) is computed under different decision thresholds depending of the demographic and can give a false impression that a biometric verification system is fair (this problem is further discussed in section 3).Furthermore, this doesn't represent operational conditions where one single τ is set, and this operational point has to be fair with respect to different demographics.This problem can be observed also in several works that refers to biometric verification; for instance, in [3], [10], [15], [17], [18], [19].
Some works in the biometrics literature explicitly advocate that the value of τ should be demographic-specific, such as in [20], [21].Even the fairness figure of merit proposed by [2, (sec.2, p.14)] (covering a general case of pattern recognition) assumes one τ per demographic as well.Again, in biometric verification, this is not practical for several reasons.First, at test-time, it will involve a classification of privacy-sensitive attributes (e.g gender, age,...), which might not be legal or ethical in some applications.Second, it will involve another classification task in the pipeline that might  be error-prone and subject to biases as well.
FRVT goes in the right direction with respect to the aforementioned threshold problem by discussing the impact of demographics in terms of FMR(τ ) and FNMR(τ ) for one decision threshold only.Such a decision threshold is picked from an independent zero-effort score distribution, where the demographic doesn't play a role.This is the most sensible evaluation if the goal is to assess fairness in operational conditions.However, FRVT discussess the impact of FMR(τ ) and FNMR(τ ) separately.Hence, the trade-off between them is not considered.Furthermore, only one decision threshold is analysed.This limits the perception of fairness under different operational points.In [22] a similar direction was taken where risk distributions among the different demographic groups were equalized via different approximation methods, introducing then threshold invariant classifiers.However, no analysis in terms FMR(τ ) and FNMR(τ ) was carried out.
Our work tries to fill these evaluation gaps for biometric verification systems, by: (i) -taking into consideration the above mentioned threshold problems (ii) -considering the FMR(τ ) and FNMR(τ ) trade-off in the fairness assessment, and (iii) -taking into account different operation points (decision thresholds).

CATION
Biometric verification is the task of verifying if a given sample is from a claimed identity or not.This decision is made based on a scoring function s(e, p) and a decision threshold τ , where e is the claimed identity, p is a probe sample (test sample).If s(e, p) ≥ τ it is said that e and p are from the same identity.Conversely, if s(e, p) < τ it is said that e and p are not from the same identity.There are two possible types of errors that biometric verification system can make and they are the False Match Rate (FMR) and False Non Match Rate (FNMR).Worth noting that these two errors are functions of a decision threshold τ , which its impact is discussed further.
The value of τ plays a decisive role in these two errors and it is usually set targeting an specific FMR value in a Fig. 3: Example of a canonical UNfair biometric verification system with three demographics (0, 1, 2) and six operational points (depicted with the dashed lines).Performance measures in terms of FMR(τ ) and FNMR(τ ) can be found in TABLE 2. reference impostor score distribution set 6 .Some examples of such operational points are: τ = FMR 10 corresponds to the τ where FMR reaches 0.1 (or 10%) in the impostor distribution scores; τ = FMR 1000 corresponds to the τ where FMR reaches 0.001 (or 0.1%) in the impostor distribution scores; τ = FMR 10 6 corresponds to the τ where FMR reaches 10 −6 (or 0.001%) in the impostor distribution scores 7 .
Given a test set, "good" biometric recognition systems should present FMR x (τ ) around the operational point given by x and the lowest value as possible for FNMR(τ ).Furthermore, for a "good" biometric system to be considered fair, it should present FMR x (τ ) around the operational point x for all observed demographic groups and approximately "same" FNMR(τ ) for all observed demographic groups.The impact of the decision thresholds is illustrated in Figure 1.In this example we chose two comparison scores from male and female subjects of the MOBIO dataset using one of our tested Deep Convolutional Neural Network (DCNN) (see section 4 for further details).Those genuine pairs were cherry-picked by looking at the score values that are around the average genuine scores for each demographic group.τ in this case is equals to −0.5298.It can be noticed that both comparisons using female subjects are rejected using this operational point and the two male subjects are accepted.
Lets put this in terms of separation criteria discussed before (see equations 2 and 3) and define fairness more formally first observing FMR(τ ) and then FNMR(τ ).Given a set of demographic groups D = {d 1 , d 2 , ..., d n }, and τ = FMR x 8 , a biometric verification system is considered fair with respect to FMR if the following premisse holds:  Such premisse can be written with the following equation: where is a relaxation constraint.
Conversely, in terms of FNMR, a biometric verification system is considered fair if the following premisse holds: Such premisse can be written with the following equation: (6) Since 5 and 6 are functions of τ , both can be summarized in one figure of merit, that we refer as Fairness Discrepancy Rate (FDR) which is defined as: where α is a hyper-parameter that defines the weight of A(τ ) in the figure of merit (the importance of False Matches).The values that FDR can take varies from 0 (the most unfair behavior possible) to 1 (the most fair behavior possible).As with equations 5 and 6 FDR can be possibly thresholded with a slack variable and an overall threshold defining what is fair and what is not can be defined as: The role of is discussed further in this section.
The following subsection presents one example of a desired fair biometric recognition system and one example of an undesired unfair biometric verification system that illustrates how FDR evaluates these two systems.

Fairness Discrepancy Rate using synthetic data
Figure 2 shows a canonical fictional example of a fair biometric recognition system.Each box plot shows the score distributions, from both, zeroth effort impostors (in red) and genuines (in blue) of three abstract demographics (labeled as 0, 1 and 2).It is possible to observe that the score distribution from the three demographics are systematically aligned in all quartiles, which indicates that Premisses 1 and 2 can hold for both FMR and FNMR for any given τ .In this experiment τ = FMR x (τ ) where x varies from 10 to 10 6 .Conversely, on the other side of the spectrum, an example of unfair biometric verification system is presented in the Figure 3.As it can be noticed, the score distributions from both, zeroth effort impostors and genuines, are not as aligned as in the previous example (see Figure 2).Intuitively, one can argue that it is difficult to have a single threshold τ that holds Premisses 1 and 2.
Let's now test FDR using these two theoretical systems 9 .Table 1 presents FNMR(τ ), FMR(τ ) and FDR(τ ) for different values of τ of the fair synthetic biometric system presented in Figure 2. In this experiment τ = FMR x (τ ) where x varies from 10 to 10 6 .It is possible to observe that FDR(τ ) is stable and higher than 0.99 for all values of x, which indicates a fair behavior with respect to these abstract demographics.To analyse the other side of the spectrum, Table 2 presents FNMR(τ ), FMR(τ ) and FDR(τ ) for different values of τ of the unfair synthetic biometric system presented in Figure 3.The values of τ are set in the same way as in the previous experiment.It is possible observe that FDR(τ ) is consistently higher for the fair biometric system than with the unfair one, which indicates a consistency in this figure of merit in the evaluation of fairness.On the other hand, Figure 4 shows the ROC curves of these two synthetic examples and the three demographics.It can be noticed that all 6 ROC curves are perfectly aligned in the top corner of the figure, showing perfect recognition rates.In fact Area Under the ROC is equals to 1 for every single demographic for both fair/unfair synthetic verification systems.This example clearly gives the false impression that the unfair synthetic verification system is fair, which is not as we could spot with the FDR.FDR can also be plotted as function of x (or τ ) so two biometric systems can be compared in a more intuitive way. Figure 5 presents how two biometric systems can be 9.This example is available in the following link: https://github.com/tiagofrepereira2012/fdr/compared under this figure of metric.It can be observed that FDR is stable for all values of x for the fair biometric system.For the unfair biometric system it can be noticed that FDR substantially decreases once x increases (when less false-matches are allowed).Another way to establish a comparison between two systems with respect to its fairness is by analysing the Area Under FDR.For a given range of τ (estimated by using x) the Area Under FDR can be calculated by simply integrating the FDR(τ ) over x.The value of x can be scaled from 0 to 1, so Area Under FDR is bounded from 0 to 1.However, by scaling it, the range of x has to be reported.Hence, only Area Under FDR whose range of x matches can be fairly compared as presented in Table 3.Using this figure of merit it is also possible to observe that the system that it is intuitively considered as fair (see Figure 2) it presents higher Area Under FDR than the one it was intuitively considered as unfair (see Figure 3).Fig. 5: FDR as a function of x from two synthetic biometric systems from Figures 2 and 3.

The role of alpha
The hyper-parameter α in equation 7 has a crucial role in the computation of FDR(τ ).As previously mentioned, it controls the weight of False Matches in the FDR computation.Such value is a business/application decision.For instance, a bank that deploys a biometric verification system in an ATM might prefer to favor fairness towards False Non-Matches and, for this reason, α can assume low values.On the other hand, in a border control scenario, where false matches are more critical, decision-makers might decide to favor fairness towards False-Matches.Hence, α should be high.
Figure 6 shows the α trade-off between the two synthetic systems; the fair ones are represented by the solid lines and the unfair by the dashed lines.It is possible to observe that the fair system presents a FDR(τ ) ∼ 0.99 no matter the value of α.For the unfair system, FDR(τ ) presents a stepper decay once α decreases.In the limit (when α = 0) the unfair biometric system is completely unfair (FDR(τ ) ∼ 0).This also can be seen via the Area Under FDR.As can be noticed in Table 4, for the unfair biometric system, the Area Under FDR decreases once α decreases.Instead, we'll use both, FDR and Area Under FDR to compare different biometric verification systems and define the relative fairness between them.Fig. 7: MEDS II database: distribution of gender by race (extracted from [23])

FACE VERIFICATION USE CASE
In this section, a case of study of the Fairness Discrepancy Rate is presented using different face verification systems.To approach this four face verification systems are used.The first system, is the Facenet by David Sandberg [24].This is the closest open-source implementation of the model proposed in [25], where neither training data or source code were made available.For this evaluation we have used the 20170512-110547 model (Inception-ResNet v1), trained on the MS-Celeb-1M dataset.The second system is also a DCNN based on the Inception-Resnet v2 architecture [26].This DCCN was trained also using MS-Celeb-1M dataset using a joint loss function combining the cross-entropy loss and center loss.More details on how this DCNN was trained can be found in [27, p.147].For these two biometric systems, comparisons between samples are made with the embeddings of each DCNN using the cosine similarity metric.Given the embeddings e and p for enrollment and probing respectively, the similarity s is given by Equation 9.
The third face recognition system is a baseline that came before the DCNNs era: Gabor Graph matching [28].Finally, the fourth face verification system a Commercial Of The Shelf System (COTS) developed by RankONE 11 .

Dataset setup
There are several databases publicly available in the literature with privacy-sensible attributes where face verification tests can be made using those attributes.The most recent ones available are based on images from celebrities scraped from the web, such as Racial Faces in the Wild (RFW) [3], Balanced Faces in the Wild [17] and, IARPA Janus Benchmark C (IJB-C) [29].Although all the aforementioned datasets contain meta-information where we can do our fairness assessment, they are not captured in controlled 11. https://www.rankone.ioversion 1.22.1 conditions and this might interfere with our fairness assessment using FDR.Since this is the first work with this figure of merit, we've focused on three datasets where capture conditions are relatively well-controlled and whose demographic attributes are available.The selected datasets are: MEDS II dataset [23], MORPH dataset [30] and MOBIO dataset [31].Fig. 8: MORPH database: subject example (extracted from [30]) The MEDS II database was developed by NIST to support and assists their biometrics evaluation program.It is composed by 518 identities from both men/women (labeled as M and F) and five different race annotations and they are Asian, Black, American Indian, Unknown and White (labeled as A, B, I, U and W).Unfortunately, the distribution of gender and race is extremely unbalanced as it can be observed in Figure 7. Furthermore, only 256 subjects has more than one image sample (obviously it is not possible to do a biometric evaluation with one sample per subject).For this reason, we've performed our evaluation in a subset of this dataset, which is composed only by 194 subjects composed by White and Black men only.More details on how this evaluation protocol is organized can be found in its webpage 12 .Its evaluation protocol is published in a python package; hence, future researchers will be able to reproduce exactly the same tests executed in this work.
The MORPH dataset is relatively old, but is getting some traction recently ( [10], [12]) mostly because its richness with respect to sensitive attributes.It is composed by 55,000 samples from 13,000 subjects from men and women and five race clusters (called ancestry) and they are the following: African, European, Asian, Hispanic and Others.Figure 8 present some samples from this database.More details on how this evaluation protocol is organized can be found in its webpage 13 , whose organization is similar with the one made with the previous dataset.
The MOBIO dataset is a video database containing bimodal data (face/speaker).It is composed by 152 people (split in the two genders male and female), mostly Europeans, split in 5 sessions (few weeks time lapse between sessions).The database was recorded using two types of mobile devices: mobile phones (NOKIA N93i) and laptop computers(standard 2008 MacBook).In this paper we only use the mobile phone data.As with other datasets, its evaluation protocol is also published as a python package 14 .

Experiments
In this section we discuss how fair the four of-the-shelf face verification systems are using the Fairness Discrepancy Rate.Each one of the following subsections discusses each database in isolation.In each one of the experiments both False Matches and False Non Matches has the same weight therefore, α is equal to 0.5.As aforementioned, we'll not set a value for , instead, we'll use both, FDR and Area Under FDR to compare different biometric verification systems and define the relative fairness between them.5 presents the FMR(τ ), FNMR(τ ) and FDR(τ ) in the test set for the Inception Resnet v2 system.For the sake of brevity, only this system is presented in this extensive manner.Please, check the supplementary material to have information about the other systems.In this experiment, τ was set at different operational points in the impostor score distribution from an independent set (development set in this case).It is worth noting that such impostor score distribution contains samples from all races; which is the closest scenario from reality, where one single threshold has to be fair to all demographic groups.

MEDS II database (Fairness with respect to race)
Both FMR x (τ ) and FNMR x (τ ) tables are fragmented by demographics (race in this case).Hence, in Table 5, "White -White" means biometric references from White subjects compared with probe samples from White subjects, and so on.
In terms of FMR x (τ ) it is possible to observe that for x = 10 and x = 10 2 (FMR 10 (τ ) or FMR 10 2 (τ )) the face verification system tends to have more false alarms for comparison between biometric references and probes from Black subjects.In terms of FNMR x (τ ), it is possible to notice that such a system tends to reject more White subjects than Black for x ≥ 10 2 .Figure 9 presents the FDR plot of the four different biometric systems covering the same decision thresholds showed in Table 5.It is possible to see that the Gabor Graph baselines are less fair compared with Facenet, Inception Resnet v2, and the COTS.Furthermore, Facenet is only fairer than Inception Resnet v2 for one decision threshold (x = 10 3 ).The COTS is fairer than all systems for all decision thresholds.To have a full picture about the fairness of such systems, Table 6 presents the Area Under FDR (x varying from 10 to 10 5 ) of every biometric verification system.It is possible to observe that the COTS indeed is fairer than other evaluated systems.Table 7 presents the FMR(τ ), FNMR(τ ) and FDR(τ ) in the test set (Male subjects only) for the Inception Resnet v2 verification system.As with the last section, for the sake of brevity only this system will be presented in this extensive x where dev is the development-set.FMR(τ ), FNMR(τ ) and F DR(τ ) are reported using the test-set manner.Please, check the supplementary material to have information about the other systems.In this experiment τ was set at different operational points in the impostor score distribution from the development set.
Both FMR x (τ ) and FNMR x (τ ) tables are fragmented by demographics (race in this case) in the same manner as in the previous experiment.However in this one, we have four demographic groups and they are the following: Asian, Black, Hispanic, and White (samples labeled as "Others" were left aside).
In terms of FMR x (τ ) it is possible to observe that from x = 10 to x = 10 2 (from FMR 10 (τ ) to FMR 10 2 (τ )) the face verification system tends to have more false alarms for comparisons between biometric references and probes from Hispanic and Asian subjects.Worth noting as well that for x = 10 1 , a significant amount of false alarms are observed between Asian biometric references with Hispanic Probes and vice-versa.
In terms of FNMR x (τ ), it is possible to notice that such a system tends to reject more White and Hispanic subjects from x ≥ 10 4 .
Figure 10 presents the FDR plot of the four different biometric verification systems covering the same decision thresholds showed in Table 7.It is possible to observe that the Gabor Graph verification system is fairer than other systems only for x = 10 1 .Among the DCNN and the COTS based systems, Facenet tends to be the fairest.To have a full picture about these observations, Table 8 presents the Area Under FDR (x varying from 10 to 10 6 ) 15 of every biometric verification system.Indeed, under this figure of merit Facenet is the fairest one (under the observed decision 15.In this experiments we have enough scores to place a τ at 10 −6 thresholds).More surprisingly is that Gabor Graph is fairer than Inception Resnet v2.Same trends are observed for the Female demographics.This can be spotted in the supplementary material.In terms of FMR x (τ ) it is possible to notice that from x = 10 to x = 10 2 (from FMR 10 (τ ) or FMR 10 2 (τ )) the face verification system tends to have more false alarms for comparison between biometric references and probes from female subjects.The biggest gap is for x = 10 1 where the FMR(τ ) between comparisons of male samples goes from 0.067 to 0.28 for comparison between female samples.In terms of FNMR x (τ ) such system also tends to reject more Female subjects, the biggest gap for x = 10 5 .
To have an overall picture about fairness, Figure 11 presents the FDR plot of the four different biometric systems covering the same decision thresholds showed in Table 9.We can observe that FDR(τ ) for Gabor Graph decreases smoothly from x = 10 to x = 10 5 , behavior that can't be seen in Facenet, Inception Resnet v2, and the COTS where FDR(τ ) are below 0.95 for x = 10.This has an impact in the computation of the Area Under FDR that we will see further.As can be noticed in Table 10 (a), the Area Under FDR of the Gabor Graph based system is higher than for Inception Resnet v2, Facenet and the COTS.
Worth noting that a fair behavior is not necessarily a proxy for more accurate behavior.Table 11 presents the FMR x (τ ), FNMR x (τ ), and FDR(τ ) for the Gabor Graph face verification system in the MOBIO database.It is possible to observe that, although this system tends to be fairer than Inception Resnet v2, Facenet and the COTS (for the range of thresholds we selected), it presents very high FNMR x (τ ) and FMR x (τ ) for no matter the selected threshold.
Another important point to highlight is that the Area Under FDR depends on the range of decision thresholds explored.For instance, in this experiment five decision thresholds were explored: from x = 10 1 to x = 10 5 .The selection of x controls the proportions of False-Matches that can be tolerated in the biometric verification system and the values that x can assume is basically a business decision.Let's imagine now that the business decisions changed and the values tolerated to be explored are from x = 10 2 to x = 10

Discussion
In this section, it was presented a case of study using the proposed Fairness Discrepancy Rate to assess error discrepancies with respect to different demographic groups using x where dev is the development-set.FMR(τ ), FNMR(τ ) and F DR(τ ) are reported using the test-set several FR systems.Three open-source FR baselines and one COTS system were used along with three databases where gender and racial biases were studied.We could notice that both FDR and Area Under FDR were able to spot the race and gender biases in the tested databases.With the FDR plots it was also possible to spot the range of decision thresholds that one biometric system presents the fairest behavior.Furthermore, with the Area Under FDR it was possible to directly compare different face recognition systems with respect to the discrepancies they present.
Another finding in these set of experiments was to spot that the biometric system before the era of DCNNs (Gabor Graph) also present unfair behavior.Actually, in most of the experiments, this system was the least fair.Hence, it would contradict a belief that Deep-Learning-based FR systems are necessarily biased.

CONCLUSIONS
In this work, it was presented the Fairness Discrepancy Rate (FDR) that is able to assess recognition discrepancies with respect to different demographic groups using biometric verification systems.FDR tackles a threshold problem which is the main issue on how fairness is addressed by the majority of the biometrics community by truly assessing the separation criteria with respect to both FMR and FNMR.Most of the works in the biometrics community assess fairness in verification systems by comparing DET curves, and/or ROC curves of different demographic groups separately.This type of comparison assumes that decision thresholds are demographic-specific, which is not feasible in operational conditions and doesn't proxy statistical separation.FDR addresses that by assessing demographic discrepancies assuming single decision thresholds.Fair biometric recognition systems are fair if a decision threshold τ is "fair" for all demographic groups with respect to FMR(τ ) and FNMR(τ ) and FDR proxies this behavior.Furthermore, the FMR(τ ) and FNMR(τ ) trade-off with respect to fair behavior can be set by addressing the value α in Equation 7. Finally, the Area Under FDR provides a general overview of fairness under a range of decision thresholds and also allows a quick comparison between different biometric verification systems with respect to that.Worth to emphasize that FDR is a proxy to the separation criteria (mentioned in 2).
Two groups of experiments were carried out to evaluate this new figure of merit.In the first one, a case of study using synthetic data was presented and it was demonstrated how FDR behaves in extreme cases of fair and unfair scenarios 9 .In the second, a case of study using four different face verification systems and three databases was carried out.With the developed tools was possible to observe that all evaluated face verification systems present gender and racial biases either observing FMR either observing FNMR for different values of τ .Furthermore, it was possible to quickly compare different face recognition systems with respect to their demographic discrepancies.Worth noting that neither FDR nor Area Under FDR are direct proxies for how "accurate" a biometric verification system is.Possible error rates have to be analyzed in parallel in order to have a full picture of accuracy vs fairness.
For reproducibility purposes of the work, all the source code, trained models, and recognition scores are made publicly available.
We hope that these tools are useful for the biometrics community to assess fairness and we advocate for some standardization.Hence, fairness can be easily assessed as any other figure of merit, such as FMR and/or FNMR.

Fig. 4 :
Fig.4: ROC curves for the canonical fair and unfair synthetic verification systems.It can be observed that analysing this curves gives a false impression that the unfair synthetic verification system is fair.

*
In this example τ = FMR dev x where dev is the development-set.FMR(τ ), FNMR(τ ) and F DR(τ ) are reported using the test-set

Fig. 9 :
Fig. 9: MEDS II: Fairness Discrepancy Rate of different face verification systems for different decision thresholds Table5presents the FMR(τ ), FNMR(τ ) and FDR(τ ) in the test set for the Inception Resnet v2 system.For the sake of brevity, only this system is presented in this extensive manner.Please, check the supplementary material to have information about the other systems.In this experiment, τ was set at different operational points in the impostor score distribution from an independent set (development set in this case).It is worth noting that such impostor score distribution contains samples from all races; which is the

Fig. 10 :
Fig. 10: Morph: Fairness Discrepancy Rate of different face verification systems for different decision thresholds

TABLE 1 :
Canonical fair biometric verification system: FNMR(τ ), FMR(τ ), and FDR(τ ) per demographic (Demog.)where the operational points are defined as τ = FMR x * τ is set using an independent zeroth-effort impostor score distribution with scores from all demographics.It can be seen as a development set.

TABLE 2 :
Canonical UNfair biometric verification system: FNMR(τ ), FMR(τ ), and FDR(τ ) per demographic (Demog.)where the operational points are defined as τ = FMR x * τ is set using an independent zeroth-effort impostor score distribution with scores from all demographics.It can be seen as a development set.

TABLE 3 :
Area Under the Fairness Discrepancy Rate for x varying from 10 1 to 10 6 .

TABLE 4 :
Area Under the FDR for different values of α for x varying from 10 1 to 10 6 .

.3 The role of epsilon
In this work we will not draw a line to define what's fair and what's not for biometric verification systems.As mentioned before, there's no legal or technical basis for such, and the ones that do exists are not suitable for biometrics.10. https://www.eeoc.gov/laws/guidance/employment-tests-andselection-procedures

TABLE 6 :
MEDS II: Area Under the Fairness Discrepancy Rate for x varying from 10 1 to 10 5 .

TABLE 7 :
MORPH -Inception Resnet v2: FNMR(τ ), FMR(τ ), and FDR(τ ) per demographic (Demog.) in the test set.These figures of merit are fragmented by the race of the samples used for enrollment and the race of the samples used for probe ("(e-p)" in the table.)* * In this example τ = FMR dev

TABLE 8 :
MORPH: Area Under the Fairness Discrepancy Rate for x varying from 10 1 to 10 6 .Table9presents the FMR(τ ), FNMR(τ ) and FDR(τ ) in the test set for the Inception Resnet v2 system.As with the last section, for the sake of brevity, only this system will be presented in this extensive manner.Please, check the supplementary material to have information about the other systems.MOBIO dataset is composed basically by Caucasians and for that reason, this experiment focus on gender biases only.Hence, FMR x (τ ) and FNMR x (τ ) tables are fragmented by gender in the same manner as in the previous experiment.In this setup, τ is set at different operational points in an independent zeroth-effort impostor score distribution (from the development set).

TABLE 9 :
MOBIO -Inception Resnet v2: FNMR(τ ), FMR(τ ), and FDR(τ ) per gender in the test set.These figures of merit are fragmented by the gender of the samples used for enrollment and the race of the samples used for probe ("(e-p)" in the table.)* .
* In this example τ = FMR devx where dev is the development-set.FMR(τ ), FNMR(τ ) and F DR(τ ) are reported using the test-set

TABLE 10 :
5. Table10 (b)demonstrates the outcome of this exercise by showing the Area Under FDR in this new range.It can be noticed that under this new decision rule, the COTS is the fairest one.MOBIO: Area Under the Fairness Discrepancy Rate for: (a) x varying from 10 1 to 10 5 and (b) x varying from 10 2 to 10 5 .

TABLE 11 :
MOBIO -Gabor Graph: FNMR(τ ), FMR(τ ), and FDR(τ ) per gender in the test set.These figures of merit are fragmented by the gender of the samples used for enrollment and the race of the samples used for probe ("(ep)" in the table.)* .
* In this example τ = FMR dev