Adversarial Learning of Mappings Onto Regularized Spaces for Biometric Authentication

We present AuthNet: a novel framework for generic biometric authentication which, by learning a regularized mapping instead of a classification boundary, leads to higher performance and improved robustness. The biometric traits are mapped onto a latent space in which authorized and unauthorized users follow simple and well-behaved distributions. In turn, this enables simple and tunable decision boundaries to be employed in order to make a decision. We show that, differently from the deep learning and traditional template-based authentication systems, regularizing the latent space to simple target distributions leads to improved performance as measured in terms of Equal Error Rate (EER), accuracy, False Acceptance Rate (FAR) and Genuine Acceptance Rate (GAR). Extensive experiments on publicly available datasets of faces and fingerprints confirm the superiority of AuthNet over existing methods.


I. INTRODUCTION
Biometric authentication systems are drawing increasing attention thanks to their convenience: the users are authenticated based on information they inherently own avoiding the need to remember passwords or provide keys. The typical approach followed by such systems is based on template matching: each biometric trait is associated with a template which should be able to embed its most discriminative features. Hence, all templates of a biometric trait belonging to the same user should be close in some suitable distance metric. Once the user's face, fingerprint or other biometric trait have been acquired through a dedicated sensor, they are processed in order to obtain the corresponding templates which are then stored in a secure fashion. This phase, which is referred to as the enrollment, prepares the system to grant access only to the enrolled users. At this point the system can be used in verification phase: a fresh biometric trait of The associate editor coordinating the review of this manuscript and approving it for publication was Zhenhua Guo . a user requesting authentication is acquired, the associated template is computed and then matched with the stored ones. Depending on the outcome of the matching process, the user can be either granted or denied access to the system.
Focusing on authentication accuracy the most critical part of a biometric authentication system is the feature extraction. Indeed, the extracted features not only have to be the most discriminative ones but should also be embedded in a proper metric space, in order to enable the template matching. Traditionally, features were extracted by means of hand-crafted design. However, the advent of deep learning methods highlighted the great advantage of learning the best features from data instead of using a model-based design, in terms of learning complex mappings [1], [2] and addressing difficult classification tasks [3].
When considering deep learning approaches, the biometric authentication problems are usually addressed by learning a feature embedding in which a template is able to represent the most discriminative features of a specific biometric trait class in a suitable space. Similarly to standard biometric authentication systems, the learned features are shared among different users and the template matching is based on a distance measure between two or more embeddings. In this work we follow a different path: with AuthNet we rely on a classification-based approach in which the neural network not only learns the most discriminative features of a specific user's biometric traits, but also learns the boundaries which can separate that specific user with respect to every other user.
As classification-based approaches require a per-user training, they trade off the added complexity with improved, user-specific features. Note that, the training process of embedding-based networks requires a large amount of labelled data as the network has to learn the very general features of the data class. The classification-based approach avoids this since, in a user-specific training the network has been trained on that specific user for which the most discriminative features have been learned. Conversely, embedding-based approaches learn specific features of the considered class, e.g. faces, and may fail on a specific user.
In this regard, it is important to underline that deep learning based classification learns highly non-linear boundaries with complex shapes in order to partition the feature space [4]. As shown in [4], the geometry of the decision boundaries heavily affects the robustness of the classifier. More specifically, as discussed in [5] most of the mass of the data points gathers close to the decision boundaries. As such, two similar biometric traits of a user may be assigned to different classes, leading to an error. Moreover, this undesirable behavior is an intrinsic property of the classifier structure and does not depend on the visual properties of the input data [5].
For the above reasons, we propose a novel user-specific classification strategy which does not explicitly enforce the network to learn complex classification boundaries. Instead, we envision a network design which learns a mapping of the input biometric traits onto a regularized and well-behaved latent space. By following this approach, the feature distributions are regularized so as to lead to simple and tunable boundaries between the classes, thereby reducing the probability of misclassification. In particular, we aim to obtain ''non-arbitrary'' boundaries which can lead to improved accuracy and increased robustness.
The first step consists in learning a compact and meaningful mapping of the input biometric traits onto the latent space. The latent space should be shaped in a simple and well-defined way: authorized and unauthorized users should cluster in two different and compact regions of the space leading to very regular boundaries. Then, a decision is made by employing a linear decision boundary to discriminate between the authorized user and everyone else. This system, which we will refer to as AuthNet, makes use of adversarial training in order to enforce a proper shaping of the latent space. With this paper we extend and improve our previous work [6] by introducing a new loss function via selection of better statistical parameters, provide an in-depth discussion on how AuthNet correctly maps users that are misclassified by other approaches and motivation behind higher misclassification rate by competing methods, introduce new architectural designs which come in two different flavors, based on ResNet [7] and DenseNet [8] with detailed performance comparison on the average values of the considered metrics computed independently on each user and aggregated scores. We further provide a detailed analysis of robustness on new datasets not seen during the training and on targeted perturbations, and verify how the regularization of latent space to simple target distributions leads to robust authentication compared to learning of the boundaries. Further, we add a discussion on the choice of the optimal system parameters.

II. RELATED WORK
Over the years, different methods have been proposed to address the biometric authentication task when dealing with different biometric traits such as faces, fingerprints, retinas and gait. With this work, we specifically focus on the most widely spread biometric modalities, namely face and fingerprint.

A. FACES
The face as a practical biometric modality has appeared only recently because of the inherent difficulty in handling far from ideal acquisition conditions. Indeed, standard model-based approaches tend to exhibit a high variance with respect to pose and/or illumination changes. A pioneer in this sense is the well-known eigenfaces approach [9]: the features used to describe the faces are obtained by projecting the test image onto the space spanned by the eigenvectors computed on the training data. Some of its weaknesses have been surpassed with the introduction of the Fisherfaces method [10] in which the projection operator is learned in a supervised fashion in order to maximize (minimize) inter (intra) class variance. This approach allowed a higher degree of invariance with respect to illumination changes. Other standard approaches are based on low-dimensional representations of the faces; examples include sparse representations [11], [12], linear subspace [13], [14] and manifold [15] representations. Following a different approach, [16], [17] attempt to overcome the limitations in handling facial changes by employing local features.
The largest performance improvement has been achieved by means of deep learning methods. These allowed to obtain excellent performance in far from ideal acquisition under different pose, expression and illumination conditions, see e.g. Deep face [18]. One of the most well-know methods is Facenet [19] which uses a triplet loss in order to learn embeddings of the input images. More specifically, the network is trained in such a way that the embeddings preserve the notion of image similarity in terms of 2 distance in the embedding space. However, because of the instability arising during a triplet-loss training, it is common to train the network with a softmax cross-entropy loss. Nevertheless, in this case the intra-class compactness and inter-class dispersion is not guaranteed. A more recent approach named ArcFace [20] introduced the additive angular margin loss to improve the discriminative power of the learned embedding whilst leading to a stable training process. A few other works also adopt the same strategy, e.g. [21]- [25].
All the above works rely on the recent trend in face recognition based on embedding computation and matching. Indeed, most research efforts are spent on the design of novel loss functions which can lead to more effective and/or stable embeddings.
In this regard, let us better highlight the scope of Auth-Net with respect to recent trends in unconstrained face recognition. With this work we are specifically focusing on the biometric authentication problem for which, apart from achieving a high recognition accuracy, it is even more crucial to reduce the number of wrongly authorized users. For the same reason, it is common to assume that the user puts him/herself in a controlled condition and as such, the face datasets we consider, are those commonly used for biometric authentication tasks, see [26], [27]. Conversely, recent ''in-the-wild'' face datasets, because of the large number of users and poses, are better suited for the evaluation of recognition and clustering tasks. Lastly, such datasets do not cope well with a user-specific training procedure as done in AuthNet since the number of samples per user is very limited.

B. FINGERPRINTS
This was one of the first biometric traits to be commonly used in practical systems. As such, most of the approaches rely on standard template matching based on hand-crafted features computed from minutiae, ridge and valleys patterns or global intensity image. In general, they can be categorized based on the use of either global or local features. Among the methods relying on global fingerprint features we mention the works in [28], [29]. Conversely, the approaches proposed in [30]- [34] rely on descriptors making use of local information of the minutiae and their neighbourhood. Additionally, in works such as [35] it has been shown that performance improvements can be achieved when additional information such as shape context and orientation is included. In the last few years, new approaches have been proposed in order to take advantage of the deep learning representational capabilities, for example, in order to improve the robustness of minutiae extraction and classification. Examples include [36], [37] in which Convolutional Neural Networks (CNN) are used to extract minutiae from raw fingerprint images and [38] where a stacked autoencoder is used to classify fingerprints into arch, left/right loop, and whorl. In [39] minutiae are filtered using a neural network to improve detection, whereas in [40] the authors use a neural network to extract the minutiae on thinned fingerprint images. Latent fingerprint minutiae extraction based on CNN has been also proposed in [41].

III. PROPOSED METHOD
In this section we introduce and describe the components of the proposed architecture for biometric authentication as shown in Fig. 1. As previously discussed, AuthNet strives to find a well-behaved representation of the input biometric FIGURE 1. The goal of AuthNet is to map the input biometric traits onto target distributions in the latent space. Authorized users (blue) are mapped to a target distribution whose mean value is far from that of the unauthorized users (red).
traits in some latent space, which in turn enables simple decision boundaries to be used for the classification task. More specifically, as described in the following, we want to learn a mapping from a sample in the biometric space onto a sample of target probability distributions for authorized and unauthorized users. Ideally, the distance (in some suitable metric) between the probability distribution of the samples resulting from the mapping and the target one should be minimal. One of the most widely used approaches to tackle this kind of problem is by means of an adversarial game.

A. ADVERSARIAL LEARNING
Adversarial models are now a very widespread approach to generative models. The first generative model trained by means of an adversarial loss, the Generative Adversarial Network (GAN) [1], gained immediate popularity and opened the path to the field of adversarial training.
A GAN tries to implicitly learn the probability distribution of the input data in such a way that the network is then able to generate samples similar to the input data. In other words, the network learns to minimize a distance metric between the distribution of the generated samples and that of the real data. The distance metric employed by GAN is Jensen-Shannon (JS) divergence which, interestingly, is the optimal solution of a two-player adversarial game. The main idea behind adversarial models is to reach the minimum of a functional defined as a minimax game where two entities have adversarial (opposite) goals. The global optimum corresponds to the equilibrium solution between the locally optimal solutions of the single entities. Within the deep learning framework, the two entities called generator and discriminator are modeled as neural networks and the minimax game is introduced in the loss function in order to make the two networks compete against each other during the training process. In more detail, the discriminator should be able to correctly discriminate between generated and real samples, while the generator should be able to generate samples which are realistic enough to fool the discriminator.
In AuthNet, as described in detail in the following, samples of the data distribution are mapped onto a latent representation which follows a target distribution. This can be considered as the inverse mapping of a conventional GAN, in which samples of a fixed distribution are mapped onto the captured distribution of the data. AuthNet-R architecture at enrollment phase. Training biometric traits are given as input to the encoder which consists of an 18-layered residual network followed by a fully connected layer. The output of the encoder, together with a one-hot vector and samples of the target distributions, is given as input to the discriminator which is made of 6 fully connected layers. AuthNet-R architecture at authentication phase. In this phase the biometric trait of a user requesting access is given to the pre-trained encoder which will output a sample z coming from either P 0 or P 1 . Then, the thresholding decision is made and a binary output (accept or reject) is returned.

B. LATENT MAPPING
We are now ready to provide the details of AuthNet whose main concept is depicted in Fig. 1.
Let B = {B a=0 , B a=1 } denote the set of all possible biometric traits and a ∈ {0, 1} an indicator variable such that a = 1 represents the authorized user and a = 0 represents all other unauthorized users. Moreover, let us define as x ∈ R n a generic biometric trait in B and as z ∈ R d its latent representation with d < n. The goal is to learn an encoding function z = H (x) of the input biometric trait such that z ∼ P 1 if x ∈ B a=1 and z ∼ P 0 if x ∈ B a=0 , with P 1 and P 0 the target distributions in the latent space. If the distributions P 1 and P 0 are well-behaved, a simple distance-based thresholding approach can be employed to determine whether the user with its associated biometric trait x is authorized or not.
Let us set P 1 = N (µ 1 , σ 1 I) and P 0 = N (µ 0 , σ 0 I) to be Gaussian, this amounts to enclosing the energy of the latent representation of authorized and unauthorized users within hyperspheres whose radius depends on both d and the distribution parameters. For the sake of simplicity and without loss of generality, we set E[z 1 ] < E[z 0 ] with z 1 ∼ P 1 and z 0 ∼ P 0 having σ 1 = σ 0 . If the distributions are taken as Gaussian with the same variance, a hyperplane is the optimal decision boundary, which further boils down to a simple threshold when z is a scalar. This leads to a very simple classifier, which learns a complex mapping to a high-dimensional latent space, in a way that mimics kernel-based methods.

Modes of Operation:
AuthNet operates in two phases, an enrollment phase and an authentication phase. During the enrollment phase (see Fig. 2), based on the training data users are registered in the system. Latent representation of authorized users are forced to follow P 1 , whereas latent representations of unauthorized users are forced to follow P 0 based on the one-hot label vector. Once the enrollment phase is completed, in the following authentication phase (see Fig. 3), the latent representations of the input biometric traits are tested against the target distributions, to find out whether the test biometry belongs to the authorized user class, or to the class of unauthorized users class. For d = 1, if the metric value is less than the threshold i.e. z ∼ P 1 , the user is categorized as an authorized user, else the user is categorized as an unauthorized user.

C. ENROLLMENT
During the enrollment phase the goal is to learn an encoding function H (x) which maps the user biometric traits onto the target distributions. The optimal H (x) is the one for which a distance metric between H (x) : x ∈ B a=1 and P 1 , and between H (x) : x ∈ B a=0 and P 0 is minimized. To address this problem we propose to employ an adversarial model whose optimum is reached when the JS divergence between the latent mapping and target distribution is minimized.
The AuthNet architecture at enrollment phase is depicted in Fig. 2. It is made of two competing neural networks: an encoding function H (x, θ h ) having parameters θ h and a VOLUME 8, 2020 discriminator D(p, θ d ) with parameters θ d . For the sake of readability, unless needed, we will drop the parameters in the notation of the encoding and discriminator networks.
The encoding function H (·) takes as input the biometric traits x and output their encoded latent representation z. The discriminator D(p) takes as input the vector p ∈ {s, z}, namely it is given in an alternate fashion either a sample from one of the target distributions s or the encoded latent representation z. The vector s ∈ R d is made of randomly drawn samples from the target distributions P 1 if x ∈ B a=1 or P 0 if x ∈ B a=0 , respectively. In order to improve the stability and performance of the training process, the input biometric trait label a is given to the discriminator as an additional information which, a acts as a switch to select a ''subdiscriminator'' function for either authorized or unauthorized users.
The discriminator D(p) outputs a scalar value which can be interpreted as the probability of given input coming either from the encoding function or the target distribution.
The loss function we consider to address the above-defined adversarial setting is given by which is optimized as a minimax two-player game according to where the optimization is carried over the parameters θ h and θ d in an alternate fashion. Being an adversarial model, the specific goal of the encoding function H (x) is to generate samples which, when given to the discriminator, minimise the probability of D making a correct choice, i.e. generate samples z which will fool the discriminator. The task of the discriminator D(p) is to maximize the probability of assigning the correct label to both latent representations z and samples from the target distribution s.
At the beginning of the learning phase, the discriminator quickly learns how to distinguish the latent representation z and the samples from the target distribution s. After some iterations, the encoder learns to generate samples which are closer to the target distributions. Eventually, the encoder will start to generate samples z which are close enough to s so that the discriminator is not able to distinguish between them.
In the case of AuthNet, as commonly done for adversarial models, these two objectives are optimized in an alternate fashion: one step for the discriminator followed by one for the encoder.

D. AUTHENTICATION
For AuthNet, during the authentication phase only the trained encoder network is utilized. This network computes the latent representation z of the input biometric trait. Then, a decision is made according to this value. As said, for our choice of target distributions a hyperplane can be used for the optimal decision, i.e., we can use the test For d = 1, this boils down to comparing the scalar z with a threshold τ = (µ 1 + µ 0 )/2, (see Fig. 3).

IV. TRAINING AND IMPLEMENTATION DETAILS A. NETWORK INSIGHT 1) ENCODER SUB-NETWORK
A biometric trait in the form of either a RGB or a gray-scale image with size depending on the employed dataset is given as an input to the encoder sub-network. The choice of the encoder is a crucial task. In general, one may employ any state-of-the-art neural network architecture able to learn good features. To prove the idea, we conducted experiments on several neural network architectures such as plain CNN, ResNet [7] and DenseNet [8] with different number of layers. For the considered datasets, it was empirically found that either ResNet-18 or DenseNet-50, followed by a fully connected layer having an output of size d, are sufficient to effectively learn the latent mapping. It is important to notice that in this last layer of the encoder network we do not use any non-linear activation as the output should be mapped to a sample of the target distributions. Further, it was found that if a network with too many parameters, like ResNet-101/152 or DenseNet-121/169 is employed for a small/medium sized datasets, it leads to slower training without performance improvement. This motivates us to use ResNet-18 / DenseNet-50 as the encoder sub-network.
In the following sections we will refer to AuthNet with ResNet encoder sub-network as AuthNet-R and to AuthNet employing DenseNet encoder as AuthNet-D.

2) DISCRIMINATOR SUB-NETWORK
The discriminator sub-network has three main inputs: i) samples from target prior distributions, ii) latent vector output from the encoder sub-network z having size d, and iii) one-hot vector a used during the training process to tell the discriminator whether the sample is authorized or unauthorized. The discriminator is a fully connected network consisting of 8 layers with the ReLU activation function employed at the output of each layer. This number was chosen empirically so that the discriminator has enough capacity to compete with the encoder sub-network. We found from empirical testing that the chosen network sizes worked well across different d-values, and they make the discriminator strong enough and with enough capacity to compete with the given encoder (i.e. ResNet-18 or DenseNet-50) and thus lead to a stable training. Indeed, the layer size depends on the structure of the encoder sub-network: if the discriminator layers are properly sized the encoder loss might quickly drop to zero, thus stopping the training. We found that 8 discriminator layers are enough to cope with the ''capacity'' (or the number of parameters) of the encoder sub-network.
The input of the discriminator sub-network is the concatenation of latent vector z from the encoder sub-network and the one hot vector a indicating the class to which the corresponding user belongs to. The first fully connected layer has an output size equal to 100. This size gradually increases to a maximum of 1000. After this, the size gradually decreases with the final layer having an output of size equal to 1 to which a sigmoid activation is applied estimating the probability that the sample is coming from the encoder or the target prior distribution.

3) PREPROCESSING AND TRAINING PARAMETERS
The network is trained using Adam optimizer [42] using an iterative algorithm as discussed in [1]; the optimization is carried out one step for the encoder and one for the discriminator. Weight decay is set to 0.0004 and a dropout of 0.7 is used. The learning rate is set to 0.01 for first 5000 iterations and it is then decreased by a factor of 10 after every 5000 iterations. In total, the network is trained for 30000 iterations. The only pre-processing employed for AuthNet on all considered datasets is energy normalization of the input images.

B. DATA AUGMENTATION
Having a diverse and large dataset is crucial for deep neural networks training. The performance of a neural network depends upon the features learned from the training data. In the case of biometric authentication the acquisition process should be fast and usually the number of acquired samples during the enrollment is very limited. An efficient augmentation strategy is hence needed, so that enough data are provided to the network. In addition, we aim to have a general purpose augmentation strategy which could work for different biometric traits.
As summarized in Fig. 4, our augmentation process is based on both image crops and samples mixup. For each sample of size m × m, all possible crops of size n × n are extracted. Since the number of positive samples (authorized users) is much less than the number of available negative samples (unauthorized users), we employ two different augmentation factors, namely F and F 1 . The former refers to the augmentation factor for the positive samples, the latter for the negative ones. Clearly, in our case F > F 1 .
After obtaining multiple crops of the samples, positive and negative training samples are mixed using a convex combination as described in [43] in order to create more diverse training samples. As a side advantage, as shown in [43], the mixup also helps to regularize and improve the network generalization. Given a positive and a negative sample, respectively denoted as x a=1 and x a=0 , a new sample is fabricated as x m = λx a=1 +(1−λ)x a=0 , where λ ∈ [0, 1] follows a Beta distribution with parameters α and β that in our case are both fixed to 0.4. This parameter choice results in a distribution peaked at 0 and 1 and achieves the lowest probability for λ = 0.5. This avoids creating augmented samples that are too distant from the centroid of either class. To associate a label to a newly created sample, we use l = round(λ).

V. PERFORMANCE ANALYSIS
AuthNet is a general purpose network designed to seamlessly work on different types of biometric traits. We have conducted experiments on faces and fingerprints. In biometric authentication systems it is common to assume that the user puts him/herself in a controlled condition for the biometric traits acquisition. In this regard, the datasets we consider are among the biggest ones acquired in such conditions.

A. DATASETS
For face authentication, we evaluate our method on CMU Multi-PIE [44] and Yale Face database DB2 [45].
CMU Multi-PIE consists of 750,000 images of 337 candidates. The dataset is acquired over a span of 5 months in four different sessions. The dataset consists of images having 15 view points and 19 illumination conditions. It contains images with different poses, illuminations and expressions. We consider the frontal posed images with different expressions and illuminations to highlight the robustness of the algorithm. Indeed, as hinted above we assume to have controlled acquisitions which lead us to consider only the frontal pose. However, to keep high intra-class variability, we do not fix other sources of noise such as facial expressions and illumunation conditions.
For each user enrollment 75% of the samples are employed for the training and remaining 25% are left for testing. For unauthorized users, out of 128, 96 users samples are drawn for the training and remaining 32 users samples are left for testing. Further, train and test splits are made in such a way to avoid the sharing of the same facial expressions or illumination conditions and thus reducing the probability of overfitting.
Samples are resized to 144×192×3 maintaining the aspect ratio. To create more diverse samples, positive and negative users samples are combined through a mixup strategy as discussed in Sec. IV-B.

B. EVALUATION METRICS
The main metric we will use in our experiments is Equal Error Rate (EER) defined as the value at which the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR). Given a threshold τ , the FAR indicates the number of accepted samples that should have been rejected over the total number of samples. Conversely, the FRR indicates the number of rejected samples which should have been accepted over the total number of samples.
It is important to notice that in biometric authentication systems the FAR is a critical parameter: a large value indicates a high number of unauthorized users wrongly authorized by the system. This situation is indeed more dangerous with respect to having high false rejections of authorized users (large FRR). For good biometric systems minimum FAR is desired. For this reason we also test the systems at small values of FAR: we report the Genuine Acceptance Rate (GAR), namely the relative number of correctly accepted users at FAR equal to 10 −2 and 10 −3 . Finally, we report the maximum accuracy, defined as the value at which the number of correctly classified samples is maximized.
In the results section, the metrics are first computed independently for each of the considered users, reporting the resulting average values and their relative standard deviations. This will give insights on how the system performs, on average, on a per-user basis. Additionally, to gain a better understanding of the overall performance, we also report the aggregated results on all the users scores and  illustrate the Receiver Operating Characteristic (ROC) curve computed on the aggregated scores of the considered users.

C. DIMENSIONALITY OF LATENT SPACE
An important parameter in the design of AuthNet is the choice of the latent space dimensionality d. The datasets we are considering are medium sized, thus it is not surprising that a smaller d achieves better results. In case of large datasets, a larger latent space improves the data separation and leads to improved performance.
In our tests, we fixed the hyperparameter d = 1, since in our experiments this choice gave us better results as can be seen in Tab. 1. Intuitively, as the latent space grows in dimensionality, a larger number of training samples are required to avoid overfitting. As an example MultiPIE has a relatively larger size compared to Yale dataset, it can be observed from Tab. 1 that for MultiPIE higher GAR is achieved at larger values of d compared to Yale dataset.

D. PARAMETERS OF AUTHORIZED AND UNAUTHORIZED USERS DISTRIBUTIONS
In AuthNet the authorized and unauthorized target distributions are set to be Gaussian. This choice comes from the fact that the output of a (large enough) fully connected layer, by the central limit theorem, will naturally tend to a Gaussian distributed output [49], [50]. We set the distributions to be P 1 = N (0, 1) and P 0 = N (40, 1). We choose µ 1 = 0 and µ 0 = 40 to be different enough to keep the distributions far apart from each other. Further, we set σ 1 = σ 0 = 1 as the choice lead to simple decision boundaries. As for the Gaussian discrimination problem if σ 1 = σ 0 , then a linear decision boundary (hyperplane) is optimal. In more detail, in Fig. 5 we show the maximum accuracy obtained by AuthNet, together with skewness and kurtosis of the latent representation as a function of (µ 0 − µ 1 )/σ for a randomly selected CMU-MultiPIE user. It can be seen that the region for which the accuracy is maximum, corresponds roughly to 15 ≤ (µ 0 − µ 1 )/σ ≤ 45; in this region, skewness and kurtosis are close to 0 and 3 respectively, showing that the training indeed converges to Gaussian distributions. Further, if (µ 0 − µ 1 )/σ is too large, the training process becomes unstable and the distributions become far from Gaussian.

E. RESULTS
Before presenting the results it is important to consider that the precision of the performance metrics we consider is proportional to the number of test samples. The maximum precision which can be obtained for the considered metrics (explained in Sec. V-B) is given by 1/c with c = min{L × F, Q × F 1 }. Therefore, we will verify that the proposed augmentation strategy does not introduce any bias on the measured performance. As can be seen in Fig. 6, augmentation avoids coarse quantization of probability values without introducing any bias. For this reason, the metrics we will consider from now on will be computed on the augmented dataset.
For our results, in addition to biometric-related methods, we also include the comparison with the Encoder network of AuthNet-R used as a classifier and trained with sigmoid cross entropy loss. In Sec. I we discussed the issue of classifiers having highly non-linear and complex to analyze boundaries. Therefore, we evaluate the behavior of a deep learning classifier based on the same architecture as the AuthNet-R encoder but which is not trained in an adversarial way, in order to assess the benefits of the adversarial scheme employed in AuthNet.
Tab. 2 presents the results achieved by AuthNet and benchmarking methods in terms of EER, GAR values at FAR = {10 −2 , 10 −3 } and maximum accuracies on the individual and aggregated scores. Fig. 7-9 depict the histogram of the aggregated scores obtained by different methods. The ROC comparison for different benchmarking methods is depicted in Fig. 10. Lastly, for the sake of readability, unless differently specified from now on we will refer to both AuthNet-R and AuthNet-D as ''AuthNet''.     [19] and ArcFace [20] in (c); with AuthNet encoder classifier, VeriFinger [51] and the hybrid approach [35] in (b). In all the cases, AuthNet (red) and (black) achieves higher GAR with respect to other authentication schemes at different values of FAR.

1) FACE AUTHENTICATION
The datasets we employ for face authentication are CMU Multi-PIE and Yale Face database B, as detailed in Sec. V-A. For benchmarking with state-of-the-art deep learning techniques, we compare with ArcFace [20] and FaceNet [19]. FaceNet and ArcFace tend to work better on aligned face patches. For CMU-Multi-PIE, we pre-process the dataset by aligning and cropping the input faces using the well-known approach of joint face detection and alignment using Multitask Cascaded Convolutional networks (MTCNN) [52]. Yale Face database already consists of frontal face images of the subjects, so face alignment and crop are not needed.
Regarding the training process of Facenet and ArcFace, we employ the standard architecture as described in their respective papers using 512-dimensional embeddings. Since the above methods are meant to learn a generic face embedding to be used for either face recognition, verification, or clustering, they 1) require a very large training dataset, and 2) cannot learn a user-specific embedding. This will result in an unfair comparison with AuthNet. To alleviate this issue and make the comparison fair, we follow a two-step approach. At first we train FaceNet and Arcface on the large CASIA WebFace dataset [53] in such a way that we can obtain 512 dimensional embeddings from given input face images. Then, given the embeddings, we train two-class FC classfiers (one for each user) which have to classify the embeddings as either authorized or unauthorized.
Tab. 2 presents a comparison of EER, GAR at FAR = {10 −2 , 10 −3 }, and maximum accuracy for CMU Multi-PIE and Yale Face Database B, calculated on the individual and the aggregated scores of the users. From the results it can be observed that, in terms of EER, AuthNet achieves the lowest value outperforming other methods. Further, a very small advantage of AuthNet-R with respect to AuthNet-D can also be observed. Nevertheless, as shown in later experiments the performance of the these two AuthNet flavors is comparable and a clear winner cannot be identified.
It is also interesting to observe that for the AuthNet, even for very small values of FAR, high GAR values are obtained. The high performance for Multi-PIE compared to Yale Face database B is understandable since the former has a significantly larger number of high-quality samples per user compared to other datasets. Further, AuthNet outperforms the competing methods in terms of maximum accuracy achieved. It can be observed that for AuthNet encoder classifier the performance in terms of EER is an order of magnitude less than that of AuthNet. In more detail, we can exclude that this is due to AuthNet encoder classifier overfitting on the negative samples. Indeed, this case be seen by looking at Fig. 13 where it is depicted the ROC for the considered approaches when tested on out-of-domain or never-seen negative examples. It can be noticed that the performance drop of AuthNet encoder classifier is mostly bounded, and thus the poorer performance is due to the lack of regularization of the decision space. Indeed, the results of this comparison imply that by regularizing the latent space through well-behaved distributions, it is possible to increase the accuracy of the system by decreasing the number of false positives. This highlights the superiority of the proposed latent space regularization over a traditional classifier. Additionally, the achieved EER by FaceNet and ArcFace is also an order of magnitude less than that of AuthNet. Furthermore, for small values of FAR, the genuine acceptance for these methods significantly reduces, which is not the case with AuthNet. This indicates a high variablity of the results on a per-user basis, which can be observed from both individual and aggregated user scores in Tab. 2.
Furthermore, to better appreciate the effective regularization of the latent space of AuthNet, in Fig. 7 and 8 the face authentication scores for authorized and unauthorized users are depicted for different benchmarking algorithms. The blue curve in the figure depicts the histogram of the score obtained for the authorized users, and the red curve depicts the histogram of the scores obtained for unauthorized users. The histogram of the z scores obtained from AuthNet-R and AuthNet-D are depicted in Fig. 7a, 7b for Multi-PIE and Fig. 8a, 8b for Yale Face database B respectively. It can be observed that for both datasets, AuthNet very effectively separates authorized and unauthorized users samples and there is no mixing of authorized and unauthorized users distributions. The scores of the sigmoid output obtained from the AuthNet encoder classifiers are depicted in Fig. 7c and 8c. It can be observed that, being the output a sigmoid activation, the distributions are mainly peaked at 0 and 1; however there is noticeable spillover in the area in between. This is the reason for lower EER and GAR at small values of FAR. The histogram of the sigmoid output obtained from FaceNet and ArcFace embeddings classifiers is depicted in Fig. 7d, 7e for Multi-PIE and Fig. 8d, 8e for Yale Face database B, respectively. In both cases it is possible to appreciate a non-perfect separation of the scores: these misclassified users eventually lead to lower performance.
Lastly, Fig. 10a and 10b illustrates the ROC comparison of AuthNet with respect to other benchmark techniques on the aggregated scores of the users. It can be clearly observed that the ROC curves for AuthNet lie above all other methods and consistently achieve higher GAR even at very low values of FAR proving its superiority.

2) FINGERPRINT AUTHENTICATION
For the fingerprints, we employ the FVC 2006 DB2 dataset, detailed in Sec. V-A.
For benchmarking, we compare AuthNet with Auth-Net encoder classifier, Verifinger [51] and the hybrid approach described in [35]. Verifinger is a well-known and commercially available system commonly used for minutiae extraction and fingerprint matching achieving state-of-the-art performance in fingerprint identification [54].
Tab. 3 depicts the comparison of EER, maximum accuracy, and GAR of AuthNet at small values of FAR with the benchmarking methods. From Tab. 3, it can be observed that AuthNet achieves the lowest EER and highest accuracy, outperforming all benchmark methods. However, differently from the previous results, it can be observed that AuthNet-D has a slight performance advantage over AuthNet-R. In general it is difficult to state which of the two AuthNet flavors achieves higher performance. Indeed, the performance of AuthNet is to some extent independent of the encoder network architecture. As long as the encoder network has VOLUME 8, 2020 enough capacity, any recent CNN architecture will be able to reach, on average, high performance.
Additionally, for small values of FAR, both AuthNet-R and AuthNet-D achieve high values of GAR. Verifinger, AuthNet encoder classifier and hybrid approach, also achieve small EER values; however, it can be observed that the GAR values significantly drop as the FAR values are decreased, which is not the case with AuthNet.
Further, it can be seen from Fig. 9a and 9b that the proposed method separates the authorized and unauthorized users very effectively. Conversely, in the case of non-deep learning approaches such as Verifinger in Fig. 9d, and the hybrid approach in Fig. 9e, the authorized and unauthorized users do not have a clear scores separation and the related regions are not well-behaved. Moreover, it can be noticed in Fig. 9c that similarly to the case of face datasets, while AuthNet encoder classifier provides a separation between the scores it also introduces some ''leakage''.
Lastly, in Fig. 10c the ROC comparison of AuthNet with respect to other fingerprint authentication schemes is depicted. The red curve depicts the GAR at different FAR values obtained by AuthNet. It can be seen that AuthNet ROC curve lies above other benchmarking methods. Furthermore, it can be clearly observed here that at small values of FARs, AuthNet clearly outperforms all the other competing algorithms, maintaining highest GAR values.

VI. IN-DEPTH ANALYSIS OF AuthNet
In order to better understand the performance improvement of AuthNet with respect to competing methods, a deeper technical insight is provided with the purpose of explaining how the regularization of the distributions performed by AuthNet yields fewer misclassifications compared to existing methods. Further, it is shown how Authnet is able to correctly classify samples that are misclassified by competing approaches.

A. MOTIVATION BEHIND HIGHER MISCLASSIFICATION RATE BY COMPETING METHODS
In the first set of experiments, shown in Fig. 11 the latent space outputs of AuthNet and the logit scores obtained by the competing methods, normalized to the target means of µ = 0 for the authorized users and µ = 40 for unauthorized users are presented; this normalization allows us to directly compare these methods with AuthNet. It can be observed that the logit scores of the other methods naturally tends to be Gaussian, from the central limit theorem [49], [50]. During AuthNet training, the target distributions are enforced to follow Gaussian distributions that are well separated, with predefined mean and standard deviation. However, for traditional classification methods this is not specifically enforced which results in distributions with unpredictable mean and standard deviation. As a result, it can be observed in Fig. 11 that the normalized logit score distributions of the competing methods exhibit higher variance with heavier tails, compared to that of AuthNet which instead obtains distributions which are well-separated in the latent space. Moreover, in Fig. 11 normalized logit scores for correctly accepted authorized users (blue), wrongly rejected authorized users (green), correctly rejected unauthorized users (red), and wrongly accepted unauthorized users (yellow) are highlighted. It can be clearly observed that for AuthNet the authorized and unauthorized users scores are well separated based on the predefined target distributions, yielding very few misclassifications, i.e. false rejections of authorized users (green) and false acceptance of unauthorized users (yellow) area. On the other side, in the competing methods, the logit scores distributions of the authorized and unauthorized users are broader, which results in a higher number of misclassifications as can be observed from the green and yellow areas.
Tab. 4 reports the standard deviation σ and kurtosis β 2 of the latent space features of AuthNet and the normalized logit scores obtained by different methods. It can be observed from the table that the lack of regularization of the distributions in the competing methods tends to have much higher σ . Similarly, the distributions obtained by the competing methods are heavy-tailed as can be seen from the measured values of β 2 . This points out a higher spread of the authorized and unauthorized user distributions with respect to the mass center, resulting in a higher number of misclassifications.

B. HOW AuthNet CORRECTLY MAPS USERS THAT ARE MISCLASSIFIED BY OTHER METHODS
In Fig. 13, depicting latent features obtained by Authnet, the latent feature outputs corresponding to authorized users that are wrongly rejected by competing networks are highlighted   in green, whereas features corresponding to wrongly accepted unauthorized users are highlighted in yellow. It can be observed that in all the cases AuthNet maps the wrongly accepted unauthorized users near the mass center of correctly rejected unauthorized users i.e. the red area. Similarly, Auth-Net properly maps the wrongly rejected authorized users in the right class in the blue area.
In summary defining well separated target Gaussian distributions having specified mean and standard deviation during training avoids spread of the authorized and unauthorized users samples yielding a lower number of misclassifications.

VII. ROBUSTNESS ANALYSIS
In this second set of experiments, we show that regularizing the latent space to simple target distributions leads not only to improved accuracy, but also to more robust authentication. In particular, we test AuthNet and the benchmark methods trained on MultiPIE on datasets that the network has not seen during the training. Further, we also test the robustness of the proposed approach against targeted perturbations.

A. EVALUATION ON NEW DATASETS NOT SEEN DURING TRAINING
To show the robustness and resilience of AuthNet against the face datasets that the network has never seen during training, we test AuthNet-R and competing methods trained on MultiPIE on LFW [55], YTF [56], and CALFW [57] datasets. Fig. 13 shows the ROC comparison of methods trained and tested on MultiPIE versus the same methods trained on MultiPIE and tested on YTF, LFW, and CALFW datasets, for the class of unauthorized users (note that in this setup the unauthorized users are not present in the test dataset). The solid curves depict the results when methods are trained and tested on the same dataset, the dotted curves VOLUME 8, 2020 depict the test results on the datasets which the network has not seen during training. The robustness is measured in terms of the performance drop on the datasets that have not been seen during the training. It can be observed that AuthNet is robust against the datasets which were not presented at training time: it correctly maps the samples from these datasets to the unauthorized target distribution. This effect is more significant at small FAR (10 −3 ) where a large performance drop can be observed for the competing methods, whereas AuthNet maintains high GAR value, outperforming them by a big margin.
For a more detailed analysis, Tab. 5 reports the absolute difference in GAR at different values of FAR and the maximum accuracy difference achieved by different methods when tested on MultiPIE versus the other datasets. It can be seen that AuthNet consistently outperforms all competing methods, yielding a very small performance drop when tested on different datasets. The effect is very evident at small values of FAR.
To further evaluate the robustness of AuthNet, we also considered a non-face dataset: we test AuthNet and competing methods trained on MultiPIE on Caltech 101 [58] dataset. This dataset does not include faces and it is made of images of objects belonging to 101 different categories. From both Fig. 13d and Tab. 5 it can be observed that the performance drop is very significant for the competing methods. Conversely, AuthNet still maps the images of Caltech 101 to the unauthorized distribution giving stable results even at small FAR values.
The results in this section show that regularizing the latent space using well-behaved target distributions leads to robust authentication against features that have never been seen before. Furthermore, the behavior of the non-authorized region of AuthNet is consistent across different datasets.

B. EVALUATION ON TARGETED PERTURBATIONS
We further analyse the robustness of the AuthNet approach against the targeted perturbations. We consider white-box Fast Gradient Sign Method (FGSM) [59] due its simplicity and speed in crafting the perturbations. In FGSM the input samples are adjusted to maximize the loss based on the back propagated gradients. The model back propagates to the input data to calculate ∇ x J (θ, x, a), then the input samples are adjusted by a step of in the direction of sign(∇ x J (θ, x, a)) that will maximize the loss.
For this experiment, we compare AuthNet-R with the AuthNet encoder classifier trained on Multi-PIE in order to highlight the advantages of learning the mapping instead of the boundaries. The rationale is to show that for traditional methods producing arbitrary boundaries, it is usually possible to craft samples that result in incorrect classification with a minimal perturbation, whereas for the proposed method this is much more difficult, leading to improved robustness.
For both AuthNet and AuthNet encoder classifier, every test sample is perturbed with ∞ bounded perturbation and the results are aggregated. We define n as the noise vector such that (n) ∞ ≤ where noise strength is defined as the ratio (n) ∞ / (x) ∞ . As an example 100% noise strength means the model is able to corrupt the image with noise values within the full range of the input image.
In Fig. 14 we depict the probability of success of FGSM as a function of the noise strength. For AuthNet it can be noticed that trying to move the authorized users into the unauthorized region (z < 20) has a high probability of success for large noise strength, i.e. larger than 10% of the maximum pixel values of the input images. However, by lowering , the probability of success decreases accordingly. Conversely, granting access to unauthorized users is a much harder task. The maximum probability of success, reached at 2% noise strength, is 0.27. Furthermore, the probability of success in such setting is close to zero even for very large perturbations. This can be explained by the way AuthNet regularizes the latent space: authorized users are strictly enclosed within the high mass region of P 1 . If the perturbation is too strong, the likelihood that the perturbed samples are treated as unauthorized users increases. We further study this effect in Fig. 15 where we  show the trajectory of z in the latent space: for a perturbed sample coming from an unauthorized user as a function of the noise level. It can be seen that for large perturbations, z stays within the high mass region of P 0 . Similarly, if is limited to less than 1%, the value of z remains close to 40. Between these limits we have a region which may lead to misclassification of unauthorized users. An interpretation of this behavior is that the regularized decision boundary provided by AuthNet does not allow to choose an easy path for crossing the boundary from a generic point within the decision region, i.e., every point on the other side of the boundary tends to be equally far away. If we compare these results with those of the AuthNet encoder classifier in Fig. 14, it is immediate to notice that overall FGSM is much more successful, especially for large noise strength. Also in this case FGSM targeting authorized users is more successful. This confirms our conjecture that the highly complex boundaries learned through a classifier are more vulnerable to adversarial perturbations. Conversely, the proposed AuthNet architecture, by properly regularizing the latent space is able to greatly reduce such effects and thus reduce the likelihod of targeted perturbations to succeed.

VIII. CONCLUSION
We presented a novel approach for biometric authentication based on adversarial learning in which the latent space regularization leads to improved robustness and accuracy of the biometric classification. Our intuition behind this behavior is that the non-linear boundaries learned by standard deep learning classifiers indeed become very complex as they try to closely fit the training data, leaving room for misclassification. Conversely, the adversarial learning of AuthNet enables much simpler boundaries to be used as it does not learn how to partition the space but rather how to map the input space into the latent space. With extensive experimentation, on multiple large biometric datasets with several state-ofthe-art benchmark methods, we showed that AuthNet consistently outperforms other existing techniques. We further show that regularizing the latent space makes the architecture less vulnerable to targeted and non targeted perturbations.
Future work will consider adding new users to a pre-trained AuthNet and to handle user revocation. He is currently a Senior Engineer with the Sony Research and Development Center Europe Stuttgart Laboratory 1. He has authored or coauthored over ten articles in the area of wavelets, numerical integration, computer vision, and deep learning. His research interests include neural network compression, RGB-D fusion, and sensor security.