Fingerprint Presentation Attack Detection Based on Local Features Encoding for Unknown Attacks

Fingerprint-based biometric systems have experienced a large development in the last years. Despite their many advantages, they are still vulnerable to presentation attacks (PAs). Therefore, the task of determining whether a sample stems from a live subject (i.e., bona fide) or from an artificial replica is a mandatory issue which has received a lot of attention recently. Nowadays, when the materials for the fabrication of the Presentation Attack Instruments (PAIs) have been used to train the PA Detection (PAD) methods, the PAIs can be successfully identified. However, current PAD methods still face difficulties detecting PAIs built from unknown materials or captured using other sensors. Based on that fact, we propose a new PAD technique based on three image representation approaches combining local and global information of the fingerprint. By transforming these representations into a common feature space, we can correctly discriminate bona fide from attack presentations in the aforementioned scenarios. The experimental evaluation of our proposal over the LivDet 2011 to 2015 databases, yielded error rates outperforming the top state-of-the-art results by up to 50\% in the most challenging scenarios. In addition, the best configuration achieved the best results in the LivDet 2019 competition (overall accuracy of 96.17\%).


I. INTRODUCTION
Biometric recognition is based on the use of distinctive anatomical and behavioural characteristics to automatically recognise a subject [1]. Among other biometric characteristics, fingerprints offer a high recognition accuracy and at the same time enjoy a high popular acceptance. Despite these and other advantages, fingerprint-based recognition systems can be circumvented by launching Presentation Attacks (PAs), in which an artificial fingerprint, denoted as Presentation Attack Instrument (PAI) is presented to a sensor [2], [3], [4], [5].
The threat posed by PAIs is not reduced to an academic issue. In 2002, Matsumoto et al. [4], [6] analysed the vulnerabilities of eleven commercial fingerprint-based biometric systems to gummy fingerprints. The experimental evaluation  [15] 3.33% 25.25% 15.20% VGG [14] 4.61% 19.80% 30.70% showed that 68% to 100% of the PAIs built with cooperative methods were accepted as bona fide presentations (i.e., genuine or live fingers). In 2009, Japan reported the use of presentation attacks in one of its airports, and in 2013, a Brazilian doctor used artificial silicone fingerprints to tamper a biometric attendance system at the Sao Paulo hospital [7]. In order to tackle those severe security issues, the development of Presentation Attack Detection (PAD) techniques, which automatically detect PAIs presented to the biometric capture device, is a mandatory task, which has attracted a lot of attention within the biometric research community not only for fingerprint systems [8], [9], but also for other characteristics such as face [10] or iris [11]. These PAD methods can be widely classified as hardware-or software-based approaches. Whereas the former require dedicated, and mostly expensive, specific hardware, software-based approaches focus on dynamic or static characteristics extracted from the same biometric samples used for recognition purposes. Therefore, software-based methods are less expensive, and will be the focus of this article.
The newest fingerprint PAD techniques based on deep learning and textural features have shown to be a powerful tool to detect most PAIs [12], [13], [14], [15]. However, they share a common limitation: they depend both on i) the material used for fabricating the PAIs, and ii) the sensor used for acquiring the fingerprint samples. More specifically, their error rates are multiplied five to 18 times when either the PAIs' materials or the sensors utilised are not known a priori (see Table I).
To address the issue of generalisation to unknown factors, we analyse the combination of local features (i.e., Scale-Invariant Feature Transform, SIFT [16]) with three different general purpose feature encoding approaches, which have shown remarkable results in object classification tasks [17], [18], [19]: i) Bag of Words (BoW), ii) Vector of Locally Aggregated Descriptors (Vlad), and iii) Fisher Vector (FV). The local descriptors, computed over the image gradient, allow capturing different artefacts produced by materials used for building the PAIs. Then, the afforementioned encoding ap-proaches assign each local descriptor (i.e., SIFT) to the closest entry in a visual vocabulary [20]. This visual vocabulary defines a common feature space, thereby allowing a better generalisation to unknown attacks or capture devices.
In order to evaluate the performance of the proposed methods and to allow the reproducibility of the results, we conduct a thorough experimental evaluation on the LivDet 2011, LivDet 2013, and LivDet 2015 databases. The performance is reported in compliance with the ISO/IEC 30107 international standard on PAD evaluation [5], thereby allowing a rigorous analysis of the results. The evaluation shows the capacity of the new method to be used in high security applications: for a high security operating point with an Attack Presentation Classification Error Rate (APCER) of 1%, an average Bona Fide Presentation Classification Error Rate (BPCER) of 0.25%, 0.38% and 7.11% was achieved, respectively, on the three databases, thereby outperforming the state-of-the-art. In addition, we would like to highlight that the proposed method took part in the Fingerprint Liveness Detection Competition 2019, achieving the best detection performance with an average accuracy of 96.17% [21].
The remainder of this paper is organized as follows: related works are summarised in Sect. II. In Sect. III, we describe the proposed PAD methods. The experimental evaluation is presented in Sect. IV. Finally, conclusions and future work directions are presented in Sect. V.

II. RELATED WORK
As we mentioned in Sect. I, we focus on static softwarebased fingerprint PAD methods, since they are the most time and cost efficient. In particular, we review those methods based on either deep learning or addressing scenarios with unknown factors. For more details on other methods, the reader is referred to [8], [9], [22].
In this context, it has been observed that some textural properties including the morphology, smoothness, and ridgevalley structure may be different between attack and bona fide presentations, and can thus be used to discriminate them. Building upon this idea, several texture-based PAD methods have been proposed in the literature [23], [24]. More recently, new methods based on deep learning approaches have significantly outperformed any earlier PAD techniques. For instance, Nogueira et al. [14] benchmarked three classic Convolutional Neural Networks (CNN). One of their proposals achieved the best results in the LivDet 2015 competition, with an overall accuracy of 95.5%. In spite of those promising results, the main limitation of these methods is that they learn features from a whole image with a fixed size. In many cases, also within the LivDet databases, the Region of Interest (ROI) covers only a small area of the whole image (e.g., 19% for some subsets of LivDet 2011), thus not being large enough to allow an efficient PA detection. This is highlighted by the results achieved on the LivDet 2011 -Italdata dataset, where the ACER increased up to 9.2%.
To address the small ROI issue, Pala and Bhanu [15] proposed training a triple convolutional network on one fixed size and randomly extracted patch per image. In spite of the obtained improvement with respect to the previous wholeimage-based approach [14], in the random patch extraction process several patches extracted from Italdata 2011 could stem from the background region of the image, thereby resulting in a still high ACER of 5.1%.
More recently, and based on the fact that PAIs produce spurious minutiae on a fingerprint image, Chugh et al. [12], [13] proposed a deep learning framework for independently classifying local patches around minutiae extracted from a fingerprint image. The final bona fide vs PA decision was defined as the average between PAD scores of the local patches. This approach additionally allows finding PA regions inside a sample, even if the PAI only covers part of the underlying fingerprint. The method achieves the lowest ACER values reported so far over the LivDet databases (see Table I, left column). However, despite the excellent results reported in the known environment (i.e., known attacks and known sensors), an evaluation on more challenging scenarios (i.e., unknown sensors and/or PAI fabrication materials) shows an increase in the error rates (see Table I).
Finally, Park et al. propose in [25] an efficient CNN based on the fire module of the SqueezeNet to optimise the hardware and time requirements. Evaluated over the LivDet 2011 to 2015, the CNN outperforms for some datasets the work presented in [13], at the same time reducing over 6 times the execution time. It should be though noted that the performance of this PAD method under more challenging scenarios with unknown attacks or sensors remains unknown.
To sum up, the main drawback of the aforementioned methods is their high dependency both on the PAI fabrication materials and the capture device. To tackle these issues, several approaches based on handcrafted features have been followed. On the one hand, Rattani et al. proposed in [26] an automatic adaptation of Weibull-calibrated support vector machines (SVMs). Over the LivDet 2011 database, the obtained equal error rates (EERs) oscillated between 20 and 30% for the best configuration in the presence of unknown PAI species. On the other hand, Ding and Ross analysed an ensemble of one-class SVMs trained only on bona fide data in [27], which lowered the error rates to 10-22% over the same dataset.
More recently, in an extension of [13], Chugh and Jain identified in [28] a subset of six out of 12 PAI species which can yield detection rates similar to known attacks scenarios. That is, training the SpoofBuster with only those six PAI species and testing on all 12 species results in an APCER = 10.24% at BPCER = 0.2%, very close to the APCER = 9.03% when all PAI species are used for training. In spite of these impressive results, it should be noted that the selection of the training PAI plays a crucial role in this study.
This dependecy is highlighted again by Engelsma and Jain in [29], where multiple generative adversarial networks (GANs) are trained on bona fide images acquired with the RaspiReader sensor. From the same 12 different PAI species, six are used for training and six for testing. In a benchmark with the method proposed in [27], the GANs outperform the SVMs. However, the average APCERs achieved for a BPCER = 0.2% vary from 31.42% to 68.98%, depending on the training set used. This shows again a high sensitivity  to different training datasets. In addition, this approach is not directly comparable to those based on conventional (e.g., Crossmatch or Greenbit) sensors, since a specific hardware, namely the RaspiReader, was used to acquire the samples. Finally, Gajawada et al. try to tackle this dependency on the PAI species contained in the training set from a different perspective in [30]. They propose a so-called deep learning based "Universal Material Translator" (UMT). Given a reduced number (e.g., five) of samples from a new PAI species, the UMT extracts their main appearance features to embed them into a database of bona fide samples, in order to generate synthetic samples of the new PAI species. Those synthetic samples can be then utilised to train any CNN. Over the LivDet 2015 database, the authors showed how the proposed approach can improve up to 17% the detection rates, achieving a remarkable 21.96% APCER for a BPCER = 0.1%. However, it should be noted that this approach does require some samples (i.e., five) of the analysed unknown PAI species.
In this context, our method tackles the issue of detection performance degradation in the presence of unknown factors (i.e., attacks, sensors, or databases) by transforming the local descriptors extracted from the fingerprint samples into a common feature space. This allows for better generalisation capabilities to more challenging scenarios, not needing any samples of the unknown attacks for training. Fig. 1 shows an overview of the proposed PAD approach, based on the fusion of three different feature encoding approachs. In the first common processing step, the Pyramid Histogram of Visual Words (PHOW) [31] algorithm is used to extract local features: the so-called dense Scale-Invariant Feature Transform (dense-SIFT) descriptors (Sect. III-A). Subsequently, three encoding methods are applied to bring the aforementioned local descriptors into a common feature space:

A. Local Features Extraction: dense-SIFT Descriptors
As local feature descriptors we have chosen the dense-SIFT approach, computed over the image gradient, since they can capture lower coherence areas introduced by the coarseness of different PAI fabrication materials. In particular, the Pyramid Histogram Of visual Words (PHOW) approach proposed by [31] computes SIFT descriptors densely at fixed points on a regular grid with uniform spacing S (e.g., 5 pixels), as summarised in Fig. 2 (left). For each point in the grid, the dense-SIFT descriptor computes the gradient vector for each pixel in the feature point's neighbourhood (Fig. 2, top right), taking into account 8 different directions. Subsequently, a normalized 8-bin histogram of gradient directions (Fig. 2, bottom right) is built over 4×4 sample regions. In addition, in order to account for the scale variation between fingerprints, these dense-SIFT descriptors are computed over four circular patches or windows with different scales σ = {5, 7, 10, 12}. Therefore, each point in the grid is represented by four SIFT descriptors (i.e., one per σ) comprising a total number of 128 features (i.e., 4 × 4 8-bin histograms).
It should be noted that windows with different scales allow extracting local information of fingerprints at different resolution levels, thereby detecting variable-size artefacts produced in the fabrication of PAIs. In addition, near-uniform local patches do not yield stable keypoints or descriptors. Therefore, we have used a fixed threshold δ on the average norm of the local gradient in order to remove local descriptors from low contrast regions (i.e., regions with an average norm value close to zero).

B. Local Feature Encoding
In the second stage of the PAD algorithm, three different feature encoding approaches for the dense-SIFT descriptors are analysed.

1) Bag of Words (BoW):
Bag-of-Words (BoW) based techniques were first developed for text categorization tasks, in which a text document is assigned to one or more categories based on its content [33]. For this purpose, BoW represents the text document by a sparse histogram of word occurrence based on a visual vocabulary. Following this same idea, Csurka et al. [17] adopted and transformed this technique to represent local features from an image in terms of the so-called visual words. Our method builds upon this last approach.
As first proposed in [34], the BoW representation first computes the visual vocabulary as a codebook with K different centroids or visual words (see Fig. 1, top) with k-means clustering. Then, the BoW representation is defined as the histogram of the number of image descriptors assigned to each visual word. Its computation is summarised in Fig. 3. First, an m-level pyramid of spatial histograms is used in order to incorporate spatial relationships between patches. To do that, the fingerprint image is partitioned into increasingly fine subregions, and the dense-SIFT descriptors inside each sub-region are assigned to the closest centroid among the K visual words, using a fast version of k-means clustering [35]. Subsequently, the histograms inside each sub-region are computed and stacked into a single and final feature vector.
2) Fisher Vector (FV): BoW approaches encode local features using a hard assignment, in which a local descriptor is only assigned to one visual word based on a similarity function. In contrast, the Fisher Vector (FV) method derives a kernel from a generative model of the data (e.g., Gaussian Mixture Model, GMM), and describes how the set of local descriptors deviate from an average distribution of the descriptors [20]. The aforementioned model can be understood as a probabilistic visual vocabulary, which thereby allows a soft assignment. Thus, the FV paradigm encodes not only the number of descriptors assigned to each region, but also their position in terms of their deviation with respect to the predefined model.
As proposed in [36], we train a GMM model with diagonal covariances from decorrelated dense-SIFT descriptors extracted on the previous step (see the second row in Fig. 1). In general, the K-components of the GMM are represented by the mixture weights (w k ), Gaussian means (µ k ) and covariance matrix (σ k ), with k = 1, . . . , K. This leads to an image representation which captures the average statistics first-order and second-order differences between the local features and each of the GMM centres [37]: where α i (k) is the soft assignment weights of the i-th feature x i to the k-th Gaussian. It is important to highlight that w k , µ k and σ k are computed during the training stage. Finally, the FV representation that defines a fingerprint image is obtained by stacking the differences: With the aim of clustering the extracted local features with GMM diagonal covariance matrices, the dense-SIFT features are decorrelated using PCA [32]. In our approach, the dense-SIFT descriptor dimension was reduced from 128 to d = 64 components, hence resulting the final FV representation in a 2Kd = 128·K size vector, where K is the number of Gaussian components in the GMM and d is the dimension of a dense-SIFT descriptor.
3) Vector Locally Aggregated Descriptors (Vlad): In order to reduce the high-dimension image representation proposed by the FV and BoW approaches, gaining in efficiency and memory usage, we have finally studied the Vector Locally Aggregated Descriptors (Vlad) methodology [32] (see Fig. 1, third row). This is a simplified non-probabilistic version of FV, which models the data distribution from the accumulative distances between a visual word x i and its closest center c in the visual vocabulary. Therefore, as in the BoW approach, a visual vocabulary needs to be computed in the first step with the k-means algorithm.
More specifically, a d-dimensional local feature descriptor x (i.e., dense-SIFT descriptor) can be represented by a Vlad descriptor v x of size Kd as follows: where x j and c i,j denote the j-th component of x, and its corresponding closest visual word c i . In our method, v x is subsequently L 2 -normalised in order to further improve the classification accuracy. Finally, it is important to highlight that Vlad also uses PCA for decorrelating training data.

C. Classification
In order to classify the final encoded representations, separate linear SVMs have been used for each encoding approach. In order to find the optimal hyperplane separating the bona fide from the attack presentations, the optimisation algorithm bounds the loss from below. Therefore, we have trained two complementary SVMs as follows: • The first SVM labels the bona fide samples as +1 and the presentation attacks as -1, thereby yielding the corresponding W bf (weights) and b bf (bias) classifier parameters. • The second SVM labels the bona fide samples as -1 and the presentation attacks as +1, thereby yielding the corresponding W pa and b pa classifier parameters. Subsequently, given an encoded feature descriptor x, two different scores are computed, which estimate both the class of the sample (i.e., the score sign) and the confidence of such decision (i.e., the absolute value of the score is the distance to the hyperplane): The final score is then computed to minimise the distance to the corresponding hyperplane, thereby choosing the most reliable decision for the given vector: Given three different individual PAD scores, s FV , s Vlad , s BoW , output by the corresponding SVM, we define the final fused score s fusion as follows: where α + β ≤ 1.

IV. EXPERIMENTAL EVALUATION
In this section, we evaluate and benchmark the detection performance of each fingerprint encoding scheme described in Sect. III. Specifically, three goals were taken into account for the experimental protocol design: i) analyse the impact of the key parameter K (vocabulary size) on the detection performance of the three proposed PAD schemes, ii) benchmark the detection performance of our proposals against the top state-of-the-art approaches, and iii) study the computational performance of the three fingerprint encoding schemes.

A. Experimental Protocol
The proposed PAD methods were implemented in C++ using the open-source VLFeat library 1 . All the experiments were conducted on an Intel(R) Xeon(R) CPU E5-2670 v2 processor at 2.50 GHz, 378GB RAM.

1) Databases:
The experiments were conducted on the well-established benchmarks from LivDet 2011 [38], LivDet 2013 [39] and LivDet 2015 [40]. A summary of the PAI fabrications materials is included in Table II.   2) Evaluation Protocol and Metrics: To reach the aforementioned objectives, the experimental evaluation considers three different scenarios: i) known-material and known-sensor, ii) known-sensor and unknown-material, and iii) unknownsensor and cross-database.
The detection performance is evaluated in compliance with the ISO/IEC IS 30107 [5]: we report the Attack Presentation Classification Error Rate (APCER), which refers to the percentage of misclassified presentation attacks for a fixed threshold, and the Bona Fide Presentation Classification Error Rate (BPCER), which indicates the percentage of misclassified bona fide presentations. We also include the Detection Error Trade-Off (DET) curves between both error rates, as well as the BPCER for a fixed APCER of 10% (BPCER10), 5% (BPCER20) and 1% (BPCER100).
Then, in order to establish a fair benchmark with the existing literature, we report the ACER as the average of the APCER and the BPCER for a fixed detection threshold δ.
B. Experimental Results 1) Known-Material and Known-Sensor Scenario: First, we optimise the algorithms' detection performances in terms of the main key parameter: the visual vocabulary size K. To that end, we focus on the known scenario, in order to avoid a bias due to other variables. We test the following range of values: K = {256, 512, 1024, 2048}, since K > 2048 would yield too long feature vectors, not usable for real-time applications. We found that the best K value on average is K = 1024 (for more details, the reader is referred to the appendix), and optimised the fusion parameters (see Sect. III-D) for this value in terms of the D-EER. Fig. 4 shows the DET curves for the FPAD approach over all sensors for K = 1024. As it can be observed, for low APCER values of 1% (i.e., high security thresholds), the FPAD achieves a remarkable average BPCER100 = 0.25% (vs4.05% in [13]) for LivDet 2011 and 0.38% for LivDet 2013. More in detail, for LivDet 2011, the Digital Persona and Sagem sensors report a BPCER = 0% for any APCER ≥ 0.2%. Regarding the LivDet 2013 database, the results are similar and for all sensors, and we observe a BPCER = 0% for any APCER ≥ 10%. In contrast, the FPAD suffers a detection performance decrease, with error rates multiplied by up to 42 times. More specifically, it shows a BPCER10 = 0.94%, BPCER20 = 2.12% and BPCER100 = 7.11%.    Avg. DET curves for unkown-material scenario adopted from [18] FPAD Fixed thresholds FPAD Optimised thresholds (b) Unknown-material protocol from [14]. In Table VIa, we benchmark our results with the stateof-the-art in terms of the ACER. The lowest value on each row is highlighted in bold. As it can be observed, even if the individual feature encoding approaches do not outperform the FSB, the fused FPAD approach yields the lowest average ACER for both LivDet 2011 (0.28% vs1.67%) and LivDet 2013 (0.43%). On the other hand, the FSB achieves the best performance over LivDet 2015 (0.97% vs2.82%). Nonetheless, it should be noted that the main goal of the present work is not only to achieve the best performance at a single operating point (i.e., the ACER is measured for δ = 0.5) but overall for different applications requiring either a low BPCER (i.e., high convenience) or low APCER (i.e., high security), and also under more challenging and realistic conditions (i.e., unknown sensors or PAI species).
2) Known-Sensor and Unknown-Material Scenario: In this scenario, both training and test samples were acquired by the same sensor, while presentation attacks in the test set were acquired from unknown PAI species. We analyse in detail the best performing single approach (FV) and the FPAD method. For the latter, we select the fixed thresholds obtained for the known-scenario (see α, β values in Table I), and denote this configuration as "fixed thresholds". In addition, we also evaluate its performance on the best α, β threshold combination (hereafter referred to as "optimised thresholds"). The corresponding DET curves are reported in Fig. 5.
Regarding the LivDet 2015 protocol, we can observe a similar behaviour between the FV encoding and the fused FPAD algorithm for fixed thresholds in Fig. 5a. In particular, the BPCER10 and BPCER20 are slightly higher for the individual FV encoding (around 1.6-7% and 3.5-9%), but for high security thresholds, the FPAD achieves lower error rates (BPCER 14.3% vs. 14.4%). Also, the DET curves for Greenbit and Crossmatch are very close, whereas the performance for HI Scan and Digital Persona decreases. In contrast, the optimised thresholds FPAD achieves the best performance for Hi Scan, only showing a lower performance for Digital Persona. And in all cases, the detection rates are higher, yielding a low BPCER of 7%. Regarding the state-of-the-art, [30] achieves an average APCER of 22% for a BPCER = 0.1% for the Crossmatch dataset, and he FPAD approach achieves an APCER under 20%, thus highlighting its soundness.
In the second set of experiments, we follow the unkownmaterial protocol defined in [14]. In this case, Fig. 5b shows one of the main strengths of FV encoding: under high security scenarios, an average BPCER100 under 5% can be achieved. In particular, for Italdata 2011 (BPCER100 = 6.20%) and Italdata 2013 (BPCER100 = 0.0%) those values outperform the ones reported by [13]. Regarding the fused algorithms, it can be also observed that even the fixed thresholds configuration achieves a BPCER100 comparable to FSB [13] (i.e., BPCER100 = 4.48% vs. 4.24%). In addition, the optimised thresholds FPAD reports a BPCER100 = 1.85%, which is twice smaller.
We finally compare in Table VIb the performance of our methods and FSB [13] in terms of the ACER. We can observe that the FV encoding outperforms the remaining algorithms for three out of the four datasets. Moreover, for the fixed and optimised thresholds, our FPAD pipeline achieves an average ACER = 2.61% and ACER = 1.01% respectively, which considerably outperforms the top state-of-the-art.
3) Unknown-Sensor and Cross-Database Scenarios: Finally, we evaluate the soundness of our proposals in scenarios where different (i.e., unknown) sensors are used following the unknown-sensor and cross-database scenarios proposed by [14].
In the first set of experiments, training and test samples are acquired using different sensors (i.e., sensor inter-operability analysis). Fig. 6a shows the corresponding ISO-compliant evaluation. As it may be observed, training over the Italdata subset yields a better performance at all operating points than training over Biometrika (grey vs orange, and blue vs yellow cuves). Only low BPCERs ≤ 0.5% over the LivDet 2013 show a different behaviour. Moreover, for a fixed APCER of 1%, the FV encoding achieves BPCER100 of 26.80%, which reduces almost by 50% the top state-of-the-art result (BPCER100 = 52.52%) [13]. In addition, our optimised thresholds FPAD approach attains a BPCER = 0% for all APCERs over the Italdata13 train set -we may thus conclude that the method found the optimal common feature space from the Italdata 2013 training set to correctly classify the Biometrika 2013 samples.
Table VIc benchmarks all methods to FSB [13] in terms of ACER. In general, and regardless of the particular traintest combination, FV encoding is able to outperform both the other two encoding approaches and the results obtained in [13] (i.e., average ACER = 7.83% for FV vs. 14.59% for FSB, which implies a relative improvement of 48%). Moreover, the FPAD also outperforms the FSB [13] for both the fixed and the optimised thresholds by a relative improvement of 38% and 55%, respectively.
In the second experiment, the performance is evaluated over the change of data collection over the same sensor (i.e., train and test over the same sensor, but acquired for LivDet 2011 and LivDet 2013, respectively). We refer to this protocol as cross-database scenario. In Fig. 6b we can see different behaviours for each algorithm for the different datasets. Whereas the Biometrika curves (orange and yellow) are very close for the FV encoding, this is not the case for the fused FPAD. This is due to the different generalisation capabilities of the remaining encoding approaches (BoW and Vlad), as it may be seen in Table VId. In particular, the ACER achieved training over Biometrika 2011 are better than training over Biometrika 2013 for BoW (28.8% vs. 15.70%), and vice versa for Vlad (15/70% vs. 11.10%). In addition, the poor performance of BoW also affects the fixed thresholds FPAD, thereby yielding a poor BPCER100 of almost 60%. However, the optimised thresholds FPAD can improve the error rates yielded by FV, achieving an average BPCER100 of 26%.
Finally, coming back to the ACER-based benchmark with FSB [13], we may observe that, on average, all the FV approach (ACER = 9.15%), the fixed thresholds FPAD (ACER = 17.75%) and the optimised thresholds FPAD (ACER = 8.23%) are able to outperform the FSB (ACER = 17.91%) by up to a 55% relative improvement.     Fig. 6: Performance evaluation over the unknown sensor scenarios proposed by [14].

4)
Computational efficiency: In this last set of experiments, we study the computational efficiency of the proposed image encodings for different parameter configurations. For this purpose, we select the LivDet 2015 database, which contains the largest images. We found that the BoW encoding requires 0.38 seconds, Vlad 1.58 seconds, and FV 2.11 seconds. There is thus a trade-off between detection performance and time efficiency. However, in all cases, the algorithms can be utilised for real-time applications.
V. CONCLUSIONS In this paper, we have proposed a new PAD method based on the combination of local dense-SIFT image descriptors and three different feature encoding approaches (i.e., FV, Vlad, and BoW). The experimental evaluation conducted over the publicly available LivDet 2011, LivDet 2013 and LivDet 2015 databases assessed the performance of our proposals with respect to the top state-of-the-art methods. The analysis of the detection performance showed that the FV reached the best individual detection accuracy for all databases. However, a score-level fusion of the three encoding approaches (known as FPAD) yielded an improved performance, significantly outperforming the top state-of-the-art results in the analysed scenarios, specially under the most challenging and realistic scenarios, where both unknown materials and unknown sensors are frequently employed. In addition, this fused approach achieved the highest detection accuracy on the LivDet 2019 competition [21].
It should be also noted that the fixed thresholds configurations do not always outperform the FV encoding as a standalone algorithm. This highlights the challenges faced when unknown sensors or PAI species are contained in the test set. However, a proper tuning of the thresholds yields a very promising performance for the FPAD algorithm.
In more details, the ISO-compliant evaluation in terms of BPCER and APCER showed one of the main strengths of the FV encoding and the FPAD proposal: the low BPCERs achieved even for very high security operating points (i.e., APCER ≤ 1%). Specifically, the FPAD technique yielded an average BPCER100 of 25% on the unkown-sensor scenario, and a BPCER100 of 26% to 28% on the cross-database scenario, thereby outperforming the top state-of-the-art results [13] by up to a relative 50% to 60%, respectively. Moreover, both methods proved to be suitable in the presence of unknown PAI species, achieving a BPCER100 as low as 4.6% and 1%. In summary, the previous results indicate that i) orientation histograms provided by the dense-SIFT method correctly represent the lack of continuity in the ridge's flow, and hence the artefacts produced in the fabrication of PAIs, and ii) FV as well as the fusion-based proposal in combination with dense-SIFT descriptors found a new common feature space, which allows successfully detecting both known and unknown PAIs.
Finally, the computational efficiency evaluation showed that BoW encoding attained efficiency results below 400 milliseconds, while Vlad and FV encodings were above 1150 milliseconds. As future work lines, we will improve the computational cost of the Vlad and FV encodings in order to obtain the best trade-off between detection accuracy and computational efficiency.

ANALYSIS OF THE DETECTION PERFORMANCE FOR DIFFERENT VOCABULARY SIZES
As it was mentioned in the article, the main parameter shared by all feature encoding approaches is the vocabulary size K. The larger K is, the higher number of visual words is, and thus, the less the information loss during the quantisation carried out to convert the local dense-SIFT descriptors into the so-called common feature space. However, this also entails a higher computational cost, and can eventually end up in over fitting. Therefore, we analyse here in detail the impact of K on the detection performance and the computational efficiency of the PAD method for each scenario.

A. Known-Material and Known-Sensor Scenario
In the first place, we need to analyse the impact of K on the performance of the three proposed schemes individually. We do that under this all-known scenario in order to avoid a bias due to other variables (i.e., unknown PAI species or sensors). More specifically, we test the following range of values: K = {256, 512, 1024, 2048}, since K > 2048 would yield too long feature vectors, not usable for real-time applications.
The ACER values for each method and K are presented in Table VIa, and graphically in Fig. 7a. As it can be observed, most curves reach a minimum (i.e., lowest ACER, and thus, best detection performance) for K = 1024. In some cases, the ACER achieved for K = 2048 continues to decrease (e.g., the BoW encoding for LivDet 2013), thus not reaching a minimum over the selected range. However, as it was mentioned above, such vocabulary sizes would imply a non real-time detection, and will thus not be considered in the present study. Now, focusing on the best K value on average, K = 1024, we can highlight that FV encoding achieves on average, for all sensors, an ACER of 2.13%, 1.88% and 3.31% on LivDet 2011, LivDet 2013 and LivDet 2015, respectively. On the other hand, the best Vlad performance values are found at K = 1024 (i.e. 2.88% on LivDet 2011 and 2.68% on LivDet 2013) for all databases with exception of the LivDet 2015 dataset, in which the best accuracy is reached at K = 2048 (ACER = 4.16%). Finally, the BoW encoding improves its detection performance with K, thereby achieving its minimum ACER result at K = 2048.

B. Known-Sensor and Unknown-Material Scenario
In this scenario, both training and test samples were acquired by the same sensor, while presentation attacks in the test set were acquired from unknown PAI species.
In the first set of experiments, we select the LivDet 2015 database, since it already includes unknown PAI species for testing. Fig. 7b shows, in terms of ACER, the impact of the parameter key K on the performance of the proposed encoding techniques. As it can be seen, the average performance (represented with a dashed red line) improves with increasing values for K, achieving a minimum for K = 2048. More specifically, the FV encoding yields the best ACER results, with an average value of 3.31%.
We have also analysed the unknown materials protocols for LivDet 2011 and 2013 proposed in [14]. The results are presented in Table VIb. In this case, only the BoW encoding reaches the best detection performance for K = 2048. On the other hand, on average, the best results are yielded by K = 512 for FV, and K = 256 for Vlad.
Finally, it should also be highlighted that, for all three datasets (i.e., LivDet 2011, 2013 and 2015), BoW shows a higher variability range for different values of K. For instance, ACER varies within 3.03 and 4.44 for FV, between 1.64 and 8.61 for Vlad, and between 9.28 and 16.60 for BoW for LivDet 2015. Therefore, BoW is much more sensitive to changes in K.

C. Unknown-Sensor and Cross-Database Scenarios
Finally, in order to evaluate the soundness of our proposals in scenarios where different (i.e., unknown) sensors are used, we follow the unknown-sensor and cross-database scenarios proposed by [14].
In the first set of experiments, training and test samples are acquired using different sensors. Table VIc shows the ACER for different values of K. As it can be observed, the FV encoding achieves its better results at different values of K, depending on the sensor used for training: whereas for Italdata 2011 and 2013, the lowest ACER is achieved for K = 512 (9.60% and 0.90%), for Biometrika it is obtained for K = 2048 (18.50% and 1.20%). In general, and regardless of the particular train-test combination, FV encoding is able to outperform both the other two encoding approaches and the results obtained in [13] (i.e., ACER = 7.83% for FV vs 14.59% for FSB [13], which implies a relative improvement of 48%). These results indicate that FV encoding found a set of common features in training images that allow a correct detection of PAIs acquired with other sensors. In the second experiment, the performance is evaluated over the change of data collection over the same sensor (i.e., train and test over the same sensor, but acquired for LivDet 2011 and LivDet 2013, respectively). We refer to this protocol as cross-database scenario, and Table VId shows the impact of K on each proposed approach. As it can be observed, again the FV encoding is able to outperform both the other encoding approaches presented in this study and the top state-of-theart results. In particular, in three out of four cases, the best peformance is achieved for K = 2048. Only fo Biometrika13 -Biometrika11 the best performance is reached for K = 512.
Under these last two scenarios, the range of variability of BoW's performance is comparable to FV and Vlad. However, the ACER is multiplied by up to 4.8 times, thus making this encoding not as suitable for PAD purposes as the other two.
In general, we have seen how different values of K can impact the performance of the PAD method, and how, depending on the scenario considered, different values yield the best performance. However, an average value of K = 1024 always achieved either the best performance for FV and Vlad or it is close to it. Therefore, we can conclude that, if no data  is available to carefully analyse the best option, 1024 can be chosen as a sub-optimal value for K.

D. Computational efficiency
In this last set of experiments, we study the computational efficiency of the proposed image encodings for different pa-rameter configurations. For this purpose, we select the LivDet 2015 database, since it contains the largest images. Table VII shows the average performance of the proposal over different vocabulary sizes K. As it could be expected, different K values have an impact on the average computational efficiency of the proposed methods, since the feature vector sizes depend directly on K. More specifically, these efficiency results indicate that higher vocabulary sizes K worsen the computational efficiency of the PAD methods in many cases. On the other hand, in some cases, larger K values also lead to a better detection performance.
It should be noted that, in all cases, the efficiency values reported by BoW encoding for each parameter combination are always below 400 milliseconds, while for FV encoding they are above 1100 milliseconds. Therefore, being FV the most accurate approach, it will be interesting to improve its computation efficiency in future work in order to attain a better trade-off between detection accuracy and computational efficiency.