Journals & Magazines >IEEE Access >Volume: 12

Keystroke Verification Challenge (KVC): Biometric and Fairness Benchmark Evaluation

In the top row, the graphs show the performance of TypeNet (experiment 5F against experiment 10F) in the desktop case. In the bottom row, TypeFormer (experiment 5F agains...

Abstract:

Analyzing keystroke dynamics (KD) for biometric verification has several advantages: it is among the most discriminative behavioral traits; keyboards are among the most c...Show More

Metadata

Abstract:

Analyzing keystroke dynamics (KD) for biometric verification has several advantages: it is among the most discriminative behavioral traits; keyboards are among the most common human-computer interfaces, being the primary means for users to enter textual data; its acquisition does not require additional hardware, and its processing is relatively lightweight; and it allows for transparently recognizing subjects. However, the heterogeneity of experimental protocols and metrics, and the limited size of the databases adopted in the literature impede direct comparisons between different systems, thus representing an obstacle in the advancement of keystroke biometrics. To alleviate this aspect, we present a new experimental framework to benchmark KD-based biometric verification performance and fairness based on tweet -long sequences of variable transcript text from over 185,000 subjects, acquired through desktop and mobile keyboards, extracted from the Aalto Keystroke Databases. The framework runs on CodaLab in the form of the Keystroke Verification Challenge (KVC). Moreover, we also introduce a novel fairness metric, the Skewed Impostor Ratio (SIR), to capture inter - and intra -demographic group bias patterns in the verification scores. We demonstrate the usefulness of the proposed framework by employing two state-of-the-art keystroke verification systems, TypeNet and TypeFormer, to compare different sets of input features, achieving a less privacy-invasive system, by discarding the analysis of text content (ASCII codes of the keys pressed) in favor of extended features in the time domain. Our experiments show that this approach allows to maintain satisfactory performance.

In the top row, the graphs show the performance of TypeNet (experiment 5F against experiment 10F) in the desktop case. In the bottom row, TypeFormer (experiment 5F agains...

Published in: IEEE Access ( Volume: 12)

Page(s): 1102 - 1116

Date of Publication: 21 December 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3345452

Funding Agency:

Citations are not available for this document.

Contents

SECTION I.

Introduction

A. Keystroke Dynamics for Biometric Recognition

Keystroke Dynamics (KD) refers to the typing behavior of human subjects. It is commonly regarded as a behavioral biometric trait, similarly to voice [1], signature [2], [3], gait [4], [5], [6], touch gestures [7], [8], etc. In comparison with its physiological counterparts such as face or fingerprint, behavioral biometrics represent a more challenging technical problem in terms of recognition performance as they are in general characterized by a higher intra-user variability, and lower inter-user variability. These challenges are magnified when dealing with real-life applications that have up to millions of subjects. Nevertheless, they offer the advantage of verifying identities transparently, improving security, as well as usability [9], since they spare users from having to actively carry out a specific verification procedure (such as having their fingerprint scanned, or typing a password).

In particular, the deployment of keystroke dynamics verification systems is also economic, as there is no need for additional hardware. Potential applications span from verifying the subject identity while they write an email or they take a test in online educational platforms (free-text format) [10], to identifying malicious users across multiple accounts based on their typing style (free-text format) [11], or as an additional biometric security layer on top of a traditional knowledge-based password (fixed-text format) [12], etc. These aspects have prompted several companies to develop commercial solutions to enhance the security of users through KD [11].

A coarse classification of KD can take place according to two criteria: (i) the typology of acquisition device (keyboard): desktop or mobile. Due to differences in the pose or activity of typing subjects, more variability is commonly associated to mobile touchscreens in comparison with desktop keyboards; and (ii) regarding the text format, which can be free, fixed, or transcript. In the first case, the text typed is not the same across different samples: consequently, data are much sparser, more unstructured, and they present a higher rate of typing errors, compared to the fixed-text case, which aims to represent for instance the case of an intruder typing the password of the victim. Finally, the transcript text could be defined as a hybrid format as the subjects are asked to read, memorize, and type a text that is presented to them.

In its simplest form, keystroke dynamics are captured as discrete time instants: the time instants a key is pressed and released (for instance in Unix time format), accompanied by the code (ASCII) of the key pressed. More complex features can be extracted from these raw data. In particular, the ASCII codes are useful for learning relations between time and spatial distributions over the keyboard layout. Nevertheless, although handled in compliance with sensitive data protection regulations [13], they inevitably reveal the content of the text, putting at risk the privacy of the subjects [14], [15]. Other information such as the amount of pressure on the key or the size of the fingertip might be available depending on the specific hardware capabilities.

B. Limitations of Existing Evaluation Methodologies

In the field of keystroke biometrics, a typical obstacle for research advancement is represented by the heterogeneity of databases, experimental protocols, and metrics. In Table 1, some of the most important public keystroke dynamics databases are reported in chronological order. Although the literature on keystroke biometrics is extensive, to the best of our knowledge, except very few cases [11], [16], previous systems have mostly been only evaluated with up to several hundred subjects not representing well the recent challenges that massive usage applications can face. In addition, most research works are mainly focused only on desktop and fixed-text scenarios. Therefore, keystroke dynamics can still be considered a biometric modality at the early stages, especially for mobile devices. In fact, for mechanical keyboards of desktop computers, more in-depth evaluations have been conducted and commercial applications have been proposed [17]. Moreover, even if using the same databases, different systems proposed in the literature over the years have often been developed based on different subsets of users for development and evaluation, number of enrolment sessions, and metrics, hindering direct comparisons. In contrast, we propose a clearly defined experimental protocol based on same realistic use cases (Sec. VI), that can be easily adopted by researchers and practitioners of the field using the provided comparison files (Sec. II).

TABLE 1 Some of the Most Important Public Keystroke Dynamics Databases in Chronological Order

C. Fairness Considerations

Moreover, in the context of decision-making, algorithms are vulnerable to biases that render their decisions “unfair” [28], [29]. Consequently, in this context, fairness is defined as the absence of any prejudice or favoritism toward an individual or group based on their inherent or acquired characteristics. In the last years, innumerable studies have highlighted the existence of biases in biometric systems with regard to categories such as age, gender, and ethnicity [30], leading to worse decisions that affect specific demographic groups. In addition, these aspects are also relevant from the point of view of the privacy of users, as the existence of bias due to sensitive attributes often implies that the sensitive attributes themselves might be embedded in the learned representations [31], [32]. Therefore, the risk of leakage of some soft-biometric1 information about the subjects should be assessed as well. In general terms, the existence of bias or privacy leakage in biometric systems presupposes specific patterns in the input data associated with different demographic groups. For instance, in face biometrics, the existence of biological differences between different genders, ages, or ethnic groups is a trivial hypothesis that does not need a formal demonstration. However, for many biometric modalities including KD, it is not straightforward to make similar assumptions. Nevertheless, for KD, several studies have evaluated the predictability of gender [35], [36], age [37], [38], both [39], [40], and even emotions [41] and mother tongue [42]. In light of this, in this article we propose an experimental framework designed to highlight potential gender and age biases in the scores, which are still mainly unexplored aspects for KD on such a large scale. Within the current work, the focus is limited to age and gender because other potential sources of bias (such as the subject mother tongue, the device used, or the degree of familiarity of the subjects with keyboards) were not reported for most subjects in the raw databases.

D. Contributions

In brief, the main contributions of this article can be summarized as follows:

We propose a novel experimental framework to benchmark KD for biometric verification, which, to the best of our knowledge, is still lacking in this field. The framework is provided in the form of the Keystroke Verification Challenge (KVC),2 hosted on CodaLab.3 The CodaLab platform returns several metrics (Sec. III) that quantify the recognition performance as well as the fairness of biometric systems. To create the framework, we consider two of the largest public databases of keystroke dynamics up to date, the Aalto Desktop [25] and Mobile [26] Keystroke Databases, extracting datasets that guarantee a minimum amount of data per subject, age and gender annotations, absence of corrupted data, and that avoid too unbalanced subject distributions with respect to the considered demographic attributes.
We illustrate the main aspects of the proposed framework by considering two recent state-of-the-art keystroke biometric systems, TypeNet [11], and TypeFormer [16], [43]. To this end, we propose a thorough analysis considering four different sets of features (Sec. VI-A) towards more privacy-preserving biometric systems not requiring the ASCII code, which would reveal the text content, as an input feature. Our experiments show that by removing spatial information of the key location on the keyboard layout (ASCII code) in favor of additional features in the time domain, an acceptable level of performance is maintained.
A comparative analysis of keystroke dynamics verification systems in desktop and mobile scenarios is provided (Sec. VII-A.
We propose a new metric, the Skewed Impostor Ratio (SIR), useful to quantify how harder is for the classifier a pairwise comparison between subjects belonging to the same demographic group in relation with comparisons of subject belonging to different groups.

The remainder of the article is organized as follows: first, the resources provided within the proposed experimental framework are described (Sec. II). Then, Sec. III includes a detailed presentation of the evaluation protocol of the experimental framework and challenge, whereas Sec. IV presents the metrics adopted, including the definition of SIR, a novel metric proposed in this article. Sec. V provides an overview of the two biometric systems, TypeNet [11] and TypeFormer [16], utilized to validate the framework, followed by Sec. VI, in which the set of experiments for privacy-enhancement is illustrated. Finally, Sec. VII and Sec. VIII respectively contain the analysis of the results obtained and the article conclusive remarks.

SECTION II.

Resources Provided

The proposed experimental framework is based on the two most complete and large-scale public databases of free-text keystroke dynamics up to date, collected by the User Interfaces4 group of the Aalto University (Finland). The two databases are collected respectively in a desktop5 [25] and mobile6 [26] acquisition environment, including respectively around 168,000 and 60,000 subjects, thus representing well the typical challenges related to massive application usage. Each of the acquisition sessions contains a sentence of transcript text (variable content, but not fully free-text). The data were captured through a web application in an unsupervised way under realistic scenarios. Subjects were asked to read, memorize, and type in their device English sentences that were randomly selected from a set of 1,525 sentences. Subject metadata such as age and gender are self-reported during the data acquisition.

The two databases have been processed to arrange the data in a convenient format for the analysis of KD. The raw data acquired consist of the timestamp of the instant a key is pressed, the timestamp of the instant the key is released, and the key ASCII code. After discarding some of the subject data due to insufficient acquisition sessions per subject (less than 15 per subject), the two databases as downloaded have been rearranged to form four datasets:

Desktop Dataset:
1. Development set: 115,120 subjects provided in a single.npy file that contains a Python nested dictionary (subject IDs: session IDs: data). Average session length: 48.65 ( $\sigma$ = 18.50) characters typed.
2. Evaluation set: data from 15,000 subjects, provided in a single.npy file that contains a shallow Python dictionary (sessions IDs: data). Average session length: 48.77 ( $\sigma$ = 18.64) characters typed.
Mobile Dataset:
1. Development set: 40,639 subjects provided in a single.npy file that contains a Python nested dictionary (subject IDs: session IDs: data). Average session length: 48.59 ( $\sigma$ = 21.84) characters typed.
2. Evaluation set: data from 5,000 subjects, provided in a single.npy file that contains a shallow Python dictionary (sessions IDs: data). Average session length: 47.98 ( $\sigma$ = 20.93) characters typed.

The proposed experimental framework follows an open-set learning protocol, in other words, the subjects in the development and evaluation sets are different7 III) in 10 subsets. Then, we computed the global EER (Sec. IV) for each of the random subsets, to provide mean and standard deviation. As an example, we report the following values for TypeFormer 5F (Sec. VI-A): $\mu = 12.949\%$ , $\sigma = 0.090\%$ for desktop, $\mu = 10.164\%$ , $\sigma = 0.073\%$ for mobile. Please refer to Sec. IV and VII for details about the metrics and results. A validation set is not explicitly provided, but it can be obtained from the development set according to different training approaches.

Table 2 shows the demographic distribution of the datasets provided in the KVC. The subjects have been divided into six age groups (10 - 13, 14 - 17, 18 - 26, 27 - 35, 36 - 44, 45 - 79). The evaluation sets are balanced with respect to gender. The gender and age labels of the development set are released together with the data.

TABLE 2 Demographic Distributions of the Provided Datasets. The Rows Represent Different Age Groups, While the Columns Represent Genders. The Evaluation Sets are Balanced With Respect to Gender

The evaluation sets are separated by scenario (desktop and mobile), and they are provided in the form of two shallow Python dictionaries containing independent sessions. Such data are accompanied by the respective lists of pairwise comparisons to be carried out. Two Python script files are provided to load the data, and run the comparisons, generating a text file with the scores of each comparison, ready to be submitted to CodaLab for scoring. To push forward the state of the art and deepen the knowledge on the topic, the proposed protocol is designed for researchers working on KD as a novel tool to evaluate different approaches (pre-processing of input features, classifier architectures, learning approaches, etc.) for different goals (biometric recognition and fairness improvement) under the same experimental conditions, considering various metrics.

SECTION III.

Evaluation Description

The design and the implementation of the evaluation protocol described in this Section represents a significant novelty aspect proposed in the current work.

The two tasks (desktop and mobile) are structured similarly, and they are designed for a biometric verification protocol. In other words, a score between 0 and 1 related to a single comparison of two biometric samples will be produced (1: same identity, 0: different identities). It is a binary classification problem, as it is not necessary to ascertain to which identity a specific biometric sample belongs to (identification). In this experimental framework, a biometric sample corresponds to an acquisition session.

The total number of 1 vs 1 session-level comparisons is as follows:

Task 1 (Desktop): 2,250,000 comparisons, involving 15,000 subjects not included in the development set.
Task 2 (Mobile): 750,000 comparisons, involving 5,000 subjects not included in the development set.

The design of the comparisons is illustrated in Fig. 1. For each subject, there are 5 enrolment sessions and 10 verification sessions, leading to 50 1vs1 comparisons, which are averaged over the 5 enrolment sessions generating 10 genuine scores per subject. In a similar manner, 20 impostor scores per subject are generated. The impostor sessions are divided into two groups: 10 similar impostor scores, for which the verification sessions are randomly selected from subjects belonging to the same demographic group (same gender and age); 10 dissimilar impostor scores, in which the verification sessions are all randomly selected from subjects of different gender and age intervals 9 IV): 10 $\mu = 12.410\%, \sigma = 0.036\%$ in the desktop case, $\mu = 9.444\%, \sigma = 0.045\%$ in the mobile case. Please refer to Sec. IV and VII for details about the metrics and results.

FIGURE 1.

Each one of the verification sessions is compared with each of the enrolment sessions. For an easier comprehension, examples of faces showing gender and age are included instead of keystroke examples. Then, the scores generated are averaged over the enrolment session, leading to three distributions: genuine (green), similar impostor (orange), and dissimilar impostors (red).

Show All

Based on the described evaluation design, following [11] and [16], we consider two cases for evaluating the system:

Global distributions: this case corresponds to dividing all scores into two groups, genuine and impostor scores, regardless of which subject they belong too. This case corresponds to a having a fixed, pre-determined threshold, implying a simpler deployment of the biometric system. In order to assess the performance of the biometric system, this choice means setting one single threshold for all comparisons to obtain a decision.
Mean per-subject distributions: the optimal threshold is computed at subject-level, considering the 30 verification scores as described above. This choice corresponds to providing the system with more flexibility, so that it can adapt to user-specific distributions [44], [45]. In a real-life use case, this would require processing the subject’s enrolment samples to establish a threshold, and it can be done as follows: acquiring various enrollment samples, from which to derive a genuine subject-specific score distribution by considering pairwise comparisons between enrollment samples; considering a pool of samples from different subjects, from which to derive an impostor subject-specific score distribution by considering pairwise comparisons with the genuine enrollment samples; computing a subject-specific threshold based on the two distributions. It is important to highlight that this does not require re-training or fine-tuning the biometric system using subject-specific data. Then, all metrics computed per-subject are averaged considering all subjects in the evaluations set to obtain the values displayed. Generally, the verification performance of the system benefits from considering a different threshold per user.

SECTION IV.

Metrics Adopted

Within the years, several metrics have been proposed for biometric verification. The common aspect of all metrics is that they are based on the (normalized) scores that are typically generated by pairwise comparisons of biometric data. However, a comparison between systems is often a difficult operation if they are evaluated according to different metrics. Moreover, the attention of the scientific community has recently shifted towards the evaluation of the fairness of systems [46]. Consequently, based on the scores, we also provide an initial assessment of this important aspect which, to the best of our knowledge, is still an unexplored aspect of KD. An overview of all metrics considered is provided below. The scores will be computed considering global and mean per-subject distributions. All the presented metrics are returned by the KVC CodaLab scoring program, easily allowing experimental analyses of multiple aspects, such as different sets of input features for privacy-enhancement (Sec. VII). To the best of our knowledge, these scenarios have not been proposed in previously existing literature.

A. Verification Metrics

In biometrics, a false match (FM) is defined as a comparison decision of a match for a biometric probe and a biometric reference that are from different biometric capture subjects, while a false non-match (FNM) is defined as comparison decision of non-match for a biometric probe and a biometric reference that are from the same biometric capture subject and of the same biometric characteristic. The rate respectively associated with FMR (FNMR) corresponds to proportion of the completed biometric non-mated (mated) comparison trials that result in a false match (non-match) [47].

1) Equal Error Rate (EER)

The EER describes the point in which the FMR and FNMR curves intersect. The two rates typically have opposite trends with respect to the threshold setting (in the case of genuine scores closer to 1, and impostor scores closer to 0, as the threshold of a biometric system increases, the FMR will drop and the FNMR curve will rise). On the DET curve (Sec. IV-B), which is the plot of FNMR against FMR, at various threshold settings, it corresponds to the point where $y = x$ .

2) False Non-Match Rate at X% False Match Rate (FNMR @ X% FMR)

We consider $X = 1\%, 10\%$ . This also corresponds to a point on the DET curve. This metric expresses a trade-off between security and usability [9]. In fact, while from the point of view of security the priority is avoiding intrusions, denying the access to the genuine subject a large number of times would generate frustration and highly impacts the usability of the system. In this case, the threshold is set to X = 1%, 10% of FMR (rejection of 99%, 90% of impostor attempts, respectively), aiming to minimize the FNMR.

3) Area Under the Receiver Operating Characteristic (Roc) Curve (AUC)

The ROC curve (Sec. IV-B) is the plot of the TMR (True Match Rate) against FMR, at various threshold settings. A true match corresponds to the case of a genuine subject recognized as such. By definition, the TMR and the FNMR sum to 1. A perfect classifier has an Area Under the ROC Curve (AUC) of 1.

4) Accuracy

The accuracy is computed as the fraction of correctly classified attempts at a given discrimination threshold $\tau$ , corresponding, in the current work, to the EER threshold.

5) Rank- ${n}$

This metric concerns the identification of subjects (i.e., 1 to many comparisons), therefore assessing a different scenario from the previous metrics, which refer to the case of verification (i.e., binary classification). Starting from the comparison of biometric enrolment samples with ${N}$ biometric samples including a genuine one, it represents the rate to which the genuine scores fall within the best ${n}$ matches. In the proposed framework, this metric is computed by considering separately each one of the genuine verification sessions against all 20 impostor sessions available for each subject. The returned rank- ${n}$ values are averaged over the 10 genuine verification sessions ( ${n}$ = 1).

B. Curves

1) Score Histograms

They are computed considering the global genuine and impostor distributions. It is necessary to have a clear separation between the two, with the genuine distribution shifted toward 1, and the impostor one toward 0. A small overlap of the tails corresponds to a better performance of the system.

2) Detection Error Trade-Off (DET) Curve

It is the plot of FMR against FNMR, at various threshold settings, typically on a non-linear scale. As the threshold decreases, the amount of false matches (impostor subjects classified as genuine) increases, and the number of false non-matches decreases (genuine subjects classified as impostor). The closest the DET curve to the bottom left corner, the better the biometric system will be.

3) Roc Curve

It is the plot of the TMR against FMR, at various threshold settings.

C. Fairness Metrics

1) Standard Deviation (STD) of EER By Demographic Group

It considers the demographic differential assessment by calculating the standard deviation in accuracy performance between all demographic groups at a given discrimination threshold $\tau$ (in this work corresponding to the global EER threshold). The STD expresses a measure of how dispersed the values is in relation to the mean. The optimal STD value is 0%.

2) Skewed Error Ratio (SER) of EER By Demographic Group

Skewness is a measure of the asymmetry of a distribution. Similarly to the STD, SER is computed across demographic subsets as the ratio between the greatest and smallest error scores. It mainly represents the difference between the sensitive attribute with the best and worst performance. The larger the value, the greater the difference in the algorithm’s discrimination towards a certain attribute. The optimal SER value is 1.

3) Fairness Discrepancy Rate (FDR)

It was proposed in [48]. It considers the FMR and FNMR trade-off in the demographic differential assessment by calculating the max difference in TMR and TNMR performance between any two demographic groups $d_{i}$ and $d_{j}$ and a given discrimination threshold $\tau$ . In the current work, $\tau$ corresponds to the point of FMR = 1%. Those differences are then weighted by parameters $\alpha$ and $\beta = 1-\alpha$ , which represent the level of concern applied to differences in FMR and FNMR respectively. It is advantageous due to its formulation, that encompasses two different metrics, and its flexibility given by the parameters $\alpha$ and $\beta$ . It can be plotted as a function of every possible threshold value. Consequently, it is necessary to have access to the ground truth. The FDR ranges from 0 (fair) to 1 (unfair).

4) Inequity Rate (IR)

It was proposed in [49]. It is computed considering the ratio differences between minimum and maximum FMR and FNMR per demographic groups $d_{i}$ and $d_{j}$ for $\tau$ corresponding to the point of FMR = 1%. Although spanning different ranges from, its characteristics are similar to those of FDR. The IR ranges from 0 (fair) to infinite (unfair).

5) GINI Aggregation Rate for Biometric Equitability (GARBE)

It was proposed in [50] to overcome the limitations of FDR, and IR. In fact, the former does not scale the values of FMR and FNMR to the same order of magnitude, whereas the latter has no theoretical upper bound and may have a denominator equal to zero. GARBE is inspired in the mathematics of the Gini coefficient, computed for $x = \{\textrm {FMR}, \textrm {FNMR}\}$ as:

$\begin{equation*} G_{x}(\tau) = \left ({\frac {n}{n-1} }\right) \left ({\frac {\sum _{i=1}^{n}\sum _{j=1}^{n} |r_{i}-r_{j}|}{2 n^{2}\bar {r}} }\right)\end{equation*}$ View Source

where

$r_{i}, r_{j} \in \{\textrm {FMR}_{d_{i}} \text{@}\tau |d_{i} \in \mathbb {D} \}$

for

$r = \textrm {FMR}$

$r_{i}, r_{j} \in \{FNMR_{d_{i}} \text{@}\tau |d_{i} \in \mathbb {D} \}$

for

$r = \textrm {FNMR}$

$\tau$

is the decision threshold (FMR = 1%),

$d_{i}$

refers to a demographic group from the set

$\mathbb {D}$

of demographic groups, and

$n$

is the number of demographic groups. These Gini coefficients are combined as follows:

$\begin{equation*} \textrm {GARBE}(t_{z}) = \alpha G_{\textrm {FMR}}(t_{z}) + (1-\alpha) G_{\textit {FNMR}}(t_{z})\end{equation*}$

View Source

where

$\alpha$

represents the level of concern applied to differences in FMR and FNMR respectively. The GARBE ranges from 0 (fair) to 1 (unfair).

6) Skewed Impostor Ratio (Sir)

This is a novel metric proposed in this article. Normalized impostor scores are grouped according to a specific attribute (age or gender). For instance, considering age, it is possible to group the comparisons as follows: ‘10-13 vs 10-13’, ‘10-13 vs 14-17’, ‘10-13 vs 18-26’, and so on, considering all combinations. For each score group combination, the average value is taken. Then, all values can be arranged in a matrix, which is symmetric. The elements on the main diagonal represent comparisons between the different subjects belonging to the the same age or gender group (‘10-13 vs 10-13’, ‘14-17 vs 14-17’, etc.), while all other elements are obtained from the remaining cross-group comparisons. The ratio between the mean value of the elements in the main diagonal, and the remaining non-duplicated elements, is finally computed as a percentage. Such value expresses how harder is a comparison between different subjects belonging to the same demographic group in comparison to subjects belonging to different ones, quantifying to which extent demographic information is retained in the scores. It can be formulated as follows:

$\begin{equation*} \textrm {SIR} = 100 \left ({\frac {\mu (s_{ii})}{\mu (s_{ij, i\neq j})}-1}\right), i,j = 1, \ldots, n\end{equation*}$ View Source

where

${n}$

is the number of demographic groups (

${n}$

= 6 for age,

${n}$

= 2 for gender), and

${s}$

is the average score for a specific type of comparisons. The optimal SIR value is 0 (%), in case of no skew between impostor scores regardless of their demographic group. In comparison with the previous metrics, the advantages are as follows:

It is not necessary to select a threshold value.
It focuses on both intra-group and inter-group relations, highlighting the differences in the two cases. If the differences between the two cases are not significant nor consistent, then the system is bias-free.
It is not necessary to have access to the system, which can be treated as a black box. It is sufficient to run a significant number of appropriately distributed comparisons.
By considering the entire matrix as described above, it is possible to focus on comparisons between specific groups, gaining some precious insights about the system and the similarities between demographic groups.

SECTION V.

Biometric Verification Systems

Throughout the proposed framework, we evaluate two recent state-of-the-art deep-learning models:

TypeNet (2021) [11]: a Long-Short Term Memory (LSTM) Recurrent Neural Network (RNN), trained with triplet loss. In this case, we consider input sequences of 150 characters typed. TypeNet is implemented in Tensorflow [51].
TypeFormer (2023) [16], [43]: a novel transformer architecture consisting in a temporal and a channel module enclosing two LSTM RNN layers, a Gaussian Range Encoding (GRE), a multi-head self-attention mechanism, and a block-recurrent transformer structure. TypeFormer is also trained with triplet loss. In this case, we consider input sequences of 50 characters typed. TypeFormer is implemented in PyTorch [52].

Both approaches utilize Distance Metric Learning (DML) [53]. The fundamental concept of DML involves training a model that transforms input data into a new feature space, enabling straightforward distances to be used for analyzing and leveraging the “semantic” arrangement of the input space [54]. A DML approach aims to establish a neighborhood structure in the feature space by considering the relationship between intra-class (distances among samples from the same class) and inter-class (distances among samples from different classes) distances. In an ideal feature space, samples from the same class will remain in close proximity, while samples from different classes will be distinctly separated. Following this idea, the input sequences obtained from all sessions are transformed into feature embeddings, that are expected to have lower Euclidean distances if belonging to the same subject, higher otherwise. In the test stage, the distances obtained from the comparisons of feature embeddings corresponding to each of the test sessions are normalized, and then they are subtracted from 1 in order to transform them into similarity scores.

SECTION VI.

Experimental Protocol

A. Evaluation of Privacy-Enhancing

The two biometric systems considered, TypeNet and TypeFormer, take as input the same set of features extracted from the raw data (Unix timestamps of the actions of pressing and releasing a key), which include:

Hold Time (HT): time interval between the release and press instants of a given key, expressed in seconds.
(ii) Inter-Press Time (IPT): time interval between two consecutive press actions, expressed in seconds.
Inter-Release Time (IRT): time interval between two consecutive release actions, expressed in seconds.
(iv) Inter-Key Time (IKT): time interval between a release and the following press action, expressed in seconds.
ASCII code (ASCII), normalized by dividing it by 255.

Such features are graphically represented in black in Fig. 2. Input features (i) - (iv) are useful to capture the typing behavior of the user in the time-domain, whereas the ASCII code (v) describes the spatial relations due to the location of the key pressed on the keyboard layout. However, although handled in compliance with sensitive data protection regulations [13], the acquisition and processing of the ASCII codes inevitably reveals the content of the text, putting at risk the privacy of the users. Consequently, in this experiment, we strive to remove the ASCII code information in order to make the system content-agnostic, and consequently more privacy preserving. The sets of experiments run can be summarized as follows:

First, to evaluate and compare TypeNet and TypeFormer, we consider their original set of features (the experiment is named 5F, where “ ${F}$ ” stands for “features”), marked in black in Fig. 2.
Then to quantify the importance of the ASCII code information, we remove the ASCII code information from the original set of features (experiment 4F).
We consider an extended set of time-domain features. Several studies have in fact shown the usefulness of considering groups of keys typed such as digraphs, trigraphs, and ${n}$ -graphs [55]. By considering not only adjacent keys, but groups of three keys (Fig. 2, in red and blue), we obtain a set of 10 features (experiment 10F):
$\begin{align*} [\textrm {HT}, \textrm {IPT}, \textrm {IRT}, \textrm {IKT}, \textrm {IPT2}, \textrm {IRT2}, \textrm {IKT2}, \\ \textrm {IPT3}, \textrm {IRT3}, \textrm {IKT3}]\end{align*}$ View Source
We consider the extended set of time-domain features, together with the ASCII code (experiment 11F).

FIGURE 2.

A diagram representing the initial feature extraction process for the time instant $t = 0$ . HT = Hold Time; IPT = Inter-Press Time; IRT = Inter-Release Time; IKT = Inter-Key Time; IPT2 = Inter-Press Time with second following key; IRT2 = Inter-Release Time with second following key; IKT2 = Inter-Key Time with second following key; IPT3 = Inter-Press Time with third following key; IRT3 = Inter-Release Time with third following key; IKT3 = Inter-Key Time with third following key.

Show All

In each case, the deep learning models are trained from scratch.

B. Model Training

The training of both models takes place on the KVC development set considering identical settings to those described in their respective papers. The only differences are related to the division into training and validation sets. For TypeNet, we consider a subset of 400 subjects to validate the model at the end of each training epoch in terms of average EER per subject. This choice is justified by the experimental protocol followed in [11]. For TypeFormer, we consider an 80%-20% train-validation division of the KVC development set, and we adopt the global EER as validation metric. According to these validation metrics, the best-performing epoch model is saved in each case.

SECTION VII.

Experimental Results

A. Biometric Verification

The results of the experiments are reported in Table 3. The table is divided into two parts, each one corresponding to one scenario: desktop and mobile. Each half can be further divided into the two cases considered: results obtained in the global genuine and impostor distributions (see Sec. III), and results considering the mean values obtained for per-subject genuine and impostor distributions. Each row shows a different system, TypeNet or TypeFormer, trained on a different set of input features, according to the 4 experiments described in Sec. VI-A, while the different metrics are reported along the columns.

TABLE 3 Complete Comparison of the Presented Keystroke Biometric Verification Systems

By observing the overall trends in the table, it is possible to notice that in the desktop case higher verification results can be achieved, possibly due to a more constrained acquisition scenario, as, in contrast to mobile devices, subjects are more likely to be sitting down and in a still position while typing on a desktop keyboard. In fact, as an example, considering both TypeNet and TypeFormer with all possible sets of features (8 experiments in total), the mean value of all EERs obtained from the global distributions in the desktop scenario is 10.53%, whereas the corresponding value obtained in the mobile case is 13.11%. The trend is consistent if we analyze the other metrics, such as the FNMR @1% FMR (60.47% vs 72.95%) or AUC (95.73% vs 93.99%). Furthermore, the desktop case results to perform better also in the case of mean per-subject distributions. In this case, we obtain a mean EER of 5.67% in the desktop case vs 7.61% in the mobile case. Similarly, the mean AUC is 97.70% vs 96.59%, and the mean rank-1 is 75.92% vs 68.08%, respectively for desktop and mobile devices.

Focusing on the four experiments involving different sets of input features, it is possible to draw some interesting conclusions. The best performing system for the desktop scenario is TypeNet, which is affected by the lack of the spatial information given by the ASCII code, and the extended set of features is not able to compensate it (6.76% EER for TypeNet 5F vs 8.95% EER for TypeNet 10F for global distributions). Nevertheless, despite the performance decrease, the system still shows competitive performance against the Transformer-based TypeFormer 5F or TypeFormer 10F (respectively 12.95% EER and 12.75% EER for global distributions). Moreover, these tendencies are consistent considering mean per-subject distributions.

The opposite outcome can be observed in the case of TypeFormer, which is the best performing system in the mobile case: the model achieved with the extended set of features (10F) proves to be the best performing system in the mobile scenario (9.45% EER vs second best, 5F, of 10.17% EER for global distributions). Consequently, by introducing temporally-deepened input features, not only it has been possible to limit the performance decrease towards a more privacy-preserving verification system, but the verification performance is even significantly improved. In this case, it must be specified that the number of heads in the attention mechanism must be a multiple of the number of input features, consequently we considered (5 for 5F and 10F, 4 for 4F, and 1 for 11F due to memory constraints). In the case of TypeFormer, it is also interesting to point out that the second best performing model corresponds to the experiment 5F, with the initial set of features.

In addition, it is noticeable that subject-specific distributions lead to better verification performance in all cases. In fact, the system benefits from gaining more flexibility by setting a different threshold per subject. It is worth to highlight that this is really the case for security systems based on KD: in a real-life use case, once the system is deployed and the subject identity is verified in some other way, it is not hard to acquire multiple samples per subject. All these subject-specific data could be used as enrolment data, building a complete behavioral profile of the subject and leading to even better performance, without necessarily having to train or fine-tune the system. This would in fact be a further possible step, that leaves additional margin of improvement [56].

Another interesting observation is that it is very distinct how in the desktop case TypeNet performs better, while TypeFormer shows superiority in the mobile case. TypeNet is based on a two-layer LSTM RNN, while TypeFormer is based on a Transformer architecture, composed of several modules, and more parameters. As an example, in the desktop case the average TypeNet performance in terms of EER, FNMR @1% FNMR, and AUC, over the four input feature experiments, is respectively 7.89%, 45.83%, 97.48%, for global distributions and 3.44% (EER), 98.96% (AUC), 86.59% (rank-1), for mean-per subject distributions. In each case, the corresponding values achieved by TypeFormer are 13.17%, 75.10%, 93.98%, (global distributions), and 7.89%, 96.44%, 65.25% (mean per-subject distributions), showing a significant gap. These trends are clearly opposite in the mobile case, where TypeFormer shows 11.32%, 72.92%, 95.07% in the case of global distributions and 6.75%, 97.02%, 69.09% for the mean per-subject distributions. In the corresponding experiments, TypeNet achieves 14.91%, 72.98%, 92.91%, for global distributions and 8.48%, 96.16%, 67.07% for the mean per-subject distributions. These trends show that LSTM RNNs seem to model well the desktop environment, while the higher variability of mobile devices is better modelled by a Transformer, which, however, does not reach the same level of performance in absolute terms.

In addition, in the case of the global distributions, also the FNMR @10% FMR is reported. This represents a more relaxed approach, as the threshold selected is less stringent (90% of the impostors are rejected against 99% of FNMR @1% FMR). According to both these metrics, TypeNet in the desktop case achieves significantly better results in comparison with all the other configurations, and its gap with TypeFormer is much greater than for the mobile case, where the two system performance is closer. These metrics are not computed for the case of mean per-subject distributions as there are not enough scores per subject for a sufficiently fine threshold resolution. They are substituted by the Rank-1 (not determinable for global distributions as subject-dependent), which also shows a more regular behavior of TypeNet in the desktop case in comparison with all other configurations.

Fig. 3 and Fig. 4 show the curves described in Sec. IV-B. In particular, it is possible to carry out a direct comparison of TypeNet in the desktop case and TypeFormer in the mobile case considering two of the four input feature sets presented above (5F vs 10F). From left to right, the histograms of the genuine and impostor distributions are reported in Fig. 3 and 4 (a). It is possible to see that in both rows the genuine distributions are more separated for the case of 5F, while for the mobile case (TypeFormer), the 10F setup is able to create a better separation of impostors, leading to a lower EER value. The threshold corresponding to the EER value is marked by the black line. From these macroscopic trends, the difference between the two impostor scenarios (“similar impostors” and “dissimilar impostors”, for comparison between subjects of the same and different demographic groups, respectively) is not very pronounced. In all graphs, the threshold values are reported in black. Then, Fig. 3 and 4 (b) represent the DET curves (the threshold corresponding to 1% and 10% of FMR are respectively marked by the dashed red lines), while Fig. 3 and 4 (c) report the ROC curves, showing similar trends from different perspectives.

FIGURE 3.

The graphs show a comparison between TypeNet 5F vs TypeNet 10F in the desktop case. (a) shows the score histograms, (b) shows the DET curve, (c) shows the ROC curve.

Show All

FIGURE 4.

The graphs show a comparison between TypeFormer 5F vs TypeFormer 10F in the mobile case. (a) shows the score histograms, (b) shows the DET curve, (c) shows the ROC curve.

Show All

B. Fairness Evaluation

Table 4 shows the performance in terms of accuracy based on the global EER threshold (%) considering different demographic groups. Age groups are placed along the rows, while genders along the columns. As an example, from all experiments we take into consideration TypeNet in the desktop case, and TypeFormer in the mobile case, based on the same set of features (“5F”). In both cases, it is possible to see that males achieved higher values (93.13% against 92.49% of global EER for TypeNet, and 89.80% against 88.55% for TypeFormer, respectively for males and females). It is interesting to notice that, although the STD and SER values are quite smaller in the desktop case, the difference of EER between error rates across genders is still significant. Formulating an hypothesis that would explain this trend is not immediate, and out of the scope of the current work. Furthermore, by analyzing the behavior of the systems considering different age groups, it is possible to notice that for younger subjects, TypeFormer performs worse than for older ones, while this trend is not as evident for TypeNet. Such discrepancy could be due to cultural differences related to the degree of easiness and comfort in interacting with mobile devices across different generations.

TABLE 4 Results in Terms of Global Accuracy for the Final Evaluation Datasets (Desktop and Mobile), Evaluated for the Different Demographic Groups (Gender and Age). STD Refers to the Standard Deviation, Whereas SER Refers to the Skewed Error Rate (Sec. IV). Both Metrics Are Computed From all the Elements of Each Table

Table 5 shows all results of the fairness assessment provided throughout the proposed framework. Along the rows, the table is divided into two parts: the upper part presents the results in the desktop scenario, while the lower part is focused on the mobile scenario. Each half is further divided into two sections, each reporting results of one of the two models considered, TypeNet and TypeFormer, considering the four experimental configurations presented in Sec. VI-A. Concerning the different metrics, it is necessary to point out that all metrics except SIR are calculated considering 12 demographic groups, due to 6 age groups and 2 genders (see Sec. II). In contrast, SIR_a is computed considering scores divided by age only (square matrix of dimension 6), while SIR_g considering gender only (square matrix of dimension 2), to keep the assessments of each of the two attributes independent (Sec. IV).

TABLE 5 Comparison of the Presented Keystroke Biometric Verification Systems from the Point of View of Fairness Metrics

By observing Table 5, it possible to notice that there is no clear superiority of one model as in the case of the verification performance (TypeNet 5F for desktop devices, TypeFormer 10F for mobile devices). By carrying out an overall comparison of the desktop and mobile scenarios, it is possible to observe that demographic differentials related to age and gender tend to be higher in the mobile case. In fact, considering the mean values for each one of the metrics computed over all desktop against all mobile experiments, we obtain STD 0.77% vs 1.79%, SER 1.032 vs 1.07, FDR 95.90 vs 95.06, IR 1.36 vs 2.44, GARBE 0.06 vs 0.14, SIR $_{a} 2.80$ % vs 6.03%, SIR $_{g} 2.34$ % vs 6.63%. It is clear that in the desktop scenario there is an improvement in fairness according to each single metric.

If we consider a comparison of the two architectures in each scenario, we can see that in the desktop case the average performance achieved by TypeNet in terms of SIR_a is 2.80% and SIR_g is 2.16%, while the scores of TypeFormer are respectively 3.06% and 2.53%, being more biased as well as less effective. A similar trend is reported for the mobile case, with TypeNet (although with lower verification performance) achieving SIR_a=5.22% and SIR_g=6.42%, against SIR_a=6.84% and SIR_g=6.83% achieved by TypeFormer.

Finally, having a look at trends due to different sets of input features within the same scenario and same model configuration, we can observe that the results are more irregular and they do not show a clear overall trend as in the previous cases. Nevertheless, considering for instance the SIR values, we can observe that generally the ASCII code is associated with higher bias in almost all experiments and scenarios, showing that comparisons among the same demographic groups are slightly harder than among different groups.

SECTION VIII.

Conclusion and Future Work

In this article an open experimental framework to benchmark keystroke dynamics for biometric verification is provided to the research community to alleviate the heterogeneity of the experimental protocols, metrics, and the limited size of the databases adopted in the literature. The framework is provided in the form of the Keystroke Verification Challenge (KVC),11 hosted on CodaLab,12 and held within the 2023 IEEE International Conference on Big Data,13 Sorrento, Italy, December $15^{th}$ - $18^{th}$ , 2023. After such term, the challenge will be made ongoing. The experimental framework is based on the two most complete and large-scale public databases of free-text keystroke dynamics up to date, collected respectively in a desktop and mobile acquisition environment, including keystroke data from more than 185,000 subjects overall. The proposed framework not only allows a complete assessment of the verification performance, but it also returns several metrics related to biometric fairness and bias based on the comparison scores, to gain new insights about KD. To this end, a novel metric, the Skewed Impostor Ratio (SIR), is proposed, designed to highlight inter - and intra -demographic group patterns present in the final comparison scores.

Finally, we make a first use of the proposed framework by employing two recent state-of-the-art keystroke biometric systems, TypeNet [11] and TypeFormer [16]. Besides a direct comparison of the two, that shows the superiority of the former in the desktop scenario, and of the latter in the mobile one, we consider different sets of input features towards a more privacy-preserving keystroke verification system. The proposed solution is based on discarding the ASCII code, which reveals the text content, in favor of extended features in the time domain. We analyze four different experimental configurations, that focus on deepening the temporal information provided to the models in order to compensate the removal of spatial information due to the ASCII code, utilized to learn the relation in between the location of the specific key in the keyboard layout, and the correspondent time dynamics. Our experimental results show that such approach allows to maintain satisfactory performance in the desktop scenario, and even improved for mobile devices.

Concerning future work, the next directions of research will go toward the optimisation of the model architectures to improve the recognition performance and to reduce bias. More sophisticated training approaches will also be investigated, i.e. loss functions based on the selections of hard comparisons and adaptive margins [53], [57], specifically designed for the case of behavioral biometrics such as KD. Moreover, approaches based on the generation of synthetic subject-specific data will be considered to assess the suitability of such techniques to the problem of behavioral biometrics-based verification. To this end, KVC represents a dedicated and unified test bench for the entire biometric research community. By doing so, we aim to foster the design of innovative solutions that achieve improved performance in comparison with existing ones, that are benchmarked here.

Finally, the results of our contributed benchmark KVC in terms of demographic attribute assessment also enable further large-scale studies focused on examining the differences in subjects’ typing behavior due to biological, cultural, or linguistic factors. In this sense, new findings might be of great interest for several branches of the Human-Computer Interaction (HCI) community, e.g. privacy protection [14], security of minors online [58], user experience improvement [59], etc.

Cites in Papers - |

Cites in Papers - IEEE (3)

Select All

Nahuel Gonzalez, Giuseppe Stragapede, Ruben Vera-Rodriguez, Ruben Tolosana, "Type2Branch: Keystroke Biometrics Based on a Dual-Branch Architecture With Attention Mechanisms and Set2set Loss", IEEE Transactions on Information Forensics and Security, vol.20, pp.5859-5871, 2025.

Show Article

Google Scholar

Roberto Daza, Luis F. Gomez, Julian Fierrez, Aythami Morales, Ruben Tolosana, Javier Ortega-Garcia, "DeepFace-Attention: Multimodal Face Biometrics for Attention Estimation With Application to e-Learning", IEEE Access, vol.12, pp.111343-111359, 2024.

Show Article

Google Scholar

Marco Huber, Anh Thi Luu, Naser Damer, "Recognition Performance Variation Across Demographic Groups Through the Eyes of Explainable Face Recognition", 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pp.1-10, 2024.

Show Article

Google Scholar

Cites in Papers - Other Publishers (2)

Marco Huber, Fadi Boutros, Naser Damer, "Frequency Matters: Explaining Biases of Face Recognition in the Frequency Domain", Computer Vision – ECCV 2024 Workshops, vol.15644, pp.279, 2025.

CrossRef Google Scholar

Mohamed Meselhy Eltoukhy, Tarek Gaber, Abdulwahab Ali Almazroi, Marwa F. Mohamed, "ONE3A: one-against-all authentication model for smartphone using GAN network and optimization techniques", PeerJ Computer Science, vol.10, pp.e2001, 2024.

CrossRef Google Scholar

References is not available for this document.

Keystroke Verification Challenge (KVC): Biometric and Fairness Benchmark Evaluation

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Keystroke Dynamics for Biometric Recognition

B. Limitations of Existing Evaluation Methodologies

C. Fairness Considerations

D. Contributions

Resources Provided

Evaluation Description

Metrics Adopted

A. Verification Metrics

1) Equal Error Rate (EER)

2) False Non-Match Rate at X% False Match Rate (FNMR @ X% FMR)

3) Area Under the Receiver Operating Characteristic (Roc) Curve (AUC)

4) Accuracy

5) Rank-{n}{n}

B. Curves

1) Score Histograms

2) Detection Error Trade-Off (DET) Curve

3) Roc Curve

C. Fairness Metrics

1) Standard Deviation (STD) of EER By Demographic Group

2) Skewed Error Ratio (SER) of EER By Demographic Group

3) Fairness Discrepancy Rate (FDR)

4) Inequity Rate (IR)

5) GINI Aggregation Rate for Biometric Equitability (GARBE)

6) Skewed Impostor Ratio (Sir)

Biometric Verification Systems

Experimental Protocol

A. Evaluation of Privacy-Enhancing

B. Model Training

Experimental Results

A. Biometric Verification

B. Fairness Evaluation

Conclusion and Future Work

Authors

Figures

References

Citations

Cites in Papers - IEEE (3) | Other Publishers (2)

Cites in Papers - IEEE (3)

Cites in Papers - Other Publishers (2)

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?

5) Rank- ${n}$

Cites in Papers - |