Introduction
A. Keystroke Dynamics for Biometric Recognition
Keystroke Dynamics (KD) refers to the typing behavior of human subjects. It is commonly regarded as a behavioral biometric trait, similarly to voice [1], signature [2], [3], gait [4], [5], [6], touch gestures [7], [8], etc. In comparison with its physiological counterparts such as face or fingerprint, behavioral biometrics represent a more challenging technical problem in terms of recognition performance as they are in general characterized by a higher intra-user variability, and lower inter-user variability. These challenges are magnified when dealing with real-life applications that have up to millions of subjects. Nevertheless, they offer the advantage of verifying identities transparently, improving security, as well as usability [9], since they spare users from having to actively carry out a specific verification procedure (such as having their fingerprint scanned, or typing a password).
In particular, the deployment of keystroke dynamics verification systems is also economic, as there is no need for additional hardware. Potential applications span from verifying the subject identity while they write an email or they take a test in online educational platforms (free-text format) [10], to identifying malicious users across multiple accounts based on their typing style (free-text format) [11], or as an additional biometric security layer on top of a traditional knowledge-based password (fixed-text format) [12], etc. These aspects have prompted several companies to develop commercial solutions to enhance the security of users through KD [11].
A coarse classification of KD can take place according to two criteria: (i) the typology of acquisition device (keyboard): desktop or mobile. Due to differences in the pose or activity of typing subjects, more variability is commonly associated to mobile touchscreens in comparison with desktop keyboards; and (ii) regarding the text format, which can be free, fixed, or transcript. In the first case, the text typed is not the same across different samples: consequently, data are much sparser, more unstructured, and they present a higher rate of typing errors, compared to the fixed-text case, which aims to represent for instance the case of an intruder typing the password of the victim. Finally, the transcript text could be defined as a hybrid format as the subjects are asked to read, memorize, and type a text that is presented to them.
In its simplest form, keystroke dynamics are captured as discrete time instants: the time instants a key is pressed and released (for instance in Unix time format), accompanied by the code (ASCII) of the key pressed. More complex features can be extracted from these raw data. In particular, the ASCII codes are useful for learning relations between time and spatial distributions over the keyboard layout. Nevertheless, although handled in compliance with sensitive data protection regulations [13], they inevitably reveal the content of the text, putting at risk the privacy of the subjects [14], [15]. Other information such as the amount of pressure on the key or the size of the fingertip might be available depending on the specific hardware capabilities.
B. Limitations of Existing Evaluation Methodologies
In the field of keystroke biometrics, a typical obstacle for research advancement is represented by the heterogeneity of databases, experimental protocols, and metrics. In Table 1, some of the most important public keystroke dynamics databases are reported in chronological order. Although the literature on keystroke biometrics is extensive, to the best of our knowledge, except very few cases [11], [16], previous systems have mostly been only evaluated with up to several hundred subjects not representing well the recent challenges that massive usage applications can face. In addition, most research works are mainly focused only on desktop and fixed-text scenarios. Therefore, keystroke dynamics can still be considered a biometric modality at the early stages, especially for mobile devices. In fact, for mechanical keyboards of desktop computers, more in-depth evaluations have been conducted and commercial applications have been proposed [17]. Moreover, even if using the same databases, different systems proposed in the literature over the years have often been developed based on different subsets of users for development and evaluation, number of enrolment sessions, and metrics, hindering direct comparisons. In contrast, we propose a clearly defined experimental protocol based on same realistic use cases (Sec. VI), that can be easily adopted by researchers and practitioners of the field using the provided comparison files (Sec. II).
C. Fairness Considerations
Moreover, in the context of decision-making, algorithms are vulnerable to biases that render their decisions “unfair” [28], [29]. Consequently, in this context, fairness is defined as the absence of any prejudice or favoritism toward an individual or group based on their inherent or acquired characteristics. In the last years, innumerable studies have highlighted the existence of biases in biometric systems with regard to categories such as age, gender, and ethnicity [30], leading to worse decisions that affect specific demographic groups. In addition, these aspects are also relevant from the point of view of the privacy of users, as the existence of bias due to sensitive attributes often implies that the sensitive attributes themselves might be embedded in the learned representations [31], [32]. Therefore, the risk of leakage of some soft-biometric1 information about the subjects should be assessed as well. In general terms, the existence of bias or privacy leakage in biometric systems presupposes specific patterns in the input data associated with different demographic groups. For instance, in face biometrics, the existence of biological differences between different genders, ages, or ethnic groups is a trivial hypothesis that does not need a formal demonstration. However, for many biometric modalities including KD, it is not straightforward to make similar assumptions. Nevertheless, for KD, several studies have evaluated the predictability of gender [35], [36], age [37], [38], both [39], [40], and even emotions [41] and mother tongue [42]. In light of this, in this article we propose an experimental framework designed to highlight potential gender and age biases in the scores, which are still mainly unexplored aspects for KD on such a large scale. Within the current work, the focus is limited to age and gender because other potential sources of bias (such as the subject mother tongue, the device used, or the degree of familiarity of the subjects with keyboards) were not reported for most subjects in the raw databases.
D. Contributions
In brief, the main contributions of this article can be summarized as follows:
We propose a novel experimental framework to benchmark KD for biometric verification, which, to the best of our knowledge, is still lacking in this field. The framework is provided in the form of the Keystroke Verification Challenge (KVC),2 hosted on CodaLab.3 The CodaLab platform returns several metrics (Sec. III) that quantify the recognition performance as well as the fairness of biometric systems. To create the framework, we consider two of the largest public databases of keystroke dynamics up to date, the Aalto Desktop [25] and Mobile [26] Keystroke Databases, extracting datasets that guarantee a minimum amount of data per subject, age and gender annotations, absence of corrupted data, and that avoid too unbalanced subject distributions with respect to the considered demographic attributes.
We illustrate the main aspects of the proposed framework by considering two recent state-of-the-art keystroke biometric systems, TypeNet [11], and TypeFormer [16], [43]. To this end, we propose a thorough analysis considering four different sets of features (Sec. VI-A) towards more privacy-preserving biometric systems not requiring the ASCII code, which would reveal the text content, as an input feature. Our experiments show that by removing spatial information of the key location on the keyboard layout (ASCII code) in favor of additional features in the time domain, an acceptable level of performance is maintained.
A comparative analysis of keystroke dynamics verification systems in desktop and mobile scenarios is provided (Sec. VII-A.
We propose a new metric, the Skewed Impostor Ratio (SIR), useful to quantify how harder is for the classifier a pairwise comparison between subjects belonging to the same demographic group in relation with comparisons of subject belonging to different groups.
The remainder of the article is organized as follows: first, the resources provided within the proposed experimental framework are described (Sec. II). Then, Sec. III includes a detailed presentation of the evaluation protocol of the experimental framework and challenge, whereas Sec. IV presents the metrics adopted, including the definition of SIR, a novel metric proposed in this article. Sec. V provides an overview of the two biometric systems, TypeNet [11] and TypeFormer [16], utilized to validate the framework, followed by Sec. VI, in which the set of experiments for privacy-enhancement is illustrated. Finally, Sec. VII and Sec. VIII respectively contain the analysis of the results obtained and the article conclusive remarks.
Resources Provided
The proposed experimental framework is based on the two most complete and large-scale public databases of free-text keystroke dynamics up to date, collected by the User Interfaces4 group of the Aalto University (Finland). The two databases are collected respectively in a desktop5 [25] and mobile6 [26] acquisition environment, including respectively around 168,000 and 60,000 subjects, thus representing well the typical challenges related to massive application usage. Each of the acquisition sessions contains a sentence of transcript text (variable content, but not fully free-text). The data were captured through a web application in an unsupervised way under realistic scenarios. Subjects were asked to read, memorize, and type in their device English sentences that were randomly selected from a set of 1,525 sentences. Subject metadata such as age and gender are self-reported during the data acquisition.
The two databases have been processed to arrange the data in a convenient format for the analysis of KD. The raw data acquired consist of the timestamp of the instant a key is pressed, the timestamp of the instant the key is released, and the key ASCII code. After discarding some of the subject data due to insufficient acquisition sessions per subject (less than 15 per subject), the two databases as downloaded have been rearranged to form four datasets:
Desktop Dataset:
Development set: 115,120 subjects provided in a single.npy file that contains a Python nested dictionary (subject IDs: session IDs: data). Average session length: 48.65 (
= 18.50) characters typed.\sigma Evaluation set: data from 15,000 subjects, provided in a single.npy file that contains a shallow Python dictionary (sessions IDs: data). Average session length: 48.77 (
= 18.64) characters typed.\sigma
Mobile Dataset:
Development set: 40,639 subjects provided in a single.npy file that contains a Python nested dictionary (subject IDs: session IDs: data). Average session length: 48.59 (
= 21.84) characters typed.\sigma Evaluation set: data from 5,000 subjects, provided in a single.npy file that contains a shallow Python dictionary (sessions IDs: data). Average session length: 47.98 (
= 20.93) characters typed.\sigma
The proposed experimental framework follows an open-set learning protocol, in other words, the subjects in the development and evaluation sets are different7 III) in 10 subsets. Then, we computed the global EER (Sec. IV) for each of the random subsets, to provide mean and standard deviation. As an example, we report the following values for TypeFormer 5F (Sec. VI-A):
Table 2 shows the demographic distribution of the datasets provided in the KVC. The subjects have been divided into six age groups (10 - 13, 14 - 17, 18 - 26, 27 - 35, 36 - 44, 45 - 79). The evaluation sets are balanced with respect to gender. The gender and age labels of the development set are released together with the data.
The evaluation sets are separated by scenario (desktop and mobile), and they are provided in the form of two shallow Python dictionaries containing independent sessions. Such data are accompanied by the respective lists of pairwise comparisons to be carried out. Two Python script files are provided to load the data, and run the comparisons, generating a text file with the scores of each comparison, ready to be submitted to CodaLab for scoring. To push forward the state of the art and deepen the knowledge on the topic, the proposed protocol is designed for researchers working on KD as a novel tool to evaluate different approaches (pre-processing of input features, classifier architectures, learning approaches, etc.) for different goals (biometric recognition and fairness improvement) under the same experimental conditions, considering various metrics.
Evaluation Description
The design and the implementation of the evaluation protocol described in this Section represents a significant novelty aspect proposed in the current work.
The two tasks (desktop and mobile) are structured similarly, and they are designed for a biometric verification protocol. In other words, a score between 0 and 1 related to a single comparison of two biometric samples will be produced (1: same identity, 0: different identities). It is a binary classification problem, as it is not necessary to ascertain to which identity a specific biometric sample belongs to (identification). In this experimental framework, a biometric sample corresponds to an acquisition session.
The total number of 1 vs 1 session-level comparisons is as follows:
Task 1 (Desktop): 2,250,000 comparisons, involving 15,000 subjects not included in the development set.
Task 2 (Mobile): 750,000 comparisons, involving 5,000 subjects not included in the development set.
The design of the comparisons is illustrated in Fig. 1. For each subject, there are 5 enrolment sessions and 10 verification sessions, leading to 50 1vs1 comparisons, which are averaged over the 5 enrolment sessions generating 10 genuine scores per subject. In a similar manner, 20 impostor scores per subject are generated. The impostor sessions are divided into two groups: 10 similar impostor scores, for which the verification sessions are randomly selected from subjects belonging to the same demographic group (same gender and age); 10 dissimilar impostor scores, in which the verification sessions are all randomly selected from subjects of different gender and age intervals 9 IV): 10
Each one of the verification sessions is compared with each of the enrolment sessions. For an easier comprehension, examples of faces showing gender and age are included instead of keystroke examples. Then, the scores generated are averaged over the enrolment session, leading to three distributions: genuine (green), similar impostor (orange), and dissimilar impostors (red).
Based on the described evaluation design, following [11] and [16], we consider two cases for evaluating the system:
Global distributions: this case corresponds to dividing all scores into two groups, genuine and impostor scores, regardless of which subject they belong too. This case corresponds to a having a fixed, pre-determined threshold, implying a simpler deployment of the biometric system. In order to assess the performance of the biometric system, this choice means setting one single threshold for all comparisons to obtain a decision.
Mean per-subject distributions: the optimal threshold is computed at subject-level, considering the 30 verification scores as described above. This choice corresponds to providing the system with more flexibility, so that it can adapt to user-specific distributions [44], [45]. In a real-life use case, this would require processing the subject’s enrolment samples to establish a threshold, and it can be done as follows: acquiring various enrollment samples, from which to derive a genuine subject-specific score distribution by considering pairwise comparisons between enrollment samples; considering a pool of samples from different subjects, from which to derive an impostor subject-specific score distribution by considering pairwise comparisons with the genuine enrollment samples; computing a subject-specific threshold based on the two distributions. It is important to highlight that this does not require re-training or fine-tuning the biometric system using subject-specific data. Then, all metrics computed per-subject are averaged considering all subjects in the evaluations set to obtain the values displayed. Generally, the verification performance of the system benefits from considering a different threshold per user.
Metrics Adopted
Within the years, several metrics have been proposed for biometric verification. The common aspect of all metrics is that they are based on the (normalized) scores that are typically generated by pairwise comparisons of biometric data. However, a comparison between systems is often a difficult operation if they are evaluated according to different metrics. Moreover, the attention of the scientific community has recently shifted towards the evaluation of the fairness of systems [46]. Consequently, based on the scores, we also provide an initial assessment of this important aspect which, to the best of our knowledge, is still an unexplored aspect of KD. An overview of all metrics considered is provided below. The scores will be computed considering global and mean per-subject distributions. All the presented metrics are returned by the KVC CodaLab scoring program, easily allowing experimental analyses of multiple aspects, such as different sets of input features for privacy-enhancement (Sec. VII). To the best of our knowledge, these scenarios have not been proposed in previously existing literature.
A. Verification Metrics
In biometrics, a false match (FM) is defined as a comparison decision of a match for a biometric probe and a biometric reference that are from different biometric capture subjects, while a false non-match (FNM) is defined as comparison decision of non-match for a biometric probe and a biometric reference that are from the same biometric capture subject and of the same biometric characteristic. The rate respectively associated with FMR (FNMR) corresponds to proportion of the completed biometric non-mated (mated) comparison trials that result in a false match (non-match) [47].
1) Equal Error Rate (EER)
The EER describes the point in which the FMR and FNMR curves intersect. The two rates typically have opposite trends with respect to the threshold setting (in the case of genuine scores closer to 1, and impostor scores closer to 0, as the threshold of a biometric system increases, the FMR will drop and the FNMR curve will rise). On the DET curve (Sec. IV-B), which is the plot of FNMR against FMR, at various threshold settings, it corresponds to the point where
2) False Non-Match Rate at X% False Match Rate (FNMR @ X% FMR)
We consider
3) Area Under the Receiver Operating Characteristic (Roc) Curve (AUC)
The ROC curve (Sec. IV-B) is the plot of the TMR (True Match Rate) against FMR, at various threshold settings. A true match corresponds to the case of a genuine subject recognized as such. By definition, the TMR and the FNMR sum to 1. A perfect classifier has an Area Under the ROC Curve (AUC) of 1.
4) Accuracy
The accuracy is computed as the fraction of correctly classified attempts at a given discrimination threshold
5) Rank-{n}
This metric concerns the identification of subjects (i.e., 1 to many comparisons), therefore assessing a different scenario from the previous metrics, which refer to the case of verification (i.e., binary classification). Starting from the comparison of biometric enrolment samples with
B. Curves
1) Score Histograms
They are computed considering the global genuine and impostor distributions. It is necessary to have a clear separation between the two, with the genuine distribution shifted toward 1, and the impostor one toward 0. A small overlap of the tails corresponds to a better performance of the system.
2) Detection Error Trade-Off (DET) Curve
It is the plot of FMR against FNMR, at various threshold settings, typically on a non-linear scale. As the threshold decreases, the amount of false matches (impostor subjects classified as genuine) increases, and the number of false non-matches decreases (genuine subjects classified as impostor). The closest the DET curve to the bottom left corner, the better the biometric system will be.
3) Roc Curve
It is the plot of the TMR against FMR, at various threshold settings.
C. Fairness Metrics
1) Standard Deviation (STD) of EER By Demographic Group
It considers the demographic differential assessment by calculating the standard deviation in accuracy performance between all demographic groups at a given discrimination threshold
2) Skewed Error Ratio (SER) of EER By Demographic Group
Skewness is a measure of the asymmetry of a distribution. Similarly to the STD, SER is computed across demographic subsets as the ratio between the greatest and smallest error scores. It mainly represents the difference between the sensitive attribute with the best and worst performance. The larger the value, the greater the difference in the algorithm’s discrimination towards a certain attribute. The optimal SER value is 1.
3) Fairness Discrepancy Rate (FDR)
It was proposed in [48]. It considers the FMR and FNMR trade-off in the demographic differential assessment by calculating the max difference in TMR and TNMR performance between any two demographic groups
4) Inequity Rate (IR)
It was proposed in [49]. It is computed considering the ratio differences between minimum and maximum FMR and FNMR per demographic groups
5) GINI Aggregation Rate for Biometric Equitability (GARBE)
It was proposed in [50] to overcome the limitations of FDR, and IR. In fact, the former does not scale the values of FMR and FNMR to the same order of magnitude, whereas the latter has no theoretical upper bound and may have a denominator equal to zero. GARBE is inspired in the mathematics of the Gini coefficient, computed for \begin{equation*} G_{x}(\tau) = \left ({\frac {n}{n-1} }\right) \left ({\frac {\sum _{i=1}^{n}\sum _{j=1}^{n} |r_{i}-r_{j}|}{2 n^{2}\bar {r}} }\right)\end{equation*}
\begin{equation*} \textrm {GARBE}(t_{z}) = \alpha G_{\textrm {FMR}}(t_{z}) + (1-\alpha) G_{\textit {FNMR}}(t_{z})\end{equation*}
6) Skewed Impostor Ratio (Sir)
This is a novel metric proposed in this article. Normalized impostor scores are grouped according to a specific attribute (age or gender). For instance, considering age, it is possible to group the comparisons as follows: ‘10-13 vs 10-13’, ‘10-13 vs 14-17’, ‘10-13 vs 18-26’, and so on, considering all combinations. For each score group combination, the average value is taken. Then, all values can be arranged in a matrix, which is symmetric. The elements on the main diagonal represent comparisons between the different subjects belonging to the the same age or gender group (‘10-13 vs 10-13’, ‘14-17 vs 14-17’, etc.), while all other elements are obtained from the remaining cross-group comparisons. The ratio between the mean value of the elements in the main diagonal, and the remaining non-duplicated elements, is finally computed as a percentage. Such value expresses how harder is a comparison between different subjects belonging to the same demographic group in comparison to subjects belonging to different ones, quantifying to which extent demographic information is retained in the scores. It can be formulated as follows:\begin{equation*} \textrm {SIR} = 100 \left ({\frac {\mu (s_{ii})}{\mu (s_{ij, i\neq j})}-1}\right), i,j = 1, \ldots, n\end{equation*}
It is not necessary to select a threshold value.
It focuses on both intra-group and inter-group relations, highlighting the differences in the two cases. If the differences between the two cases are not significant nor consistent, then the system is bias-free.
It is not necessary to have access to the system, which can be treated as a black box. It is sufficient to run a significant number of appropriately distributed comparisons.
By considering the entire matrix as described above, it is possible to focus on comparisons between specific groups, gaining some precious insights about the system and the similarities between demographic groups.
Biometric Verification Systems
Throughout the proposed framework, we evaluate two recent state-of-the-art deep-learning models:
TypeNet (2021) [11]: a Long-Short Term Memory (LSTM) Recurrent Neural Network (RNN), trained with triplet loss. In this case, we consider input sequences of 150 characters typed. TypeNet is implemented in
Tensorflow [51].TypeFormer (2023) [16], [43]: a novel transformer architecture consisting in a temporal and a channel module enclosing two LSTM RNN layers, a Gaussian Range Encoding (GRE), a multi-head self-attention mechanism, and a block-recurrent transformer structure. TypeFormer is also trained with triplet loss. In this case, we consider input sequences of 50 characters typed. TypeFormer is implemented in
PyTorch [52].
Both approaches utilize Distance Metric Learning (DML) [53]. The fundamental concept of DML involves training a model that transforms input data into a new feature space, enabling straightforward distances to be used for analyzing and leveraging the “semantic” arrangement of the input space [54]. A DML approach aims to establish a neighborhood structure in the feature space by considering the relationship between intra-class (distances among samples from the same class) and inter-class (distances among samples from different classes) distances. In an ideal feature space, samples from the same class will remain in close proximity, while samples from different classes will be distinctly separated. Following this idea, the input sequences obtained from all sessions are transformed into feature embeddings, that are expected to have lower Euclidean distances if belonging to the same subject, higher otherwise. In the test stage, the distances obtained from the comparisons of feature embeddings corresponding to each of the test sessions are normalized, and then they are subtracted from 1 in order to transform them into similarity scores.
Experimental Protocol
A. Evaluation of Privacy-Enhancing
The two biometric systems considered, TypeNet and TypeFormer, take as input the same set of features extracted from the raw data (Unix timestamps of the actions of pressing and releasing a key), which include:
Hold Time (HT): time interval between the release and press instants of a given key, expressed in seconds.
(ii) Inter-Press Time (IPT): time interval between two consecutive press actions, expressed in seconds.
Inter-Release Time (IRT): time interval between two consecutive release actions, expressed in seconds.
(iv) Inter-Key Time (IKT): time interval between a release and the following press action, expressed in seconds.
ASCII code (ASCII), normalized by dividing it by 255.
Such features are graphically represented in black in Fig. 2. Input features (i) - (iv) are useful to capture the typing behavior of the user in the time-domain, whereas the ASCII code (v) describes the spatial relations due to the location of the key pressed on the keyboard layout. However, although handled in compliance with sensitive data protection regulations [13], the acquisition and processing of the ASCII codes inevitably reveals the content of the text, putting at risk the privacy of the users. Consequently, in this experiment, we strive to remove the ASCII code information in order to make the system content-agnostic, and consequently more privacy preserving. The sets of experiments run can be summarized as follows:
First, to evaluate and compare TypeNet and TypeFormer, we consider their original set of features (the experiment is named 5F, where “
” stands for “features”), marked in black in Fig. 2.{F} Then to quantify the importance of the ASCII code information, we remove the ASCII code information from the original set of features (experiment 4F).
We consider an extended set of time-domain features. Several studies have in fact shown the usefulness of considering groups of keys typed such as digraphs, trigraphs, and
-graphs [55]. By considering not only adjacent keys, but groups of three keys (Fig. 2, in red and blue), we obtain a set of 10 features (experiment 10F):{n} \begin{align*} [\textrm {HT}, \textrm {IPT}, \textrm {IRT}, \textrm {IKT}, \textrm {IPT2}, \textrm {IRT2}, \textrm {IKT2}, \\ \textrm {IPT3}, \textrm {IRT3}, \textrm {IKT3}]\end{align*} View Source\begin{align*} [\textrm {HT}, \textrm {IPT}, \textrm {IRT}, \textrm {IKT}, \textrm {IPT2}, \textrm {IRT2}, \textrm {IKT2}, \\ \textrm {IPT3}, \textrm {IRT3}, \textrm {IKT3}]\end{align*}
We consider the extended set of time-domain features, together with the ASCII code (experiment 11F).
A diagram representing the initial feature extraction process for the time instant
In each case, the deep learning models are trained from scratch.
B. Model Training
The training of both models takes place on the KVC development set considering identical settings to those described in their respective papers. The only differences are related to the division into training and validation sets. For TypeNet, we consider a subset of 400 subjects to validate the model at the end of each training epoch in terms of average EER per subject. This choice is justified by the experimental protocol followed in [11]. For TypeFormer, we consider an 80%-20% train-validation division of the KVC development set, and we adopt the global EER as validation metric. According to these validation metrics, the best-performing epoch model is saved in each case.
Experimental Results
A. Biometric Verification
The results of the experiments are reported in Table 3. The table is divided into two parts, each one corresponding to one scenario: desktop and mobile. Each half can be further divided into the two cases considered: results obtained in the global genuine and impostor distributions (see Sec. III), and results considering the mean values obtained for per-subject genuine and impostor distributions. Each row shows a different system, TypeNet or TypeFormer, trained on a different set of input features, according to the 4 experiments described in Sec. VI-A, while the different metrics are reported along the columns.
By observing the overall trends in the table, it is possible to notice that in the desktop case higher verification results can be achieved, possibly due to a more constrained acquisition scenario, as, in contrast to mobile devices, subjects are more likely to be sitting down and in a still position while typing on a desktop keyboard. In fact, as an example, considering both TypeNet and TypeFormer with all possible sets of features (8 experiments in total), the mean value of all EERs obtained from the global distributions in the desktop scenario is 10.53%, whereas the corresponding value obtained in the mobile case is 13.11%. The trend is consistent if we analyze the other metrics, such as the FNMR @1% FMR (60.47% vs 72.95%) or AUC (95.73% vs 93.99%). Furthermore, the desktop case results to perform better also in the case of mean per-subject distributions. In this case, we obtain a mean EER of 5.67% in the desktop case vs 7.61% in the mobile case. Similarly, the mean AUC is 97.70% vs 96.59%, and the mean rank-1 is 75.92% vs 68.08%, respectively for desktop and mobile devices.
Focusing on the four experiments involving different sets of input features, it is possible to draw some interesting conclusions. The best performing system for the desktop scenario is TypeNet, which is affected by the lack of the spatial information given by the ASCII code, and the extended set of features is not able to compensate it (6.76% EER for TypeNet 5F vs 8.95% EER for TypeNet 10F for global distributions). Nevertheless, despite the performance decrease, the system still shows competitive performance against the Transformer-based TypeFormer 5F or TypeFormer 10F (respectively 12.95% EER and 12.75% EER for global distributions). Moreover, these tendencies are consistent considering mean per-subject distributions.
The opposite outcome can be observed in the case of TypeFormer, which is the best performing system in the mobile case: the model achieved with the extended set of features (10F) proves to be the best performing system in the mobile scenario (9.45% EER vs second best, 5F, of 10.17% EER for global distributions). Consequently, by introducing temporally-deepened input features, not only it has been possible to limit the performance decrease towards a more privacy-preserving verification system, but the verification performance is even significantly improved. In this case, it must be specified that the number of heads in the attention mechanism must be a multiple of the number of input features, consequently we considered (5 for 5F and 10F, 4 for 4F, and 1 for 11F due to memory constraints). In the case of TypeFormer, it is also interesting to point out that the second best performing model corresponds to the experiment 5F, with the initial set of features.
In addition, it is noticeable that subject-specific distributions lead to better verification performance in all cases. In fact, the system benefits from gaining more flexibility by setting a different threshold per subject. It is worth to highlight that this is really the case for security systems based on KD: in a real-life use case, once the system is deployed and the subject identity is verified in some other way, it is not hard to acquire multiple samples per subject. All these subject-specific data could be used as enrolment data, building a complete behavioral profile of the subject and leading to even better performance, without necessarily having to train or fine-tune the system. This would in fact be a further possible step, that leaves additional margin of improvement [56].
Another interesting observation is that it is very distinct how in the desktop case TypeNet performs better, while TypeFormer shows superiority in the mobile case. TypeNet is based on a two-layer LSTM RNN, while TypeFormer is based on a Transformer architecture, composed of several modules, and more parameters. As an example, in the desktop case the average TypeNet performance in terms of EER, FNMR @1% FNMR, and AUC, over the four input feature experiments, is respectively 7.89%, 45.83%, 97.48%, for global distributions and 3.44% (EER), 98.96% (AUC), 86.59% (rank-1), for mean-per subject distributions. In each case, the corresponding values achieved by TypeFormer are 13.17%, 75.10%, 93.98%, (global distributions), and 7.89%, 96.44%, 65.25% (mean per-subject distributions), showing a significant gap. These trends are clearly opposite in the mobile case, where TypeFormer shows 11.32%, 72.92%, 95.07% in the case of global distributions and 6.75%, 97.02%, 69.09% for the mean per-subject distributions. In the corresponding experiments, TypeNet achieves 14.91%, 72.98%, 92.91%, for global distributions and 8.48%, 96.16%, 67.07% for the mean per-subject distributions. These trends show that LSTM RNNs seem to model well the desktop environment, while the higher variability of mobile devices is better modelled by a Transformer, which, however, does not reach the same level of performance in absolute terms.
In addition, in the case of the global distributions, also the FNMR @10% FMR is reported. This represents a more relaxed approach, as the threshold selected is less stringent (90% of the impostors are rejected against 99% of FNMR @1% FMR). According to both these metrics, TypeNet in the desktop case achieves significantly better results in comparison with all the other configurations, and its gap with TypeFormer is much greater than for the mobile case, where the two system performance is closer. These metrics are not computed for the case of mean per-subject distributions as there are not enough scores per subject for a sufficiently fine threshold resolution. They are substituted by the Rank-1 (not determinable for global distributions as subject-dependent), which also shows a more regular behavior of TypeNet in the desktop case in comparison with all other configurations.
Fig. 3 and Fig. 4 show the curves described in Sec. IV-B. In particular, it is possible to carry out a direct comparison of TypeNet in the desktop case and TypeFormer in the mobile case considering two of the four input feature sets presented above (5F vs 10F). From left to right, the histograms of the genuine and impostor distributions are reported in Fig. 3 and 4 (a). It is possible to see that in both rows the genuine distributions are more separated for the case of 5F, while for the mobile case (TypeFormer), the 10F setup is able to create a better separation of impostors, leading to a lower EER value. The threshold corresponding to the EER value is marked by the black line. From these macroscopic trends, the difference between the two impostor scenarios (“similar impostors” and “dissimilar impostors”, for comparison between subjects of the same and different demographic groups, respectively) is not very pronounced. In all graphs, the threshold values are reported in black. Then, Fig. 3 and 4 (b) represent the DET curves (the threshold corresponding to 1% and 10% of FMR are respectively marked by the dashed red lines), while Fig. 3 and 4 (c) report the ROC curves, showing similar trends from different perspectives.
The graphs show a comparison between TypeNet 5F vs TypeNet 10F in the desktop case. (a) shows the score histograms, (b) shows the DET curve, (c) shows the ROC curve.
The graphs show a comparison between TypeFormer 5F vs TypeFormer 10F in the mobile case. (a) shows the score histograms, (b) shows the DET curve, (c) shows the ROC curve.
B. Fairness Evaluation
Table 4 shows the performance in terms of accuracy based on the global EER threshold (%) considering different demographic groups. Age groups are placed along the rows, while genders along the columns. As an example, from all experiments we take into consideration TypeNet in the desktop case, and TypeFormer in the mobile case, based on the same set of features (“5F”). In both cases, it is possible to see that males achieved higher values (93.13% against 92.49% of global EER for TypeNet, and 89.80% against 88.55% for TypeFormer, respectively for males and females). It is interesting to notice that, although the STD and SER values are quite smaller in the desktop case, the difference of EER between error rates across genders is still significant. Formulating an hypothesis that would explain this trend is not immediate, and out of the scope of the current work. Furthermore, by analyzing the behavior of the systems considering different age groups, it is possible to notice that for younger subjects, TypeFormer performs worse than for older ones, while this trend is not as evident for TypeNet. Such discrepancy could be due to cultural differences related to the degree of easiness and comfort in interacting with mobile devices across different generations.
Table 5 shows all results of the fairness assessment provided throughout the proposed framework. Along the rows, the table is divided into two parts: the upper part presents the results in the desktop scenario, while the lower part is focused on the mobile scenario. Each half is further divided into two sections, each reporting results of one of the two models considered, TypeNet and TypeFormer, considering the four experimental configurations presented in Sec. VI-A. Concerning the different metrics, it is necessary to point out that all metrics except SIR are calculated considering 12 demographic groups, due to 6 age groups and 2 genders (see Sec. II). In contrast, SIRa is computed considering scores divided by age only (square matrix of dimension 6), while SIRg considering gender only (square matrix of dimension 2), to keep the assessments of each of the two attributes independent (Sec. IV).
By observing Table 5, it possible to notice that there is no clear superiority of one model as in the case of the verification performance (TypeNet 5F for desktop devices, TypeFormer 10F for mobile devices). By carrying out an overall comparison of the desktop and mobile scenarios, it is possible to observe that demographic differentials related to age and gender tend to be higher in the mobile case. In fact, considering the mean values for each one of the metrics computed over all desktop against all mobile experiments, we obtain STD 0.77% vs 1.79%, SER 1.032 vs 1.07, FDR 95.90 vs 95.06, IR 1.36 vs 2.44, GARBE 0.06 vs 0.14, SIR
If we consider a comparison of the two architectures in each scenario, we can see that in the desktop case the average performance achieved by TypeNet in terms of SIRa is 2.80% and SIRg is 2.16%, while the scores of TypeFormer are respectively 3.06% and 2.53%, being more biased as well as less effective. A similar trend is reported for the mobile case, with TypeNet (although with lower verification performance) achieving SIRa=5.22% and SIRg=6.42%, against SIRa=6.84% and SIRg=6.83% achieved by TypeFormer.
Finally, having a look at trends due to different sets of input features within the same scenario and same model configuration, we can observe that the results are more irregular and they do not show a clear overall trend as in the previous cases. Nevertheless, considering for instance the SIR values, we can observe that generally the ASCII code is associated with higher bias in almost all experiments and scenarios, showing that comparisons among the same demographic groups are slightly harder than among different groups.
Conclusion and Future Work
In this article an open experimental framework to benchmark keystroke dynamics for biometric verification is provided to the research community to alleviate the heterogeneity of the experimental protocols, metrics, and the limited size of the databases adopted in the literature. The framework is provided in the form of the Keystroke Verification Challenge (KVC),11 hosted on CodaLab,12 and held within the 2023 IEEE International Conference on Big Data,13 Sorrento, Italy, December
Finally, we make a first use of the proposed framework by employing two recent state-of-the-art keystroke biometric systems, TypeNet [11] and TypeFormer [16]. Besides a direct comparison of the two, that shows the superiority of the former in the desktop scenario, and of the latter in the mobile one, we consider different sets of input features towards a more privacy-preserving keystroke verification system. The proposed solution is based on discarding the ASCII code, which reveals the text content, in favor of extended features in the time domain. We analyze four different experimental configurations, that focus on deepening the temporal information provided to the models in order to compensate the removal of spatial information due to the ASCII code, utilized to learn the relation in between the location of the specific key in the keyboard layout, and the correspondent time dynamics. Our experimental results show that such approach allows to maintain satisfactory performance in the desktop scenario, and even improved for mobile devices.
Concerning future work, the next directions of research will go toward the optimisation of the model architectures to improve the recognition performance and to reduce bias. More sophisticated training approaches will also be investigated, i.e. loss functions based on the selections of hard comparisons and adaptive margins [53], [57], specifically designed for the case of behavioral biometrics such as KD. Moreover, approaches based on the generation of synthetic subject-specific data will be considered to assess the suitability of such techniques to the problem of behavioral biometrics-based verification. To this end, KVC represents a dedicated and unified test bench for the entire biometric research community. By doing so, we aim to foster the design of innovative solutions that achieve improved performance in comparison with existing ones, that are benchmarked here.
Finally, the results of our contributed benchmark KVC in terms of demographic attribute assessment also enable further large-scale studies focused on examining the differences in subjects’ typing behavior due to biological, cultural, or linguistic factors. In this sense, new findings might be of great interest for several branches of the Human-Computer Interaction (HCI) community, e.g. privacy protection [14], security of minors online [58], user experience improvement [59], etc.