An Enhanced Electrocardiogram Biometric Authentication System Using Machine Learning

Traditional authentication systems use alphanumeric or graphical passwords, or token-based techniques that require"something you know and something you have". The disadvantages of these systems include the risks of forgetfulness, loss, and theft. To address these shortcomings, biometric authentication is rapidly replacing traditional authentication methods and is becoming an everyday part of life. The electrocardiogram (ECG) is one of the most recent traits considered for biometric purposes, and three typical use cases have been described: security checks, hospitals and wearable devices. Here we describe an ECG-based authentication system suitable for security checks and hospital environments. The proposed authentication system will help investigators studying ECG-based biometric authentication techniques to define dataset boundaries and to acquire high-quality training data. We evaluated the performance of the proposed system using a confusion matrix and also by applying the Amang ECG (amgecg) toolbox in MATLAB to investigate two parameters that directly affect the accuracy of authentication: the ECG slicing time (sliding window) and sampling time. Using this approach, we found that accuracy was optimized by using a sliding window of 0.4 s and a sampling time of 37 s.


I. INTRODUCTION
Biometric authentication is replacing typical identification and access control systems to become a part of everyday life [1,2].The electrocardiogram (ECG) is one of the most recent traits to be explored for biometric purposes [3,4].ECGs report electrical conduction through the heart and can be used to recognize specific individuals [5].The utilization of ECGs as a biometric trait was first proposed in a 1977 US military report [6].Although much progress has been achieved over the last 20 years, many challenges remain to be overcome [5], including data acquisition, pre-processing for data enhancement, the assignment of authentication categories, and the application of deep learning (DL) and other machine learning (ML) classification approaches [11].ML techniques have recently been used to construct a verification model for identification based on live ECG data [7][8][9][10].ML is considered a subset of artificial intelligence that allows computers to perform tasks without explicit instructions.Instead, ML algorithms build a mathematical model of training data to make regressions (predictions) or decisions (classification/pattern recognition) [12].The diverse applications of ML include the analysis of videos, images, and sounds [13], as well as ECG data [11,[14][15].ECG research covers a wide range of fields with different requirements.Medical engineers set up electrocardiographs for the collection of rich ECG data [5,[16][17] whereas electrical engineers use simpler sensors to detect ECG signals [18][19][20][21].Therefore, our previous study defined three use cases that influence the setup of an ECG-based authentication system [22] focusing attention on aspects of the system that are relevant for external users [23].The three use cases are security checks (SCK), hospitals (HOS) and wearable devices (WD).This use case analysis helps researches in the field of biometric authentication to understand the conditions and setups required in each scenario [22].
In this article, we consider the SCK and HOS use cases in more detail.In a typical SCK scenario, biometric authentication would take place at a security checkpoint in the entrance to a building and would identify employees and visitors, while excluding unknown persons based on a simple ECG scan [22].In contrast, the HOS use case involves complex medical equipment which collects detailed ECG data during the training and testing phases.This requires a longer sampling time and multiple leads are used to gather the data [22].Our novel authentication system uses multi-variable regression to break down the dataset into smaller subsets, and builds a decision tree (DT) model based on these variables to predict target values [24][25].The convoluted nature of popular deepstructured machine learning means that such models also lack transparency and interpretability.The knowledge obtained by interpretable learners (e.g.decision tree) is critical in biometric software design.We use the time-sliced ECG method to build the training and testing datasets [22] and also consider the optimum sliding window.Previous research has shown that the performance of ML is dependent on the ECG slicing time [7][8][9][10] and we have therefore investigated this relationship.Time-sliced ECG data provides a sufficient number of samples and each sliced dataset can be used as the input for ML training.Although time-sliced ECG data offer sufficient flexibility to mix with other training inputs, we have used this source of data on its own.The minimum heartbeat interval within atypical heart rate [26] was chosen as the sliding window.Representative time-sliced ECG data are shown in Fig. 1.We used the Amang ECG (amgecg) toolbox in MATLAB for ECG time slicing and to build the training input for the regression approach [22].This article is divided into seven sections.Section II describes the new ML-based authentication system and evaluates its performance using a confusion matrix [28].Section III considers the two parameters (slicing time and sampling time) that directly affect authentication performance and evaluates their relationship and impact.Finally, the new authentication system and its contributions are summarized in Section IV.

II. ECG-BASED AUTHENTICATION USING ML
This section describes the ECG-based authentication system using regression as a ML technique that particularly complements the SCK use case.The sampling time for the testing (validation) phase is relatively short (less than 20 s) and the system should identify the unknown entity [22].

A. SECURITY CHECK CASE: EXPERIMENTAL SETUPS
Several ML approaches could be used to develop a regression model but our previous study showed that the DT method achieves the best performance with time-sliced ECG data [22].We confirmed this by comparing the performance of the decision tree (fine tree) and support vector machine (SVM) methods and the results are shown in Table I.Based on these results, we selected the DT-based regression method for our authentication system.We therefore applied this method to the SCK use case based on 90 ECG data samples collected in a HOS environment [22].An ECG-based authentication system can be used to identify employees and exclude unknown persons assuming that employees have registered their identities and the ECG data are stable enough during both the training and testing phases.Of the 90 samples used to construct the dataset for this experiment, 63 were sourced from the PhysioBank database [27,29] and 27 from the Diabetes Complications Research Initiative [30].Ten additional samples (described as 'unknown') were randomly selected from the dataset and added during the testing phase [22].However, pre-processing for the HOS use case is still necessary even when dealing with the SCK scenario because the sources do not originate from the security check.
The authentication process began with use-case categorization and pre-processing before training the dataset.All preprocessing steps recommended in our previous study [22] were applied (including baseline drift adjustment, power line interference (PLI) noise adjustment and checking the flipping signal) because the data were sourced from medical equipment (HOS use case).The ECG data were collected at two different times, resulting in two datasets which were defined as the training and testing sets.The ECG data were trained using the DT regression method (Fig. 2).Although many different measures and techniques are cited in the biometric literature for feature selection filters, we used mutual information in DT model as the measure to score and rank the features.Mutual information was proposed by Shannon [31] as part of his information theory, which included the concept of entropy: the idea that random signals such as speech have an irreducible complexity, below which no further compression is possible.The entropy  of a random variable  with a probability mass function () can therefore be defined as shown below (1): where  is a set of all possible outcomes of .The concept of entropy gives rise to: i) conditional entropy, where the entropy H of a random variable x is conditional upon the knowledge of another random variable y; and ii) mutual information, denoting the amount of information gained about y as a result of knowing x.In terms of our mutual information theoretic feature selection filter, the random variables x and y can be flexibly used to represent features and class labels.For example, x can be used to denote a feature within the dataset, and y can be invoked to denote as a class label, i.e., how likely the machine learner is to correctly predict the class label for any given instance as a result of learning a given feature.Computing the expressions ( 1)-( 3) requires knowledge of probability measures (), () and (, ).Since these quantities are frequently unknown a priori for any given datasets, we invoke a widely-used histogram-based approach [32] to estimate the probability distribution.Generally histogramming is known to introduce estimation bias due to sensitivity to bin size.
Rank-based mutual information theoretic feature selection is inherently based on the assumption that classifier performance is linked to the amount of mutual information shared between the class label and a feature, i.e. the greater the number of highranking features selected for classifier training, the better the classifiers perform in terms of correctly identifying the class label for any given instance.However, this approach ignores the potential for more optimal subsets to exist, comprising features that are not sequentially ranked in terms of their mutual information values.
The sliced ECG time (sliding window) for this experiment is 0.6 s which is equivalent to the interval between heartbeats at the typical rate of 100 beats per minute [26].However, the slice time can be changed and may therefore affect the authentication performance, and this relationship is discussed in Section III.The detailed process flow for the training and testing phases is summarized in Fig. 3.The training phase generates the reference regression functions for each sample (i.e., entity) using the DT technique and stores all functions as a database.This database is then used to compare the ECG data when new ECG data are detected during the testing phase.The sampling time for the training data was set to 50 s.The sampling time for the testing data should be shorter due to the properties of the SCK use case category.The detection of the ECG testing data should be faster because the SCK use case considers a scenario in which employees are entering a company building.Accordingly, the sampling time for testing was set to 15 s.The core process generates reference regression functions for each set of ECG training data.Some of the trained reference functions are shown in Fig. 4 and these can be compared with the ECG testing data without fixing the sampling frequency.

B. EXPERIMENTAL RESULTS
Authentication performance was evaluated using a confusion matrix, i.e. a specific table layout that allows the performance of an algorithm to be visualized, typically a supervised learning based authentication [28].Given the SCK use case, the ability to handle unknown entities is also required as part of the authentication process.Was also applied a data quality measure based on the mean square error (MSE) before starting to detect the testing ECG data.The experiment was performed 150 times using 100 samples, with a sampling time of 15 s for the authentication testing and the confusion matrix.The results are shown in Table II.The authentication score was 90 out of 122 (73.77%) and the successful identification of an unknown entity was achieved six times out of eight (75%).Notably, 28 of the 150 ECG datasets (17.61%) were rejected because they did not meet the data quality criteria [22].The acceptance criteria for validation could be increased based on the quality of the training data (according to the upper control limit of the MSE).When these higher data quality criteria were used, we achieved the results shown in Table III.In this case, only 82 of 150 entities in the ECG testing data were accepted for validation using the reference regression function because the data quality criteria.On the other hand, the accuracy of this biometric authentication system was 76 out of 82 (92.7%).Notably, the values could vary because the ECG testing data were randomly selected for each trial to make the SCK use case more realistic.

III. SLICING AND SAMPLING TIME DEPENDENCIES
Some ML performance measures depend on the ECG slicing time (sliding window).We gathered 70 samples from the various sources discussed above [27,[29][30] and the HOS use case was addressed to find a relationship between the authentication performance and two key parameters: the sliding window (i.e., ECG slice time) and the ECG data sampling time.The relationship between the slicing time and authentication accuracy is shown in Fig. 5, revealing that the optimal slicing time is approximately half the average interval between heartbeats (0.4 s).We also investigated the relationship between sampling time and authentication accuracy (Fig. 6).Although there was no clear relationship between these parameters, the optimal sliding time was 37 seconds in our experiment.The optimal slicing and sampling times may not remain the same if different datasets are used, but our experiments clearly demonstrated that optimal values for these parameters exist and can be used to improve performance.

IV. CONCLUSION
Biometric authentication systems are poised to replace traditional authentication systems and but the particular use case (SCK, HOS and WD) define the ECG setup and therefore the nature of the authentication system.Here we proposed an enhanced ECG-based biometric authentication system for SCK and HOS use cases in which a regression-based interpretable ML approach was used to define the dataset boundaries and to acquire good-quality training data.We trained on a total of 90 ECG data samples to generate the reference function database.The reference function for each ECG data entity (i.e., identification) was then generated using a mutual-information-based DT regression approach.The authentication performance of the proposed system was evaluated not only with a confusion matrix but also by using the amgecg toolbox in MATLAB to analyze two key parameters: the ECG slicing time (sliding window) and the sampling time.We found that a sliding window of 0.4 s achieved the best performance and that the optimal sampling duration is 37 seconds.In conclusion, using these optimized parameters, the proposed authentication system is able to achieve accurate results.

APPENDIX
The amgecg toolbox v.05 [22] and WFDB toolbox v0.10.0 [33] were used to design our authentication system.The corresponding MATLAB codes are available on GitHub (http://URL_will_be_available_after_accepting_the_paper) for users to try the demonstrations.

FIGURE 2 .
FIGURE 2. The training process for the SCK use case.

FIGURE 3 .
FIGURE 3. Process flow for ECG-based authentication in the SCK use case.

FIGURE 4 .
FIGURE 4. The reference ECG regression functions for the SCK use case.

FIGURE 5 .
FIGURE 5. Authentication accuracy based on slicing time.

FIGURE 6 .
FIGURE 6. Authentication accuracy based on sampling time.