Off-Person ECG Biometrics Using Spatial Representations and Convolutional Neural Networks

Targeting off-person ECG-based biometrics, we report a comparative analysis of identification accuracy and verification Equal Error Rate (EER) performances of four distinct types of spatial representations of ECG signals applied as inputs to Convolutional Neural Networks. The actual algorithms used to transform the original time series into 2D/3D images are based on a modified version of the Continuous Wavelet Transform (the S-Transform), the Gramian Angular Field, the recurrence plot, and state-space representations. Extensive experiments have been conducted using UofT and CYBHI datasets including recordings acquired on fingers and hand palm under various activity scenarios. The wavelet-based approach yielded best results, while all analyzed solutions compare favorably with previously reported performances.


I. INTRODUCTION
During the last decades the biometrics landscape has been largely dominated by systems which acquire and process fingerprint, face, iris, speech data or combinations of those. Due to the increasing success rate of counterfeiting or circumventing such approaches, researchers have been forced to consider alternative sensory information and better decision-making solutions, while keeping a comfortable, user-friendly acquisition and operation setup. One robust line of research has considered physiological traces such as the electrocardiogram (ECG), photoplethysmogram (PPG), or even electroencephalogram (EEG) as viable promises to increase resistance against fraud [1]. Moreover, solutions relying on more convenient off-person acquisition setups (e.g., ECG signals acquired in a less intrusive manner on hand, fingers, or ears, without requiring special preparation of the subject's skin) may prove particularly attractive for mobile/wearable settings [2], [3].
ECG signals have been mostly preferred for biometric applications, due to their inherent liveness properties, difficulties associated with fraudulent recording, and increased availability of affordable, wearable, user-accepted sensory The associate editor coordinating the review of this manuscript and approving it for publication was Baozhen Yao . acquisition devices. While still facing problems related to variability induced by human activity, depreciation of signalto-noise ratio in case of off-person acquisition, or time lapse between training and testing sessions, ECG-based biometrics has been considered by a significant number of studies that have been extensively surveyed recently [4]. More than 90 papers have been analyzed in terms of acquisition protocols, signal denoising and preprocessing techniques, feature extraction, decision-making, and multimodal operation. While performance levels largely vary according to the actual setup scenario, several conclusions are worth investigating further. To start with, there is a clear need for additional studies using off-person acquired data, including public availability of reliable, long-term, activity-dependent recordings for a significant number of people. Many papers report identification accuracy and verification performances evaluated on private datasets with a limited (often, less than 20) number of subjects, which makes difficult a fair comparison of the proposed methods. Moreover, despite practical appeal, experiments conducted on ECG records acquired off the person are rather scarce, and clearly outnumbered by those using sensors placed on the chest (although in this case, identification accuracy close to 100% and very small EER values have been reached by a set of different approaches [4]). Datasets including variable activity scenarios and health statuses are also extremely rare, and furthermore, they seldom cover long time intervals, while robustness against such sources of variability is a clear requirement of any biometric solution. Secondly, the survey identifies the recent trend of using deep learning approaches for ECG-based biometric applications, following promising results obtained in other classification tasks, such as arrhythmia, coronary artery disease, or mental stress detection [5]- [7]. Convolutional Neural Networks (CNN's) have been the typical choice, although recurrent networks (sometimes combined with CNN's) or autoencoders have also been considered [4]. Such architectures are able to generate abstract representations of the raw input data by means of successive processing layers, yielding discriminative features relevant for the various classification tasks without the need of complicated preprocessing stages or handcrafted engineering. While most of the successful approaches directly applied the original EGC time series as inputs to 1D CNN's [8], [9], some studies have also considered generating 2D spatial representations, typically using the (short-time) Fourier or Continuous Wavelet transforms [10], [11]. As such, CNN's may classify the textures of compact bi-dimensional representations of lengthy timeseries, while benefiting from the availability of high quality model parameters by employing the transfer learning paradigm (thus additionally coping with small datasets scenarios). Additional advantages of 1D to 2D transformation include better noise robustness and flexibility for data augmentation, e.g. using cropping: while the clinical significance of standard 1D representations may be adversely affected by various augmentation procedures, multiple patches of reduced size may be extracted from a given spatial representation and further resized by linear interpolation [12], coping with the highly unbalanced training datasets scenario. Comparative analysis of 1D versus 2D approaches to ECG classification revealed up to 3% performance improvement [13], [14], while requiring more complex models and additional conversion overhead.
Along the latter line, the present paper presents, to the best of the authors' knowledge, the first comparative analysis of four distinct spatial representations, namely a phase-corrected version of the Continuous Wavelet Transform (CWT) called the S-Transform [15]- [18], the Gramian Angular Field [19], the recurrence plot [20]- [22], and state-space portraits [23], [24] for ECG-based biometrics. Applying the resulting images as inputs to the same CNN network, comparative identification accuracy and verification EER performances have been assessed using two distinct databases, namely CYBHI [25] including data acquired on hand palms and UofT [26] using finger recordings, under various activity scenarios. Extensive experiments revealed that the wavelet-based approach yields the best results, while all analyzed solutions compare favorably with other previously reported methods. The approach does not require the detection of any fiducial points except for the R peaks, which benefits from the existence of a number of performant solutions, including some that may operate in a mobile/wearable setup [27], [28].
The paper extends the preliminary results published in [29], by including 3 additional spatial representation methods (S-Transform, Gramian Angular Field, recurrence plot) besides the state-space portraits approach (while the latter now additionally considers 3D state-space trajectories, as opposed to 2D input images in [29]), analyzing verification EER performances besides identification accuracy, and studying the effect of input data fusion and long-term test data.

II. SPATIAL REPRESENTATIONS OF TIME SERIES
Transforming ECG time series into images has been previously considered for compression purposes [30], building upon clever using of both intra-segment and inter-segment correlation of data divided into individual heartbeats. Encoding (ECG) time series as images brings a series of advantages including the extent of the range of texture features to be further classified (repeated convolutions extract relevant information using the local spatial coherence of such features), the identification of regions of interest in the resulting bidimensional representation (allowing interpretation if the encoding enables time series reconstruction), and proper use of the transfer learning principle (by considering already available high quality models to be tuned according to the specific ECG-based dataset).
The present approach aims at generating discriminative spatial representations from the one-dimensional (denoised and segmented) data, from which CNN's may first extract relevant features, followed by further classification, as described in Figure 1. The preprocessing steps are common for all considered approaches, and include band-pass filtering for denoising purposes and extraction of fixed-length time segments centered around R peaks, followed by quality check of resulting data and elimination of poor signal-to-noise such fragments. In the following, we describe the methods considered and exemplify the corresponding images obtained.

A. THE S-TRANSFORM
Since the classical Fourier Transform (FT) is unable to reflect the temporal dynamics of the spectral components of a given signal, the Short-Time Fourier Transform (STFT) has been proposed as a time-frequency tool aiming at compensating this limitation. The associated fixed width of the analysis window still confines the efficiency of the method, especially for non-stationary data. In order to correct this, the Continuous Wavelet Transform (CWT) introduces a frequency dependent resolution of the time-frequency space, yielding better localization of significant changes in the characteristics of the signals under study at both low and high ends of the spectrum. The CWT of signal x(t) is [31]: (1) VOLUME 8, 2020 where ψ(t, σ ) is a scaled version of a specific (sometimes, data-dependent) mother wavelet and parameter σ controls the width and hence the resolution of the representation.
Since standard CWT lacks precise phase information (that is critical for a set of applications, including source localization, geophysical and biomedical data processing), the S-Transform (ST) has been introduced as a phase-corrected version of the CWT as [17]: by choosing a particular mother wavelet defined as ψ(t, σ ) = 2σ 2 with a frequency-dependent width σ (f ) = 1 |f |. As such, in contrast to the standard CWT, ST provides both absolutely referenced phase information and frequency independent amplitude response [17]. Moreover, ST is defined in a time-frequency plane, by contrast to the standard CWT that uses a time-scale pair of axes, making the frequency information more interpretable in case of ST, and hence simplifying the design of ECG-oriented applications such as denoising procedures [32] and QRS detection schemes [33]. Comparative analysis of ST against FT, STFT, and CWT, along with in-depth analysis of its properties, implementation, applications and limitations have been the subject of several papers [15]- [18].
In order to improve the flexibility of resolution adjustment in the time-frequency plane, a number of generalized versions of the standard ST transform have been proposed, yielding more complicated dependencies of the σ parameter over frequency. One of the most general formulations considers σ (f ) = λ |f | p , where λ and p are positive real-valued parameters whose values are application-specific and set in order to maximize properly defined performance measures [18]. In our experiments we used the method described in [16] that includes code and optimal parameter values. Figure 2 presents examples of ST images corresponding to distinct subjects.

B. GRAMIAN ANGULAR FIELD
Gramian Angular Fields (GAF) [19] represent another time series-to-image transformation enabling visually-oriented classification of one-dimensional data. Given a time series . . x n }, the algorithm starts by first rescaling the data under study in the [−1, 1] interval, followed by expressing the resulting signal in polar coordinates, using the angular cosine as the phase information, and coding time as the distance from origin as in Eq. (3) (N is a scaling constant): Due to the monotonicity and continuity properties of the cosine function in the interval [0, π], the encoding map given in Eq. (3) is bijective, hence it provides a unique representation in polar coordinates. Moreover, the map additionally preserves absolute temporal relationships, unlike using Cartesian coordinates [19]. We finally define the Gramian Angular Field as follows: Upon close inspection, the definition in Eq. (4) reveals that computing the trigonometric sum of the phase values reflects the temporal correlation within various time intervals of the given time series. Examples of GAF plots obtained from ECG segments for different subjects are given in Figure 3.

C. PHASE-SPACE TRAJECTORIES
A number of well-known models have been proposed in order to explain the temporal dynamics and morphology of ECG VOLUME 8, 2020 waveforms [24]. These typically make use of coupled sets of nonlinear differential equations and enable the generation of a broad range of synthetic waveforms, including normal, pathologic or fetal ones. This approach has also motivated the use of ECG analysis tools rooted in nonlinear dynamic systems theory, including reconstructed phase space portraits, recurrence plots, Lyapunov exponents, and correlation dimension.
Following previous approaches [34], [35], we considered state-space trajectories generated according to the celebrated Takens embedding theorem [24] to be applied as inputs to a CNN network for classification purposes. Using a set of time-delayed versions of a generic scalar time series . . x n }, we may construct a high-dimensional representation using a set of successively delayed versions as: where the delay τ and the embedding dimension m should be set according to the characteristics of the given time series using the false nearest neighbor algorithm and the mutual information method, respectively [35], [36]. More specifically, experiments conducted on the datasets under test revealed a typical value of τ ∈ {10, 11}, which is consistent with previous findings of optimal setting of the τ parameter around 20% of the QRS interval duration [37] (hence τ 20 ms for 500 Hz sampling rate). From a graphical perspective, if m = 2 we obtain a 2D trajectory along the coordinates x n and x n−τ , while for m = 3 we may generate a spatial 3D representation in terms of x n , x n−τ , and The final image is obtained by an initial appropriate scale adjustment followed by binarization of the resulting trajectory. Much similar to [37], each time-delayed version of the (filtered, R-peak centered, normalized, fixed-length) ECG segment is first scaled into a common  range, then a 200 × 200 pixels binary image is obtained by setting black color on all pixels hit by at least one point of the phase space trajectory, while labeling by white color all the rest of the pixels. As a consequence, the subsequent CNN based classifier may use single channel bidimensional inputs (as analyzed in [29]) or three (generally, multiple) channel inputs obtained by projecting the 3D (generally multidimensional) representation onto corresponding orthogonal planes.
In Figure 4 we present examples of 2D/3D representations in the phase space. VOLUME 8, 2020 D. RECURRENCE PLOTS Periodicity and irregular cyclicities represent examples of typical characteristics of time series generated by nonlinear dynamical systems and have been extensively studied. Among the various analysis techniques, recurrence plots (RP) represent a well-known tool originating in the chaos theory for visual inspection of time series [20]- [22]. More specifically, RP aim at efficiently exploring the high-dimensional phase space associated with a dynamical nonlinear system through a bidimensional representation of its recurrences. As such, we may construct the following RP matrix: where m defines the embedding dimension set as in [36], N is the total number of states, ε is a real-valued threshold, and θ is the step function (the components of the state vector s are obtained by subsampling the given time series using a proper delay). The definition above suggests that the RP matrix may reveal trajectories that return (close) to previous states. The actual plot exhibits both global characteristics (homogeneity, periodicity) and local patterns as isolated dots, diagonal, horizontal or vertical lines.
The present paper proposes the direct use of the RP plots as inputs to a CNN network, while most of the previous approaches have used recurrence quantification analysis (RQA) scalar measures [20] (describing the frequency and duration of recurrences) to be further classified by SVM's or other solutions. Figure 5 presents examples of recurrence plots for several distinct subjects.

III. EXPERIMENTAL RESULTS
In this section we describe the datasets used in the experiments, the preprocessing procedures, the CNN architecture, and provide comparative identification accuracy and verification EER performances of the analyzed approaches against existing solutions.

A. DATASETS
The Check Your Biosignals Here initiative (CYBHi) [25] collected both on-person (chest) and off-person (hand palms and finger) 5 minutes long ECG records from 65 persons (49 males and 16 females, average age 31.1 years) using dry Ag/AgCl sensors, sampled at 1 kHz and 12 bit resolution. The dataset includes distinct recordings for each person under three distinct scenarios: neutral discussion, subject exposure to low arousal video, and high arousal video, respectively. The database also provides a long-term dataset including ECG signals collected from the same set of subjects after a 3 months interval. All volunteers from which data was collected were informed about the purpose of the research, the materials involved in the experiments, and were required to sign an informed consent document. While the above-mentioned dataset includes records from both hand palms and fingers, we used only the former data in the experiments.
As indicated in Table 1, the University of Toronto ECG Database (UofT) [26] includes up to six recordings acquired on fingers (with 2-5 minutes duration) over a period of six months, in five distinct postures: supine, tripod, exercise, sitting, and standing, for a total number of 1020 subjects. Out of those, only 46 subjects have records covering the overall six months period, and only 52 subjects include all five distinct postures. The data is sampled at 200 Hz rate, with 12 bits resolution.
The original CYBHI data was downsampled and UofT data was upsampled to a common 500 Hz rate (using a FIR anti-aliasing filter based on the Kaiser window).

B. DATA PREPROCESSING PROCEDURES
Following common practice, we band-pass filtered the original ECG records, using a 4-th order 1-40 Hz Butterworth filter (alternative solutions include Kalman and Savitzky-Golay filters, or the Discrete Cosine Transform).
We subsequently implemented a wavelet-oriented R peak detection scheme. More specifically, we began by using the Maximum Overlap Discrete Wavelet Transform (MODWT), an undecimated wavelet transform operating on arbitrary length sequences [38]) in order to decompose the raw signals up to level 6, using the ''sym4'' wavelet family. Then, we computed an approximated version of the signals using the set of coefficients on scales 5 and 6, covering a coarse 8-20 Hz frequency band containing most of the energy of the analyzed signal (reference [28] presents a comparative analysis of various QRS detection schemes including the choice of optimal frequency band). We further squared the absolute values of the approximation to enhance the samples around the R peaks, and finally used the findpeaks MATLAB function in order to detect the actual position of the peaks. The function looks for local maxima points, namely those that are larger than the two neighboring points around it, while additionally imposing constraints on the minimum distance between successive peaks (set at 0.5 s in our experiments) and their minimum amplitude. The procedure exhibited better localization accuracy when compared to the well-known Pan-Tompkins algorithm in case of the datasets under study, while maintaining simplicity against other wavelet-based approaches [39].
Along the lines of similar works, a 0.2 s interval before and 0.4 s after each R peak (including the QRS complex, P and T waves) is selected in order to characterize successive individual heartbeats. The segments have a common 300 samples duration. Every such segment is then normalized to zero-mean and unit variance.
Severe artifacts are present in some of the recordings due to the acquisition setup and protocol, and those may significantly impact the classification performances. As a consequence, we used a selection procedure in order to filter out unreliable segments. Several distinct options have been tested (including PCA-based reconstruction error or computing an average segment for each subject/acquisition scenario and calculating the normalized distance between individual segments and the resulting average template, to be compared against a proper threshold value), yet best results have been obtained using the Normalized Cross-Correlation Clustering (NCCC) approach [31]. The algorithm relies on the assumption that reliable/clean segments are mostly similar, while noisy/distorted ones are uncorrelated with those. It successively constructs a cluster of correlated waveforms and rejects potential outliers, using a set of threshold parameters indicated in [31].
The comparison criteria between various outlier rejection techniques included both visual inspection, as indicated in Figure 6, and computing the variance between the kept segments.

C. CNN ARCHITECTURE
The topology of the convolutional neural network is indicated in Table 2 and is much similar to the architectures used in [37], [40]. The actual structure resulted after repeated experiments using various dimensions of the convolution filters and number of neurons in the fully connected layers. It includes three successive stages of (convolution + RELU nonlinearity + max pooling) blocks, followed by a two-layers fully connected network using softmax activation functions for the output layer. Dropout is used within the fully connected module for improving the generalization capacity of the classifier, while learning is performed by the Adam algorithm using batch normalization [41]. The initial learning rate was set to 0.001, β 1 = 0.9, β 2 = 0.99, and the mini-batch size was 64.

D. IDENTIFICATION PERFORMANCE ANALYSIS
While the duration of the recordings for both datasets is already low (2-5 minutes per session), the number of useful ECG segments to be used for training purposes is further diminished by the quality checking procedure. We aimed at reaching a proper trade-off between the dimensionality  of the training datasets and the number of subjects, taking into consideration the additional constraint that in case of multiple set experiments (CYBHI set includes three distinct acquisition scenarios, and UofT data considers five distinct postures) we should use the same number of ECG segments/per scenario/per subject. As a consequence, we were left with about 220 segments/per subject and 700 segments/per subject in case of CYBHI and UofT datasets, respectively. Each segment was further successively transformed into one of the four spatial representations taken into consideration.
Due to the low dimensionality of available datasets, the records were split into five equal subsequences in order to implement a 5-fold cross-validation strategy: for each dataset, training was successively performed on four subsequences (gathered from all subjects), while the identification and verification performances were evaluated on the remaining ones. For each scenario considered, five distinct trials were performed in order to reduce bias against dataset sampling and learning process. Figure 7 reports identification accuracy performances in terms of Cumulative Match Characteristic (CMC) charts for the CYBHI dataset acquired on hand palms for two scenarios: a) short-term acquisition, using training and test data recorded on the same session; b) long-term acquisition, when the training and test set sessions are separated by a three-months interval. Figure 8 reports identification accuracy performances in terms of CMC charts for the UofT dataset acquired on fingers for three scenarios: a) training and test data recorded on 52 subjects in sitting position only, over 5 acquisition sessions; b) training and test data recorded on 52 subjects in 5 positions (sit, stand, exercise, supine, tripod), over 5 acquisition sessions; c) long-term acquisition on 52 subjects in sitting-only position, training set acquired on first 4 sessions, while the test set was recorded on the 6-th session, separated by a two-month interval from the training set. In the case of the data acquired on sitting position only, we have also tested the efficiency of an input fusion approach, by merging the 2D representations provided by the S-Transform, GAF and RP plots as input channels for the same CNN network (we discarded the state-space representation since it performs significantly worse than the rest of the methods. A larger dimensionality of the CNN input would also have increased the training time).
The importance of the fixed window length around the R peaks was assessed through a set of experiments that considered variable length segments successively including the QRS complex, and the P and T waves, respectively. More specifically, if defining the fixed window centred on an R wave as (R-t 1 , R + t 2 ), QRS segments correspond to t 1 , t 2 = 0.08s, including P and T waves requires t 1 = 0.2s and t 2 = 0.4s, respectively. Classification performances reported in Figure 9 indicate that the complete waveform should be considered, while neglecting the T wave seems critical.
From a practical perspective, we would expect that a set of successive individual segments of a given subject be first classified, followed by a final decision given by a majority vote. In Figure 10 we plot the dependence of the identification accuracy on the number of individual segments taken into account, showing a steep performance improvement, exceeding 98.3% for 3 segments and 99.3% for 5 segments, for both datasets. Results reported in Figures 9 and 10 correspond to the use of the S-Transform for constructing the spatial representations, while they exhibit similar trends for GAF and recurrence plots. We have also performed experiments following the transfer learning approach, considering pretrained models such as AlexNet, SqueezeNet, GoogleNet, VGG16, MobileNetv2, Inceptionv3, Resnet50, DenseNet201, and Xception (all models available in MATLAB). We have progressively frozen the layers of the models, starting from the input towards the first fully connected layer in the final classification stage. We noticed that keeping about 70% of the total number of layers yielded optimal results (freezing too few layers renders the networks prone to overfitting due to the limited dataset dimension compared to the large number of trainable parameters, while freezing all layers up to the first fully connected one forces the models to use features (to be further classified) generated from natural images in the ImageNet dataset that differ significantly from those produced by the methods described in the previous sections).
Comparative identification accuracy performances are indicated in Figure 11 and show that all pretrained models lag behind the custom designed network for both datasets. Resnet50 model performed best, with a marginal 1.5% Rank-1 accuracy degradation, while maximal difference is about 3% for SqueezeNet/MobileNetv2 models. The justification behind those results resides in a combination of the two reasons above, namely reduced dimensionality of the datasets and major differences between the texture-like representations generated by ST/GAF/recurrence plot/state-space algorithms and natural images included in the ImageNet database. After performing a paired samples t-test on the CMC classification results we concluded that the performance differences between the custom network and the pretrained ones were statistically significant (at 5% confidence level) for all models.
We have tested the efficiency of data augmentation techniques, performing horizontal and vertical translations with up to ±10 pixels, and scale variation in the range [0.9-1.1] on the various 2D representations, but the performances did not improve. Most data augmentation solutions in the literature act directly on successive ECG multi-segment blocks, and include flipping, magnitude scaling, and random permutations of subsegments from the original time series. Since our approach is oriented towards processing individual (scale normalized) segments, those methods are less efficient.
F1-scores follow the same trend as the classification accuracy, yielding best results for the ST algorithm (higher than 0.92 and 0.88 for UofT sit and CYBHI datasets, respectively). Cohen's Kappa coefficient (measuring the agreement between true and predicted single classifier outputs, or the outputs of two distinct classifiers on the same task, corrected by random chance agreement [42]) was higher that 0.8 for all models and test scenarios.

E. AUTHENTICATION PERFORMANCE ANALYSIS
A separate set of experiments focused on assessing the authentication performances of the spatial representation methods under study. For each genuine user we constructed a number of enrolled templates to be compared with the template of a person claiming a valid/enrolled identity. We further used metric-based matching that relies on computing a dissimilarity score between the enrolled and test templates to be compared against a certain threshold value before accepting or denying the identity claim.
In order to compute the templates we followed the same approach as in [41], [43], by using the identification models used in the experiments reported in the previous section as feature extractors, following the well-known transfer learning paradigm: after successful training, new ECG segments were applied at the inputs of CNN models described in Table 2, for which we kept only the weights up to the output of the first fully-connected layer (the final fully connected and softmax layers were removed). As a consequence, we may define the enrollment/testing templates as the corresponding 1600 long vectors. We used the following algorithm: Step 1: define training, enrollment and test datasets. We selected separate subjects for training the models and enrolling/testing phase, respectively (as in [41], [43], we used identical persons in the enrollment and testing sets).
Step 2: training was performed as described in the previous section. As such, we considered two distinct scenarios: scenario S1 corresponds to the identification setup related to Figure 8a, namely training data recorded on 52 subjects from the UofT dataset in sitting position only, over 5 acquisition sessions, 700 segments/per subject. The test set included 200 persons from the UofT set, 200 segments/per subject (those persons were selected such as the quality check procedure based on the NCCC algorithm provides a significant number of segments for each subject). In scenario S2 we employed a cross-dataset procedure, by training the identification models using CYBHI data acquired on hands from 65 persons as reported in Figure 7a (220 segments/per subject) and assessing the authentication performances on the same set of 200 persons selected from the UofT data.
Step 3: up to 10 templates were computed and stored for each subject in the enrollment set, using the very first data segments in order to cope with a realistic scenario.
Step 4: in the testing phase, the same subjects as in the enrollment phase were used, except for the fact that new ECG segments were applied at the input of the models. Similar to [41], when a test user claims an identity, we first compute the normalized Euclidean distances between each stored template of the given identity and the test template, followed by selecting the minimum value of those as the dissimilarity score. This score is further compared to a threshold whose optimal value is set according to the Receiver Operating Characteristic (ROC) curve and associated Equal Error Rate (ERR) performance. The normalized Euclidean distance between vectors u and v is defined as [41]: and lies in the [0, 1] interval (var denotes the vector variance). In both scenarios, the training and testing procedures were performed five times in order to cope with the 5-fold cross-validation approach. VOLUME 8, 2020  Figure 12 presents ERR values for the two scenarios, according to the number of stored templates for each genuine subject. These correspond to the use of S-Transform for constructing the spatial representations, while GAF and recurrence plots yield 2-3% larger values (we did not consider the state-space representation since it performs significantly worse than the other methods). The figure reveals a steady decrease of the ERR values for both scenarios: for S1 we obtained 9.69% ERR when using a single template, 5.48% for 3 templates, 4.86% for 5 templates, and a limit of 4.4% for more than 13 templates. For S2 the results are slightly worse, with 14.1% for one template, 8.61% for 3 templates, 7.62% for 5 templates and a limit of 7% for more than 13 templates.
Comparative performances against other existing solutions are indicated in Table 3, exhibiting favourable performances for the proposed approach. Figures 7-10 and Table 3 indicate that both identification and authentication performances are comparable or better than state-of-the-art approaches for the considered datasets, although the comparison is influenced by the use of different databases with a total number of subjects varying to a large extent.

Results reported in
For the CYBHI hand set, the identification accuracy for S-Transform, GAF and recurrence plots ranges in the interval 93.7-95%, placing ST in the top position, while the state-space portraits based solution lags almost 2% behind. This ordering is also valid for the UofT finger datasets, showing a slight improvement for the identification task in the range 94.8-95.6% for the sitting only position experiments using the first three spatial representation methods, and a small degradation of about 1.5% when considering all five postures. This is due to the higher variability of the waveforms and poorer signal quality.
It is worth noting a steep increase in recognition accuracy from Rank-1 to Rank-5 classification, exceeding 99% in the latter case, exhibiting a roughly 3% improvement from Rank-1 to Rank-2 performances for both datasets. Moreover, Figure  8 indicates that input fusion of the various spatial representations yields only a marginal accuracy gain of less than 1%.
In case of phase portraits, we also tested the effect of the embedding dimension on the identification performances. We have successively considered m = 2, 3, and 4, yielding single, three and six channels, respectively, as inputs to the CNN network (multiple channels were obtained by combinations of pairs of delayed ECG segments). Only marginal improvements of about 1% were obtained by considering three channels versus single channel inputs, and less than 1.5% in case of six channels.
Since most of the existing approaches typically use 5 to 10 successive ECG segments as inputs to 1D/2D CNN or recurrent models, a fair comparison against our results should consider performances indicated in Figure 10 (S-Transform representation), showing more than 99.3% accuracy on identification tasks.
As stated before, the phase-portrait representation performs significantly worse than the rest of the spatial transformation methods when coupled with a CNN classifier. This is a consequence of the difficulties faced by classical convolutional networks in handling missing/sparse input data [51] that has been tackled by various techniques, including defining validity masks or performing sparse convolution on 2D or 3D data [52], [53].
Significantly worse results were obtained for the long-term scenario for both datasets considered, namely for large time lapse between the acquisition of the training and test sets, confirming the findings reported in [45]. Identification performances lag more than 25% in both cases, indicating that temporal variability of ECG signals may adversely influence biometric applications and calls for extended datasets acquisition under diversified health status and activity patterns scenarios.
Training sessions were conducted on a computer with an i7 Intel CPU, 16 GB RAM, NVIDIA Titan Xp GPU unit running CuDNN library, consuming less than 90 seconds per training epoch.

V. CONCLUSION
Building upon the remarkable classification performances of convolutional neural networks, encoding ECG time series as images brings a series of advantages including the extent of the range of texture features to be further classified, possible identification of regions of interest in the resulting 2D representation and proper use of the transfer learning principle.
The present paper revealed that combining various time series-to-image transformations with CNN based classification represents a viable alternative to existing ECG oriented biometric applications, although the computational cost is higher. Reported results indicate that both identification and verification performances compare favorably with state of the art approaches in case of off-person acquired signals. The S-Transform, Gramian Angular Field and recurrence plots yield comparable performances, while the method based on the state-space representation lags behind. In the latter case, alternative CNN architectures and learning algorithms better coping with the sparse characteristics of the resulting images may significantly improve the performances. Overall, additional experiments using larger datasets acquired for many subjects during long time intervals, under diverse activity scenarios and health statuses are needed in order to fully validate the proposed approach, which may prove also useful for pathology-oriented (arrhythmia) ECG classification. Further work will aim at designing an end-to-end deep learning architecture that should learn optimal 1D-2D representations (instead of using predefined ones) to be further classified. He is the author of two books, three book chapters, and more than 50 articles. His research interests include deep learning applications, pattern recognition, biometrics, and biomedical signal processing. Dr. Ciocoiu is a member of the IEEE Computational Intelligence Society, and the IEEE Romania Section Membership Development, East Zone Coordinator. He is currently an Associate Professor with the Faculty of Electronics, Telecommunications, and Information Technology, ''Gheorghe Asachi'' Technical University of Iasi. He is the author of one book and more than 20 articles. His research interests include digital signal processing, optimization problems, and neural networks.