Ensemble Networks for User Recognition in Various Situations Based on Electrocardiogram

Research on electrocardiogram (ECG) signals has been actively undertaken to assess their value as a next generation user recognition technology, because they require no stimulation and are robust against forgery and modification. However, even within the same user, the heart rate and waveform of ECG signals will vary depending on physical activity, mental effects, and measurement time. Therefore, when data acquired across changes in the user state is used as registered data, an overfitting problem occurs due to data generalization, which degrades the recognition performance for newly acquired data. Therefore, in this paper, we propose parallel ensemble networks to solve the overfitting problem and prevent the degradation. First, ECG signals acquired in various environments are used as the input data of parallel convolutional neural networks (CNNs). Each CNN is set up with different parameters to detect different features. The ECG signals outputted from each network are classified for each subject, and then fused into one database to be used as registered data for re-training. Instead of fusing all the output signals from each network, only the ECG signals of Top-3 networks showing excellent performance are fused and composed of registered data. The reconstructed registered data are used for user recognition, by re-training with time independent comparison data in the CNN. The experimental results of comparing the proposed parallel ensemble networks with those of previous studies using the self-acquired actual ECG signals show that the proposed method achieves recognition performance higher than the previous studies, with an accuracy rate of 98.5%.


I. INTRODUCTION
With the rapid development of the information society and the subsequent improvements to various systems and devices, research on the identification of individuals has been actively conducted, and is being applied in real life scenarios. In terms of conventional user recognition methods currently being implemented there are (i) knowledge based user recognition methods such as passwords, (ii) ownership based user recognition methods such as one time passwords (OTPs), and (iii) biometric based user recognition methods such as those using face and fingerprint information [1]. Knowledge and ownership based user recognition methods recognize users by verifying their knowledge or possessions and are cheaper than The associate editor coordinating the review of this manuscript and approving it for publication was Kemal Polat . other recognition methods. However, these methods incur additional administrative costs and are vulnerable to security threats when the user forgets their password or loses an OTP or ID card [2], [3].
A biometric user recognition method recognizes the user using information about body features, such as a fingerprint, iris, face, or voice, or user behavior pattern features such as gait and signature. Biometric based user recognition technologies using fingerprints and faces remove the risks associated with loss or theft. They have already been applied in real life, owing to low consumer resistance. However, they utilize anatomical and physical form of information that is displayed externally and is vulnerable to forgery and modification. Moreover, they have the disadvantage that the user must intervene in a face to face manner, which requires their cooperation. Unlike general passwords, biometrics are also difficult to change when they are leaked, and additional information such as race and medical history can be extracted and collected without the user's consent. As a result, social issues such as the financial losses accompanying the leakage of users' personal information are a persistent risk.
Biosignal based user recognition methods using electrocardiography have recently been researched as a potential user recognition technology, they use the uniquely characterizable signals generated within the body. They are not widely resisted by consumers, thus are being researched as the next generation user recognition method. The electrocardiogram (ECG) signal is a biosignal with characteristics unique to the individual, it is generated autonomously according to factors such as the position, size, and structure of the heart, as well as the users' age and gender information, which are unique to individuals [4]. The ECG signal is mainly used in clinical diagnoses, yet user recognition technology using the ECG signal was first proposed in 2001. Since then, research on the user recognition algorithm and the number of measurement channels and statuses of the ECG signal has been conducted and shows the possibility of user recognition using the ECG signal [5], [6].
However, most previous studies on the ECG based user recognition method applied recognition algorithms using only the same data for the user state, in order to minimize the effects of heart rate and waveform changes, or using data measured only once per day in order to use only the time independent ECG [7], [8]. However, depending of the heart rate and waveform of the ECG signal change on the physical activity, measurement time, and mental state of the individual, results in a section where the waveform changes significantly and a section where the waveform changes relatively little. As shown in Figure 1, when comparing the waveforms of ECG signals before and after exercise, the P and T waves are generated closer to the QRS section. That is, using the ECG signals acquired across a change in user state as comparison data instead of using the ECG signals acquired in the same environment is likely to cause the degradation of user recognition performance. In addition, when data acquired via the change in user state is inputted into a single deep learning based network, data generalization of the large amount of data causes an overfitting problem that degrades the recognition performance for new data. Therefore, there is a demand for research on how to achieve high user recognition performance in real life environments, through a recognition algorithm that can apply data acquired across a change in the user state.
In this paper, we propose a parallel ensemble network structure in order to process the data acquired before and after exercise, and in the lying and standing postures that cause heart rate and waveform changes. The ensemble method combines different models to improve performance. In this paper, we design multiple single network models in order to not only realize a performance improvement but also to avoid an overfitting problem, and then construct the data output from each network into one registered dataset. By retraining the registered data constructed with the data outputs of each network using a CNN designed as a single network of the same structure, we verify the results of the user recognition performance for comparison data.
The structure of this paper is as follows. Following the introduction of Section 1, Section 2 presents conventional user recognition methods which use the ECG signal. Section 3 describes in detail the proposed user recognition method, based on a parallel ensemble CNN using the ECG signal. Section 4 analyzes the results of its user recognition performance measurements. Section 5 concludes the paper and suggests future research directions.

II. RELATED WORKS
In order to solve the problems of certain previous studies, research into applying the ECG signal to deep learning methods has recently been undertaken. Deep learning outputs the optimal values from deeper layers by adding various types of layers, starting with the multilayer perceptron of conventional machine learning. Typically, deep learning is achieved through a variety of networks, such as a deep neural network (DNN) [9], a convolutional neural network (CNN) [10], a recurrent neural network (RNN) [11], and a deep belief network (DBN) [12].
One of the most prevalent methods in image processing is the CNN, which is designed to automatically learn and classify various feature extraction filters. In conventional feature detection methods, the features are extracted by humans and only the classification procedure is performed by machine learning, however, the CNN automatically performs both feature detection and classification.
As a conventional method of applying the ECG signal to deep learning, Ubeyli proposed a method of detecting arrhythmia with the RNN, using eigenvector based feature extraction. Experimental results showed that this model has an average accuracy of 98.06% for four different types of arrhythmia [13]. Zubair et al. designed a nine layer CNN with an accuracy of 92.7% [14]. Zhai et al. transformed ECG waveforms into 2D images by duplicating them, and applied these to the CNN, which subsequently showed an accuracy of 98.6% and 97.5% for the specific waveform detection [15]. Acharya et al. designed a nine layer CNN with an accuracy of 94.03% and 93.49% for waveforms with and without noise, respectively [16].
Kiranyaz et al. applied a 1D CNN to ECG arrhythmia classification. Unlike the method of applying the CNN to 2D ECG images, Kiranyaz's method delivered excellent performance results by applying the CNN to 1D ECG signals [17]. Rajpurkar et al. proposed a 1D CNN model that used a deeper network and more numerous data than Kiranyaz's CNN model. However, detection performance was low in spite of the use of more ECG data [18]. In the case of the ECG signal used in the experiment, the deeper network did not result in any performance improvement, as the ECG signal used as the input was still 1D though the size of the data set increased.
The user recognition methods using an existing single network show a fast learning speed when processing large quantities of data yet have a relatively low accuracy due to an overfitting problem for the training data. In order to make up for this shortcoming, research on user recognition via ensemble structures combining various methods has been conducted. The ensemble model is a recognition model that combines output features and results by designing multiple recognition models rather than using a single one, it shows higher recognition performance than a single model. Xiao et al. initially used various machine learning methods such as KNN, SVM, random forest, decision trees, and gradient boosting decision as a single classifier, and then fused the classified results into a single training dataset. Next, they processed the fused training data using a single DNN to classify the fused training data. Experimental data was used to determine normality and abnormality for lung, stomach, and breast cancers. Experimental results showed that the performance accuracy increase of the proposed ensemble model was 98.8%, 98.78%, and 98.41% for lung, stomach and breast cancer data, compared to the case of a single classifier's being used as the machine learning method [19]. Fan et al. proposed a two layer multi scale CNN model (shown in Figure 2) to classify normal signals and abnormal arrhythmia signals using the ECG signal. Its structure was designed with different filter sizes for each layer, to detect features of different scales, and they applied a database of the ECG signal sampled at 20 s intervals. Experimental results showed that the multi scale CNN model they proposed achieved a classification result of 98.13%, which was an improvement upon the results of 89.58% and 98.03% achieved when a single network and the VGGnet were applied, respectively [20].
Liu et al. proposed an ensemble network that combines multi CNNs and a single RNN to detect myocardial infarction signals using 12 lead ECG signals. They designed multiple independent CNNs to receive different signals from the 12 leads. The features outputted from each network were used as the input data of the single RNN. They also solved the overfitting problem between the multi CNNs and the single RNN by applying the lead random mask (LRM). Lead random mask solves a generalization problem that occurs when processing large amounts of data in training, using randomly selected data in the same way as Dropout. Experimental results showed that the detection performance for myocardial infarction and normal signals achieved a detection rate of 99.9%, and the recognition rate of experimental subjects was 93.08% [21].
Oh et al. proposed an ensemble network model using a single CNN and single RNN to diagnose five types of arrhythmia. The single CNN was designed to enable a classifier to extract spatial features, and arrhythmia signals were detected by applying the detected features to the single RNN, which receives data according to temporal information. The results of applying various public databases of arrhythmia signals showed a high detection rate of 98.1% [22].
As described above, previous studies on user recognition using ECG signals have recently gained attention as a next generation user recognition method, that can replace conventional recognition methods effectively, owing to their high accuracy. However, although the ECG signal changes due to various factors such as environment, time, and user, the previous research on ECG signals was conducted using data acquired in the same environment and time period. That is, the previous ECG based user recognition methods do not consider changes in the user state and environment, thus there is a lack of research into verifying the reproducibility of the results. Therefore, in this paper, we propose parallel ensemble networks to enable user recognition using the ECG signal, acquired across changes in the user state to make up for the problems of conventional user recognition methods.

III. ENSEMBLE NETWORKS FOR USER RECOGNITION IN VARIOUS SITUATIONS
This section describes the ensemble networks that can apply the data acquired across changes in the user state for user recognition method employing the ECG signal in various situations. First, the registered data uses the ECG signal acquired across user state changes as input data. Next, preprocessing -for instance the removal of noises occurring in the process of acquiring the ECG signal and signal segmentation, are performed. Since noises occurring from the ECG signal are represented in different frequency bands depending on the source of occurrence, they are removed by using multiple filters, such as high pass filter (HPF) and band reject filter. The final step of the preprocessing is the process of segmenting the ECG signal using the Pan and Tompkins method. Lastly, the data loss and overfitting problems occurring in a single network are remedied by designing a parallel ensemble CNN, as proposed in this paper.

A. PREPROCESSING FOR NOISE REMOVAL AND ECG SEGMENTATION
Various types of electrical signals generated by the human body contain important information relating to physical activities. Of these electrical signals, noises occurring in the ECG signal are unnecessary to the process of recognizing individuals. In other words, noise removal is essential to prevent the propagation of false information.
As regards the types of noises that occur while acquiring the ECG signal, there is (i) a power line interference arising from the equipment used for acquiring the ECG signal, (ii) motion artifacts caused by subjects' movement, (iii) muscle contractions by irregular muscle activity, and (iv) baseline drift by breathing [23]. The HPF used in this paper removes signals with frequencies below a certain threshold, and passes those frequencies above it. The degree of reduction for each frequency depends on the design and parameters of the filter. By using the HPF with a cutoff frequency of 0.5 Hz, we remove baseline drift in the low frequency band.
Power line interference is noise caused by improper grounding, or proximity to high voltage or transformer lines, it is generally observed at 60 Hz. Power line interference affects P and R waves with a smaller amplitude than other waveforms when analyzing the ECG signal, causing errors in the diagnosis of arrhythmia and myocardial infarction, and also degrading performance in measuring QRS or QT sections by distorting the ECG signal. To remove this power line interference, we applied the band reject filter (i.e., notch filter) for the 60 Hz band. Lastly, we perform R wave peak detection using the Pan and Tompkins method, to segment the ECG signal going through the noise removal process. The Pan and Tompkins method can be typically divided into a preprocessing stage and a QRS section detection stage using an adaptive dual threshold. In this paper, we perform the preprocessing of the Pan and Tompkins method using noise removal methods such as HPT and the band reject filter described above. After calculating the peaks of the entire ECG signal going through the preprocessing stages, the QRS section is detected by applying the dual threshold. In order to detect the Q wave peak, we calculate the signal difference within a certain range to the left hand side of the R wave peak, and look for the point that shows the maximum difference from the peak of the R wave. This point is detected as the peak of Q wave. Similarly, the peak of the S wave is detected by defining a certain distance to the right hand side of the Q wave peak as the range and finding the point within that exhibits the largest difference from the peak of the R wave. Figure 3 shows the results of applying the multiple noise removal filters to the ECG signal used in this study, along with the results of signal segmentation.

B. PARALLEL ARCHITECTURE BASED ENSEMBLE NETWORKS
The ECG signal is time series data for which the heart rate and waveform changes can be caused by individuals' physical activity, measurement times, and mental effects. However, even within the same user, the heart rate and waveform of the ECG signal acquired in daily activities will vary depending on his/her state and environment. Therefore, when data acquired across changes in the user state is inputted to a single deep learning based network, an overfitting problem may occur due to the data generalization of the large amount of data, resulting in high recognition performance only in the registered data. In addition, when designing very deep public networks such as GoogleNet, VGGNet, and AlexNet, recognition performance is degraded due to data loss during the training process of repeating feed forward neural networks and back propagation, and the number of errors increases proportionally with the number of parameters.
The recently introduced ResNet and DenseNet were proposed to solve the problem of the initial characteristics of images being lost at the final output stage as the CNN model becomes deeper. However, the ECG signal consists of simple waveforms, unlike general data which feature complex patterns [24], [25]. Therefore, in this paper, we propose a parallel ensemble network structure using data acquired across changes in the user state (see Figure 4). First, as shown in Figure 4-(a), the data acquired across changes in the user state and environment is constructed into the registration database and used as input data for a parallel 1D single CNN. The structure of the 1D single network uses instead a convolutional layer that detects unique features of the ECG signal and transforms them into a feature map through a convolution operation, a pooling layer that reduces data size, and a dense layer that sets input and size. The pooling layer reduces the amount of calculation required, by decreasing the size of data outputted from the convolutional layer, and also enables the robust extraction of features. The pooling layer uses maximum pooling, which selects the highest value in the window as down scale weighting occurs, this reduces the stronger properties of the output from the convolutional layer, where the result of the calculations of the active function ReLU is frequently '0'.
An ensemble method improves performance by combining different models. Each 1D single network consists of different parameters so as to detect different features, and the parameters are set as shown in Table 1. The number of training iterations was set to 500 and 750, and the batch size, which is the number of data used for one training, was set to 256 and 512. The drop out rate reduced the calculation time and the amount of calculation required, by omitting part of the network, it was set between 50% and 70%, and the learning rate was set to 0.001, which is common practice.
Next, as shown in Figure 4-(b), the ECG signal outputted from each network is fused into one database and used as the registered data for retraining. However, when the entire ECG signal outputted from each network is used as the registered data, recognition performance is degraded, owing to the fact that it includes within the registered data the low recognition results, which are caused by improper parameter choices and network design. Accordingly, instead of using all the output data, we fuse the result data of the Top-3 network, which shows excellent performance, and then construct the registered data. Lastly, we use the reconstructed registered data to perform user recognition by retraining time independent comparison data in a single CNN.

A. ECG DATABASE
To evaluate the performance of the parallel ensemble networks proposed in this paper, we used the MIT-BIH arrhythmia, a public database provided by Physionet. The MIT-BIH arrhythmia database contains two channel ECG signal recordings of 30 min, acquired from 47 patients aged between 20 and 90 years of age, studied in the BIH arrhythmia laboratory. The measurement data consists of ECG signals from 23 patients randomly selected from inpatients, and 4,000 outpatients, the remaining 24 recordings consist of patients with clinically significant arrhythmia from the above patients. The 0.1∼100Hz band pass filtered signals were sampled at 11 bps and 360 Hz, and information on the type according to the heart rate and the time of reference point are contained in [26]. In this paper, we regarded arrhythmia signals as well as normal ECG signals as patients' intrinsic ECG signals and used them as user recognition data rather than for detecting arrhythmia. In this paper, we set the five cycles of the ECG signal as one datum. The total amount of data per individual was 180, consisting of 94 training data, 50 validation data, and 36 comparison data.
Additionally, in this study, so as to analyze change in the user state and the variation of ECG signals across various environments, and to apply them to user recognition, we acquired ECG signals ourselves for 89 adults aged between 20 and 60 years of age. A survey was prepared to check user states. The instrument used for measurement was the BI160AC MP160 model, and lead-1 ECG signals were acquired using a wet electrode. Lead-1 ECG signals are recorded by the electric potential difference between two electrodes attached to the left and right hands and are used alongside other limb lead methods. Measurements were taken over the course of one year, in order to acquire time 36532 VOLUME 8, 2020 independent ECG signals, and the ECG signals were measured by defining the following four situations as ones that could easily change ECG signals in real life. Previous studies placed restrictions of food or the intake of drugs such as caffeine, which could affect ECG signals, this study did not consider these restraints. Depending on the subject's schedule, ECG signals were measured three times at a 2,000 Hz sampling rate across different days (i.e., multi-session).

B. PERFORMANCE ANALYSIS OF USER RECOGNITION USING ENSEMBLE NETWORKS
In this experiment, we prove the effectiveness of the proposed ensemble network by comparing its performance with the results of previous studies on user recognition using the MIT-BIH arrhythmia data, which is a public database. In addition, the results of the user recognition performances are analyzed by constructing comparison data with various combinations per cycle, using the virtual signals produced through the ECG signals self-acquired across changes in the user state and environment. The ECG signal cycle used in the experiment was divided into 1 cycle for the entire ECG signal, and then made to consist of 5 cycles by connecting these consecutively. First, we conducted a comparative experiment by applying the MIT-BIH arrhythmia data to a single network used in the proposed parallel ensemble network. The number of data used in the experiment (per individual) was 180, consisting of 94 training data, 50 validation data, and 36 comparison data. A total number of 8,460 databases were constructed from all 47 subjects.
Experimental results showed that the recognition accuracy improvement of the proposed parallel ensemble networks was 99.6% compared to the performance of using only a single network. The user recognition results of using a single network with different parameters showed more than 97% recognition accuracy in all single networks, which showed higher or similar performance than the previous studies, as shown in Table 2. This is the result of applying the public database consisting of training, verification, and comparison data, which was measured once in the same state so as to minimize the effects of heart rate and waveform changes. That is, it can be interpreted as a result of recognizing the environment at the time of data acquisition, not as a result of recognizing the subjects' intrinsic characteristics. Therefore, in this paper, we conducted a comparative experiment between the performances of a public network and the proposed parallel ensemble networks. The public network used in the experiment was a VGG network that showed high recognition performance in the 2014 ImageNet large scale visual recognition challenge (ILSVRC). It is designed in a simple way compared to other public networks, so it is easy to modify.
As shown in Figure 6, the MIT-BIH arrhythmia data was applied to the VGG network and the proposed ensemble networks, both of which showed high recognition performance. In particular, the VGG network, which is composed of deeper neural networks than the proposed ensemble networks, showed high recognition performance and lower error rates despite the small number of training data. However, as shown in Figure 8, there was an overfitting problem in that the error rate increased as the learning progressed, and when the early termination of the learning process was employed to avoid overfitting, high recognition performance could not be achieved. On the other hand, the proposed ensemble networks required more training than the VGG network to achieve a high recognition performance and reduce the error rate.   However, we confirmed the effectiveness of the proposed parallel ensemble networks, which produced a stable learning process by gradually converging to a high recognition rate and low error rate.
Lastly, in order to verify the parallel structure ensemble networks proposed in this paper, we conducted a comparative performance experiment between the self-acquired actual ECG signals for a parallel 1D CNN method and a single CNN method using two dimensional ECG images. The single CNN using 2D ECG images, proposed by Jun et al. (2018), applies 1D ECG signals to a single CNN by converting 1D ECG signals into 2D images. The structure of a convolutional neural network was designed with six convolutional layers and three maximum pooling layers, with batch normalization applied to each layer. The parallel 1D CNN proposed by Zhang et al. (2018) used wavelet transforms of input ECG signals, and then extracted different features through autocorrelation coefficients. The extracted features were used as the input data of each single network, and user recognition was performed by fusing each feature in a fully connected layer. The structure of our proposed convolutional neural network was designed with five convolutional layers, four maximum pooling layers, and two fully connected layers. However, since the number of parallel networks was undefined, 6 networks were designed identically to their study. Figure 9 shows the results of comparing our recognition performance with those of the previous studies. The single CNN method using 2D ECG images and the parallel 1D CNN method showed a recognition accuracy of 94.4% and 95.7%, respectively. The results of applying them to the parallel ensemble CNN proposed in this paper showed a recognition accuracy of 98.5%, which was higher than the previous studies. In the case of 2D ECG images, it used only ECG waveform information as features and was designed as a single CNN. Therefore, it showed the lowest recognition performance because it did not recognize the ECG signals acquired across changes in the user state. The parallel 1D CNN method showed lower performance than the ensemble CNN proposed in this paper, as a difference in the output characteristics according to the wavelet transform and user state dependent autocorrelation coefficient can occur even within the same subject.

V. CONCLUSION
Most previous studies on ECG based user recognition applied recognition algorithms by using only the same data of the user state in order to minimize the effects of heart rate and waveform changes, or employing data measured only once a day so as to use only time independent ECG data. However, the heart rate and waveform of the ECG signal change depending on an individual's physical activity, measurement time, and mental effects, resulting in a degradation of user recognition performance.
Therefore, in order to solve the overfitting problem, we conducted experiments on user recognition for the previous studies -i.e., a single CNN method and parallel 1D CNN method using 2D ECG images and for the parallel ensemble network, proposed in this paper, that uses the self-acquired actual ECG signals. Regarding the experimental data, the first ECG signals acquired in the lying posture, standing posture, and before and after exercise were used as registered data, the second signals acquired were used as validation data, and the third signals acquired were used as comparison data. The experimental results show that the single CNN method and the parallel 1D CNN method using 2D ECG images achieved a recognition accuracy of 94.4% and 95.7%, respectively. The results of applying them to the parallel ensemble CNN proposed in this paper showed that the recognition performance was higher than that of the previous studies, with a recognition accuracy of 98.5%. In conclusion, we solved the overfitting problem caused by employing ECG signals acquired across changes in the user state as the comparison data, through the parallel ensemble networks proposed in this paper.
We acquired the ECG signal using MP160, a measurement equipment from BIOPAC systems, and applied it to the user recognition method. However, in order to apply it in real life, the ECG signals acquired from wearable devices such as smart phones and smart watches will be used. It is easy to access ECG signal measurements. However, there is a need for further research on realistic approaches, such as removal of the noise caused by measurement.