Robust Multimodal Heartbeat Detection Using Hybrid Neural Networks

Many arrhythmia datasets are multimodal due to the simultaneous collection of physiological signals of a subject. These datasets frequently have missing modalities or missing block-wise data, a characteristic that various recent applications of neural networks fail to consider. Most arrhythmic detection models only use electrocardiogram and blood pressure recordings. Unconsidered physiological signals may be strongly correlated with other modalities despite having missing data. To improve robustness and accuracy of heartbeat detection, all available modalities should be considered in multimodal arrhythmia datasets. Several hybrid neural networks are proposed to robustly analyze heartbeats by considering every available physiological signal. These networks combine elements from convolutional neural networks, recurrent neural networks, and a deep learning architecture. This enables researchers to analyze every signal of subjects while the set of signals collected among subjects may differ. The proposed hybrid neural networks provide more robust results in heartbeat detection when utilizing missing data modalities.


I. INTRODUCTION
Biological events can be documented by multiple signals, or modalities. If multiple modalities are recorded for an event, then the existing multimodal data may reveal characteristics that each modality might not independently uncover. In recent years, much attention has been placed on analyzing multimodal data through machine learning [1].
Many multiparameter datasets have incomplete information in the form of missing modalities and missing block-wise data. This can occur when modalities are recording data at different times or when sensors fail to record data under certain conditions. When some modalities have more complete data than others, an asymmetric multimodal dataset is obtained. Many recent methods fail to obtain accurate detection results from these asymmetric datasets, since they do not account for the missing data.
Multimodal medical imaging datasets frequently have missing data. Physicians may not record every physiological signal for every patient. Some subjects may have respiratory The associate editor coordinating the review of this manuscript and approving it for publication was Sungroh Yoon . rates recorded while others will have electrocardiogram (ECG) recordings. After appending a group of patients' data together, some modalities may have missing data. For example, if three patients have ECG recordings, while four other patients do not, an appended matrix will reveal missing data for the ECG modality. If neural networks train on large datasets where observations may have unique combinations of various biometric recordings, then these neural networks must be able to analyze modalities that have missing data. Various mechanisms and data processing techniques have been utilized to robustly analyze multimodal datasets with missing data: late fusion, co-learning, orthogonal regularization, probabilistic graphic models, and deep Boltzmann machines.
In recent years, various deep learning methods have been proposed. More frequently, neural networks, including convolutional neural networks (CNNs) [2], recurrent neural networks (RNNs) [3], and modular neural networks [4], have been used to analyze multimodal data. These artificial neural networks, inspired by biological neural networks, learn to perform tasks through a training dataset. Once trained, the neural networks should accurately classify inputs VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ into various output categories and can be used to analyze biometrics. The entirety of a multimodal arrhythmia dataset should be considered to robustly detect heartbeats. Although some modalities may contain missing data, or some modalities may be missing in many observations, these physiological signals may have a strong correlation with other signals. Alternatively, some signals may be too noisy and decrease heartbeat detection accuracy. The intrinsic structure of an arrhythmia dataset would only be revealed through an analysis of every modality.
Three contributions are provided in this paper. First, several hybrid neural networks (HNNs) are proposed to analyze multiple modalities while obtaining robust results for heartbeat detection. Second, an evaluation of the proposed HNNs on three real-world multimodal arrhythmia datasets is provided. Third, a comparison of CNN, CNN-Dropout, long short-term memory (LSTM), gated recurrent unit (GRU), EmbraceNet (EN), and the proposed HNNs is provided, along with a discussion regarding which common physiological signals should and should not be included in future analyses.
In Section II, related research to biometric anomaly analysis, heartbeat detection, and deep learning with neural networks is explored. The discussed research will be related to neural network theory and robust multimodal heartbeat detection. In Section III, multiple HNNs are proposed to improve the robustness of heartbeat detection with missing multimodal data. In Section IV, the experimental set-up, datasets, and experimental results are reported. In Section V, a conclusion discusses the findings, and possible future work is presented.

II. RELATED RESEARCH
Anomaly detection has been applied to many biological domains [5]- [10]. Several papers propose the implementation of anomaly detection within existing medical wireless sensor networks [8], [9], while others propose to detect outliers with medical devices [11], [12]. Since anomalies within biological systems may be fatal, much attention has been focused on the detection of system abnormalities.
Deep learning has been widely applied to many applications, from agriculture [13] to medical image analysis [14], [15]. In recent years, various deep learning methods have been proposed to analyze large datasets [16]- [18]. With more advanced computing devices and larger datasets available, it became possible to utilize deep learning architectures with a large number of layers [19], [20].

A. DEEP LEARNING FOR MULTIMODAL DATASETS
Often times, several biometric parameters are recorded simultaneously for an individual subject, generating a multimodal dataset. Many methods have been proposed to apply neural networks to multimodal biometric datasets [4], [21]- [28], several of which attempted to increase robustness for multimodal datasets with missing data [29]- [31].
A multimodal ensemble-based system for emotional detection has revealed robust qualities when exploring fusion methods for emotional recognition [29]. This technique solves the missing data problem at the multimodal fusion stage.
Deep Boltzmann machines are stochastic RNNs that have been proposed for learning a generative model for multimodal datasets [30]. Since most of these datasets are images, CNNs are favored over RNNs for detecting biometric anomalies, a noteworthy example being arrhythmia [32]. LSTMs have produced significant results in time-series anomaly detection [33], as well as multimodal biometric anomaly detection [34]. Although GRUs have not seen many multimodal applications, it may prove to be advantageous for smaller datasets [35].
EN, a deep learning architecture, was designed for multimodal tasks that consider cross-modal correlations while avoiding over-fitting. Missing information in one modality can be covered by other signals, revealing some robustness when encountering missing modalities or block-wise missing data [31]. Although EN reveals robust qualities when analyzing multimodal datasets with missing data, the conventional CNN demonstrates superior feature extraction for images, while taking advantage of local spatial coherence [36].

B. DEEP LEARNING FOR HEARTBEAT DETECTION
Various neural network types have been utilized in several disciplines. Recent progress has frequently been made in classification and detection problems [37], [38], most prominently in heartbeat detection. Many methods are used with artificial neural networks, including multi-domain feature extraction [36], backpropagation [39], [40], and linear and nonlinear feature combination [41]. Substantial work has also been done in heartbeat detection using CNNs. One significant example is found in CNN-based generalized information fusion, which reveals the implications of using multi-channel data to increase accuracy [42]. Directly reading the multimodal data, similar to what would be found in hospital readings, mitigates the need for intermediate estimating methods like signal differencing, filterbanks, wavelet transform, and Hilbert transform [43]- [46]. Examples of other solutions include the use of co-occurrence matrices [47] and 3-dimensional data structures [48].
Some efforts have been made to apply a hybrid CNN-LSTM neural network to detect biometric anomalies, such as arrhythmia [32], [49], [50], or to detect diseases, such as diabetes [51]. Neither of these efforts attempt to apply a CNN-LSTM architecture to a multimodal dataset. However in the field of gesture recognition [52], CNN-LSTM architectures were applied to multimodal datasets.

C. DATA PRE-PROCESSING
A need for data pre-processing arises due to the existence of noisy readings, which limits the efficiency of deep learning architectures [53]. A multitude of solutions have been proposed, including combining finite impulse response (FIR) filters and principal component analysis (PCA) [54], wavelet transform approaches [55]- [57], modified empirical mode decomposition [58], and iterative denoising [59].
The main objective of combining FIR filters and PCA is to minimize base line wandering, which significantly improves accuracy. Alternatively, wavelet transform approaches aim to change the wavelet coefficients rather than the actual base line. However, in these approaches, the scalability is limited.
In an attempt to mitigate this, the multiple wavelets transform theorem, based on the ideas of the single wavelet transform and multi-resolution analysis, was proposed, showing improvement upon the traditional single-wavelet transform method.
Similarly, the modified empirical mode decomposition method combines the empirical mode decomposition method and the nonlocal means method to conserve the morphological characteristics of the wavelets, while maintaining high accuracy.
In an effort to maximize overall accuracy, an iterative denoising method was implemented, applying every wavelet function and possible decomposition to denoise the wavelets. Consequently, accuracy increased at the cost of computation speed.

III. APPROACH
To propose novel HNNs, a mathematical framework must be provided to support the integration of the different architectures. These percentages are calculated following a comparison of existing neural networks and the proposed HNNs, one can determine the deep learning architectures that best handle arrhythmia datasets with missing modalities or missing block-wise data.

A. DATA PROCESSING
Although the proposed HNNs can be implemented in most biometric applications, it is critical to correctly pre-process the data. A generalized pre-processing technique is provided for heartbeat analysis.
Since the physiological signals are obtained as time-series data, non-overlapping windows of the samples are scaled by the maximum magnitude. Additionally, normalizing physiological data increases learning speed and decreases convergence time. Missing values were filled with the mean of their neighboring values. All signals were resampled to 250 Hz and normalized to a range of [0,1].
Each HNN is optimized with its weights throughout the training and validation phases using cross-validation for each dataset. In an effort to provide the most rigorous analysis of the proposed HNNs, cross validation is implemented on the training set.
Snippets are generated from the biometric samples by moving a window of l length one sample at a time, where l can be selected based on the physiological signals analyzed. The value of l can be generalized to a length of 251 for most arrhythmia detection applications. The generated snippets are used to train and optimize the presented neural architectures.
Although previous attempts at detecting heartbeats utilize a logistic sigmoid nonlinearity in the fully connected network, the hyperbolic tangent nonlinearity, ReLU, or leaky ReLU are tested with the proposed HNNs. The sigmoid nonlinearity is not sensitive with values close to 0 or 1. Additionally, it is hard to initialize weights of sigmoid neurons to prevent saturation. If weights are slightly large, the neural network may have difficulties learning. Finally, the sigmoid activation is not zero-centered, which is critical if a neural network will operate on more than one convolution layer.

C. CONVOLUTIONAL NEURAL NETWORKS
Simultaneously, K physiological signals are recorded for a subject from time t 1 to time t 2 , denoted x (i) t 1 :t 2 where 1 ≤ i ≤ K . Since multimodal datasets are analyzed, K ≥ 2.
Like many neural networks, CNNs can systematically fuse information from multiple signals by adding convolution layers between the input layer and the output layer. With K signals, each filter contains K one-dimensional component filters with a common length L. The desired feature set, or overall filter output, is generated by summing these component filter outputs. By assuming p filters, the coefficients of the k-th channel of the j-th filter is denoted {h . By summing all channels, the overall output of the j-th filter is obtained: y n . At a given start time t, a window of length M of the signal vector x (k) t:t+M −1 for the k-th channel is considered. Without zero-padding, a relatively short filter can produce a larger number of outputs than a long filter, which would improve accuracy for smaller datasets. To avoid zero-padding, the j-th filter output y This results in M −L +1 samples. The length of the produced feature vector for all filters is p(M − L + 1).
The filter coefficients and network weights were optimized using back propagation under the cross-entropy loss function where X = {x (1) , · · · , x (n) } is the set of input examples in the training dataset, Y = {y (1) , · · · , y (n) } is the corresponding label set for those input examples, and a(x) is the output of a given input x. Throughout the experiments, label 1 will be used for inputs that are determined a heartbeat and the label 0 will denote the absence of a heartbeat. The values of a(x) are computed through one of the following functions: where W is a weight matrix, b is a bias vector, and α is a small, positive constant. The functions a 1 (x), a 2 (x), a 3 (x), and a 4 (x) are the logistic sigmoid, hyperbolic tangent, ReLU, and leaky ReLU activation functions, respectively. Since many arrhythmia datasets contain more than two physiological signals, the proposed HNNs may use multiple layers. If two layers are used, the activation functions can be expanded to where , V is the weight matrix for the first layer, c is the bias vector for the first layer, and z i (x) for i = 1, 2, 3, 4 is the activation layer of the hidden layer. If multiple hidden layers exist, this process is repeated.

D. LONG SHORT-TERM MEMORY
One of the more popular RNNs for biometric analysis is the LSTM. This model is trained to reconstruct the normal time-series data inherent in the physiological recordings. Any reconstruction errors are used to obtain the likelihood of a point denoting a heartbeat. Since LSTM units do not perform activation, the same physiological signals can flow through the neural network for an arbitrarily long period of time. This characteristic enables LSTMs to represent time series, where the interaction between past and present data is sensitive to distant and recent events [61].
The jth unit of a LSTM maintains a memory c j t at time t. The output of the jth unit is is an output gate modulating the amount of content exposure for the memory and σ (·) is an activation function. Given a sequence x = (x 1 , x 2 , · · · , x r ), this gate is computed as where W and U are weight matrices, h t is the model's recurrent hidden state at time t, and V o is a diagonal matrix.
The memory c j t is updated by adding a temporary memory c j t after partially forgetting the existing memory, where c j t = σ (W c x t + U c h t−1 ) j . The extent of the temporary loss of the existing memory is controlled by a forget gate Similar to the output gate, V f and V i are diagonal matrices. An LSTM unit uses these gates to determine whether the existing or new memory content carries more information in feature extraction. If an LSTM model detects an important feature early in the training stage, its unit carries this information throughout the training, thereby capturing potential long-distance dependencies.

E. GATED RECURRENT UNIT
In recent years, GRUs, a modified LSTM, have been proposed, which allows each unit to adaptively capture dependencies on different time scales [62]. GRUs do not separate memory cells when modulating the flow of information inside each recurrent unit.
When working on less training data, GRUs train faster and perform better on testing data than LSTMs. Due to the simplicity of GRUs, they compute much more efficiently [63]. However, LSTMs can remember longer sequences than GRUs and more accurately model long-distance relationships [61], [64].
For a GRU, the activation h acts as a linear interpolation between the previous activation h j t−1 and the new activation h j t = σ (Wx t + U (r t h t−1 )) j , where r t is a set of reset gates. The update gate z j t = σ (W z x t + U z h t−1 ) j determines how much of the unit is updated.
When r j t ≈ 0, the reset gate forces the recurrent unit to forget the previously computed state. Similar to the update gate, One key difference between the LSTM and GRU unit is that the GRU unit does not have a mechanism to control the degree of exposure for the unit's state; instead, the GRU unit exposes the state entirely at any given time.

F. EmbraceNet
The deep learning architecture EN has the robust quality of dealing with missing data [31]. Assuming there exists K physiological signals with corresponding neural networks, let x (k) be the output vector from the k-th neural network, where k ∈ {1, 2, · · · , K }. Through the use of docking layers, EN converts each input vector into vectors of the same size. The i-th component of the k-th docking layer is denoted z i is a weight vector and b (k) i is a bias vector. An activation function is applied to z (k) i to obtain the output of the k-th docking layer, d c ] and c is the dimension of all docked vectors. With the K vectors of dimension c obtained from the docking layers, EN combines these vectors into an embraced vector.
The fusion technique of the K docking vectors uses a multinomial sampling, where r i = [r (1) i , r (2) i , · · · , r (K ) i ] T is a vector drawn from a multinomial distribution (i.e. r i ∼ Multinomial(1, p)), where p = [p 1 , p 2 , · · · , p m ] T and k p k = 1. From these constraints, one value of r i will equal 1 with the remaining values equal to 0. Thereafter, the vector i . This procedure is outlined in Algorithm 1.

Algorithm 1 EmbraceNet
16: end for 17: return e = [e 1 , e 2 , · · · , e c ] T This process ensures that one physiological signal contributes to each component of the vector e. Through the modality selection process, the output of EN is generated from data of all signals. The selection process depends on the values of p. In cases where little to no data is missing, p = [1/K , 1/K , · · · , 1/K ] T gives an equal chance for all physiological signals to be selected.
However, arrhythmia datasets often have missing modalities or missing block-wise data. In this case, the probabilities p are adjusted. Let u = [u 1 , u 2 , · · · , u K ] T indicate the presence of each signal, where u k = 1 if x (k) exists and 0 otherwise. The multinomial distribution is adjusted by changing p top = [p 1 ,p 2 , · · · ,p K ] T , wherep k = u k p k j u j p j . If the k-th signal is not available, u k = 0 andp k = 0, eliminating the chance for that value of r (k) i to become 1. This prevents invalid data coming from the k-th physiological signal to propagate to the EN output.

G. HYBRID NEURAL NETWORKS
Recent attempts at detecting heartbeats make use of two physiological signals, or channels: ECGs and blood pressure (BP). An analysis of these two modalities is a bimodal approach. The proposed HNNs can accurately analyze many channels. Since the proposed architectures are not restricted to unimodal or bimodal datasets, they may contain more than one convolution layer. Additionally, many arrhythmia datasets contain missing data, which deters a multimodal (K > 2) approach to analyze the data.
A combination of neural network architectures may reveal a more robust approach to analyzing datasets that have missing data for heartbeat detection; due to the rhythmic nature of this detection and the necessity for quality feature extraction with missing data, a single neural network may not be advantageous. It is for this reason that many researchers are implementing HNNs like CNN-LSTM and CNN-GRU in multimodal datasets.
CNNs are ideal when data is periodically sampled in one or more dimensions, which is the most common occurrence for arrhythmia data. LSTMs are useful when the data display a periodic rhythm and can return accurate results when detecting heartbeats. GRUs provide the benefits of LSTMs on smaller datasets. The benefit of EN arises when datasets have missing modalities or missing block-wise data, which is a common occurrence for the volatile nature of physiological recordings. Therefore, combinations of these four neural architectures may provide accurate, robust results. HNNs can incorporate many channel signals by CNN, LSTM, or GRU and feed these channels into the EN for final prediction. A diagram of the proposed CNN-LSTM-EN neural network is provided in Figure 1.
An explanation of the CNN-LSTM-EN is provided first, since it is the most involved neural network alongside the CNN-GRU-EN architecture. Note that since a GRU is a modified LSTM, the CNN-GRU-EN is similarly constructed to the CNN-LSTM-EN.
Since CNNs read periodically sampled data well, this neural network will be the first architecture in the proposed CNN-LSTM-EN model, leveraging its superior feature extraction over the other neural architectures. The benefit of including an LSTM in an HNN is that LSTMs can best analyze periodic data. EN is the last architecture since it VOLUME 8, 2020 requires periodically analyzed data to adjust the probabilitiesp before attempting to detect heartbeats in the arrhythmia data. This same reasoning justifies the structure of the existing hybrid network CNN-LSTM, where the LSTM is the architecture that classifies heartbeat location rather than an EN.
Typically, a CNN's probability vector p cnn is generated prior to data classification. In the CNN-LSTM-EN neural architecture, the dense layer of the ith modality is an LSTM, where i ∈ {1, 2, · · · , m}. Following the feature extraction of the convolution layers in the m CNN architectures, the LSTM will produce a probability vector p l that determines the probabilities of points denoting heartbeats, where l ∈ {1, 2, · · · , m}. Rather than executing the classification process after the dense layer, p l will become the input to the EN. An EN will take the m probability vectors as the input vectors, where the docking layers are computed as Thereafter, the EN architecture will compute more robust results by adapting the probabilistic distributions of the available data and then classifying heartbeat locations.
For the CNN-LSTM and CNN-LSTM-EN neural networks, the CNN architecture conducts the feature extraction. The extracted features become the input for the LSTM model. This is advantageous since the CNN extracts features from the arrhythmia data with accuracy while the LSTM detects arrhythmic anomalies following the CNN's feature extraction. The CNN-EN architecture operates similarly to the CNN-LSTM-EN, where the m probability vectors of the CNN models become the input vectors for the EN architecture.
As a modified LSTM, GRUs can also detect arrhythmic conditions following the implementation of a CNN. Therefore, CNN-GRU and CNN-GRU-EN models are examined throughout the experiment. Although LSTM-EN and GRU-EN models cannot leverage the periodic sampling technique and feature extraction of CNNs, their results are included in the experiment as a means of comparison.
CNN, regularized CNN with dropout, GRU, LSTM, and EN are included in the experiment as a baseline for comparing the results between single neural architectures and their hybrid counterparts. In total, twelve neural architectures are analyzed.

H. FEATURE EXTRACTION
Compared to scattering networks, RNNs, and EN, CNNs boast several additional strengths: (1) there are a wide variety of filters (random, supervised, and unsupervised filters) that can be employed (2) there exists a variety of non-linearities (rectified linear units, hyperbolic tangent, and logistic sigmoid) (3) pooling separators (sub-sampling, average-pooling, and max-pooling) can be applied and (4) these filters, nonlinearities, and pooling separators can be in different network layers. With these additional features, CNNs have more customization and flexibility than scattering networks, RNNs, and EN.
Let L p (R d ) be the space of Lebesgue-measurable functions f : R d → C that satisfies ||f || p := ( R d |f (x)| p dx) 1/p < ∞, where p ∈ [ 1, ∞) . Since semi-discrete frames can be interpreted as shift-invariant frames of a countable index n with a continuous translation parameter, the following definitions can be provided.
In the nth network layer for a CNN, a convolution with atoms g λ n ∈ L 1 (R d ) ∩ L 2 (R d ) of a semi-discrete Parseval frame n := {T b I g λn } b∈R d ,λ n ∈ n for L 2 (R d ) is employed on a countable index set n , where T b and I g λn are frame coefficients outlined in [65] and the frame atoms g λ n are arbitrary. The semi-discrete frame n functions as a feature extractor.
For n ∈ N, let M n and P n be Lipschitz-continuous operators where M n f = P n f = 0 for f = 0. Using the previously defined semi-discrete frame n , the sequence = (( n , M n , P n )) n∈N is called a module-sequence. Given the module-sequence = (( n , M n , P n )) n∈N , the feature extractor n , with , maps L 2 (R d ) to the feature vector where n (f ) is the operator associated with the nth network layer, and the function χ n is the output-generating atom of the nth layer. The set n (f ) corresponds to the features in the nth network layer generated by function f .
The feature extractors obtained through the implemented neural architectures use this feature-extracting mechanism to store all features in the feature vectors i (f ) before classifying the biometric data analyzed.

IV. EXPERIMENT
The experiments were run with a NVIDIA Tesla P100, a 16-GB GPU computing processor. The NVIDIA Tesla P100 contains 3584 CUDA cores with a bandwidth of 720 GBps. The experimental results follow a description of the used datasets.

A. DATASETS
To ensure robustness of the proposed HNNs, three multimodal datasets were analyzed: the MIT-BIH polysomnographic dataset, the MIT CC original dataset, and the MIT CC augmented dataset. These datasets include a variety of sample lengths. In each dataset, the patient record contains notes from cardiologists on the heartbeat locations. More specifically, the weighted mean of the R-peak and S-peak of an ECG is described as the heartbeat location. The individual databases are expanded upon below.

1) MIT-BIH Polysomnographic Database
The MIT-BIH dataset contains four-, six-, and sevenchannel polysomnographic recordings that sum to over 80 hours worth of data [66]. Additionally, heartbeat annotations, ECGs, EEGs, and respiration signals annotated with respect to sleep conditions are provided. These multimodal recordings were obtained during sleep for eighteen subjects. A LightWAVE visualization of subject 45 from the MIT-BIH polysomnographic dataset is provided [ Fig. 2]. This sample was selected to display the variety in available modalities for each sample.

2) MIT Computing in Cardiology (CC) Database
The publicly available MIT CC database contains 200 patient records and a hidden test set with 210 records [67]. This database contains two datasets: an original dataset and an augmented dataset. Both datasets contain 100 training observations. Since the augmented dataset does not have a fixed fiducial point and the annotations were not generated from a specific physiological channel, this dataset provides a greater challenge for the neural networks. Signals' sampling frequencies vary between 250 and 360 Hz. A LightWAVE visualization of sample 100 from the MIT CC database is provided [ Fig. 3]. This sample was selected to display the variety in available modalities for each sample within the multimodal database. Physician annotations that are recorded in blue indicate a heartbeat. ECG, BP, EEG, SV, and SO2 represent the physiological signals for ECGs, BP, electroencephalography, stroke volume and sulfur dioxide, respectively. Resp (nasal) and Resp (abdominal) indicate signals for nasal respiration and abdominal respiration, respectively.

B. EXPERIMENTAL RESULTS
The testing accuracies for the CNN, CNN-Dropout, CNN-LSTM, and the proposed HNNs across fifty epochs for all three datasets are displayed in Figures 4, 5, and 6. These figures are included to reveal the effect that modality count  has on the learning of each implemented neural network. When analyzing the testing accuracies for these neural networks, we compare only their best performance. The testing accuracies are reported as percentages following each model for the different datasets, where testing accuracy is defined as the percentage of correctly detected heartbeats. A comparison is made between each neural network's detected heartbeats and the physician-annotated heartbeats in each dataset. Tables 1, 2, and 3 report the change in percentage points for each model after dropping each modality in the three datasets. In some cases, the performance of a model improves when particular modalities are removed. This suggests that several modalities contribute noisy and uncorrelated data, since removing their impact on the network increases that network's predictability. However, several models reveal that common physiological signals can improve the accuracy of locating heartbeats. In Table 3, neural architectures that did not experience a change in percentage points for dropped modalities were removed.
The neural networks that reported the best accuracy on the MIT-BIH Polysomnographic database contained convolution layers. Namely, the CNN (97.63%), CNN-LSTM (97.62%), CNN-EN (96.86%), and CNN-Dropout (96.19%) models reported the highest accuracies on the polysomnographic testing data. This indicates that the MIT-BIH data requires more advanced feature extraction to obtain accurate results, which is best satisfied with neural architectures that leverage the feature extraction of CNNs. Note that despite CNN-LSTM having a higher accuracy than CNN at most epochs, CNN obtains the highest accuracy at epoch 35. Although the best testing accuracies were reported by the CNN and CNN-LSTM architectures, these models contained high variance in accuracy when modalities were dropped, indicating a relatively heavy reliance on ECG and BP readings. Additionally, the regularized CNN-Dropout model displayed relatively high variance.
The CNN-EN model had comparable results with the CNN and CNN-LSTM in terms of the best reported accuracy while exhibiting less volatility. This is likely due to the EN's ability to adjust probability assessments while working with the CNN, which would prove advantageous since the MIT-BIH data requires a more rigorous treatment towards feature extraction than analyzing a periodic rhythm. When analyzing arrhythmia datasets, a high testing accuracy with low variance is ideal; therefore, the results of CNN-EN are comparable to the CNN and CNN-LSTM models, while CNN-EN retains higher testing accuracy after modalities are removed. The GRU-EN (94.59%), CNN-GRU-EN (94.87%), and CNN-LSTM-EN (94.76%) models also reported stable testing accuracies following the removal of modalities.
The MIT CC original dataset reported similar results to the polysomnographic dataset. The CNN (98.91%), CNN-EN (98.68%), CNN-Dropout (98.56%), and EN (98.31%) models recorded the highest testing accuracy. Similar to the MIT-BIH data, this dataset requires advanced feature extraction. CNN-EN demonstrates less variance in testing accuracy than a regularized CNN with comparable results, likely due to the ability to adjust the probability vector outputted from the CNN architecture. The models containing RNNs such as the GRU-EN (95.56%) and CNN-GRU (94.86%) models performed relatively worse than the other HNNs, indicating that the MIT CC original dataset does not require a rigorous periodic analysis of the physiological signals.
The most accurate models for the MIT CC augmented dataset utilized an embracement layer for late fusion. The CNN-EN (97.74%), EN (96.83%), CNN-GRU-EN (95.28%), and LSTM-EN (95.23%) reported the best results, since the EN architecture works well with a large number of modalities (46 physiological signals). The neural architectures without an EN layer failed to learn after the first epoch, since the relatively large amount of modalities inhibited learning; each of these neural architectures reported a 63.34% testing accuracy throughout the experiment, only utilizing ECG and BP readings as inputs.
For the three datasets tested, ECG and BP recordings were the most common physiological signals and have the largest impact on the outcomes of the models. Tables 1, 2, and 3 report the change in percentage points of accuracy as certain modalities are dropped from the input to the neural networks.
For the MIT polysomnographic dataset, the HNNs that do not utilize EN were mostly impacted by the removal of the ECG modality. After removing ECG signals from the observations, the CNN, GRU, LSTM, CNN-Dropout, and CNN-LSTM lost 51, 49, 39, 36, and 34 percentage points of accuracy, respectively. The CNN-EN, GRU-EN, and EN only lost 2, 1, and 1 percentage points, respectively, following the removal of ECG signals. The neural architectures without an embracement layer lost the most percentage points in accuracy because they failed to make use of the other physiological signals. With EN, the proposed HNNs were able to remain relatively accurate following the removal of either the ECG or BP modality. The CNN-LSTM-EN and CNN-GRU-EN architectures report little to no loss in percentage points for accuracy. The CNN-GRU and LSTM-EN models report a 1 percentage point increase in accuracy after removing ECG, indicating that these models better predict heartbeat locations from BP than ECG recordings. These findings indicate that the proposed HNNs and EN outperform the state-of-the-art neural networks when common modalities are not present, displaying the robustness of HNNs.
When BP signals are removed, CNN-EN barely loses accuracy, whereas CNN, CNN-LSTM, CNN-GRU, CNN-Dropout, GRU, and LSTM lose 35,34,30,26,23, and 13 percentage points, respectively. Additionally, CNN-GRU-EN, CNN-LSTM-EN, LSTM-EN, GRU-EN, and EN experience a loss of 42, 35, 32, 22, and 7 percentage points, respectively. The CNN-EN architecture remains relatively accurate compared to the CNN-LSTM-EN and CNN-GRU-EN architectures following the removal of the BP modality because this dataset does not require a rigorous periodic analysis to be completed by the RNNs. None of the HNNs that incorporate an embracement layer lose any accuracy when dropping physiological signals other than BP or ECG recordings, since they are able to make use of each modality in detecting heartbeats. For the polysomnographic dataset, CNN-EN provides comparable testing accuracy to CNN, regularized CNN and CNN-LSTM while providing much more robust accuracy following the removal of modalities. Therefore, in a messy arrhythmia dataset, CNN-EN provides robust results while offering testing accuracy that competes well with state-of-the-art neural networks.
Within the MIT polysomnographic dataset, ECG, BP, and nasal respiration appear to positively affect the networks' testing accuracies, whereas EEG (O2-A1) and the summation of respiration appear to negatively affect testing accuracies. While most of the remaining physiological signals do not greatly impact neural network performance, EEG (C4-A1) provides either a positive or negative impact on testing accuracies depending on the tested neural architecture.
The MIT CC original dataset reports similar results to the polysomnographic dataset. While CNN, CNN-EN, CNN-Dropout, and EN provide the best testing accuracies, EN and CNN-EN provide the most robust results when dropping modalities. The HNNs that implement an embracement layer do not lose accuracy when dropping more of the uncommon physiological signals, as well. However, GRU-EN and CNN-GRU-EN lose a significant amount of accuracy after dropping BP or ECG signals, since the MIT CC original dataset has longer periods of recordings that the GRU architecture struggles to analyze. Thus, EN and CNN-EN report some of the highest testing accuracies while providing modality robustness.
For the MIT CC original dataset, the neural architectures that use an embracement layer tend to perform better after removing modalities, with the exception of BP and ECGs. However, the neural networks without EN reported lower testing accuracies following the removal of most modalities. Thus, the proposed HNNs that utilize an embracement layer report robust results following the removal of both common and uncommon biometric signals. Within the MIT CC original dataset, ECG, BP, nasal respiration, EEG, sulfur dioxide, chest respiration, and abdomen respiration appear to positively affect the networks' testing accuracies. Stroke volume, EOG (right), and the summation of respiration appear to negatively affect testing accuracies.
Since the neural architectures that did not incorporate EN did not learn on the MIT CC augmented dataset, dropping any modality would not affect their testing accuracy. Therefore, their results report a 0% change and are excluded from Table 3. It is likely that these neural architectures did not learn because they only analyzed the ECG and BP modalities, which did not reveal additional information past the first epoch. Relative to the other datasets, the CNN-EN, EN, and CNN-GRU-EN performed well on the MIT CC augmented dataset, only losing 4, 2, and 2 percentage points after dropping ECG signals, respectively. This is likely due to the relatively heavier reliance on the EN architecture over the CNNs and RNNs, since this dataset contains a larger amount of modalities. Since the removal of the ECG modality negatively impacts the accuracy of almost every neural network on each dataset, ECG is a strong indicator for heartbeat location; thus, ECGs should be considered in multimodal heartbeat detection when available.
Both LSTM-EN and CNN-LSTM-EN did not experience a change in accuracy after dropping ECG signals, revealing that they make use of some of the rather uncommon physiological signals. Therefore, if ECG recordings experience block-wise missing data in arrhythmia datasets that contain a large amount of modalities, the proposed LSTM-EN and CNN-LSTM-EN neural networks provide accurate and robust results. The GRU-EN model suffers a loss of 29 percentage points following the removal of BP recordings while improving by 1 percentage point after dropping ECG signals. This finding indicates that the GRU-EN model relied more on BP than ECG recordings to locate heartbeats.
Additionally, GRU-EN was the only model that gained any percentage points in accuracy after dropping a physiological signal that was not BP or ECG, which can be attributed to the inherent noise in some of the more uncommon physiological signals. Based on testing accuracy and modality robustness, CNN-EN, EN, and CNN-GRU-EN appear to best predict heartbeat location on the MIT CC augmented dataset.
On the augmented dataset, BP provides a strong positive impact on the networks' testing accuracies, whereas ECG affects accuracies relatively less. Several physiological signals negatively affect network performance for each neural architecture: nasal respiration, EEG (C3-O1), EMG, abdominal respiration, EOG (right), EEG (C4-A1), stroke volume, sulfur dioxide, ECG II, abdomen respiration, and ECG Lead AVF. The remaining physiological signals have either little or no effect on testing accuracies. In the two other datasets, nasal respiration, EEG (C3-O1), and abdomen respiration positively affected testing accuracies. This discrepancy may be due to the amount of noise inherent in this dataset since there are more biometric signals present.

V. CONCLUSION AND FUTURE WORK
Many biometric datasets contain more than two physiological signals. To make full use of each dataset, all modalities must be considered. As seen in the experimental results, not every modality will positively contribute to predictive accuracy. However, some modalities, which have been neglected up until this point by use of bimodal analysis, positively contribute to the predictive accuracy of many commonly used neural networks.
This finding reveals room for improvement in many stateof-the-art multimodal approaches that only analyze two modalities. Regarding arrhythmia studies, most researchers only analyze BP and ECGs. However, other physiological signals such as nasal respiration, EEGs, and sulfur dioxide can positively contribute to predictive accuracy when implementing several neural architectures.
As indicated throughout Tables 1-3, the proposed HNNs do not necessarily outperform the state-of-the-art neural networks when analyzing complete datasets, although the HNNs do report similar accuracies. However once confronted with missing data or modalities, the commonly used CNN, LSTM, CNN-LSTM, and GRU lose a significant amount of accuracy. VOLUME 8, 2020 In comparison, the proposed HNNs report robust results, losing a much smaller amount of accuracy with datasets containing missing data.
The CNN-EN provides the most robust testing accuracy of all of the models implemented on the MIT Polysomnographic Database. Since EN is capable of robustly handling missing data and CNNs read periodically sampled data very well, the CNN-EN borrows the strengths from each individual architecture. This model performed particularly well on messier, noisier data compared to the other tested architectures. For cleaner, complete datasets, such as the MIT CC original dataset, CNN with regularization appears to provide sufficient accuracy; however, CNN-EN boasted similar testing accuracy, while losing minimal accuracy following dropped modalities.
For the MIT CC original dataset, EN and CNN-EN provide higher accuracies while mitigating the volatility of their testing accuracies. Additionally, these models experience a minimal drop in accuracy after removing common and uncommon physiological signals. Therefore, EN and CNN-EN are ideal for arrhythmia datasets that are sparse or abundant in modalities.
Since the MIT CC augmented dataset contained up to 46 modalities for each subject, most models failed to learn. However, the HNNs that incorporated EN continued to learn throughout the fifty epochs of the experiment. EN, CNN-EN, and CNN-GRU-EN provided the most robust results, while providing high accuracies for heartbeat locations.
Future research directions may include implementing a multimodal fusion architecture prior to feature extraction for any of the tested neural architectures, through the use of adaptive, sensor, tensor, or memory fusion networks. Additionally, one can further test the efficacy of the proposed HNNs on other biometric datasets with a variety in size and dimensionality.
It is common that a patient has multiple missing modalities. Therefore, an experiment analyzing the robustness of the state-of-the-art neural architectures and the proposed HNNs on biometric datasets when multiple modalities are dropped may reveal more interesting findings.
Since the EN and CNN-EN neural networks can be implemented on messier datasets with missing modalities, previously collected databases that were overlooked due to missing data can be revisited to gain new insights.
If one could identify missing block-wise data and report these locations to the HNNs, testing accuracies may improve. By telling the HNNs that incorporate EN which modalities are missing, testing accuracy may further improve. A more computationally expensive method for improving accuracy could be individually training a neural network on each modality and using their outputs as inputs for an embraced layer.
ACKNOWLEDGMENT (Michael R. Schwob and Aeren Dempsey contributed equally to this work.) MICHAEL R. SCHWOB is currently pursuing the bachelor's degree in mathematics with the University of Nevada, Las Vegas. Due to his diverse research experience in astrophysics, bioinformatics, big data, and economics, he was named as 2019 Goldwater Scholar. As a Research Assistant with the Big Data Hub, University of Nevada, Las Vegas, he explores his research interests in spatio-temporal statistics, Bayesian analysis, game theory, biostatistics, hypergraph theory, machine learning, and ecological modeling.
AEREN DEMPSEY is currently pursuing the bachelor's degree in computer science with the University of Nevada, Las Vegas. He is also a Research Assistant with the Big Data Hub, University of Nevada, Las Vegas. His main research focuses on machine learning, data science, and deep learning.
FELIX ZHAN is currently pursuing the degree with the Ed W. Clark High School, Las Vegas, NV, USA. He is also a non-degree seeking student at the University of Nevada, Las Vegas. He has various work and volunteer experiences, including being the President and a Student  JUSTIN ZHAN (Member, IEEE) is an ARA Scholar and a Professor of data science at the Department of Computer Science and Computer Engineering, College of Engineering, University of Arkansas. He is also a Professor at the Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences. His research interests include data science, biomedical informatics, artificial intelligence, information assurance, and social computing. He has published 240 articles in peer-reviewed journals and conferences and delivered more than 30 keynote speeches and invited talks. He has been involved in more than 50 projects, as a principal investigator (PI) or a Co-PI, which were funded by the National Science Foundation, Department of Defense, and National Institute of Health. He was the Steering Chair of the IEEE International Conference on Social Computing (SocialCom), the IEEE International Conference on Privacy, Security, Risk and Trust (PAS-SAT), and the IEEE International Conference on BioMedical Computing (BioMedCom). He has been the Editor-in-Chief of the International Journal of Information Privacy, Security and Integrity and International Journal of Social Computing and Cyber-Physical Systems. He has served as the conference general chair, the program chair, the publicity chair, the workshop chair, and a program committee member for 150 international conferences; he has also served as the editor-in-chief, an editor, an associate editor, a guest editor, an editorial advisory board member, and an editorial board member for 30 journals.
ASIF MEHMOOD is a Senior Research Engineer at Air Force Research Laboratory (AFRL). He serves as a senior technical specialist and a researcher with extensive experience in developing and implementing algorithms in signal processing, pattern recognition, image processing, and machine learning. His primary research focuses on target detection, pattern recognition, target tracking, and classification. He is currently working on several deep learning-based projects, such as super resolution style transfer employing generative adversarial networks.