Evaluation of Capacitive ECG for Unobtrusive Atrial Fibrillation Monitoring

Unobtrusive collection of vital signs using sensors embedded in beds, chairs, and automobile seats can longitudinally monitor patients for abnormal heart conditions outside of the hospital to inform both preventative and postdiagnosis care. The capacitive electrocardiogram (cECG) shows potential for collecting electrical information about a patient's heart without requiring skin contact like regular electrocardiogram (ECG). However, motion artifacts and environmental factors easily corrupt cECG signal quality and reduce the diagnostic value of unobtrusively collected cECG. To evaluate the atrial fibrillation (AF) screening performance of cECG compared to ECG, we conduct three different experiments on the clinical UnoViS dataset consisting of 55 min of concurrent ECG and cECG signals from 92 patients with manual clinician annotations of AF. First, we trained and evaluated models to detect AF events on cECG and ECG separately. Then, we trained a model using ECG and evaluated it on cECG to measure the interchangeability of the ECG and cECG domains. For each experiment five-fold subjectwise, class-stratified cross-validation was used to assess trained algorithm performance on hold-out test sets and three different algorithmic methodologies were assessed. Although evaluating the trained ECG model on cECG data (AUC: 0.874 ± 0.067, F1-score: 0.553 ± 0.148) performed slightly worse than evaluating on ECG, it did perform better than the model trained and evaluated on cECG alone, which suggests that utilizing ECG data for model training can effectively screen similar conditions in cECG data.


I. INTRODUCTION
Heart rhythm disorders, such as arrhythmia, were listed as a cause of death for more than half a million people in 2018 [1], and it is projected that 12.1 million people may have atrial fibrillation (AF) by 2030 [2].Early detection and treatment of AF is critical to improving clinical outcomes [3].Specifically, patients with incidentally detected asymptomatic AF (AF detected after screening) have increased risk of mortality due to cardiovascular and all-cause mortality compared with patients with typical AF symptoms [4].
Heart rhythm disorders are routinely detected via an electrocardiogram (ECG), which records the difference in electric potential between electrodes placed on the skin surface.The electric signals generated by the atrial sine node and propagated through the heart muscles are then picked up by the electrodes and converted to a digital signal [5].When abnormalities occur in the electrical pulses of the heart, such as during AF, corresponding changes can be observed in recorded ECG signals and used by clinicians for diagnosis.
However, long-term screening for arrhythmia requires nonintrusive sensors that can be incorporated into a patient's life more comfortably [6].First introduced in 1969 [7], the capacitive ECG (cECG) allows for measurement of electrical signals without requiring skin contact, making cECG well-suited for long-term monitoring of ECG through Corresponding author: Emily Wittrup (e-mail: ewittrup@umich.edu).Associate Editor: K. Ozanyan.Digital Object Identifier 10.1109/LSENS.2023.3315223electrodes embedded in everyday objects, such as car seats and beds [8].Previous work, such as Weil et al. [9], have concluded that cECG signals could identify ST-elevation myocardial infarction in patients.Bhardwaj et al. found coherence between measured cECG, and ECG signals collected from sensors embedded in a car seat during driving.
Despite the convenience of noncontact electrodes for long-term monitoring, cECG are more susceptible to motion and static electrical artifacts when compared with skin contact ECG [8].Previous work by Czaplik et al. [10] investigated the discrepancy between clinician annotations of cECG and ECG signals recorded in the seated position.Although high correspondence was achieved for clinician labeled AF between signal modalities, a significant number of collected cECG signals were thrown out due to artifacts from environmental factors.
Recent work has attempted to overcome the volatile nature of the cECG modality through methods, such as denoising [11] and signal fusion [12].Current cECG-based comparisons are also limited by the dearth of publicly available unobtrusive signal databases as compared with large clinical ECG databases, such as Physionet [13] and MIT-BIH [14].A potential solution to this lack of data is to train diagnostic algorithms on the cleaner and abundant ECG modality to learn more accurate representations of cardiac conditions and then applying the models for diagnosis of cECG signals.
In this letter, we will compare the AF detection performance of models trained and evaluated on a single modality (ECG and cECG)

II. DATASET
The UnoViS dataset contains unobtrusive signals, including photoplethysmogram, ECG, and cECG, collected from diverse scenarios, such as in driving, bedside, and clinical environments [15].In this letter, we utilized the "UnoViS clin2009" dataset, which contains 55 min of single lead cECG and ECG recordings simultaneously collected from 92 patients while seated along with AF annotations from two clinicians [10].ECG was recorded at 500 Hz sampling rate, while cECG was recorded at 125 Hz sampling rate.Patient age was in the range 64.3 ± 21.9 years with an average data recording time per patient of 35.89 s.Details about clinical design, sensor make-up, and environmental controls can be found in [10] and [15].
The raw ECG and cECG signals were preprocessed by first up sampling the cECG signals to 500 Hz to match the sampling rate of the ECG recordings.Then, a fourth-order bandpass Butterworth filter with cutoffs at 0.5 and 40 Hz was applied to both the ECG and cECG signals to remove noise.Baseline wander was then removed using a double median filter at orders 0.2 and 0.6 times the sampling frequency.
We divided each patient's recordings into 5-s intervals of corresponding cECG and ECG signals.Each section was normalized to be in the range (−1, 1).Certain signal sections within the dataset were manually observed to be flat signals or had no observable R peaks.We removed uninformative signal intervals that contained more than 3 s of flat signal or less than two annotated R peaks, resulting in complete removal of signals from two patients.The resulting number of patients and 5 s intervals is shown in Table 1.

III. METHODS
Three diagnostic algorithms for AF were implemented and compared with each other on this dataset which will be discussed in more detail in the following sections: a probabilistic deterministic finite-state automata (PDFA) based algorithm [16], a convolutional neural network and long short-term memory network (CNN-LSTM) deep-learning model [17], and a random forest (RF) classifier using heart rate variability (HRV) features extracted as in Asl et al. [18].
We formulated AF detection as a binary classification problem and trained and evaluated the algorithms using the proposed methodology in Fig. 1.For each algorithm, we trained on the dataset and evaluated on a hold-out test set consisting of the other of the dataset.During training, we used subjectwise class-stratified five-fold cross-validation (CV) where patients are randomly sorted into folds to ensure a balanced class ratio between folds.A hyperparameter grid search is performed over the five-fold CV, and the best combination of parameters is selected based on averaged validation F1-score.We then use an ensemble of the five trained models from the best CV for evaluation on the test set.This procedure is replicated 10 times over 10 random train-test splits to reduce the impact that random splitting has on performance results.Final reported results are the mean ± standard deviation (SD) test results across all 10 repetitions.This process was designed to address the potential shortcomings of utilizing a relatively small dataset in terms of predictive power and generalizability.

A. PDFA-Based Algorithm
The PDFA-based algorithm is a finite-state automata that stochastically generates sequences of letters from a fixed alphabet [19].Each state has a transition probability for each unique letter in the alphabet, which determines the next state.The inputs of the PDFA model for arrhythmia prediction by Li et al. [16] first encodes ECG signals into a probabilistic string with symbols roughly corresponding to important subsections of a QRS complex, as can be seen in Fig. 2. A separate frequency prefix tree (FPT) of observed combinations of symbols is then constructed using the training set of encoded signals for both AF and control cases.After pruning the FPTs based on a tuned cutoff frequency parameter C, new signals are classified through expected likelihood comparison between the two PDFA distributions.In this letter, we grid search over the parameter C for both AF and control PDFAs using suggested ranges from [16].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. CNN-LSTM Deep-Learning Algorithm
For each 5-s ECG and cECG interval, a spectrogram is computed using a Tukey window of 64, shape parameter 0.25, and 50% overlap.A log transform is then computed over the spectrogram.For the model architecture, we used the CRNN proposed by Zihlmann et al. [17] composed of blocks of six consecutive Conv2D, batch normalization, and ReLU activation layers.Parameters selection and details on dropout burst and random resampling data augmentation methods follow the procedures in [17].The final layer of the model is changed to a dense layer with 1 output using sigmoid activation to fit the binary classification problem.Loss weights for model training are set to the ratio between AF and control samples in the training set.
No grid search was conducted for the CNN-LSTM model due to time limitations.During five-fold CV, the validation set is used for early stopping and learning rate scheduling.The five trained CNN-LSTM models from CV are evaluated on the hold-out set as an ensemble.

C. HRV Features + RF Algorithm
For each 5-s signal in the dataset, the Pan-Tompkins algorithm [20] is used for automatic annotation of R peak locations.A set of 17 unique HRV features are then extracted based on the R peak locations.Besides the seven linear time-domain and seven nonlinear features as listed in [18], three additional linear frequency features are added: the individual high-frequency (HF), the low-frequency (LF) values, and the LF/HF ratio.Extracted features are fed into the RF classifier implemented in the scikit-learn package in Python [21].The classifier is set to default hyperparameters except for number of trees (200) and number of candidate variables considered each split (grid searched in the range 1-17).

IV. EXPERIMENTS AND RESULTS
In Experiment 1 and Experiment 2, we compare the binary classification performance of ECG versus cECG across the three AF diagnosis algorithms, training and testing on a single data modality.In Experiment 3, we trained on ECG and evaluated performance on the cECG modality.
Significance tests between each pair of models in each experiment are performed using a paired student's t-test on the results distributions from 10 repeated CVs.The t-test is a corrected resampled t-test first proposed by Nadeau and Bengio [22] that reduces Type-I error.We adjust for multiple significance tests using the Benjamini Hochberg procedure [23].The same significance test procedure is applied to all performance metrics.

A. Experiment 1: Train and Evaluate on ECG
As shown in Table 2, compared to PDFA and CNN + LSTM models, the HRV + RF algorithm performs significantly better across all metrics in AUC (0.885 ± 0.084) and F1-score (0.583 ± 0.140).HRV + RF performs better in all other metrics; however, the improvement is not statistically significant.Performance across 10 random splits shows high SD for all three compared diagnostic algorithms, specifically in F1-score.

B. Experiment 2: Train and Evaluate on cECG
The HRV + RF algorithm outperforms both other models in all metrics, with a significant increase in AUC (Table 2).The switch to cECG modality resulted in a performance drop for all models in Experiment 2 compared to Experiment 1. Specifically, HRV + RF had a 0.04 drop in mean F1-score and an overall increase in performance instability, measured by SD, across repetitions when compared with Experiment 1.

C. Experiment 3: Train on ECG and Evaluate on cECG
The HRV + RF model still outperforms both other models across all metrics and has significantly better mean AUC and F1-score, 0.874 ± 0.067 and 0.553 ± 0.148, respectively (Table 2).Compared with Experiment 2, HRV + RF had a slight increase in most performance metrics and a general decrease in SD.Performance is still lower than Experiment 1, however, with only a 0.03 gap in mean F1-score and a 0.016 gap in mean precision.PDFA mean performance for all metrics also improved in Experiment 3 over Experiment 2. CNN + LSTM in Experiment 3 performed worse than Experiment 2 in contrast to the other compared models, with a decrease in mean F1-score and precision.

V. DISCUSSION
The highly imbalanced nature of the dataset (59 AF intervals compared to 410 control) means that the F1-score and precision metrics serve as best comparison between experiments and models.
From results across all three evaluated models, training and testing on the cECG modality in Experiment 2 results in the lowest AF detection performance.This matches with the consensus in literature that cECG signals are more susceptible to noise and do not contain as much information as traditional ECG.
When comparing the Experiments 1 and 2, the HRV + RF model only suffered a drop of 0.04 in F1-score compared with larger drops Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in F1-score by PDFA and CNN + LSTM.From Table 2, the HRV + RF model remained relatively invariant to data modality changes compared to the other two models, with a mean F1-score around 0.55.This may be due to the inherent robustness of the HRV + RF model feature format to noise.The loss in information and increased noise in cECG adversely affects models that depend on the full cECG time series for analysis.For example, the PDFA model encodes the full 5 s sequence into symbolic labels and learns to classify AF using characteristics of the QRS waveform.Likewise, the CNN + LSTM uses log spectrograms of the 5-s sequence as input.HRV features, however, only depend on accurate R peak detection.Signal artifacts in cECG have less of an impact on R peak detection compared to more sensitive wave morphology, such as P and T waves.
Performance for PDFA and HRV + RF improved from Experiment 2 to Experiment 3, where ECG data were substituted for cECG during training, suggesting that the relative noisiness of cECG compared to ECG makes it harder for algorithms to learn how to diagnose AF.In contrast, CNN + LSTM performance decreased slightly, perhaps due to the model complexity, which makes it prone to overfitting, specifically on such a small dataset.

VI. CONCLUSION
In conclusion, we observed that simpler algorithms detecting AF from cECG signals benefited from training with the higher quality ECG compared to cECG which is prone to noise and artifacts.This suggests that models trained on large ECG databases can learn informative features for the diagnosis of AF which can be applied to noninvasive cECG signals for longitudinal patient monitoring of heart rhythm disorders.Future work would ideally address a limitation of this letter by expanding to a larger, balanced dataset with signals from patients in a less controlled setting performing a variety of activities.However, the results of this letter demonstrate the utility of transfer learning for signal-based event predictions.Potential applications that may benefit from this approach include detecting cardiac events with wearable devices or objects, such as beds or car seats into which noninvasive sensors can be embedded.

TABLE 1 .
Cohort Size After Removing Uninformative Signals with the evaluation performance of a model trained on ECG and evaluated on cECG data.

TABLE 2 .
Performance Metrics for Each Algorithm and Experiment Across 10 Repetitions Presented as Mean ± Standard Deviation