A Self-Contained STFT CNN for ECG Classification and Arrhythmia Detection at the Edge

Automated classification of Electrocardiogram (ECG) for arrhythmia monitoring is the core of cardiovascular disease diagnosis. Machine Learning (ML) is widely used for arrhythmia detection. The cloud-based inference is the prevailing deployment model of modern ML algorithms which does not always meet the availability and privacy requirements of ECG monitoring. Edge inference is an emerging alternative that addresses the concerns of latency, privacy, connectivity, and availability. However, edge deployment of ML models is challenging due to the demanding requirements of modern ML algorithms and the computation constraints of edge devices. In this work, we propose a lightweight self-contained short-time Fourier Transform (STFT) Convolutional Neural Network (CNN) model for ECG classification and arrhythmia detection in real-time at the edge. We provide a clear interpretation of the convolutional layer as a Finite Impulse Response (FIR) filter and exploit this interpretation to develop an STFT-based 1D convolutional (Conv1D) layer to extract the spectrogram of the input ECG signal. The Conv1D output feature maps are reshaped into a 2D heatmap image and fed to a 2D convolutional (Conv2D) neural network (CNN) for classification. The MIT-BIH arrhythmia database is used for model training and testing. Four model variants are trained and tested on a cloud machine and then optimized for edge computing on a raspberry-pi device. Weight quantization and pruning techniques are applied to optimize the developed models for edge inference. The proposed classifier can achieve up to 99.1% classification accuracy and 95% F1-score at the edge with a maximum model size of 90 KB, an average inference time of 9 ms, and a maximum memory usage of 12 MB. The achieved results of the proposed classifier enable its deployment on a wide range of edge devices for arrhythmia monitoring.

INDEX TERMS Electrocardiogram, machine learning, edge inference, convolutional neural network, interpretable neural network, finite impulse response, short-time Fourier transform.  [1]. 28 The intricacy of arrhythmias and their mechanical and clinical 29 The associate editor coordinating the review of this manuscript and approving it for publication was Rajeswari Sundararajan . interrelationships causes numerous misdiagnoses and cross 30 classifications using visual criteria. Moreover, clinical exam-31 ination and diagnosis utilizing ECG data by physicians are 32 time-consuming, impractical, and sometimes unavailable to 33 remote places. Automatic arrhythmia beat categorization is 34 thus urgently required for dynamic ECG processing. 35 Electrocardiography is still the most accessible and exten-36 sively used method for measuring cardiac electrical activity 37 due to its simplicity, non-invasiveness, and low cost. The 38 electrocardiogram (ECG) represents the electrical activity of 39 the heart and provides vital information about heart function. 40 the demanding requirements of modern AI algorithms and the 97 computation constraints of edge devices. Moreover, the criti-98 calness of arrhythmia detection for the patient's life neces-99 sitates increasing the automatic detection accuracy which 100 introduces an extra challenge. 101 To address the above challenges, some guidelines have 102 been applied to the proposed classifier. The internationally-103 accepted MIT-BIH arrhythmia database is used for training 104 and testing the ECG classifier. A single lead will be employed 105 to capture the ECG signal to facilitate its usage by the patient. 106 Due to its recent advancements, a deep neural network (DNN) 107 model is used for ECG classification. The time-domain sam-108 pled ECG signal will be fed directly to the DNN model with-109 out further preprocessing or feature engineering. The real-110 time performance of the proposed model was planned in 111 advance to fit the resource constraints of edge inference. The 112 DNN model is optimized for edge deployment by applying 113 state-of-the-art weight quantization and pruning techniques. 114 Finally, the model is extensively tested on the edge device to 115 verify its functional correctness. 116 A convolutional neural network (CNN) composed of a 117 cascaded stack of 1D and 2D convolutional (Conv1D and 118 Conv2D) layers and dense layers is developed. We provide 119 a clear interpretation of the 1D convolutional layer (Conv1D) 120 as a Finite Impulse Response (FIR) and exploit this interpreta-121 tion to develop a short-time Fourier Transform (STFT) layer 122 to extract the spectrogram of the input ECG signal. To the 123 best of our knowledge, this is the first work to provide a clear 124 interpretation of the Conv1D layer as a frequency-selective 125 FIR filter. The Conv1D layer kernels are designed as a bank 126 of adjacent FIR band-pass filters (BPFs) acting as an STFT 127 computation engine. The Conv1D feature maps produced by 128 the FIR filter bank are then reshaped into a 2D heatmap image 129 to be fed to a Conv2D CNN classifier. The advantage of 130 such an approach compared to using a pre-processing STFT 131 computation stage commonly used in the literature is that our 132 approach produces a lightweight self-contained CNN model 133 amenable to edge optimization. 134 Four model variants are developed, tuned, trained, and 135 tested on a cloud server. The testing results show that the 136 proposed models achieve comparable classification results 137 including accuracy, recall, precision, and F1-scores com-138 pared to the state-of-the-art ECG classifiers. The devel-139 oped models are then optimized using post-quantization and 140 training-aware quantization methods for edge deployment. 141 Finally, the optimized models are tested and benchmarked 142 on a raspberry-pi device. The proposed models achieve sig-143 nificant classification results using minimum computation 144 resources fitting the computational constraints of the edge 145 device. 146 The main contributions of this work include: 147 • Advancing a novel CNN topology for time series data 148 tailored and optimized for edge inference.

149
• Providing clear interpretation of the Conv1D layer as a 150 finite impulse response (FIR) frequency-selective filter 151 and visualizing the Conv1D layer feature maps.

152
• Testing the ECG classifier on an edge device and report-  [6]. The classical 226 Fourier transform (FT) is also used to obtain the ECG fre-227 quency spectral features however it can only capture global 228 frequency information decoupled from their occurrence time. 229 The ambiguity of FT is overcome by STFT in which the FT 230 is repeatedly computed for a fixed-length moving temporal 231 window to provide local time-frequency information or spec-232 trogram of the signal however a trade-off arises between time-233 frequency resolution. The shortcoming of STFT is overcome 234 by Wavelet transform (WT) in which dilated versions of a 235 mother wavelet are shifted and correlated with the ECG signal 236 to extract a high-resolution time-frequency 2D image called 237 scalogram of the signal. Both continuous WT (CWT) and 238 discrete WT (DWT) have been extensively used for ECG 239 preprocessing and feature extraction [7]. The approaches that 240 provided the highest accuracy in the literature used features 241 from the time/frequency domain and the RR interval.

242
For a 1D input ECG signal, the sample points of the heart-243 beat signal can be used directly as features in 1D CNNs which 244 are known for their capability of automatic feature extraction. 245 The time-frequency spectrograms and scalograms of ECG 246 segments obtained using STFT and CWT, respectively, can 247 also be used as input images to 2D CNNs for feature extrac-248 tion and classification. 2D CNNs are more prevalent with 249 well-established models due to their wide usage for image 250 applications. The advantage of this approach is eliminating 251 the need for cardiology experts and relying on the automatic 252 power of CNN for extracting ECG features that maximize the 253 classification accuracy.

254
The final and most important stage in ECG monitoring is 255 the classification stage. Generally, machine and deep learn-256 ing methods have been extensively investigated for this task 257 [8], [9]. The most commonly used methods are support vec-258 tor machines (SVMs) and deep neural networks (DNNs). 259 SVMs models with various feature types have been exten-260 sively used for ECG classification [3], [5], [6], [10] yet such 261 models suffer from the computational complexity of the SVM 262 algorithm. On the other hand, many recent works proposed 263 various topologies of 1D and 2D CNNs in conjunction with 264 different feature spaces for ECG classification [3], [5], [11], 265 [11], [12], [13]. 266 Alqudah et al. [14] presented a comparative study between 267 different ECG time-frequency representations including   The common theme between these works is using the STFT 303 spectrogram accompanied by 2D CNN models to enhance the 304 ECG classification results. In all related STFT-based ECG 305 classifiers, the STFT computation is carried out as a pre-306 processing step using a separate computational block which 307 adds a computation overhead to the classifier model. The 308 main feature of our proposed model is implementing the 309 STFT extraction procedure as a part of the CNN model 310 itself exploiting the FIR interpretation of the Conv1D layer. 311 Such an approach results in a self-contained classifier model 312 with reduced computational requirements that fits edge 313 inference.

314
Many recent works other than the STFT-based CNNs have 315 been also proposed for ECG classification and arrhythmia 316 detection. Cui et al. [10] proposed an ECG feature extraction 317 method based on DWT and Conv1D with Principle Com-318 ponent Analysis (PCA) to reduce the number of features. 319 An SVM classifier is trained on an upsampled subset of the 320 MIT-BIH dataset to address dataset imbalance. 321 Li et al. [18] presented an ECG classifier based on Incre-322 mental Broad Learning (IBL) and biased dropout. Incre-323 mental learning is a machine learning paradigm where 324 the learning process takes place whenever new examples 325 emerge and adjusts what has been learned according to 326 the new examples. Baseline wander filtering and DWT are 327 used for ECG signal denoising, and morphological and 328 rhythm features such as RR intervals are extracted from the 329 denoised ECG segments. An IBL DNN is used for ECG 330 classification. 331 Liu et al. [19] proposed several models for ECG classifi-332 cation based on Wavelet Scattering Transform and PCA for 333 feature extraction and a variety of NN and probabilistic NN 334 (PNN) and k-nearest neighbor (KNN) classifiers for ECG 335 classification. WST builds translation invariant, stable, and 336 informative signal representations WST is implemented with 337 a CNN with preassigned weights that iterates over traditional 338 WT, nonlinear modulus, and averaging operators. In PNN, 339 the class probability of a new input data is estimated and 340 the Bayesian rule is then employed to allocate the class with 341 the highest posterior probability to new input data. KNN 342 is a non-parametric supervised classification algorithm that 343 classifies test data by measuring its Euclidean distance in the 344 feature space to all labeled training samples and returns the 345 nearest K labels and assigns the most frequent label to the test 346 data. 347 Mousavi and Afghah [20] proposed an ECG classification   is essential for edge deployment of AI models. None of the 398 related work presented real-time performance analysis of the 399 developed models or an interpretation of the underlying DNN 400 model. Other parameters such as the model size, number 401 of parameters, memory usage, and training time are neither 402 optimized nor presented in the literature, as well. In this 403 work, we aim to address the above challenges and provide 404 a lightweight interpretable ML model for ECG classification 405 and arrhythmia detection ready for edge deployment.

407
The internationally recognized standard ECG databases 408 include the MIT-BIH database, the AHA (American Heart 409 Association) database, the European Community CSE 410 database, and the European ST-T database. The MIT-BIH 411 arrhythmia database is the most commonly utilized and 412 highly acknowledged database in the academic world [3]. 413 The MIT-BIH Arrhythmia Database contains 48 half-414 hour excerpts of two-channel ambulatory ECG recordings, 415 obtained from 47 subjects collected between 1975 and 416 1979 [22], [23]. Twenty-three recordings were chosen at ran-417 dom from a set of 4000 24-hour ambulatory ECG record-418 ings collected from a mixed population of inpatients (about 419 60%) and outpatients (about 40%); the remaining 25 record-420 ings were selected from the same set to include less com-421 mon but clinically significant arrhythmias that would not 422 be well-represented in a small random sample. The record-423 ings were digitized at 360 samples per second per chan-424 nel with 11-bit resolution over a 10 mV range. Two or 425 more cardiologists independently annotated each record; dis-426 agreements were resolved to obtain the computer-readable 427 reference annotations for each beat (approximately 110,000 428 annotations in all) included with the database. A total of 429 15 annotations are assigned to the R-peaks of the ECG beats. 430 According to the standard developed by the Association for 431 the Advancement of Medical Instrumentation (AAMI) [24], 432 17 recommended arrhythmia categories are classified into 433 5 essential groups or superclasses. Following the AAMI-434 recommended practice, four-paced recordings are not used. 435 The AAMI standard emphasizes the problem of distinguish-436 ing ventricular ectopic beats (VEBs) from non-ventricular 437 ectopic beats, and hence normal and arrhythmic beats are 438 remapped to the five AAMI heartbeat classes. We fol-439 lowed the AAMI standard which is commonly used in the 440 literature [3] aiming to standardize the evaluation process 441 considering a clinical point of view and AAMI recommen-442 dations and to ensure a fair comparison with the results in 443 the related literature. Annotations in the MIT-BIH dataset 444 are mapped to five different beat categories acting as 445 dataset labels following the AAMI standard as shown in 446 Table 1.

447
Each ECG record of the MIT-BIH Arrhythmia Database 448 includes two leads originating from different electrodes. 449 The most common leads in the database are MLII and V1. 450 To maintain the consistency of leads, only MLII lead read-451 ings are used in this research. Records 102 and 104 have 452 VOLUME 10, 2022 been excluded from the dataset because they do not con-453 tain the MLII lead readings. The number of beats pro-454 vided in Table 1 does not include records 102 and 104.  To prepare the dataset for machine learning, the ECG sig-DWT for noise removal [3]. However, such approaches incur 489 additional computation overhead to the classification model 490 limiting their usage on resource-constrained edge devices.

491
The total number of segmented annotated beats is 492 103200 due to excluding records 102 and 104. Three proto-493 cols are proposed in the literature for dividing the MIT-BIH 494 dataset into training and test sets: intra-patient, inter-patient, 495 and random division schemes [3]. In the intra-patient divi-496 sion scheme, the heartbeats from the same patient are used 497 for training and testing which makes the evaluation process 498 biased. In the inter-patient division scheme proposed in [26], 499 the training and test datasets are divided by the record num-500 bers such that heartbeats within each set come from different 501 individuals eliminating the evaluation process bias. In this 502 work, the random division scheme is used to ensure keeping 503 the dataset distribution statistics in both the training and test-504 ing sets while eliminating the evaluation bias. The MIT-BIH 505 database is randomly shuffled, stratified, and split into train-506 ing and test datasets with a splitting ratio of 25%. The training 507 dataset is further split into training and validation datasets 508 with a splitting ratio of 25%. The total numbers of anno-509 tated heartbeat examples in the training, validation, and test 510 datasets are 58050, 19350, and 25800, respectively.

513
CNN is a deep learning model for processing data with a 514 grid pattern, such as photographs, that is inspired by the 515 architecture of the human visual cortex and designed to learn 516 spatial hierarchies of characteristics automatically and adap-517 tively, from low-to high-level patterns. CNN is a mathemat-518 ical construct made up of three types of layers (or building 519 blocks): convolutional, pooling, and fully connected layers. 520 The first two layers, convolution, and pooling extract features 521 while the third, a fully connected layer, uses the extracted 522 where y l k is the layer output, b k l is the bias of the k th neuron at  Figure 2. The FIR output is defined as: where operators F and F −1 denote the discrete-time Fourier 579 transform (DTFT) and its inverse, respectively. The complex-580 valued, multiplicative function H (ω) is the filter's fre-581 quency response. Figure 2 illustrates an example of the filter 582 impulse response h[n] and power spectrum |H (f )| 2 of an FIR 583 Low-pass Filter (LPF) for N = 32.

584
Consequently, the Conv1D kernel operation is equivalent 585 to applying an FIR frequency-selective filter to the input sig-586 nal. Furthermore, since the Conv1D kernel works as a sliding 587 window along the time axis, the filter output is a function of 588 time as well. Therefore, the Conv1D kernel output, which 589 is also known as the feature map, is a time-domain signal 590 that indicates the existence of specific frequency components 591 in the signal at specific time instants. The Conv1D kernel 592 weights can be pre-designed and assigned as non-trainable 593 parameters to apply specific FIR filtering operations but they 594 also can be trained to learn significant features of the signals 595 as usually done in CNNs. Nonetheless, a trainable Conv1D 596 kernel is still interpreted as an FIR filter with learned parame-597 ters since the kernel operation is invariant. The Conv1D layer 598 comprises multiple kernel filters which output N f feature 599 maps each of N s length where N f is the number of Conv1D 600 filters and N s is the number of samples of the input signal. The 601 Conv1D feature maps collectively can be grouped and treated 602 as a 2D heatmap that exhibits the time-frequency features of 603   Tuner comes with the Bayesian Optimization, Hyperband, 681 and Random Search algorithms. The three algorithms have 682 been investigated and the Hyperband algorithm is found to 683 give better results for the given dataset. Bayesian-optimized 684 models tend to overfit the training set resulting in a signifi-685 cant variance. The hyperband algorithm is a combination of 686 random search with adaptive resource allocation and early 687 stopping that accelerate the hyperparameter search process. 688 Hyperparameter search is approached as an optimization 689 problem with the objective of minimization/maximization of 690 a specific quantity. Usually, maximizing the validation accu-691 racy is the main objective of classification algorithms. How-692 ever, due to the class imbalance nature of the training dataset, 693 maximizing the model accuracy does not tend to give the best 694 results in terms of detecting irregular heart activities due to 695 the dominance of the normal class in the dataset. For example, 696 if a classifier is set to predict all beats as normal it would 697 achieve 86% accuracy with all normal beats being correctly 698 classified and all other beats being misclassified. The valida-699 tion Area Under the Curve of the Receiver-Operating Char-700 acteristics (ROC-AUC), the validation recall score, and the 701 validation F1-score have been inspected as the optimization 702 objectives. In our experiments, the F1-score with macro aver-703 aging tends to give the best results in terms of maximizing the 704 classification accuracy of the minority classes. Unfortunately, 705 the hyperparameters found by Keras Tuner cannot be used 706 directly to develop the edge models because Keras Tuner does 707 not consider optimizing the model complexity while search-708 ing for the best parameters. The Keras Tuner parameters are 709 used as a guideline while developing the classifier models to 710 be exported to the edge device. 711 2) Manual tuning of the proposed models is conducted to 712 maximize the model F1-score while minimizing the model 713 complexity. In this step, the model hyper parameters includ-714 ing the number of layers, the used regularization layers, the 715 loss function, the loss optimizer, the dataset class imbalance 716 mitigation method, and the search objective are drawn from 717 the Keras Tuner step. The number of filters and kernel size of 718 Conv1D and Conv2D layers is manually tuned to apply the 719 filtering operations described in Section IV and reduce the 720 model complexity. We designed N F adjacent FIR Filter bank 721 of BPFs each of order N = N k − 1 with equal bandwidths 722 and equally-spaced center frequencies between 0 and 64 Hz 723 using the Hamming window method [28]. Figure 3 shows 724 the frequency response of the FIR filters for N F = 8 and 725 N k = 16. A Conv1D input layer is instantiated with N F 726 filters each of N K size in which the FIR filter coefficients are 727 assigned to the layer kernel weights.   MaxPooling) layers with the ECG signal fed to the input layer 812 and another stack of dense layers fed with the normalized 813 post-and pre-RR intervals. Outputs from both stacks are 814 then flattened, concatenated, and fed to a dense layer with 815 softmax activation to output the five ECG class probabilities 816 as shown by Figure 4(a). The second model comprises a 817 Conv1D input layer with Tanh activation for time-frequency 818 feature extraction followed by a stack of (Conv2D, BatchNor-819 malization, Relu Activation, and MaxPooling) layers with the 820 ECG signal fed to the input layer and another stack of dense 821 layers fed with the post-and pre-RR intervals. Outputs from 822 both stacks are then fed to a dense layer with softmax acti-823 vation to output the five ECG class probabilities as shown by 824 Figure 4(b). 825 The best parameters of the Conv1D FIR input layer are 826 found to be N F = 8 and N K = 16. This layer is fixed 827 in both Conv1D and Conv1D_2D models and the param-828 eters of the remaining layers are manually tuned to maxi-829 mize the F1-score and minimize the number of model param-830 eters. Two variants of each model are trained in which 831 the Conv1D Trainable parameter is switched from False 832 in the Conv1D_2D_T_FIR and Conv1D_T_FIR models to 833 True in the Conv1D_2D_T and Conv1D_T models. The 834 training process was conducted on a cloud machine featur-835 ing 8 CPU cores, 30 GB of RAM, and an NVIDIA QUADRO 836 RTX 5000 GPU and hosted by the Paperspace Gradient cloud 837 platform [37]. Experiments are repeated 10 times for each 838 model and the average results are reported.  Table 2 shows the training and testing results of the pro-841 posed models on the cloud machine. The model number of 842 parameters, size, training time, and GPU memory usage dur-843 ing training are illustrated. The training and testing accu-844 racy of the developed models is depicted. The ROC-AUC, 845 recall, precision, and F1-weighted and macro average scores 846 are presented for the test set only. In macro average scores, 847 class weights are not considered for calculating the average 848 from individual class scores, unlike the weighted average 849 which gives higher scores due to considering class weights. 850 The model is tested using the cloud machine CPU and GPU 851 and the average inference time and throughput are calcu-852 lated. Throughput is calculated by dividing the number of test 853 examples by the whole test dataset inference time.

854
The number of parameters is the same for both trainable 855 and non-trainable Conv1D models yet the average inference 856 time is greater in models with non-trainable parameters which 857 can be attributed to that training the Conv1D layer results 858 in sparse weight tensors which accelerates inference time. 859 Comparing the training and test accuracy shows that the vari-860 ance of all models does not exceed 1% indicating that the 861 models do not overfit the training dataset and well generalize 862 to the test dataset. Models with the trainable parameter of 863 the Conv1D layer set to True outperform their counterparts 864 which indicates that the initial FIR kernel weights have been 865 updated during the backpropagation path of model training 866    Conv1D_2D_T and QAT Conv1D_2D_T_FIR Tflite models 968 lose around 1% of accuracy and %4 of F1-score. On the other 969 hand, the Conv1D_T model losses around 1.5% of accuracy 970 and 6% of F1-score which gives another advantage to the 971 Conv1D_2D_T model. The same conclusions also apply to 972 the recall and ROC-AUC scores.

973
At the performance level, the Conv1D_T_FIR and 974 Conv1D_T models provide the best results as shown by Fig-975  ure 6 Figure 7.

1017
The last two rows of each figure illustrate the Conv1D and 1018 Tanh activation layer heatmaps displayed as 2D mesh plots in 1019 which the horizontal axis is the time axis and the vertical axis 1020 is the number of the feature map representing the frequency 1021 axis for the designed non-trained FIR Conv1D adjacent BPFs. 1022   in Figure 3. The center frequencies of the FIR filters are 1048 not ordered incrementally as in the non-trainable Conv1D most compared works. The proposed method uses a DWT 1083 preprocessing stage limiting its applicability for edge infer-1084 ence. The best classification results of the model proposed 1085 by Liu et al. [19] work are achieved by KNN. However, the 1086 testing results are reported for 10-fold cross-validation exper-1087 iments, not on a separate hold-out test dataset which does 1088 not demonstrate the model generalization power. Moreover, 1089 the proposed method uses the SWT feature extraction stage 1090 which limits its applicability for edge inference. The model proposed by Mousavi and Afghah [20] is 1092 tested for both intra-and inter-patient schemes and the 1093 reported scores are superior. Surprisingly, unlike all related 1094 works, this model achieves such results without using the 1095 RR intervals, which are essential features for ECG classifi-1096 cation, raising serious concerns about the presented results. 1097 The model has a size of 5.5 MB and it requires neither 1098 computationally-intensive preprocessing nor feature extrac-1099 tion stages. Compared to our Conv1D_2D_T model with a 1100 maximum model size of 300 KB (non-optimized) and 90 KB 1101 for the optimized edge model, the model size is much larger 1102 which also indicates that the model inference time and mem-1103 ory usage will be much greater than our model.

1104
The method proposed by Raj and Ray [6] is prototyped 1105 on an ARM9 embedded platform and experimentally vali-1106 dated on the MIT-BIH arrhythmia database for both intra-1107 and inter-patient dataset division schemes. The implemented 1108 platform is recommended for utilization in hospitals to ana-1109 lyze the long-term ECG recordings however the model size, 1110 memory usage, and performance results are not reported.  As future work, we will attempt to improve the model clas-  Finally, the developed model will be investigated for other 1169 relevant time-series classification problems.