Towards Quantitative Precision for ECG Analysis: Leveraging State Space Models, Self-Supervision and Patient Metadata

Deep learning has emerged as the preferred modeling approach for automatic ECG analysis. In this study, we investigate three elements aimed at improving the quantitative accuracy of such systems. These components consistently enhance performance beyond the existing state-of-the-art, which is predominantly based on convolutional models. Firstly, we explore more expressive architectures by exploiting structured state space models (SSMs). These models have shown promise in capturing long-term dependencies in time series data. By incorporating SSMs into our approach, we not only achieve better performance, but also gain insights into long-standing questions in the field. Specifically, for standard diagnostic tasks, we find no advantage in using higher sampling rates such as 500 Hz compared to 100 Hz. Similarly, extending the input size of the model beyond 3 seconds does not lead to significant improvements. Secondly, we demonstrate that self-supervised learning using contrastive predictive coding can further improve the performance of SSMs. By leveraging self-supervision, we enable the model to learn more robust and representative features, leading to improved analysis accuracy. Lastly, we depart from synthetic benchmarking scenarios and incorporate basic demographic metadata alongside the ECG signal as input. This inclusion of patient metadata departs from the conventional practice of relying solely on the signal itself. Remarkably, this addition consistently yields positive effects on predictive performance. We firmly believe that all three components should be considered when developing next-generation ECG analysis algorithms.


I. INTRODUCTION
Machine learning for ECG analysis Machine learning , in particular deep learning, has the potential to transform the entire field of healthcare.The electrocardiogram (ECG) is particularly well-suited to lead this development because of its widespread use (in the US, an ECG was ordered or provided at about 5% of office visits [1]).While it only requires basic recording equipment, it holds enormous diagnostic potential Temesgen Mehari is with Physikalisch-Technische Bundesanstalt, Berlin, Germany and Fraunhofer Heinrich Hertz Institute, Berlin, Germany (email: temesgen.mehari@hhi.fraunhofer.de).Nils Strodthoff is with Oldenburg University, Oldenburg, Germany.(email: nils.strodthoff@uol.de).Corresponding author: NS.This project (18HLT07 MedalCare) has received funding from the EMPIR programme co-financed by the Participating States and from the European Union's Horizon 2020 research and innovation programme.
that we only gradually start to uncover with the help of machine learning [2]- [5].Model architectures On the algorithmic side, the analysis of ECGs based on raw sensor data is still largely dominated by convolutional neural networks [3], [6]- [8].This default choice is slowly being challenged by the rise of transformer-based architectures or combinations of convolutional architectures with attention elements, as exemplified by the winning solutions of the two past editions of the Computing in Cardiology Challenge [9], [10].In this work, we explore a novel algorithmic approach, Structured State Space Sequence (S4) models [11], which learn a continuous representation of time series data and are particularly suited for modeling long-term dependencies.S4 models have demonstrated exceptional performance in modeling and analyzing long-term time series data in various domains.By leveraging the inherent sequential dependencies within ECG signals, we anticipate that the application of S4 models to the ECG domain can significantly enhance the performance of existing prediction models.By capturing complex temporal dynamics and long-term dependencies, S4 models have the potential to unlock new insights and achieve more accurate ECG analysis outcomes.We demonstrate consistent improvements in quantitative accuracy through three components: (1) the use of S4 models as internal model architecture, (2) leveraging self-supervised pretraining, and (3) the inclusion of demographic metadata, see also Figure 1 for a visual summary of the achieved results.We discuss all three components in the following paragraphs: 1) More expressive architectures: SSMs We thoroughly evaluate the model using a well-established benchmarking procedure [6] on the PTB-XL [12]- [14] and Chapman [15] datasets and show consistent improvements over the existing (convolutional as well as attention-based) state-of-the-art .Our main insight is methodological in nature.As a consequence, we deliberately focus on comprehensive ECG classification tasks rather than specific clinical prediction tasks, even though we also envision that data-driven methods will eventually supplement current rule-based decision support systems in ECG devices, see also [16].
Furthermore, we use the model's capability to capture longterm dependencies to systematically investigate long-standing questions in the field, i.e. how long-ranged are the interactions in ECG data that need to be explicitly captured and do models actually profit from input data with a higher sampling frequency of 500 Hz as compared to 100 Hz. prior supervised state-of-the-art [34] Prior best performing model [34] 0.934 0.9417 (16) 0.9445 (19) 0.9463(08) Fig. 1: Visual summary: We demonstrate consistent performance improvements over the state-of-the-art through (1) the use of structured state space models (SSMs) as internal model architecture, (2) leveraging self-supervised pretraining, and (3) the inclusion of demographic metadata.
2) Self-supervision Self-supervised pretraining has exhibited numerous beneficial properties in previous work, enabling models, e.g., to learn robust and data-efficient representations from unlabeled data.Building upon these findings from previous work [17], we seek to investigate the applicability of self-supervised pretraining to S4 models.By harnessing the power of self-supervised learning, we aim to further improve the performance and generalization capabilities of S4 models within the ECG domain.[17] While the original work used an LSTM model as the internal model architecture, we show that replacing it with causal S4 layers leads to unprecedented downstream performance on both datasets under consideration.
3) Demographic metadata Previous work on automatic ECG analysis, at least in the context of deep learning, has largely focused on identifying the most effective ways to extract useful information from the (raw) signal itself.Performing this in a comparable and reproducible fashion is a challenge by itself [6], but can only be the first step in the development of clinical decision support systems.While raw ECG data provides valuable insights, clinicians often rely on accompanying metadata to aid in the diagnostic process.Consequently, we argue that the inclusion of meta-information in the application of machine learning models is essential for comprehensive ECG analysis.This is why we propose to widen the scope of the current benchmarking activities to consider also prediction tasks that include at least basic demographic metadata that should be available to the clinician in all cases.Again, we demonstrate that the combination of S4 models, self-supervision and the inclusion of demographic metadata leads to unprecedented predictive performance.It is worth noting, that this work builds on the material of an earlier (non-archival) workshop contribution [18] and extends it by including results on self-supervised pretraining and the incorporation of patient-specific metadata.

ECG classification
The field of ECG analysis is largely dominated by convolutional architectures, see [19], [20] for a recent reviews.The superiority of modern ResNet-or Inception-based convolutional architectures was confirmed in an extensive comparative study on the PTB-XL dataset [6].This is in line with the excellent performance of such architectures on a broad range of time series classification tasks, see [21].Interestingly, this supremacy was already challenged in [17], where the convolutional baseline was outperformed by a large recurrent neural network with a fully-connected feature extractor.Therefore, it represents a natural question to ask whether architectures that are even more adapted to the necessities of time series can lead to further performance improvements.

Structured State Space Models for clinical time series
The motivation for the development of structured state space models (SSMs) was the wish to devise an architecture that is suited to capture long-term dependencies in very long temporal data, including medical time series as a particular example [22].To support the applicability to the latter, the authors considered a classical vital sign prediction task on ECG and photoplethysmography time series as input [22], [23] and clearly outperformed the current state-of-the-art for this tasks.These results represent a very encouraging sign for the application of these models in the broader context of medical time series.Nevertheless the prior study cannot be considered as a comprehensive ECG analysis task, which is the topic of this work.
In a different line of work, SSMs were used to model the internal state in diffusion models for time series imputation [24], which lead to unprecedented imputation quality (among others) for ECG data, which provides additional hints for the potential advantages of SSMs also in a purely discriminative setting.We therefore aim to investigate the motivating claims for SSMs in the context of ECG data.Self-supervised learning for ECG data Driven by the recent success of self-supervised learning in natural language processing [25], speech [26] and most recently also computer vision [27], there have been several studies that applied related techniques also in the field of ECG analysis [28]- [33].Most of these studies show a rather strong methodological focus and clearly demonstrate the advantages of self-supervised pretraining as compared to conventional supervised training.However, as many of the used models tend to be shallow, the corresponding supervised baselines often fail to reach state-of-the-art performance on comprehensive ECG classification tasks and it therefore remained unclear if and to what degree these improvements would carry over to stateof-the-art models.In [17], using an adaptation of contrastive predictive coding (CPC) [34] established for representation learning for speech data, it was shown that self-supervised pretraining can in fact lead to statistically significant performance improvements compared to the state-of-the-art based on supervised training.To this end, the authors used an adaptation of contrastive predictive coding [34], which has been established for representation learning in the context of speech.These performance improvements then directly translate into an improved data efficiency, i.e., the ability to achieve the same level of performance as supervised training while using only 50-60% of the labeled data.The original CPC model, as well as the one used in [17], relied on a LSTM model [35] as predictive model to perform forecasting in latent space, which again poses the question in how far more powerful predictive models, such as causal SSMs, can further improve these results.

A. Models
Structured State Space Models Structured State Space Models (SSMs) were introduced in [11] showing outstanding results on problems that require capturing long-range dependencies.The model consists of stacked S4 layers that in turn draw on state-space models, frequently used in control theory, of the form that map a one-dimensional input u(t) ∈ R to a onedimensional output y(t) ∈ R mediated through a hidden state x(t) ∈ R N parametrized through matrices A, B, C, D. These continuous-time parameters can be mapped to discrete-time parameters Ā, B, C for a given step size ∆.These allow to form the SSM convolutional kernel, that allows to calculate the output y by a convolution operation, y = K * u + Du.One of the main contribution of [11] lies in providing a stable and efficient way to evaluate the kernel K. Second, building on earlier work [36], they identify a particular way, according to HiPPO theory [36], of initializing the matrix A ∈ R n×n as key to capture longrange interactions.H copies of such layers parametrizing a mapping from R → R are now concatenated and fused through a point-wise linear operation to form a S4 layer mapping from R H → R H , in close analogy to the architecture of a transformer block, where the SSM convolution serves as a replacement for multi-head self-attention.These H copies can be thought of H convolutional filters, which are sequenceto-sequence mappings, parametrized through the state-space Equation (1).Supervised model The model used for supervised training follows the original S4 architecture [11] and consists of a convolutional layer as input encoder, followed by four S4 blocks which are connected through residual connections interleaved with normalization layers, with a global pooling layer and a linear classifier on top.The S4 blocks comprise the S4 layer accompanied by dropout and GeLU activations and a linear layer.
S4-Layers are designed to model long-term time series data by capturing the dependencies and patterns in the sequential Fig. 2: The model used during supervised training follows the original S4 architecture [11].The model consists of a convolutional layer as input encoder, followed by four S4 blocks which are connected through residual connections with a normalization layer.The prediction is obtained from a linear layer following a mean pooling layer.In the setting where we also use the patient-specific metadata (of PTB-XL), we include it in the model through meta head (depicted in the dotted box at the bottom) that receives the metadata as input.Its output is concatenated with the pooled features representation of the signal itself and passed as input to the final classification layer.
data.They are based on the state-space formulation, see Equation ( 1), which describes the evolution of hidden states over time.As mentioned before, the output of the S4 layer can be calculated by a simple convolution with the SSM convolutional kernel, which has been shown to be equal to iteratively solving the equation to update the hidden states through time.This allows the model to capture the sequential patterns and relationships within the ECG signals using a highly parallelizable operation.Hence, by leveraging the statespace (equation 1), S4-Layers enable the modeling of longterm dependencies and complex temporal dynamics in the ECG data.We refer to [11] for details.
The architecture is summarized schematically in Figure 2. We refer to this model as the S4 model We distinguish causal/masked and bidirectional variants to demonstrate the impact of bidirectional context.
In this work, we also aim to quantitatively explore the impact of including patient-specific metadata on the prediction.To this end, we make use of age, sex, height and weight provided as metadata as part of the PTB-XL dataset.We impute missing values in the sex, height and weight columns using median values inferred from the training set.In all three cases, we include additional binary columns to indicate whether an imputation was applied in the respective columns.We process a total of seven static input features through a simple threelayer multi-layer perceptron (MLP) with ReLU activations and 64 hidden units per layer, which is a negligible number of parameters in comparison to the typically parameter-heavy feature extractors for the raw ECG signals.To improve generalization, we interleave these layers with batch normalization and dropout layers.We concatenate the output of this threelayer neural network with the pooled signal representation extracted from the respective signal classifiers considered in isolation before and pass the latter to the final classification layer to obtain the final classification output.Combining the information from signal and metadata only before the final classification layer is an example of "intermediate fusion" approach in the multimodal learning literature [37].We leave the exploration of other fusion schemes to future work and focus on demonstrating that the inclusion of patient metadata leads to consistent performance improvements compared to the synthetic benchmarking case of a classifier operation on raw signals alone.Self-supervised model The idea of the contrastive predictive framework [34] is to learn informative representations for downstream tasks by solving a forecasting task in latent space.Here, a (causal) prediction model is supposed to predict a latent representation a few time steps ahead from the latent representations observed thus far.We mostly follow the selfsupervised learning setup described in [17].We briefly describe its most important aspects: The signal is processed by a series of four convolutional layers with kernel size 1 and 512 filters.This is an important step since the CPC pretraining is set as a forecasting task in latent space.Unlike in the speech domain, there is no necessity to reduce the temporal resolution, as the input sampling frequency of 100Hz is already considerably lower than typical sampling frequencies of 16kHz in the speech domain.The so defined latent representations serve as targets for the forecasting task in the CPC framework.The task is to minimize a contrastive objective, the InfoNCE loss [34], where f k (x t+k , c t ) = exp(z t+k Ṁ LP (c t )) models the conditional probability p(x t+k |c t ) between encoded feature representation g(x t+k ) ≡ z t+k and the forecast implemented as 2-layer MLP operating on the output c t of a causal predictor model summarizing the sequence up to time step t.Here, g refers to the encoder, x t to the input at time step t, z t to the corresponding latent representation and X to the set of random samples drawn from p(x t+k ).In our case, we use the output of the SSM predictor as c t .In practice, the negative samples for the evaluation of the denominator in Equation ( 3) are drawn from a single mini-batch.In this particular case, we even draw the negatives from the same sequence as the positive sample.Due to the nature of the contrastive forecasting task, the model architecture has to be causal, i.e. the predictor has to be an autoregressive model that processed in a unidirectional fashion.This generally results in slight performance losses compared to the corresponding bidirectional model in a supervised context, which, however, typically gets overcompensated through the self-supervised pretraining step.
The most important component is the mentioned predictor model that aggregates the latent representations seen up to a certain point in time to produce a forecast for the latent representation, in this case 12 steps i.e. 120ms into the future.Unlike in [17], where the authors used a two-layer LSTM model, we use four S4 blocks in this work.The final prediction is provided after processing through two fully connected layers.During finetuning, the latter are discarded and replaced by mean pooling and a linear classification head.It is worth mentioning that this constitutes another difference to [17], as they use concat pooling and a non-linear classification head.However, we follow the training methodology put forward in [17] for the finetuning and first freeze all weights except those in the classification head and subsequently finetune the whole model, except that we refrain from using discriminative learning rates.A schematic overview of the architecture used during pretraining and finetuning is presented in Figure 3.In the context of self-supervised learned models, we use the notation LSTM with FCE to refer to the model from [17] and S4 with FCE to the architecture proposed in this work, where FCE stands for fully-connected encoder and refers to the use of four convolutional layers with kernel size 1 as opposed to a single convolutional encoder layer in the original S4 model.

B. Datasets and experimental procedure
To evaluate the effectiveness of our proposed method, we conducted tests using two large 12-lead ECG datasets that are publicly available.These datasets are known as PTB-XL [12]- [14] and Chapman [15].In the case of PTB-XL, we followed a previously established benchmarking methodology [6], where the first eight label-balanced folds of the dataset are used for training, the ninth for validation and the tenth for testing in a multi-label classification task.We assessed the performance on a label set consisting of 71 different labels that cover a wide range of diagnostic, form-related, and rhythm-related statements.As performance metric, we use the macro AUC on the test fold.For the Chapman dataset, we focused on the primary annotations that are related to rhythm statements.To ensure a fair comparison with the PTB-XL dataset, we also divided the Chapman dataset into 10 label-balanced folds

IV. RESULTS
SSMs outperform the current supervised state-of-the-art As mentioned above, state-of-the-art approaches in deep-learning-based ECG analysis mainly rely on modern convolutional architectures.As a representative example, we use a model with xresnet1d50 architecture that was shown to lead to competitive results compared to the state-of-the-art at that time [17].We also compare it to a recurrent neural network with a fully-connected feature extractor (LSTM with FCE) [17] that showed the best reported supervised performance on PTB-XL to date.In Table II, we compare these models to the S4 model based on comprehensive ECG classification tasks on the PTB-XL and Chapman datasets.Here and in the following, we report mean and standard deviation of the test set scores over 10 runs using a concise error notation where e.g.0.9175(39) signifies 0.9175 ± 0.0039.On PTB-XL, the S4 model surpasses all baseline methods in a statistical significant manner, see below for a detailed description.The fact that PTB-XL has developed into a reference dataset for the benchmarking of ECG classification models, allows to compare the proposed method also to a variety of other approaches on the dataset.Most notably, the S4 model outperforms recently proposed transformer/attention-based methods such as [41], [42].However, the results also highlight the necessity of measuring and reporting statistical uncertainties of the results.
Only those allow to assess the significance of the reported results.Interestingly, the ranking of the algorithms is largely consistent on the Chapman dataset but the differences between the different approaches are not significant in this case.This might be explained by the fact that all models achieve very high predictive performance on this dataset with macro AUC values beyond 0.98, which is not the ideal situation if one aims to quantify performance differences in a statistically significant manner.
Statistical significance As a final comment on the significance of the improvements achieved, it should be noted that (a) ECG statement prediction on PTB-XL is so far the only widely accepted benchmarking setting for comprehensive ECG prediction, and (b) improvements in a target metric are difficult to assess solely on the basis of their numerical value, but should rather be judged on the significance of the improvements over the previous state-of-the-art.To this end, we have to refine our notion of statistical and systematical uncertainty measures.Following the methodology used in [17], we consider two sources of uncertainty, the statistical uncertainty due to the randomness of the training process, which can be assessed through multiple training runs, and the uncertainty due to the finiteness and the specificity of the label distribution of the test set.We address the latter via empirical bootstrapping (n iter = 1000 iterations) on the test set.Comparing two particular trained models, we consider the performance difference to be statistically significant if the bootstrap 95% confidence intervals for the performance difference does not overlap with zero.We address the uncertainty due to the stochasticity of the training process by training n runs = 10 for each of the models we aim to compare.For each of the n 2 runs comparisons, we assess the statistical significance via bootstrapping as defined above.Finally, we define a model to perform statistically significantly better/worse in case that 60% of the model comparisons turn out to be statistically significantly better/worse.Like the significance level, this threshold can be chosen at will as long as it exceeds 40% for consistency reasons [43] and is related to the amount of uncertainty one is willing to tolerate due to fluctuations across training runs.

TABLE II:
Comparing supervised performance of the state-ofthe-art models on two large ECG datasets.The asterisk stands for statistically significant better performance than LSTM with FCE (as the previous state-of-the-art with quantified statistical uncertainty).Results marked with † were obtained from own experiments, all remaining values were taken from the literature, for which unfortunately no statistics over multiple runs is available.

PTB-XL Chapman
ViT [44] 0.862 -BaT [44] 0.905 -inception1d [6] 0.925 -xresnet1d101 [6] 0.925 ensemble [6] 0.929 multi-period attention [41] 0.932 -DLTB-ECG (signal only) [42] 0.934 -xresnet1d50 † [6] 0.9286(28) 0.9805(39) LSTM with FCE † (causal) [17] 0.9295 (31) 0.9854(12) S4 model † (this work) 0.9417(16)* 0.9876 (11) In Figure 4, we show a bootstrap comparison between the S4 model architecture and the xresnet1d50 architecture on the level of individual ECG statements.The colored bars represent the median values and the black bars the standard deviation of the n 2 runs medians of the n iter AUC difference comparisons per model combination.Labels on which the S4 model architecture performs statistically better(worse) are marked by *(-).The plot reveals that despite high median values for the difference of some label AUCs, like e.g for non-specific ST Elevation (STE) or T-wave abnormality (TAB), the S4 model architecture does not necessarily perform statistically better on those labels, as the difference varies strongly over combinations of different runs.On the other hand though, there are pathologies, where the median AUC difference is close to zero but with such a low variance, that these differences are statistically significant, as it is the case for e.g.healthy ECG signals (NORM) or inferior myocardial infarctions (IMI).Summarizing, we find statistical significant improvements for 8 (SARRH, PAC, ISCIN, STD , ABQRS, SR, IMI, NORM) of the 71 ECG statements, while only 2AVB showed a consistent decrease in performance.SSMs allow inference at unseen sampling rates A compelling aspect of state space models is that, due to the continuous character of the state-transition matrix A in Equation (1), the model can be evaluated on data that was sampled at a different rate than the training data, simply by adjusting the step size in the discretization step during test time.Table III depicts a cross-evaluation matrix, in which we trained and cross-evaluated the S4 model on 100, 200 and 500Hz.We see no or just minor losses in performance when varying test from train sampling rates, even if the sampling rates differ by a factor of 5.This is a particular asset of SSM models as it avoids the necessity to resample the data, which might otherwise be a source of additional systematic uncertainties.
No long-term dependencies in ECG data beyond 3s In this  paragraph, we aim to clarify in a quantitative way, how longranged interactions are actually present in ECG data, which is a long-standing question that has not been systematically addressed so far.We address this question in terms of the size of the input window that is passed to the model (while performing test-time-augmentation, i.e. combining information from different segments for the sample-level prediction at all times).We believe that this question has not be answered in an unbiased way so far due to the inability of prior architectures to capture long-term dependencies in the data without adjusting hyperparameters such as kernel sizes etc. in the case of convolutional models.In Figure 5, we investigate the model performance (macro AUC measured on the PTB-XL test set) as a function of input size for two convolutional models using input data sampled at 100Hz and 500Hz and two SSMs using the same kind of input data.As described above, we train models on various input sizes, measured in physical units for comparability, and obtain aggregated predictions for the full samples during test time by taking the average of ten input windows that are consequently moved through the signal with varying stride.As a first observation from Figure 5, we see that the performance gap between convolutional models and structured state space models, which was already apparent at an input size of 2.5s in Table II, persists across all input sizes.Second, within each model architecture, the results from input data sampled at 100Hz as compared to 500Hz largely overlap.This already puts into question the potential advantage of using input data at 500Hz for ECG classification purposes.We will revisit this question in a more detail in the next paragraph.Third, the plot shows an interesting dependence on the input size that is qualitatively consistent across model architectures: The performance from aggregated model predictions shows a peak around input sizes around 2-3s.This hints at the fact that for ECG classification tasks based on short 12lead ECG data, the ability of the model to explicitly capture long-range interactions beyond about 3s is not beneficial.On  the contrary, the models seemso profit more from averaging overlapping predictions from different sliding windows.This observation is very much in line with the fact that most pathologies affect all beats equally (with a few exceptions such as premature ventricular contractions) and the fact that for average heart rates between 60 and 100bpm, a sliding window of 3s already contains 3-5 beats.This question is obviously completely independent from the question of capturing longterm dependencies in long-term ECGs with rhythm changes within the sample and should be revisited in future work.No significant advantages from sampling frequencies be-yond 100Hz We also revisit the question of the comparison between sampling frequencies at 100Hz vs. 500Hz in a more statistically rigorous manner based on the methodology presented above.At a input size of 2.5s and for fixed model architecture, we find no statistically significant performance difference between both sampling frequencies.This applies equally well to the level of individual label AUCs, in an analysis analogous to the one carried out for the comparison between the S4 and xresnet model above.We want to stress that this statement obviously depends strongly on the label distribution of the dataset under consideration, in the sense that there might be systematic improvements for certain ECG statements that do not turn out to be statistically significant due to large fluctuations as a consequence of small sample sizes.
CPC with SSMs predictor outperform the current selfsupervised state-of-the-art As a final experiment, we also study the impact of SSMs in the context of self-supervised pretraining.The results compiled in Table IV reveal a statistically significant improvement in terms of downstream performance compared to the previously best result (LSTM with FCE) for the identical pretraining dataset All2020.Enlarging the pretraining dataset even further, as for All2021, leads to a further performance improvement, reaching 0.9445 (19), the highest performance reached so far on PTB-XL.In all cases, we find statistically significant improvements on PTB-XL compared to the corresponding model trained in a supervised fashion.To show that the achieved results also generalize to other datasets, we repeat the same downstream analysis using the same pretrained model for the Chapman dataset and find qualitatively similar results.At this point, it is hard to assess if the improvements in downstream performance are primarily consequences of the improved performance of the model architecture, as seen in supervised training.In any case, replacing LSTMs by S4 layers in architectures used for selfsupervised pretraining represents a promising step also for other data modalities such as speech.Incorporation of patient-specific metadata enhances predictive performance In Table V, we report the performance evaluation of the incorporation of patient-specific metadata.It reveals a consistent improvement compared to the performance of the corresponding model operating on input signals only across all models.The convolutional xresnet1d50 model benefits most from the inclusion of metadata.The pretrained SSM model S4 with FCE (pretrained on All2021) reaches on overall performance high at a macro AUC of 0.9463(08).We urge the community to also adopt this clinically more relevant task as a benchmark.To be able to assess the resilience of the results, we strongly urge to report also statistical fluctuations over multiple training runs.Visual summary In Figure 1 and 6, we present visual summaries of the outcome of this study.They show the improvements achieved through the use of structured state space model layers both in the supervised and in the selfsupervised domain.In Figure 6, we use the number of model parameters to give an impression of the model complexity, although other parameters such as inference time should be considered for a more complete picture [45].

V. SUMMARY
In this study, we utilized structured state space models, which are well-suited for capturing long-term dependencies in time series, to challenge the dominance of convolutional architectures in the realm of deep-learning-based ECG analysis.Our results consistently surpassed the previous state-of-the-art on large, comprehensive ECG classification datasets, both in the supervised and self-supervised settings.We leveraged the model's capability to capture long-term dependencies to shed new light on the optimal sampling frequency and input size for the model, rebuting for example common myths on necessary sampling rates.The significant performance improvements achieved by including patient metadata suggest the need to move beyond artificial benchmarking scenarios, where the model predictions are based solely on the signal, to more realistic scenarios, while maintaining a strict evaluation protocol.Supervised Performance on PTB-XL Fig. 6: Visual summary: Supervised performance on PTB-XL (with and without self-supervised pretraining).Models involving SSMs clearly advance the state-of-the-art while maintaining a smaller parameter budget as previous models using LSTMs as predictors.FCE stands for fully-connected encoder and (CPC) indicates that the model received selfsupervised pretrained using contrastive predictive coding.

Fig. 3 :
Fig.3: Schematic representation of the pretraining and finetuning procedures followed in this work.

TABLE I :
Overview over the datasets used in this study.
[39]first one, All2020, includes the training data from the Computing in Cardiology Challenge 2020 in a addition to the Chapman dataset and the CODE test set[8].This dataset ensures the comparability to[17], where the same dataset was used for pretraining.Going beyond this, we compiled an even larger pretraining dataset, All2021, which includes the Computing in Cardiology Challenge 2021[39]training set as well as the CODE test set and comprises in total 89080 samples.It is worth noting that the challenge datasets are by themselves collected from various sources.The PTB-XL dataset is contained both in the 2020 and the 2021 version and the Chapman dataset is part of the 2021 dataset.We summarize all datasets used in this study again in TableI.For the experiments involving S4 layers, we use a batch size of 32, N = 8 for the state dimension in the S4 layers (as optimal value identified on the PTB-XL validation set) and covered.The average of the ten respective output probabilities is used as the final prediction for the entire sample.

TABLE III :
Comparing supervised performance of the S4 model trained/tested on different sampling rates.

TABLE IV :
[17]aring downstream performance (macro AUC on the PTB-XL/Chapman test set for the most finegrained level of the label hierarchy) after self-supervised pretraining on different datasets.The asterisk stands for statistically significant better performance than its supervised counterpart.The first result was taken from[17].FCE stands for fully-connected encoder.
[12]rehensive comparison of the performance difference, per pathology, between the S4 and xresnet1d50 models at a sampling rate of 100Hz.The x-axis enumerates the ECG statements (as well as macro AUC across all statements) in the PTB-XL dataset, see[12]for details, while the y-axis indicates the difference in AUC values.The bars, color-coded according to the respective super classes, indicate the median performance difference over 100 comparisons comparing trained 10 S4 models and 10 xresnet1d50 models.The thin vertical lines represent the respective interquartile ranges, as a measure of how much the differences vary.Additionally we assess the statistical significance of the performance improvements through bootstrapping on the test set, see the main text for details, where an asterisk(hyphen) next to an ECG statement indicates that the S4 model performs statistically significantly better(worse) than the xresnet1d50 model.

TABLE V :
Comparing supervised performance of the stateof-the-art models on PTB-XL that incorporate patient-specific metadata as compared to models operating on signals only.The notation follows TableII.