Analysis of a Deep Learning Model for 12-Lead ECG Classification Reveals Learned Features Similar to Diagnostic Criteria

Despite their remarkable performance, deep neural networks remain unadopted in clinical practice, which is considered to be partially due to their lack of explainability. In this work, we apply explainable attribution methods to a pre-trained deep neural network for abnormality classification in 12-lead electrocardiography to open this “black box” and understand the relationship between model prediction and learned features. We classify data from two public databases (CPSC 2018, PTB-XL) and the attribution methods assign a “relevance score” to each sample of the classified signals. This allows analyzing what the network learned during training, for which we propose quantitative methods: average relevance scores over a) classes, b) leads, and c) average beats. The analyses of relevance scores for atrial fibrillation and left bundle branch block compared to healthy controls show that their mean values a) increase with higher classification probability and correspond to false classifications when around zero, and b) correspond to clinical recommendations regarding which lead to consider. Furthermore, c) visible P-waves and concordant T-waves result in clearly negative relevance scores in atrial fibrillation and left bundle branch block classification, respectively. Results are similar across both databases despite differences in study population and hardware. In summary, our analysis suggests that the DNN learned features similar to cardiology textbook knowledge.


I. INTRODUCTION
The development and evaluation of algorithms for automatic interpretation of biosignals has attracted great interest in the last decade.Biosignals are time series, i.e. they are ordered sequences of measurements, which are usually acquired in Submitted on 14.11.2022.This research was funded by the German Federal Ministry of Education and Research (grant no.16TTP073 11, and HiGHmed, grant no.01ZZ1802B) and the Lower Saxony "Vorab" of the Volkswagen Foundation and the Ministry for Science and Culture of Lower Saxony (grant no.76211-12-1/21).
T. Bender Traditionally, the field of ECG signal processing was dominated by methods based on mathematical or physical models recreating human physiology.Human experts defined semantic models or features which were used for different tasks, e.g. for generating synthetic waveforms [1], waveform delineation [2], or even human identification [3].Evidently, this led to a plethora of proposed features and the question of which feature set is optimal for a specific task, e.g. for ECG classification [4].Regarding this application, the aim is to either assign a label to individual heart beats or to a whole recording.As an example for the latter use case, the PhysioNet/CinC Challenge 2020 posed the task to automatically assign one or multiple of 27 classes to a large, multi-institutional database of 12lead ECGs [5].More than 200 teams took part with the most common algorithms being deep neural networks (DNNs).
In recent years, data-driven methods from the field of machine learning (ML) became popular, a significant percentage accounted for by DNNs [6].At first many works used DNNs as classifiers and used traditional, semantic features as their input.However, recently there has been a trend towards "endto-end" pipelines where the raw signal is processed and DNNs extract relevant features themselves [7]- [11].Although these methods are able to produce outstanding results and outperform conventional methods in many areas [12], [13], a pitfall lies in the fact that they are black box models and often based on agnostic features.While they bear the theoretical potential to aid in diagnostics or treatment decisions, clinicians need to be able to comprehend their reasoning as a "Clever Hans" prediction [14], based on spurious or artifactual correlations, might lead to wrong decisions and adverse consequences for patients.Hence, next to issues such as inadequate performance metrics [15] and data leakage [16], one of the main reasons for DNNs remaining unadopted in clinical practice is missing explainability [17], [18].
To address this need, frameworks and methods from the field of Explainable Artificial Intelligence (XAI) are developed and evaluated [19].While XAI for text and tabular input data is advancing, XAI for time series data such as biosignals is still in the need for further research [20].XAI methods for DNNs include layer-wise relevance propagation (LRP) [21], integrated gradients (IG) [22], and GRAD-Cam [23].However, with regard to ECG classification, these methods are usually applied qualitatively [24]- [26] by showing individual recordings and corresponding XAI information, e.g. as pseudo-colored overlays.This qualitative evaluation of single recordings is rather anecdotal evidence and does not suffice the requirements for integrating DNNs in clinical practice, which needs a comprehensive characterization of models and their limitations.
Hence, in this work, we address the unmet clinical need of missing explainability by proposing a quantitative analysis pipeline (Fig. 1) enabling an objective justification of a DNN's decision.We use a state-of-the-art, pre-trained DNN proposed by Ribeiro et al. for abnormality classification in 12-lead ECGs [27] and apply attribution XAI methods to public ECG databases.In order to analyze the generalizability of this approach, we evaluate the explanatory power of different XAI methods and evaluate results on two different databases.
The XAI methods assign to each sample of the ECG time series a relevance score reflecting how much it influenced the DNN's decision.The main contribution of this work are novel analysis methods for processing these scores.These analyses allow to gain insight into the DNN's reasoning when classifying unseen ECG signals.By mapping the results to clinical knowledge, we investigate in how far the DNN's features align with clinical knowledge.By doing so, we also propose novel visualization methods of relevance scores, allowing an intuitive and quick assessment of DNN classifications.

A. Physiological Introduction
An ECG measures electrical activity on a patient's skin to monitor his/her cardiac cycle.It is a routine measurement in clinical settings, especially in emergency care as it allows a fast, accurate and comfortable assessment of key clinical parameters.Standard parameters derived from ECGs include heart rate, lengths between different peaks and waves, as well as the heart's electrical axis.Differences of these parameters to normal values can be interpreted as abnormalities, substantiating diagnoses.The acquisition of ECGs differs in length, e.g. 10 s in acute care or 24 h for Holter measurements, as well as circumstances, such as resting or exercise.
Raw ECG data is measured at equally-spaced points in time (samples) in units millivolt (mV) from multiple directions (leads) which are computed from differences in electrical potentials measured in two distinct electrodes.A standard resting ECG uses 10 electrodes, resulting in 12 leads, including six chest leads and six limb leads derived from electrodes on each arm and the left leg.
The stages of the cardiac cycle, a single heart beat, are represented by characteristic waves and peaks in a P-QRS-T sequence.The P-wave represents the depolarization before the contraction of the atria which is initiated by the sinus node.The QRS-complex consists of the Q-, R-, and S-waves and corresponds to the ventricular systole, and the T-wave represents the ventricular relaxation.
The morphology of the different waves, such as amplitude or width, as well as the intervals in between are clinically relevant.For example, atrial fibrillation (AF) is an arrhythmia based on uncoordinated electrical impulses in the atrium of the heart and a non-functioning sinus node [28] that can be diagnosed from ECGs.Criteria for diagnosis are absence of P-waves, as they are initiated by the sinus node, and irregular RR intervals [28].However, repeating fibrillatory waves (fwaves) mimic P-waves and can usually be observed best in leads V1-6, especially V1 [29].Another abnormality is left bundle branch block (LBBB), where the cardiac conduction through the left bundle branch is compromised downstream from lesions of the His bundle or its derivatives.LBBB criteria for ECGs include unusually wide QRS-complexes with the ST-segment and T-waves pointing in opposite direction [30].I, aVL, V5 and V6 are left-sided leads, where broad notched or slurred R-waves can be observed, while Q waves are absent [30].Both, AF and LBBB, can be diagnosed by ECG acquisition with a reduced number of leads, but the gold standard for diagnosis is 12-lead ECG [31].

B. Technical Background
Ribeiro et al. published a residual network (ResNet) trained on more than two million ECGs from a Brazilian telehealth network, showing F1-scores of more than 80 % for classification of six ECG abnormalities.The output from convolutional layers in each of four residual blocks are fed into a fully connected layer with sigmoid activation function, yielding independent probabilities for six classes of ECG abnormalities [32].Thresholds calculated for the final classifications are available on GitHub 1 .In previous work, we demonstrated methods and results reproducibility with local data [33].
The model accepts a matrix with dimensions N × 4096 × 12 with 4096 and 12 defining the number of samples and leads, respectively.N denotes the number of recordings to be processed.The model outputs a matrix with dimensions N × 6 assigning probabilities for six ECG abnormalities, namely first degree AV block, right bundle branch block, LBBB, sinus bradycardia, AF and sinus tachycardia.
In medical applications such as ECG diagnostics it is important for clinicians to understand the reasoning of a DNN.XAI methods build a wrapper around the black box model, giving insight into possible features that led to the DNN's output.In this paper, we focus on two state-of-the-art attribution methods, IG and LRP.Step 3: Quantitative Analyses We propose novel analysis methods for these scores, allowing to gain insight into the DNN's reasoning.
1) Integrated Gradients: IG attribute the prediction of a neural network on unseen data to its input features.However, IG use a baseline input for attribution calculation.The authors [22] motivate this by noting that if we assign blame to something, we implicitly consider the absence of it as a baseline for comparing outcomes.
IG are calculated as follows: Let f be a function that represents a neural network, x the input at hand, and x the baseline input.The IG are defined as the path integral of the gradients along the straight-line path from the baseline x and input x.The straight-line path can easily be written down as x + α(x − x) for α ∈ [0, 1].The integrated gradient for the i-th input dimension is defined as where ∂f (x) ∂xi is the gradient of f (x) along the i-th dimension.
The property of the LRP methods that the relevance scores of the input can be summed up and approximate the prediction score (see ( 4)) can also be proven for IG by using the fundamental theorem of calculus for path integrals.This states that if f : R n → R is differentiable almost everywhere2 then n i=1 For a baseline x with prediction f (x) near zero, we can see that the sum over the IG in (2) also approximates the prediction score f (x) similar to how the sum over the relevance scores calculated by LRP approximates the prediction score f (x) in (4).This property is termed completeness in [22].
For computing IG the integration is replaced by a sum over sufficiently small intervals along the straight-line path 2) Layer-wise Relevance Propagation: LRP tries to explain the output f (x) made by a classifier f with respect to an input x by decomposing the output f (x) in such a way that where V is the input dimension.R d > 0 would then indicate the presence of the structure which is to be classified and R d < 0 would indicate its absence.
Propagation of relevance scores works as follows: Let R (ℓ+1) j be a known relevance score of a certain neuron j in the ℓ + 1-th layer of a neural network, for a classification decision f (x).The decomposition of the relevance score R (ℓ+1) j in terms of messages R i←j sent to neurons of the previous layer ℓ must hold the conservation property where i describes the sum over all neurons in the ℓ-th layer of the neural network.One possible relevance decomposition that satisfies (5) would be to use the ratio of local and global activations: where x i is the activation (calculated by a non-linear activation function) of the i-th neuron in the ℓ-th layer, w (ℓ,ℓ+1) ij is the weight connecting neuron i in the ℓ-th layer to neuron j in the ℓ + 1-th layer, b (ℓ) j is a bias term, and k describes the sum over all neurons in the ℓ-th layer.
A problem with ( 6) is that if z j gets very small, the relevance scores R i←j can get infinitely large.To overcome this problem, the authors of [21] introduced a stabilizer ϵ ≥ 0: As we can see in (7), if ϵ becomes very large, the relevance scores will tend to zero which poses another problem.To counteract this, a different treatment of positive and negative activations x i is proposed in [21].Let z + j and z − j denote the positive and negative part of z j such that z + j + z − j = z j .The same notation will be used for the positive and negative parts of . Relevance decomposition can now be defined by where α + β = 1.A different propagation rule has been proposed by [34] for real valued inputs that redistributes relevance scores according to the square magnitude of the weights: Other papers such as [35] and [36] propose a combination of different decomposition rules for different layer types, like (7) for fully connected layers to truthfully represent the decisions made via the layers' linear mapping and (8) for convolutional layers with ReLU activation functions to separately handle the positive and negative parts of

C. Experimental Design
Fig. 1 shows an overview of our DNN and XAI pipeline applied in this work.This pipeline is run separately on data stemming from two different databases.
1) Databases: The data set for our main analysis stems from the CPSC2018 database 3 acquired in eleven Chinese hospitals containing 12-lead ECGs with a ground truth provided by human experts [37].Additionally, we validate the generalizability of our results using the PTB-XL database [38] 3 https://storage.cloud.google.com/physionet-challenge-2020-12-lead-ecg-public/ PhysioNetChallenge2020_Training_CPSC.tar.gzFor our main analysis on the CPSC database, we use a subset of 200 each for AF, LBBB and healthy subjects showing normal signals, resulting in N = 600 recordings.We investigate these two classes as AF is defined by an abnormal heart rhythm, i.e. irregular distances between heart beats, and therefore it can only be diagnosed by analyzing multiple heart beats.In contrast, LBBB can be diagnosed by a single heart beat as it is characterized by distinct morphological features, e.g. a notched QRS-complex.
2) Processing pipeline: All recordings were resampled to 400 Hz and trimmed or zero-padded to 4096 samples.In the remainder of this work, we denote a single ECG sample as E n,j,k with n = {0, 1, . . .599} representing the recording index, j = {0, 1, . . .4095} representing samples, and k = {0, 1, . . .11} representing leads.Regarding data processing 4 , each ECG signal is fed to the model by Ribeiro et al. [39] for classification, resulting in a matrix with dimensions N × 6 assigning probabilities for six ECG abnormalities.In the following, we define {C n ∈ R | 0 ≤ C n ≤ 1} indicating the prediction score of the model with sigmoid activation, representing the classification probability.We utilize the package iNNvestigate [40], which implements multiple XAI methods, to compute relevance scores for each sample of the input ECGs.We use the XAI methods IG and LRP with the IG implementation being with baseline input zero and interval size m = 64, after changing the activation of the DNN's last layer to linear.Sigmoid activation does not change the ranking order of the predicted classes, but might obfuscate the true confidence of the model's individual class predictions 5 .
The XAI methods assign a relevance score R j,k ∈ R to each input sample of a classified ECG recording.By computing this for all N recordings we obtain R n,j,k with the same dimensions as our input ECG data E n,j,k .Both are the basis for our analysis to compare features embedded in the DNN model to clinically-relevant criteria.We analyze the obtained relevance scores R n,j,k with three novel quantitative methods and one qualitative method as described in the following sections.With each new analysis, we take more details into account.While in the first analysis relevance scores are binned to each class, in the second analysis we split relevance scores w.r.t.their lead and in the third analysis w.r.t.lead and heart beats.
3) Binned and Average Relevance Scores Over Class: We first analyze relevance scores for all 200 normal, 200 LBBB, and 200 AF recordings separately and bin the values for their respective class, allowing us to compare the overall distribution of R n,j,k for the different classes.
We then aggregate all leads of each recording n into with K = 12 and J = 4096.R n,j,k takes positive or negative values, hence a higher M n is associated with a higher prediction score, termed completeness in [22].Here, the prediction score is the output of the model with linear activation.
4) Average Relevance Scores Over Class and Lead: We aggregate relevance scores for each lead k and recording n in with J = 4096.This allows for comparing the distribution of R n,j,k w.r.t.class and ECG leads and thus the importance of the individual ECG leads for the DNN.This is required as the different leads show different morphologies and signal shapes that might cancel out in the first analysis.
5) Average Relevance Scores Over Class, Lead, and Beats: In the first two analysis methods, time information is lost.However, for explaining the DNNs decision this is crucial as we need to compare whether the agnostic features trained by the DNN reflect the clinical features described in section II-A such as missing P-waves, unusually wide QRS-complexes etc. Analyzing individual ECG records gives only anecdotal evidence.Therefore, we perform a two-step averaging procedure which averages the information over several recordings while preserving time information.First, for each ECG record and lead, we use the concept of "average beats" [41] by splitting the whole signal into individual heart beats with the ecg segment() function of neurokit2.We average them into a single, time-aligned representative beat for each lead.Then we use the exact same indices of the heart beats and perform the same steps on the relevance scores R n,j,k , yielding an "average relevance score".All average beats and average relevance scores are then averaged for a given class.All segments are of equal size for one recording, hence we fill segments overlapping start or end of the recording with zeros.Finally, amplitudes are normalized to [−1, 1].For scatter plot visualizations, relevance scores are upsampled by a factor of 5.
6) Qualitative Analysis of XAI Relevance Scores: The results of all processed ECG signals were visualized as heatmapcolored scatter plots for each lead, after a normalization of the output to [−1, 1], keeping the center of the values at zero.Furthermore, these relevance score plots were evaluated by an experienced cardiologist.
7) Comparison Between Databases: To evaluate the the generalizability of our processing pipeline, we evaluate results on another publicly-available dataset.For this task we use PTB-XL [38] which is which is an older public database acquired between October 1989 and June 1996 in Germany.Therefore, the ECG measurement equipment and subject's origin are completely different to the CPSC database and additionally there is the chance of different clinical guidelines being in practice for the annotation by cardiologists.
8) Comparison Between XAI Methods: Since both methods, IG and LRP, differ substantially in their approach on how to calculate relevance scores for the input, we believe that using both methods will help uncover important information about why the DNN made certain decisions.Hence, we compare IG results to LRP using the following LRP decomposition rules implemented in the iNNvestigate [40] package: a) The ϵ-LRP decomposition (see (7)) with ϵ = 1e − 07.
b) The αβ-LRP decomposition (see (8)) with α = 1 and β = 0. c) The ω 2 -LRP decomposition (see 9)). d) The combination of αβ-LRP decomposition (see (8)) with α = 1 and β = 0 for convolutional layers and ϵ-LRP decomposition (see (7)) with ϵ = 0.1 for fully connected layers.The sigmoid function (used in the output layer) maps from R to R + and thus inverts the signs of all negative values, as well as scales all values into the interval of [0, 1].This results in only small and positive values being backpropagated by the LRP method possibly resulting in small and only positive relevance scores.Thus we compared these relevance scores to those obtained by using a linear output in the last layer.Since both activations yield similar results when compared visually in heatmaps, we decided to continue with linear activation, to avoid the possible sign flip.

D. Ethics approval
Human subject research: This work only makes use of public data and does not contain any additional information involving human participants obtained by the authors.

III. RESULTS
After processing recordings with the DNN, C n is the probability that a recording n shows the interrogated abnormality.The recording is classified as this abnormality if C n is higher than a threshold defined by Ribeiro et al., which is 0.39 for AF and 0.05 for LBBB.Applying an XAI method results in a relevance score R n,j,k ∈ R for each input sample of a classified ECG with j = {0, 1, . . .4095} representing sample index, and k = {0, 1, . . .11} representing the lead.

A. Average Relevance Scores Over Class
The mean of the distributions of IG relevance scores R n,j,k for each class (Fig. 2) is close to zero, representing that the majority of ECG samples is not relevant for the DNN's decision.Distributions for both abnormalities are almost similar to normal recordings, although they are slightly broader and shifted to positive values.For LBBB, in the range [0.0, 0.10] there is a large number of more positive relevance scores compared to normal recordings (Fig. 2b).
The relevance scores of individual recordings are again centered close to zero and rather equally-distributed (Fig. 3).In general, AF shows larger values in positive and negative direction compared to LBBB.While the median value is 0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 Relevances Rn,j,k  always very close to zero, the mean value of relevance scores is increasing with increasing C n .For AF classification (Fig. 3a) a large amount of normal recordings correctly classified as not showing AF have a C n near 0 and correctly classified AF recordings are near 1.In between is a "transition area" with nine false negative classifications in [0.1, 0.39[.The remaining seven false negatives show M n values close to zero.LBBB has similar properties to AF, although there is no visible transition area and the values are not as close to 1 (Fig. 3b).

B. Average Relevance Scores Over Class and Lead
Analyzing model results of each lead k for AF classification (Fig. 4a), mean relevance scores showed medians of 0.0002,−0.0001and ranges of [−0.0002, 0.0010] and [−0.0014, 0.0012] for AF and normal recordings, respectively.For LBBB classification (Fig. 4b), medians were 0.0001,−0.0002and ranges were [−0.0008, 0.0016] and [−0.0009, 0.0022] for LBBB and normal recordings, respectively.For each lead, the mean relevance scores were significantly higher for both abnormalities compared to normal recordings, with a Wilcoxon-Rank-Sum-Test and p-value < 0.01.Particularly, lead V1 shows the highest difference in

C. Average Relevance Scores Over Class, Lead, and Beats
Average beats over 200 recordings show mostly positive relevance scores for both abnormalities, and mostly negative relevance scores for normal recordings for both classifications (Fig. 5).
When classifying AF, QRS-complexes are the most relevant areas, especially R-peaks.For normal recordings, we observed high negative values for the area of P-waves as well.Negative values of normal recordings are higher compared to positive values of AF recordings.For LBBB classification, QRScomplexes are most relevant as well (Fig. 6).Furthermore, the concentration of high absolute relevance scores on specific waves or peaks is clearer, such as the negative T-wave in LBBB, assigned with negative relevance scores when positive in normal recordings.In contrast, for AF many smaller relevance scores with higher variance are distributed on the whole beat.

D. Qualitative Analysis
We observed clusters of high absolute relevance scores in the area of QRS-complexes during visual inspection of single recordings visualized as heatmap (Fig. 7).For LBBB, IG seems to focus on negative S-waves and prolonged STsegments in lead V1.Occasionally, broad and notched Rwaves were also marked relevant.On the contrary, for AF recordings, the relevant parts were usually R-waves and in rare instances areas with missing P-waves.
When looking at individual recordings we also observed that in cases of artefacts, such as baseline drifts or noise, IG relevance scores are usually accumulated mainly in these areas.This can be seen on multiple false negative classifications, such as recordings A1017 (lead V1, Fig. 8), A0745 (V6), and A0205, A0502 (both multiple leads, mainly: V1-6).In some cases the classification was still correct despite the focus on artefacts, e.g.A0639 (V1) classified as AF with ≈ 0.904.

E. Comparison of Databases
We repeated all experiments conducted on the CPSC database using data from PTB-XL instead.All quantitative methods show similar results for PTB-XL data, exemplarily shown for average beats in AF classification in Fig. 9. Especially the distribution of relevance scores for LBBB recordings is narrower and shifted closer to positive values than for CPSC data (Fig. 10).

F. Comparison of XAI Methods
IG and all considered LRP methods yield diverging results for the given data set.As can be seen in Fig. 11 as an example, LRP methods ϵ and αβ distribute high absolute relevance scores especially around R-peaks, while ω 2 shows higher absolute values on waves in between as well as artefacts.IG can also concentrate high absolute relevance scores around artefacts, but generally shows more high absolute values, especially on R peaks, when comparing leads of single patients to each other.

IV. DISCUSSION
Results of the first analysis show that IG relevance scores follow a reasonable distribution (Fig. 2) with the majority of values being close to zero.This is expected as the majority of samples in an ECG is at baseline, e.g. the interval between two heart beats from the end of the T-wave to the beginning of the P-wave, and carry little clinically-relevant information.Comparing AF and LBBB classification shows that the AF relevance scores are more evenly spread around zero while the LBBB relevance scores tend to more positive relevance scores which can also be seen clearly in Fig. 2b with a distinct gap for positive relevance scores between LBBB and normal recordings.We conclude that the DNN trained a larger interclass distance for LBBB classification.
Analyzing individual recordings (Fig. 3) shows similar distributions for both classifications.Additionally, a distinct relationship between the averaged relevance scores M n and the probability of the DNN C n can be observed.An optimal DNN classifier would show a cluster nearby C n = 0 and M n ≪ 0 for normal recordings as well as a cluster nearby C n = 1 and M n ≫ 0 for AF/LBBB.The analyzed DNN shows a sub-optimal relationship that can generally be expected with a transition area between both clusters in which the DNN does not have high certainty in its decisions (e.g.Fig. 3a: C n ∈ [0.1, 0.4]).Furthermore, we observed many of the false negative classifications slightly below the threshold, indicating that the thresholds might not be optimal for the CPSC data set.
When analyzing individual leads, significant differences in relevance score distributions between abnormal and normal recordings were revealed (Fig. 4).This indicates which leads are most relevant for the DNNs decision.In general, for AF, the limb leads show lower relevance scores compared to the chest leads [29].For AF as well as LBBB classifications, lead V1 shows clear positive relevance scores, indicating that the DNN trained clinically-relevant features: For AF, f-waves can often be observed in V1 [42] and for LBBB a negative terminal deflection in V1, e.g. a rS-complex with a tiny R-wave and a huge S-wave, is a clear diagnostic marker [43].Interestingly, there is a large difference in the distributions of the precordial leads V4-V6.While in AF it shows a clear tendency towards positive relevance scores, for LBBB the median is close to zero.Another sign for LBBB are prolonged R-waves and absence of Q-waves in left-sided leads [44] which might not have been learned.
For these first analyses, we used averaged mean values of relevance scores, which have been used for explanations of models that take feature based input instead of raw data [45], [46].However, this is a rather coarse measure.As the relevance scores are signed, values can be composed of rather low relevance scores or competing strong relevance scores for and against the respective class.Still, outliers in overall means or means of leads could be an indicator for false classification due to artefacts, for example if a lead not typically being relevant for this abnormality has the highest mean, such as in lead V6 in Fig. 4b.
As time information is lost in average means, we proposed the third analysis.As can be seen in Figs. 5 and 6, the "average beat" and "average relevance scores" of a single lead can give an even more detailed idea of the model's features.Although it is still not possible to uniquely identify the actual features learned by the DNN, positively relevant areas in case of missing P-waves for AF classification indicate a good fit to clinical criteria [42].Additionally, for the healthy controls, there are very pronounced negatively relevant areas nearby P-waves, demonstrating that the DNN learned that existence of P-waves is a counter-sign for AF.As IG does not allow to gain insight into the time scale, we cannot quantify to what extent RR-interval variations impact relevance scores.However, as the QRS-complex has similar shapes in AF and normal recordings, we assume that the DNN took the arrhythmic RR-intervals of AF recordings into account.
Moreover, when analyzing the shape of an average relevance signal, which is continuously averaged over more and more recordings in Fig. 5 (see Supplemental Material for a video), it    can be seen that, for AF as well as normal ECGs, the variance of relevance scores is quite low.This indicates a robustness of the DNN as it generates similar relevance scores despite the natural inter-patient variability in abnormal ECGs.Regarding LBBB classification (Fig. 6), high relevance scores around  broadened QRS-complexes indicate a good fit to clinical criteria [30].The criterion of a T-wave displacement opposite to the major deflection of the QRS-complex [30] can also be observed very well, although it results in small positive relevance scores only.In contrast, for healthy controls, T-  waves result in very pronounced negatively relevant areas (e.g.Fig. 12b).Similarly, for AF classifications, P-waves are learned as a feature that indicates the absence of AF (e.g.Fig. 12a).Furthermore, the robustness of the relevance scores in terms of variance is even higher than for AF.

1) Comparison of XAI methods:
In this work we applied the XAI attribution methods IG and LRP.There are other approaches available for explaining models for biosignal data using ante-hoc methods as in [18], [47], [48], but these methods are not suitable for pre-trained DNNs where no adaption to the model itself is possible.Other methods, such as perturbation methods [49], [50], focus on occluding different parts of images and then analyzing the resulting changes in activations.These methods can also be used to calculate relevance scores for every input feature, but as shown by [51] they produce noisier heatmaps compared to LRP methods.Our results indicate that both methods, IG and LRP, are well suited for gaining insight into reasoning of DNNs applied to biosignals.Additionally, we conducted a comparison of IG and LRP methods (Fig. 11) and came to the conclusion that IG gives most distinct results.
2) Comparison of databases: To account for a change in the underlying data set, we validated our results on the CPSC database using PTB-XL instead and obtained similar results.One noticeable difference was observed in the relevance score distribution of LBBB recordings, where less negative values for PTB-XL could be explained by the more specific label "Complete LBBB", which might be easier to classify.These more differentiated labels bear the potential for comparison of model performance on complete and incomplete LBBBs.
3) Artifacts: We observed that the DNN tends to produce wrong classifications when artefacts are present as can be seen exemplarily in Fig. 8.This effect has been observed by others as well [24].Although we have not attempted it in this work, artefact detection based on our approach could be a promising avenue for future work.Additionally, we observed that the relevance scores result in certain temporal patterns that might allow the application of analysis methods from nonlinear signal processing [52] which we will analyze in future work.abnormal QRS-complex in LBBB, while other features, such as the T-wave pointing in opposite direction, are not used for LBBB detection.Instead, the opposite of the feature, a T-wave pointing in expected direction, is used as a feature for detecting healthy ECGs.Our proposed analysis and visualization methods for relevance scores facilitate a rapid and effective assessment of the DNN's learned features and were confirmed by cardiologists.5) Limitations: However, a limitation of our analysis based on IG is that we cannot infer any time-dependent information of the relevance scores.Especially for AF it is not clear whether e.g. the R-peaks are marked as relevant because of their morphology or their distance to one another.Therefore, we rate our results as more robust for LBBB as a morphological abnormality compared to AF as an arrhythmic and therefore time-dependent abnormality.Another limitation of our work is that we used public ECG databases which might introduce a certain bias.Therefore, using a data set from actual clinical practice on a cardiology ward or in emergency care might show different results.Thus, in future work, we will verify our results with more diverse data sources.

V. CONCLUSION
Missing explainability of ML methods for ECG analysis is a pressing issue preventing the dissemination of these methods in clinical practice.In this work we aimed enabling an objective justification of a DNN's decision by analyzing a state-of-the-art DNN for ECG classification with different XAI methods and data from different databases.Although this approach does not provide absolute certainty about the features learned by the DNN, it allows for inferring assumptions about its decision process.For example, our results reveal that the DNN learned that clearly-visible P-waves are a countersign for AF and T-waves pointing in same direction as the QRS-complex in particular leads are counter-signs for LBBB.Furthermore, decisions of the DNN for LBBB classification are based on unusual QRS-complexes.We conclude that the DNN learned cardiology textbook knowledge covering the whole cardiac cycle including P-wave, QRS-complex and Twave.Moreover, we were able to explain false classifications due to transient noise which attracts the DNN's relevance scores, leading to relevant features being ignored.
In future work, we will use the methods proposed in this work for developing an interactive tool for clinical practice which offers cardiologists an intuitive overview of the DNN's reasoning, supporting them in their decision whether to trust the DNN's classification, or not.
, J. M. Beinecke, D. Krefting, H. Dathe, N. Spicher and A.-C. Hauschild are with the Department of Medical Informatics, University Medical Center G öttingen, G öttingen, 37075 Germany.(e-mail: theresa.bender@med.uni-goettingen.de).C. M üller and T. Seidler are with the Department for Cardiology & Pneumology/Heart Center, University Medical Center G öttingen, G öttingen, 37075 Germany.(N.Spicher and A.-C. Hauschild are co-last authors.)successive and equally-spaced time intervals.Typical examples are the electrocardiogram (ECG) representing the electrical activity of the heart or the electroencephalogram (EEG) representing brain activity.The temporal ordering discriminates biosignals from many other types of biomedical data without any order, such as lab tests or sequencing, and introduces challenges in their interpretation by humans and algorithms alike.Next to measurement artefacts including loss of electrode contact, signals are influenced by other physiological processes, for example ECG by respiration, and (in)voluntary movement of the patient.

Fig. 1 :
Fig. 1: Overview of the processing pipeline which is applied separately to data stemming from two different databases (CPSC/PTB-XL): For each database, the data set consists of 200 healthy controls (Normal) that are compared to patients showing AF and LBBB.Each (unseen) 12-lead ECG is fed into the pre-trained DNN and subsequently results are explored with the XAI methods, yielding a relevance score for each input sample, indicated here by blue (negative relevance score), grey (neutral), and red values (positive relevance score).We propose novel analysis methods for these scores, allowing to gain insight into the DNN's reasoning.
Normal and AF recordings.Colors denote ground truth label of data set.Values for AF range from [−0.5, 0.5] and values for normal recordings from [−0.3, 0.4].Normal and LBBB recordings.Colors denote ground truth label of data set.Values for LBBB range from [−0.6, 0.9] and values for normal recordings from [−0.4, 0.5].

Fig. 4 :
Fig.4: Distribution of M n,k computed with IG w.r.t.ECG leads, colors denoting ground truth label.For AF classification (a) and LBBB classification (b) boxplots show that the abnormal mean is higher for each lead with the highest difference in V1.

Fig. 5 :
Fig. 5: Left column: Average beats (black curves) and IG relevance scores for lead V1 in AF classification.Abnormal ECGs show positive relevance scores (red) distributed over the whole P-QRS-T-cycle, negative relevance scores (blue) on normal recordings cover QRS-complexes and especially P-waves.Right column: Instead of average beats, the variance of relevance scores across recordings is shown (orange).

Fig. 6 :
Fig. 6: Left column: Average beats and IG relevance scores for lead aVL in LBBB classification.Abnormal ECGs show positive relevance scores (red) on QRS-complexes; negative scores (blue) on normal recordings can be seen on P-and T-waves.Right column: Instead of average beats, the variance of relevance scores across recordings is shown (orange).

Fig. 8 :
Fig. 8: Positive (red) and negative (blue) relevance scores calculated with IG on false negative classified ECG (AF: ∼ 0.008) from CPSC data set (ID A1017).Relevance scores are clustered around the artefact in lead V1.

4 )
Key findings: In summary, our analysis suggests that the model by Ribeiro et al. learned features similar to cardiology textbook knowledge.IG relevance scores indicate that it learned features pointing towards a disease, such as the Lead V1 used for AF classification.
Lead aVL used for LBBB classification.

Fig. 12 :
Fig. 12: Average beats (black curve) and relevance scores for individual leads in a single normal recording correctly classified by the DNN: a) Highly negative relevance scores (blue) are found during the occurrence of the P-wave.b) Negative relevance scores (blue) are found during the P-/T-waves, and especially during occurrence of the P-wave of the QRS-complex.