LUDB: A New Open-Access Validation Tool for Electrocardiogram Delineation Algorithms

We report Lobachevsky University Database (LUDB) of ECG signals, an open tool for validating ECG delineation algorithms, that is superior to the existing publicly available data bases in several aspects. LUDB contains 200 recordings of 10-second 12-lead electrocardiograms (ECG) from different subjects, representative of a variety of signal morfologies. The boundaries and peaks of QRS complexes and P and T waves are manually annotated by cardiologists for all recordings and independently for each lead, and all records received an expert classification by abnormalities. We present a case study for the recently proposed wavelet-based algorithm and the broadly used ecg-kit tool, and demonstrate the advantage of multi-lead ECG data analysis. LUDB contributes to the diversity of public databases employed in developing and validating novel ECG analysis algorithms, including the most advanced based on deep learning neural networks.


INTRODUCTION
Recording the electrical activity of heart, or electrocardiography, is one of the basic medical diagnostic means for assessing cardiac activity, in particular, determining the heart rate and rhythm disturbances. The voltage graphs -electrocardiograms (ECGs) manifest repeated activity with the commonly identified structural elements of each heart beat image: QRS complex, P and T waves (Fig. 1). Analysis of their amplitudes, shapes (morphologies) and durations allows for identifying cardiac rhythm disorders and cardiovascular diseases, such as ischemia and myocardial infarction [1]. A rich variety of signal morphology, accompanied by their nonstationary nature, potential defects in recordings and noise, makes an automated search for these waves and complexes, also known as ECG delineation (also known as ECG segmentation or ECG annotation), a challenging task. This problem has been tackled for quite a while, resulting in a number of algorithms that solve it at different level of detail. The first ones were designed to detect the QRS complex only, referring on the amplitude of the ECG signal and its first derivative [2]. Detecting boundaries and peaks of P and T waves required more sophisticated methods based on wavelet transform [3], [4], Hilbert transform [5], phasor transform [6], hidden Markov models [7], gradient based algorithms [8] and morphological transforms [9]. Validating delineation algorithms requires standardized datasets with complexes and waves that are manually annotated by specialists. Increasing their number and variety is crucial itself, for both better training and testing robustness of developed methods. Moreover, several collections that are currently available in the public domain: MIT-BIH Arrhythmia Database [10], European ST-T Database [11], and QT Database [12], have certain limitations. That is, MIT-BIH Arrhythmia Database and European ST-T Database have a markup only for QRS complexes. In turn, the QT Database contains annotations for P, QRS and T waves, but has only 2-lead Holter recordings, and is, therefore, not suitable for validating multilead delineators, which are currently the most common approach.
ECG database assembled at Lobachevsky University (LUDB) is free from these issues. The reported database consists of 200 recordings of standard 10-second 12-lead recordings [13] from different subjects, representing a variety of signal morphologies. The boundaries of P, QRS and T complexes at each lead are manually annotated by cardiologists for all 200 records, and each subject is supplemented with noticed abnormalities (same as in the other studies, we skip U-wave due to its small amplitude and noise issues). The overall number of annotated complexes in LUDB considerably exceeds that in QTDB. Altogether, these features make LUDB a valuable contribution to the current publicly available sources.
As the case study, we made use of this dataset for validating our recent algorithm [14], that implements wavelet transform for multi-lead multi-morphology analysis with error correction, and make a comparison to the popular ecg-kit tool [15], which employs one of its predecessors, a singlelead delineator [4]. Expectedly, the results demonstrate a comparable performance of both for QTDB and a noticeable improvement of delinearing P and T waves for LUDB achieved by the former algorithm.
We note that there are many recent studies related to the ECG processing including disease detection, delineation, sleep staging, biometric human identification, denoising, and others (see recent overview [16]). In this article, we only focus on the task of ECG delineation. The solution to this task can be used to solve other problems, in particular, the disease detection. On the other hand, using standard annotations and expert features not always be the best choice. Automatically generated features (such as deep learning features) can be more informative than the expert features. In particular, there have been noticeable successes in the problem of automatic recognition of cardiac diseases using sparse representation of ECG [17], using deep learning generated features [18], [19], combination of artificial intelligence methods and linear and non-linear decomposition [20], different feature extraction methods with machine learning algorithms [21], different end-to-end ECG deep learning classifiers, e.g. [23], [24], etc. The paper is organized as follows. In Section I, we describe the LUDB database. Section II contains an outline of the delineation algorithm [14]. A case study of its validation with LUDB and QTDB is reported in Section III. Section IV summarizes the results and perspectives.

I. LOBACHEVSKY UNIVERSITY DATABASE
A publicly available Lobachevsky University Database [25] contains 200 records from 200 subjects in wfdf format [26].
The ECGs were collected from healthy participants and patients of the Nizhny Novgorod City Hospital No.5 in the period 2017-2018 with various cardiovascular diseases, some of them had pacemakers. The records were made by specialized medical staff (functional diagnostics nurses). All participants provided informed written consent before participating in the experiment. The age of subjects varied from 11 to 90 years, with the average 52 years, the distribution by gender was 85 women and 115 men. Table 2 reports the breakdown by the type of rhythm and Table 3 by the type of heart electrical axis. These parameters are specified for all records in the database. ECG recordings were obtained by the Schiller Cardiovit AT-101 cardiograph [27], with conventional 12 leads (I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6), the duration is 10 seconds, the signals are digitized at 500 Hz, complying with the international standard [13].
The boundaries and peaks of QRS complexes, and P and T waves were determined by two certified and practicing cardiologists (A.V. N. and K.A.K.) by an eye inspection of each ECG signal and independently for each of 12 leads. The markup of all ECG forms was joint, relying on standard criteria [28] and based on consensus opinion, as well as classification of abnormalities. This approach was chosen as to decrease subjective influence and provide the end user with a definite annotation. The recordings and markup files in the database come separately, and are open for download and further independent exploration, in particular, with regard to assessing variability in expert opinion. In total, the dataset contains 58429 annotated waves, that is almost six times greater than in the widely referred QT database (Table 1), which is the only publicly available database with all the waves annotated, to the best of our knowledge.
Tables 4, 5 summarize the content of the database by main ECG abnormalities and their count. Note that some patients would have several issues at the same time.
Examples of ECG with manual annotations are on Figures 2-6. Example of ECG from LUDB, id=1, age: 51, sex: F. Yellow color corresponds to P waves, red to QRS complexes, green to T waves. The symbol means the onset of a wave, • means the wave peak, corresponds to the offset of a wave. Sinus rhythm. Sinus bradycardia. Electric axis of the heart: left axis deviation. Left ventricular hypertrophy. Left ventricular overload. Non-specific repolarization abnormalities: posterior wall.
where h [n] is the low-pass filter, g [n] is the high-pass filter, D [k] and A [k] are the resulting approximation coefficients, respectively. A more detailed representation of the frequency content of ECG signals is obtained by repeated DWT, applied to approximation coefficients, calculated at the previous round, according the general scheme shown in the Fig. 7.
The popular ecg-kit tool [15] is based on a single-lead delineation scheme [4]. In the following we discuss the solutions of [14] that allow for improving delineation accuracy VOLUME , of all waves and complexes, in particular, P and T waves. A comprehensive analysis of multi-lead recordings and error correction procedures stand central here.
The developed delineation method consists of several stages. Delineation of each type of waves is first implemented for all ECG leads independently, and in particular order. Then, the results are refined by aggregating and comparative processing of signals from all leads. The general scheme of the algorithm is outlined in the Fig.8.
The algorithm receives a raw ECG signal as an input, that is first preprocessed. Bandpass filtering removes the baseline drift and the high-frequency noise that can be caused by the muscle tone, interference from electrical appliances, poor contact between electrodes and skin, etc. Next, a discrete wavelet transform is applied to the filtered signal, yielding a set of detailed coefficients at different frequency scales. The following analysis relies on these sets obtained for ECG from each lead.
Identifying waves and complexes of the ECG signal takes place in a specific order: QRS complex, T-wave, and then P-  wave. QRS complex is detected first, since it typically has the largest amplitude, which simplifies the task. Then, T-wave is located, as its amplitude is usually greater than that of Pwave. Delineation of P-wave is viewed as the most complex task by both the cardiologists and mathematicians [4], [30]. The amplitude of this wave often compares to noise or flutter, so that a quality detection procedure has to rely on restricting the temporal interval of interest from both sides, by QRS complex and T-wave. Processing each type of wave has a similar pipeline. First, the algorithm explores ECG signal from each lead separately. It selects the best candidates for the corresponding wave, then determines its peak and boundaries. The algorithm by Kalyakulina et al. [14] implements yet another feature, classifying the morphology of the detected wave by determining reference points (onsets, peaks, ends). Matching them to model cases gives a much more advanced diagnostic information than duration and amplitude values would offer. The particular morphologies of the QRS complex, recognized by the algorithm, are shown in the Fig. 9. Orientation of the complex, its extremal points, the number of additional peaks or, conversely, the lack of some, are key to the diagnostic process, detecting cardiac arrhythmias or the presence of cardiovascular diseases.
After all waves of a certain type are found for the outputs from all leads, the algorithm performs a comparative analysis, aimed at correcting omissions or spurious waves, appearing in recordings for certain leads. As a formal validity threshold for a complex occurrence, we require its presence in at least 8 out of 12 leads. That is, if for some heartbeat the T-wave is detected for 10 leads out of 12, then it is taken that this wave is also present for the other two leads. Conversely, if the complex is found in at most one third of the total number of leads, then it is retracted from delineation. We donâȂŹt use the multilead correction if the complex was detected on 5. . . 8 leads. Additionally, averaging the times of the corresponding reference points for the matching complexes across the leads reduces the effect of noise and other disturbances. After this multi-lead correction, delineation steps down to the subsequent wave, taking an advantage of adjusted locations of preceding waves.
Instructively, some failures in the single-lead signal processing are apparently due to alternating morphologies of a complex in the ECG signal, which the adaptive detection threshold does not follow efficiently enough [14]. However, when the complexes are missed in less than one third of leads, their delineation is also restored by the multi-lead analysis, as exemplified in Fig. 10, and a corresponding morphological anomaly is noted down.

III. ALGORITHM VALIDATION
We validate the described tools [14], [15] with two open access datasets, the newly introduced LUDB and QTDB [12], both manually annotated by cardiologists, but distinct in the number of leads (12 and 2, respectively), number of subjects (200 and 105) and duration of recordings (10 and 15 seconds). The reference points of complexes found by an automated delineation are checked against the manually marked ones, the chosen tolerance window interval of 150 ms complies with ANSI/AAMI-EC57:1998 standard [32].
When an algorithm determines a point correctly (i.e. within the 150 ms interval of a manual point), it is counted as true positive (TP). Likewise, when a point suggested by the algorithm is absent in the manual markup, the case is counted as false positive (FP). If the algorithm fails to identify the point, which is present in the database, the case is false negative (FN). For TP cases one also calculates a time mismatch between the automated and manually assigned locations, and this quantity is referred to as "error". The quality of the algorithm is characterized by the following four metrics, implemented in [4], [30], [31], [33]: average error m, its standard deviation σ, sensitivity Se(%) = T P/(T P + F N ), and positive predictive value (precision) P P V (%) = T P/(T P + F P ). For Kalyakulina et al. method, all these quantities are computed based on the set that is pooled from the point-to-point match analysis in each single lead. Table 6 summarizes the assessment of the two tools [14], [15] against LUDB an QTDB, and gives validation data for the other methods against QTDB, borrowed from the literature [4], [30], [31], [33], and against LUDB [34].
In result, for both LUDB and QTDB, the sensitivity values for the onsets and peaks of the P, QRS and T waves are above 97%, and the standard deviation σ is almost within the limits set by the standard [35]: it must be at most 2σ CSE . The exceptions are the P wave onset for QTDB, where σ is 3 ms larger, and QRS onset for both databases, where σ is 1.2 ms larger for LUDB and 0.1 ms larger for QTDB. The maximal error is observed for the T-wave offset, whose delineation is a well-known hard problem, both from the mathematical and for the cardiological perspectives [36]. For the QRS complex, a relatively simple task, the performance of all methods is next to perfect, with occasionally slightly worse rate for the method by Kalyakulina at al. The more challenging task of detecting P and T waves is performed also almost equally well by all methods on QTDB, but the method by Kalyakulina et al. substantially outperforms ecgkit for LUDB. This is an anticipated result, since the former method takes the full advantage of LUDB 12-lead format, that allows to reduce detection failures and appearance of spurious complexes, and to improve an accuracy of timing the key points by the multi-lead refinement of delineation.
QTDB can be used to validate different methods for ECG delineation, as well as to train new deep learning algorithms for delineation. We believe that architectures like U-net [37] will allow achieving better results than known algorithms. For some preliminary results from using QTDB to train Unet-like network, see [38].

IV. CONCLUSION
Despite an urgent need in thoroughly annotated and open datasets of human ECGs to serve testbeds for delineation algorithms, the offer remains quite limited [10]- [12]. Moreover, each case comes short of having multi-lead recordings, a standard output for modern hospital cardiographs, and a manual expert markup of all kinds of waves (P, QRS, and T). Ideally, the recordings would be supplied with diagnosis or a note on abnormalities in ECG, that additionally enables training and validating the algorithms for an automated identification of possible pathology.
The presented Lobachevsky University Database is a step to fill the existing gap. Openly accessible at Lobachevsky 8 VOLUME , University website and available on PhysioNet [25], it contains 12-lead ECG recordings for 200 subjects (hospital patients and participants without a history of complaints) in wfdb (PhysioNet) format, manually annotated (except for Uwaves) and supplied with noticed abnormalities. Moreover, it offers a variety of complex morphologies to challenge delineation algorithms. A case study that employed ecg-kit [15] and our recently developed delineation algorithm [14] demonstrates how one can take a full advantage of multi-lead recordings to implement error corrections in signals from separate leads, and improve recognition of complex wave morphologies, as well as precision of timing for delineation points, as compared to the performance on the 2-lead dataset. The further extension of LUDB, that would not simply enrich the base, but will make it suitable for exploring machine learning and neural network algorithms for an automated diagnosis, is to follow. It would be also important to receive independent manual delineations by the other experts.
Our results confirm that some delineation tools can have a considerably different performance on different datasets. Different instrumental origin of ECG is only one, and probably a minor reason for that. The inevitable variability in individual expert opinion on delineation and diagnosis could give a much greater impact, both at the validation stage and for the end use. However, one still lacks enough data to evaluate and accommodate this issue. Admittedly, the future quality assurance of delineation algorithms will emphasize the robust albeit next to perfect performance over a wealth of datasets, rather than maximizing it against a given example.