DeepCEDNet: An Efficient Deep Convolutional Encoder-Decoder Networks for ECG Signal Enhancement

Electrocardiogram (ECG) signal can be thought of as an effective indicator for detection of various arrhythmias. However, the acquired ECG data is always corrupted by amounts of noise, which have a great influence on the diagnosis of cardiovascular diseases. In this paper, an efficient deep convolutional encoder-decoder network framework is proposed to remove the noise from ECG signal, which is termed as ‘DeepCEDNet’. This network is able to learn a sparse representation of data in the time-frequency domain via the high-order synchrosqueezing transform (FSSTH) and a nonlinear function that maps the noisy data into the clean one based on the distribution difference between signal and noise from the training set. Extensive experiments are conducted on ECG signals from the MIT-BIH Arrhythmia database and MIT-BIH Long-Term ECG database, and the added noise is obtained from the MIT-BIH Noise Stress Test database. The denoising performance is evaluated by means of signal to noise ratio (SNR), root mean squared error (RMSE) and percent root mean square difference (PRD). The results indicate that the proposed DeepCEDNet can obtain superior performance in both noise reduction and details preservation with higher SNR and lower RMSE and PRD compared to the traditional convolutional neural network (CNN) and the fully convolutional network-based denoising auto-encoder (FCN). We believe that the DeepCEDNet has a wide application prospect in the biomedical field.


I. INTRODUCTION
Recorded electrocardiogram (ECG) signal is inevitably contaminated by various types of noise (coined artifacts) [1], [2], such as baseline wander (BW), muscle artifact (MA), and electrode motion (EM), and so on. All of the noise has a severe impact on ECG waveform and covers the weak characteristics of ECG signal, which poses a challenge for the following cardiovascular diseases diagnosis [3]- [6]. Therefore, the noise removal from ECG signal is becoming urgent [7], [8].
Over the past decades, numerous efforts have been made to develop different methods for denoising ECG signal, for example adaptive filter [9]- [11], wavelet transform [12], The associate editor coordinating the review of this manuscript and approving it for publication was Chih-Yu Hsu .
principle component analysis (PCA) [13], independent component analysis (ICA) [14], and empirical mode decomposition (EMD) [15], [16]. Adaptive filter can effectively remove noise outside ECG signal frequency band, however, it will fail when the signal and noise share a common frequency range. Wavelet transform is capable of suppressing noise well by shrinking the wavelet coefficients in the transformed domain. Unfortunately, a suitable wavelet basis function and the threshold strategy need to be selected with prior knowledge, which is usually a troublesome process in practice. In addition, the threshold algorithm is likely to affect the ECG waveform. The key idea of the methods based on PCA and ICA is to eliminate the dimensions corresponding to noise, however, the obtained mapping model is more sensitive to small disturbances in the signal or noise. EMD-based approaches decompose the noisy signal into a series of intrinsic mode VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ functions (IMFs), and then the noise-dominant IMFs are removed while the remaining IMFs are used to reconstruct the signal. As some useful signals are embedded in the IMFs containing noise, such methods may not give satisfactory result. Meanwhile, the mode-mixing is also the main drawback of EMD. Although the variations of EMD (e.g. ensemble empirical mode decomposition (EEMD) [17], [18] and complete ensemble empirical mode decomposition (CEEMD) [19], [20]) and variational mode decomposition (VMD) [21], [22] have greatly alleviated this issue, the nature of redundancy is still not been fundamentally solved, which limits the potential application of these methods in a lot of fields.
Recently, deep learning techniques have been gaining attention due to their powerful capability in learning the characteristics of ECG signal [23]- [27]. The noise suppression techniques based on denoising autoencoder (DAE) have shown the excellent performance than conventional denoising methods. In [28], an improved DAE reformed by a wavelet transform was created, in which a scale-adaptive thresholding algorithm is employed to attenuate most of the noise. In [29], a stacked contractive DEA through multi-level feature extraction was developed for noise reduction. In [30], a fully convolutional network based DEA is proposed for ECG signal denoising. However, these methods are usually carried out in the time domain, and cannot fully exploit the capability of auto-encoders in learning a sparse representation of data.
In this paper, we present DeepCEDNet, a novel sparsely promoted deep convolutional encoder-decoder network in the time-frequency domain. This network is able to simultaneously learn a sparse representation of input data and a nonlinear function that maps the noisy data into the clean one according to the learned features from signal and noise based on the training set. We utilize real ECG signals contaminated with various types of noise such as BW, MA and EM to train the network and demonstrate its performance, and compare with state-of-the-art methods, CNN and FCN (fully convolutional network). Our contributions are as follows: (1) we propose a sparsely promoted deep neural network to extract ECG signal features, (2) DeepCEDNet has powerful advantage in learning a sparse representation of data (3) our method can achieve impressive denoising of ECG signal and preserve the details well.
The rest of this paper is organized as follows. Section II reviews the basic concepts of the autoencoder (AE) and DAE, the theory of the FSSTH, and then the proposed DeepCEDNet is depicted in detail. In Section III, the experimental results from two available MIT-BIH databases are exhibited in order to verify the effectiveness of the proposed architecture. The discussion with respect to experimental results is given in Section IV. Section V concludes this paper.

A. REVIEW OF AE AND DAE
As a deep learning model, the aim of AE is to reconstruct the input as accurately as possible through the constraint of a loss function. A basic AE architecture is comprised of encoder and decoder. The former maps an input vector x to a hidden representation y by a deterministic mapping expression. In the second stage, the latent representation y is mapped back to a reconstructed vector z. The two parts can be formulated as: where W and b are a weight matrix and a bias vector of the encoder, respectively. Similarly, W and b are the corresponding weight matrix and bias vector of the decoder, respectively. ϕ and ϕ are the non-linear activation functions. The parameters in the aforementioned model are optimized by minimizing the reconstruction error: where θ is a parameter set W , b, W , b , N is the number of data samples, and i is the sample index. DAE, originally developed by Vincent et al. [31], is a variant of classic AE. In the DAE, the input x is a corrupted version of data, which is created by means of a stochastic mapping x ∼ q ( x|x). The corrupted input x is first mapped to a hidden representation using Eq. (1), and then one can obtain its reconstruction using Eq. (2). Throughout the mapping process, the parameters are trained to minimize the reconstruction error (Eq. (3)) over a training set in order to make z as close as possible to the uncorrupted input x.

B. HIGH-ORDER SYNCHROSQUEEZING TRANSFORM
The key idea of FSSTH is to sharpen the short-time Fourier transform (STFT) representation of a signal by computing a new local instantaneous frequency estimate, using higher order approximations both for the amplitude and phase [32].
Considering an AM-FM signal: where A (t) denotes the instantaneous amplitude and φ (t) is the instantaneous phase. The Taylor expansion of the signal f in Eq. (4) for τ close to t is described as: where T (k) (t) is the kth derivative of T with respect to t. Therefore, the local instantaneous frequency estimate ω f (t, η) can be written as: where Based on the above-mentioned high order Taylor expansions of the amplitude and phase of a signal, a frequency modulation operator q [k,N ] η,f is defined as: Then, the N th-order local complex instantaneous frequency, ω [N ] η,f at time t and frequency η, can be expressed by: Finally, the FSSTH is defined as follows: where γ denotes some threshold.
The ith mode can be approximately reconstructed by: where ϕ (t) is an estimate for φ i (t), and d is a compensation factor.

C. PROPOSED DeepCEDNet
The noisy signal x (t) is firstly transformed into timefrequency domain via the FSSTH [32], a variant of STFT-based SST (FSST) [35], which can achieve a highly concentrated time-frequency representation [33], [34]. In the time-frequency domain, the noisy signal X (t, f ) can be expressed as: where S (t, f ) and N (t, f ) denote the useful signal and noise, respectively. Our aim is to estimate the underlying signal ∧ s (t) from its noise corrupted version x (t) by minimizing the following square error: where s (t) is the true signal in the time domain. ∧ s (t) is the estimated signal in the time domain, which is obtained by Herein, M (t, f ) is a nonlinear function that maps X (t, f ) to a time-frequency representation of the estimated signal. Now, the above-mentioned problem is regarded as a supervised learning problem, a deep neural network is designed to learn a sparse representation of data in the time-frequency domain and create an nonlinear mapping function through a training set, in which the distribution difference of signal and noise is described well.
Inspired by the advantage of auto-encoders in learning a sparse representation of data, we construct a sparsely promoted deep convolutional encoder-decoder network  (Figure 1), which is mainly composed of two modules: an encoder denoted by the convolutional layers and a decoder denoted by the deconvolutional layers. Meanwhile, the skip connections between two corresponding convolutional and deconvolutional layers are used, with which the training converges much faster and attains a higher-quality local optimum [36].
(1) Encoder module DeepCEDNet projects the input data X into a highdimensional feature space in order to obtain the vector y via a nonlinear mapping function En (X ): It is worth noting that the inputs to the first layer are the real and imaginary parts of the time-frequency coefficients of the noisy data X . En (X ) is carried out by the encoder network that consists of a series of 2D convolutional layers, rectified linear unit (Relu) and batch normalization (BN). The size of convolution filter is a constant (3 × 3), and the feature space is gradually reduced using strides of 2×2. These layers make up a feature extractor that can achieve learning with respect to a sparse representation of time-frequency coefficients and capture the abstract content of signal without the noise.
(2) Decoder module The decoder module is almost symmetric to the encoder part, it can be thought of as the inverse process of the encoder. The aim of decoder is to remap y back into the time-frequency space in order to recover the input details: Here, the deconvolution is utilized to map this learned highdimensional feature into the desired sparse representation. Then, the denoised signal ∧ s can be reconstructed by the inverse FSSTH with the constraint of Eq. (12). Through the whole training process, the network learns the sparse representation of noisy data and the optimal map that recovers the desired signal by minimizing a loss function. Figure 2 shows the flow diagram of the proposed denoising method. First, the noisy ECG signal is transformed into the time-frequency domain via the FSSTH. Then, DeepCEDNet extracts the real and imaginary parts of the obtained timefrequency coefficients as the input and produces the mapping for signal as the output. The estimated time-frequency coefficients ∧ S associated with ECG signal can be obtained by applying the mapping to the real and imaginary of time-frequency coefficients of noisy ECG signal X . Finally, the denoised ECG signal ∧ s is obtained by transforming ∧ S back into the time domain using an inverse FSSTH, and followed by the least square constraint. Compared with the conventional denoising algorithms, DeepCEDNet can automatically learn more abstract features from the noisy ECG data, and remove the noise in the time-frequency domain.

III. EXPERIMENTS A. EVALUATION CRITERIA
In the paper, three criteria will be used for quantitative evaluation in denoising performance, which are signal to noise ratio (SNR), root mean squared error (RMSE) and percent root mean square difference (PRD), respectively.
56702 VOLUME 9, 2021 where s (t) is the noise-free ECG signal, ∧ s (t) is the denoised ECG signal, and N is the length of ECG signal.
It should be noteworthy that the SNR describes the level of noise suppression, thus the higher the SNR, the better the denoising performance. The RMSE depicts the difference between the desired output and the actual one, a lower RMSE indicates the smaller difference between both outputs. The PRD delineates the signal recovery capability. A lower PRD means a better reconstruction.

B. DATASET DESCRIPTION
The proposed DeepCEDNet is tested using two standard ECG datasets, namely, MIT-BIH Arrhythmia Database and MIT-BIH Long-Term ECG Database, because both of them have the long duration, which is helpful to train the deep neural network. The first dataset is composed of 48 ECG records with length of 30 minutes, and are sampled at 360 Hz and quantized with 11-bit resolution. The second one contains 7 long-term ECG records and each record lasts about 14h to 22h at the sampling rate of 128 Hz. For each record, we partition 400 fragments, each with a length of 1024 samples. Real noise is comprised of BW, MA and EW, which are from the MIT-BIH Noise Stress Test Database.
We randomly divide the ECG datasets into three parts. More specifically, 90% of the samples are split into the training set and 10% into the test set. Meanwhile, in the training set, 90% of the samples are split into the training set, and 10% into the validation set. Similarly, the noise dataset is also divided into three sections, the training set, validation set and test set. In the three segments, the noise contents are similar but not the same. To increase the randomness, we randomly choose the noise samples from the training set with respect to each noise source (BW, MA and EW), and randomly scale its amplitude. Subsequently, the three kinds of noise are mixed with equal weight to form the complex noise. And, the noisy ECG data is generated by adding the above-mentioned complex noise to the original ECG data. The FSSTH is applied to generate the time-frequency representation of the noisy ECG signal, a detailed description of FSSTH algorithm, please refer to [37]. Then, the obtained time-frequency matrix is normalized by removing the mean and dividing by the standard deviation. During prediction, such information is temporarily saved in order to transform the processed time-frequency matrix into the original scale after denoising. The real and imaginary parts of timefrequency coefficients are respectively fed to the deep neural network as two channels. The same procedure is applicative to both validation and test sets. We utilize the test set to analyze the final performance and demonstrate the denoising results.

C. EXPERIMENTAL RESULTS
In this section, we first evaluate the performance of the proposed DeepCEDNet. Figure 3 shows the losses of training and validation of the presented model for two datasets. As can be obviously seen that the loss function of the training set decreases rapidly and then converges to a smaller value. Meanwhile, the validation set also shows a similar trend. Besides, when the training reaches a certain epoch, the loss difference between the training set and the validation set is relatively small, which means that the proposed DeepCEDNet has almost no overfitting, in other words, the DeepCEDNet has strong learning ability.
CNN and FCN are two classic neural networks, and have been widely applied in ECG signal analysis [27], thus, they are employed for comparison. In the paper, the CNN consists of 22 convolutional layers, while the FCN is composed of 11 layers of convolution and deconvolution operators. Now, the four examples from the test set are selected, and the denoising performance is compared with the CNN and FCN. We select the number of decomposed modes for the FSSTH algorithm as 5. The parameters used by DeepCEDNet are summarized in Table 1. In addition, we also apply the dropout technique with a rate of 0.5 for all layers since it obtain the VOLUME 9, 2021 best performance in the validation set. Figures 4 and 5 show the denoised results regarding record 101m.dat and 223m.dat from MIT-BIH Arrhythmia Database. In these figures, (a) is the original ECG signal, (b) is the noisy ECG signal with a SNR of 0 dB. The denoised results using CNN, FCN and CEDN are shown in (c), (d) and (e), respectively. As reported in Figure 4(b), the noise has a certain influence on P wave, T wave and QRS complex wave. However, in Figure 5(b), the noise corrupts the ECG signal severely so that these characteristic waveforms, P, T and QRS, are completely distorted and the useful signal features cannot be effectively extracted. The denoised results indicate that the CEDN can successfully recover the signal with a less distortion of the waveform (see magenta rectangles in Figures 4(c), (d) and (e) and Figures 5(c), (d) and (e)), meanwhile, the signal leakage is minimal, and the ECG waveform and amplitude characteristics are well preserved after denoising. In contrast, both of CNN and FCN can also suppress most of the noise but the residual noise is still found, which maybe lead to an inappropriate diagnosis of the disease. In addition, from the perspective of local details, it seems that FCN is superior to CNN because of a less loss of amplitudes of ECG signal (see black rectangles in Figures 4(c), (d) and Figures 5(c), (d)). Similarly, our method is also applied to 14046m.dat and 14149m.dat from MIT-BIH Long-Term ECG Database. In this data, it is difficult to accurately capture the features of P, T and QRS waves in the presence of strong noise, which directly results in the inability to make correct diagnosis about the diseases. Figures 6 and 7 show the denoised results based on CNN, FCN and CEDN, respectively. It can be clearly seen that CNN seems to produce a severe loss of amplitudes, especially in Figure 6(c). This phenomenon can also be found in FCN (Figure 6(d)), but is not as serious as in CNN. However, CEDN achieves a better denoising performance that maintains the waveform characteristics of ECG signal well compared to the other two methods (see magenta rectangles in Figures 6(c), (d) and (e)). In the example 14149m.dat, CEDN also yields a satisfactory denoised result, which is obviously superior to the CNN and FCN. The difference is that the overall denoising performance of the mentioned methods is better than that in 14046m.dat. Besides, it is worth noting that FCN has more advantage in denoising compared to CNN (see black rectangles in Figures 6(c), (d) and Figures 7(c), (d)). To further demonstrate the effectiveness of the proposed DeepCEDNet, we calculate the average of the SNR, RMSE and PRD for all ECG records from the test set about two datasets, which are plotted in Figures 8 and 9, respectively. The test set consists of 3 different levels of input SNR of 0, 4, and 8 dB. As it can be seen, CEDN obtains the higher SNR than CNN and FCN for all records (Figures 8(a) and 9(a)), which means that CEDN performs better in noise removal. In Figures 8 and 9, CEDN yields the lower RMSE and PRD, which indicates that the denoised signal after running DeepCEDNet is closer to the original signal, in other words, CEDN does a better job in preserving the details from ECG signal.

IV. DISCUSSION
As an indicator, ECG signal can provide a large amount of valuable information that can be used to diagnose early-stage cardiovascular disorders. However, the acquisition and transmission enable the collected ECG signal to be contaminated with various types of noise, thus, it is necessary to remove noise from ECG signal. In this paper, we proposed a deep convolutional encoderdecoder network framework, DeepCEDNet. It can learn the sparse representation of data in the time-frequency space in order to predict a mask that maps the noise corrupted signal into the clean one by optimizing a loss function. The mask determined by the deep neural network effectively decomposes the input data into signal and noise. Experimental results show that DeepCEDNet performs clearly better in noise removal and details preservation compared with the traditional CNN and FCN, which can be attributed to three aspects: (1) the difference between signal and noise is more obvious in the time-frequency domain; (2) Deep-CEDNet makes full use of the capability of auto-encoders in learning a sparse representation of data; (3) the skip connections between two corresponding convolutional and deconvolutional layers also help to handle the problem of gradient vanishing, improve the signal reconstruction performance, and enhance the robustness of deep neural network. In addition, our tests also indicate that DeepCEDNet significantly improves the SNR with minimal distortion to the underlying signal, which is extremely beneficial to clinical applications. As a deep learning model, DeepCEDNet also VOLUME 9, 2021 provides an end-to-end mapping, which is a complicated process. We visualize the reactions of the partial convolutional layers from record 101m.dat in order to exhibit the learned abstract features in CEDN, which is shown in Figure 10. In addition, our results indicate that FCN has the slight advantage over CNN in denoising performance. This may be due to the framework of DAE, which is helpful to recover the characteristics of a signal of interest during the denoising process.
Although the denoising performance of DeepCEDNet is impressive, it does not achieve a perfect separation of signal and noise. Several issues should be discussed in depth, for example, (1) the number of data segments is not sufficient, (2) high-quality signal labels need to be taken into account, 56706 VOLUME 9, 2021 (3) the network leads to a large computational burden, (4) the parameter setting of CEDN needs further optimization. Solving all above-mentioned problems will help to improve the current DeepCEDNet so that it can be applied to more complex noise suppression in the future.

V. CONCLUSION
We have developed a novel deep learning based denoising framework for ECG signal, DeepCEDNet. This network can learn a sparse representation of data in the time-frequency domain and a nonlinear mapping function that aims at signal and noise separation. Experimental results indicate that DeepCEDNet significantly outperforms the traditional CNN and FCN in both noise removal and details preservation, meanwhile, our network shows the higher SNR and lower RMSE and PRD. Thus, DeepCEDNet has the potential to provide a more effective denoising tool for ECG signal processing. Future work will focus on data segments, signal labels, parameter setting optimization and computational efficiency so that DeepCEDNet can be better applied in practice.