Uncertainty-Aware Denoising Network for Artifact Removal in EEG Signals

The electroencephalogram (EEG) is extensively employed for detecting various brain electrical activities. Nonetheless, EEG recordings are susceptible to undesirable artifacts, resulting in misleading data analysis and even significantly impacting the interpretation of results. While previous efforts to mitigate or reduce the impact of artifacts have achieved commendable performance, several challenges in this domain still persist: 1) due to black-box skepticism, deep-learning-based automatic EEG artifact removal methods have been impeded from being applied in clinical environments. How to support reliable denoised EEG signals with high accuracy is important; and 2) effectively exploring valuable local and global information from contaminated contexts remains challenging. On the one hand, feature extraction and aggregation in prior works are often performed blindly and assumed to be accurate, which is not always the case. On the other hand, global contextual information is gradually modeled by local fixed single-scaled convolutional filters layer by layer, which is neither efficient nor effective. To address the above challenges, we propose an Uncertainty-aware Denoising Network (UDNet) with multi-scaled pooling attention for efficient context capturing. Specifically, we predict the aleatoric and epistemic uncertainty existing during the denoising process to assist in finding and reducing the uncertain feature representation. We further propose a simple yet effective architecture to capture local and global contexts at multiple scales. Our proposed method can serve as an effective metric for identifying low-confidence epochs that warrant deferral to human experts for further inspection and assessment. Experimental results on two public datasets show that the proposed model outperforms state-of-the-art baselines.


I. INTRODUCTION
E LECTROENCEPHALOGRAM (EEG) is a widely uti- lized method for detecting brain electrical activity, with applications for diagnosing various neurological pathologies [1], conducting cognitive science research [2], monitoring drivers [3], tracking health parameters [4], and constructing Brain-Computer Interfaces [5], among others.Compared to other brain signal acquisition methods, EEG has several advantages, such as being a non-invasive and user-friendly technique, having a high temporal resolution, and being more cost-effective.However, its weak amplitudes make it often contaminated by various noises, especially for physiological artifacts from recording systems, including but not limited to ocular artifacts, myogenic artifacts, and cardiac artifacts.Such physiological artifacts can significantly disrupt neural information, potentially leading to their misinterpretation as normal phenomena in practical applications like brain-computer interfaces [6].Furthermore, they might mimic cognitive or pathological activities, thereby introducing biases into the visual interpretation and diagnosis of clinical research studies, including Alzheimer's disease [7], sleep pattern analysis [43], seizure detection [25], among others.Therefore, developing effective algorithms that can reduce the impact of artifacts in EEG recordings while simultaneously preserving neural information to the greatest possible extent, is of utmost importance.
While detecting and removing artifacts automatically in such applications presents a significant challenge due to the overlapping nature of artifacts with background EEG rhythms and target events in both the temporal and spectral domains.Moreover, differentiating artifacts from the desired signal becomes difficult as artifacts can exhibit considerable variations based on factors such as their origin, waveform shape, and frequency characteristics.
Traditional artifact removal algorithms have shown acceptable performance in various EEG-based applications.However, these algorithms are subject to certain limitations when applied to specific contexts [8], [10], [19].In recent years, deep learning (DL) has emerged as a highly effective approach for automatic feature extraction and representation learning [20].Consequently, significant research efforts have been dedicated to developing DL-based techniques for EEG artifact denoising [32], [33], [34], [35], [36], [37].Compared to traditional models, DL-based approaches offer two major advantages.First, they exhibit universality, as their uniform architecture enables them to handle a wide range of artifact removal tasks without the need for the manual design of prior assumptions specific to a particular type of artifact.Second, DL models possess higher capacity, which results in substantial performance improvements.Despite these advantages, several challenges persist in the domain of EEG artifact removal when utilizing DL-based methods.
C1: How to ensure the reliability of the denoising results.The widespread use of DL-based EEG artifact removal methods in clinical settings has yet to take off due to the common criticism that DL models are 'black boxes', especially when applied to artificial intelligence in healthcare and medicine [21].This raises questions about how much to trust their results when denoising real-world signals.Fortunately, uncertainty estimation can be used to gauge model reliability and is already employed extensively in many other applications [22], [23], [24].For instance, uncertainty estimation has been widely explored and proven beneficial in various fields such as MRI reconstruction [22], image segmentation [23], and seizure prediction [24].In these applications, uncertainty estimation not only aids in generating output results but also provides valuable confidence values, enabling better inference by agents.Furthermore, incorporating uncertainty measurement can result in more informed decisions and potentially improve the quality of predictions [24].Despite its integral role in many domains, uncertainty estimation has received limited attention in the field of EEG artifact removal.Besides, existing studies merely treat uncertainty estimation as a regularization term, overlooking the exploration of relationships between uncertainty regions and confidence.The question of how to leverage the knowledge embedded in confident representations to improve uncertain ones remains open.Therefore, it is worth investigating the integration of uncertainty estimation into EEG denoising, while simultaneously addressing the aforementioned challenges.
C2: How to restore accurate waveforms under extremely worse noisy context.Leveraging convolution to extract contextual information to recover the severely damaged segments is a common choice.In fact, local contexts in EEG segments reflect the adjacent trend information, aiding the recognition of noisy positions.Current DL-based artifact removal methods model local contextual information by applying 1-D convolution with fixed kernel size layer by layer.However, single-scale kernel size is not suitable for different noisy segments.On the other hand, the global context reflects the long-term trend and potentially valuable information, which should also be taken into consideration.Correspondingly, the attention mechanism is famous for its high flexibility in modeling global dependencies, which has been widely applied to various domains [27], [30].However, one of the key challenges in applying the attention mechanism lies in its inefficiency when dealing with long-time series, which primarily stems from the high computation and memory complexity.
To address the aforementioned challenges, we propose a pioneering approach called the Uncertainty-aware Denoising network (UDNet).UDNet focuses primarily on attaining precise denoising outcomes while concurrently providing accurate uncertainty estimation.To begin with, we incorporate an Uncertainty Estimation Module (UEM), which is responsible for assessing the combined aleatoric and epistemic uncertainty at each sampling point.Furthermore, we integrate a Feature Enhancement Module (FEM) to enhance the quality of hidden representation of denoised signals by capturing local and global contextual information using an efficient multi-scaled pooling-attention mechanism, since details from multi-scales help differently in signal recovery.By combining the UEM and FEM, our proposed UDNet can effectively leverage uncertainty information and improve the denoising process in accuracy and reliability.The deeper color in the uncertainty map and error map indicates a higher uncertainty over the corresponding original noisy signal or higher reconstruction error.
To the best of our knowledge, our proposed approach represents the first attempt to incorporate uncertainty estimation for EEG artifact removal.As depicted in Figure 1, our model has the capability to predict the uncertainty at each sampling point during the denoising process.Consequently, it can effectively identify uncertain regions that are more likely to contain significant reconstruction errors.By leveraging this information, we are able to achieve more compelling denoising results.To summarize, this paper makes the following contributions:

II. RELATED WORK A. Traditional Artifact Removal
Traditional methods can be divided into two main categories: those that estimate artifactual signals using a reference channel and those that decompose the EEG signal into other domains.
Specifically, regression-based methods [8] estimate noise signals by utilizing noise templates and subtracting them from the EEG data.Adaptive filtering [9] adjusts weights iteratively using optimization algorithms to reduce the amount of artifactual contamination in the primary input.However, the reliance on reference channels to enhance the accuracy of artifact removal poses limitations for certain applications [10], [11].
Fourier transform and wavelet transform (WT) [15], [16] are used to map the signal from the time domain to the spectral domain, as EEG signals and artifacts often exhibit different spectral profiles.Wavelet quantile normalization [16] attenuates artifacts of different natures, requiring no auxiliary input, parameter tuning, or human intervention.Wavelet domain optimized Savitzky-Golay (WOSG) filtering approach [17] uses the optimized SG filter in the wavelet domain for the removal of motion artifacts.Dyadic boundary points based empirical wavelet transform (DBPEWT) [18] introduced an optimal transition width-based filter bank to decompose EEG time series into sub-band (SB) signals.However, due to the overlap between artifacts and the EEG spectrum [19], complete removal of artifacts may not be achievable, leading to the potential loss of neural information.Recent approaches have proposed hybrid methods such as EEMD-ICA [28] and EEMD-CCA [29], which combine traditional techniques [13], [14] to improve performance.However, these methods still do not address the limitations imposed by prior assumptions.

B. Deep Learning Artifact Removal
With the emergence of deep learning, profound denoising models have been developed to tackle the challenge of EEG artifact removal.One such model is the 1D-ResCNN [34], which is a one-dimensional residual convolutional neural network.It constructs a regression model capable of capturing the complex and intricate nonlinear relationship between noisy and clean EEG signals.DeepSepeparator [33], an extension of linear blind source separation methods, is designed to learn the decomposition of the clean EEG signal and artifacts within a latent space.GRUMARSC [37] focuses on identifying the most relevant artifact pattern by utilizing an attention-based adaptive feature selection mechanism to prevent erroneous reconstruction of contaminated signals.
Compared to traditional methods, deep learning models offer significant advantages in terms of their universality and high capacity.However, the widespread adoption of deep learning in EEG denoising has been somewhat limited due to concerns regarding weak interpretability and safety.Therefore, there is a growing interest in developing interpretable and reliable deep learning models specifically tailored for EEG denoising.

C. Uncertainty Estimation
In general, uncertainty in EEG artifact removal can be categorized into two types: aleatoric uncertainty (data uncertainty) and epistemic uncertainty (model uncertainty) [39].Data uncertainty relates to the inherent noise present in the EEG signals.On the other hand, model uncertainty captures the uncertainty associated with the model parameters and can be reduced by increasing the number of training samples.Bayesian neural networks (BNNs) and their variants are commonly used to model epistemic uncertainty by introducing probability distributions over model parameters [40], [41].However, these methods often require different training techniques for the neural network and may introduce additional model parameters, sometimes even doubling the parameter count.Gal et al. [42] proposed the Monte Carlo dropout framework (MC-Dropout), which can be directly applied to a pre-trained model.It involves applying stochastic dropouts after each hidden layer and treating the output as a random sample generated from the posterior predictive distribution.In our approach, we propose a variant of MC-dropout and focus on capturing the epistemic uncertainty of the noisy signal representation in each layer.

A. Problem Statement
Let D = {X, Y } = {x i , y i } N i=1 be a training dataset, y i ∈ R L×d is the cleaned EEG signal for an input noisy EEG signal x i ∈ R L×d , where L denotes the length for each time series sample, d denotes the number of channels of interest.Our primary objective is to learn a transformation function f , which is parameterized by weights ω and maps a given input x to a cleaned EEG ŷ and the associated uncertainty

B. Bayesian Inference and Uncertainty Modeling
We define our likelihood as a Gaussian with mean given by the model output: p(y| f ω (x)) = N ( f ω (x), σ 2 ), with an observation noise scalar σ .In the inference phase, given a test sample x * , the predictive probability y * is computed by: where the posterior p(ω|D) is intractable and cannot be computed analytically.A variational posterior distribution q θ (ω), where θ are the variational parameters, is used to approximate the true posterior distribution by minimizing the Kullback-Leilber (KL) divergence between p(ω|D) and q θ (ω), resulting in the approximate predictive distribution Minimizing the Kullback-Leibler divergence is equivalent to maximizing the log evidence lower bound, With the re-parametrization trick [45], a differentiable minibatched Monte Carlo estimator can be obtained.The predictive (epistemic) uncertainty can be measured by performing T inference runs and averaging predictions: where T corresponds to the number of sets of mask vectors from Bernoulli distribution in MC-dropout, or the number of randomly trained models in Ensemble, which potentially leads to different set of learned parameters ω = {ω 1 , . . ., ω t }.

C. Scaled Dot-Product Attention
Scaled Dot-Product Attention [44] is a type of attention function that calculates the weights by taking the dot-product between queries and values, which offers benefits such as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.efficient use of space and time.Formally, it is defined as follows, where Q, K , V and F are queries, keys, values, and their dimension, respectively.
IV. METHODS Our goal is to restore clean EEG segments with better confidence (less uncertainty) from their noisy observation containing EOG or EMG artifacts.To this end, we introduce an Uncertainty-aware Denoising Network (UDNet), which focuses on improving the uncertain feature representation and leveraging the confident parts.As shown in Figure 2, UDNet consists of M Uncertainty-aware Denoising Layers (UDLs) that are interconnected to enhance the hidden features (HFs).In the mth UDL (m = 1, . . ., M), an Uncertainty Estimation Module (UEM) is utilized to estimate the uncertainty map U m and produce an intermediate denoising result Ĵm .Additionally, a Feature Enhancement Module (FEM) is employed to modulate the previously hidden feature H F m−1 by leveraging the confident feature and generating the Enhanced Feature E F m .To combine the enhanced feature E F m and the previously hidden feature H F m−1 , we use the uncertainty map U m as a gate.This linear combination allows UDL to update the representation in H F m−1 with E F m and output a more confident and improved representation H F m .The process can be summarized as follows: where ⊙ represents the element-wise product.
To provide more specific details, the first step in our approach is to convert the raw input x ∈ R L×d into a high dimensional representation H F 0 ∈ R L×F via linear projection.After that, H F 0 is updated gradually by M UDLs to obtain H F 0 , H F 1 , . . ., H F M .Finally, H F M is converted to the original size by linear projection.

A. Aleatoric and Epistemic Uncertainty Estimation
Bayesian deep learning provides a comprehensive framework for modeling two distinct types of uncertainty: 1) aleatoric uncertainty, which arises from the noise inherent in the observations, and 2) epistemic uncertainty, which captures uncertainty within the model itself.These two forms of uncertainty are also present in denoising models.However, conventional methods typically yield deterministic outcomes without providing any information about their associated confidence.In this paper, we introduce an Uncertainty Estimation Module (UEM) to model sampling-point-wise aleatoric uncertainty σ 2 A and epistemic uncertainty σ 2 E together.As shown in Figure 3, UEM contains two branches to model aleatoric and epistemic uncertainty, separately.
1) Aleatoric Uncertainty: Aleatoric uncertainty is explained as the inherent noise and random influences that cannot be explained explicitly.In our approach, we make the assumption that the denoising output at each sampling point, denoted as p(J | Ĵ , ω), follows a Gaussian distribution.The mean and variance of this distribution correspond to the ground-truth signal J , and aleatoric uncertainty σ 2 , where ω represents the network parameters.In the context of Bayesian neural networks, the functions are defined through the weights of the neural network, which serve as our sufficient statistics denoted as ω = (W m ) M m=1 .We perform Maximum A Posteriori (MAP) inference to obtain the optimal values for ω when given the observed data and any prior knowledge or assumptions, as follows: We treat σ 2 A = σ 2 .Two branches in UEM are used to predict σ 2 A and Ĵ , separately.For aleatoric uncertainty, it is conditioned on the denoising results of the previous layer and finally constrained by the following minimization objective: where the superscript i denotes the sampling point index, and D m is the number of the output points.In practice, we train the UEM to predict the log variance, s m A := log(σ i m A ) 2 : 2) Epistemic Uncertainty: Epistemic uncertainty encompasses the uncertainty in the model parameters, representing our lack of knowledge regarding which specific aspects of the model generated the observed data.Existing denoising models fail to capture epistemic uncertainty since they follow the deterministic network's parameters and optimize the network directly.
Bayesian neural networks have always been used to capture epistemic uncertainty, replacing the deterministic network's parameters with a prior distribution.Then performing Bayesian inference to compute the posterior distribution over these weights.While posterior distribution is not tractable for a Bayesian NN, Bayesian convolutional neural networks [40] define approximating variational distribution q θ (W m ) for every layer m to relate the approximate inference to dropout training as: here z m,n are random variables following a Bernoulli distribution with probabilities p m , and M m represents the variational parameters need to be optimized.The operator diag(•) transvectors into diagonal matrices.Following [40], we reframe the 1-D convolution operation as a linear operation that over the kernels.Specifically, let K k ∈ R l×F m−1 , where k = 1, . . ., F m , be CNN's kernels with l, and F m−1 channels.The input to the layer denoted as x ∈ R L m−1 ×F m−1 .By convolving the input with a given stride s, we can interpret it as extracting segments from the input, each with dimensions l×F m−1 .These segments are vectorized and collected as in a matrix, resulting in new representation denoted as x ∈ R L m ×l F m−1 , where L m represents the number of segments.The vectorized kernels are arranged as columns in the weight matrix W m ∈ R l F m−1 ×F m .Thus, the convolution operation can be expressed as the matrix product x W m ∈ R L m ×F m .To capture epistemic uncertainty, we introduce a prior distribution on convolution kernels and leverage Bernoulli variational distributions to approximately integrate each kernel-segment pair.Then we sample Bernoulli random variables z m,n and then multiply segment L m by the weight matrix ), which is equivalent to an approximating distribution modeling each kernel-segment pair with a distinct random variable, tying the means of the random variables over the segments.Such a modeling approach randomly sets certain kernels to zero for different segments.Implementing Bayesian CNN in the inference stage is therefore equivalent to applying dropout after every convolution layer through multiple forward propagations.
Further, in order to quantify epistemic uncertainty and perform Bayesian CNN in the training stage, we find an alternative method.To be specific, we incorporate distributions over each 1D-Conv-Sigmoid layer of each UEM and employ a mask operation to approximate the inference process.Herein, we randomly mask parts of the input feature channels by setting their values to 0. These masked inputs are then passed through a shared Conv layer to reconstruct original EEG signals.This process is repeated T times, resulting in T different denoising results { Ĵm,t } T t=1 .Subsequently, we calculate the mean and variance (representing epistemic uncertainty) of { Ĵm,t } T t=1 .Specifically, we compute the predicted mean Ĵm = 1 T T t=1 Ĵm,t and the epistemic uncertainty , which quantifies the level of uncertainty in the model's prediction.
To summarize, the predicted uncertainty of each sampling point in the mth UEM can be approximated using the following: With the assistance of the uncertainty map U m obtained from UEM m , we can discern the level of uncertainty associated with each sampling point.

B. Feature Enhancement Module (FEM)
The primary objective of FEM is to enhance the uncertain feature.It is observed that hard-to-denoise regions are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.highly related to the estimated uncertainty map, which usually contains complex contexts.In the artifact removal task, it is not surprising that the original signal amplitude from different time steps is the same.The non-local similarity characteristic suggests that two distant sampling points in the clear signal, despite having the same amplitude, may experience different types of degradation.This results in complex contextual dependencies in their corresponding noisy observations.More importantly, a denoising network is required to map these two different observations to the same amplitude, which presents a challenging many-to-one mapping task.Therefore, capturing the trend around the current time step to further help identify informative contexts is the key point.
However, capturing local trends under ambiguous contexts as existing works do [33], [34], [35] is not a simple thing.The captured inaccurate contexts by 1-D convolution will continuously accumulate errors layer by layer.Alternatively, global receptive fields can help to learn effective contextual information to a certain extent since similar trends at the far end may be relatively clean.In representation learning scenarios, the attention mechanism is commonly employed to automatically extract the most pertinent information, especially from the global perspective [44].Nevertheless, when it comes to capturing context information from long-time series data like EEG signals, employing the attention mechanism can pose challenges due to limitations in computing resources and memory.
Based on this premise, we have devised a pooling-attention architecture that endeavors to capture both local and global contextual information concurrently while mitigating the quadratic attention complexity.This design improvement enhances the efficiency of attention-like modules for EEG artifact removal applications.Additionally, we integrate multiscale convolutions instead of linear mappings to enhance the perception of trends in the data.
1) Trend-Aware Multi-Head Attention: We propose the Trend-aware Multi-Head self-Attention (TMHA) mechanism to capture contextual information in signals.TMHA is built upon the self-attention mechanism, which enables the derivation of queries, keys, and values from the same sequence of symbol representations.In TMHA, we first employ Multi-Head Self Attention (MHA), which enables simultaneous attention across multiple representation subspaces.
To implement MHA, previous works apply linear projections and transform them into separate representation subspaces.The attention function (Eq.5) is then independently performed in parallel for each subspace.Afterward, the resulting outputs from each subspace are concatenated and projected to generate the final output.By employing MHA, we can effectively capture interdependencies between different parts of the signals, facilitating the modeling of contextual information in a trend-aware manner.Formally, where h denotes the number of attention heads.W Q j , W K j , W V j , and W O are projection matrices applied to Q, K, V, and the final output, respectively.
The traditional multi-head self-attention mechanism, originally designed for discrete tokens like words, may not adequately capture the local trend information inherent in continuous data such as EEG signals.Applying this mechanism directly to EEG signal transformation can lead to a mismatch between the attention mechanism and the data characteristics [31].To address this limitation and incorporate local trend information into numerical data prediction, we propose a novel approach called TrSelfAttention, which stands for Transformer-based Self-Attention.TrSelfAttention is inspired by the Convolutional Self-Attention model [31] and introduces 1D convolutions to replace the projection operations on queries and keys in Eq. 13.This modification enables the model to consider local contextual information and be more sensitive to the changing trends present in the noisy EEG signals.
Mathematically, the definition of TrSelfAttention is as follows: where ⋆ indicates the convolution operation and Q j , K j are the parameters of convolution kernels.
2) Pooling Attention: To address the quadratic complexity issue in self-attention blocks, we introduce a pooling operation before attending to the input.This pooling operator, denoted as P(•; ), is used to downsample the intermediate tensors K and V .The parameter := (k, s, p) specifies the pooling kernel size (k t ), stride (s t ), and padding ( p t ).By default, we employ non-overlapping kernels with shape-preserving padding in our pooling attention operators.This results in an output tensor with a reduced signal length L, which is achieved by a factor of s compared to the input tensor's length L. The pooled tensors are denoted as K = P(•; K ) and V = P(•; V ).The attention computation is then performed on these shortened vectors.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Naturally, the pooling operation introduces the constraint s K ≡ s V .Computational Analysis By pooling the key and value tensors, the computation and memory requirements of attention computation, which scales quadratically with the signal length, are dramatically reduced.Denoting the reduction factors for the signal lengths as f Q , f K , and f V , we have If we consider the input tensor to the pooling operator P(•; ) to have dimensions D × T , the runtime complexity of TMHA is O(DT / h(D + T / f Q f K )) per head, and the memory complexity is O(T h(D/ h + T / f Q f K )).
3) Multi-Scale Pooling Attention: The effectiveness of examining EEG signals in multiple scales has been demonstrated in several studies [26], [49].In the context of EEG artifact removal, we are motivated by two factors to design a Multi-Scale Pooling Attention mechanism.(i) By working at multi-scale lower resolutions, we hope to reduce the computing requirements as well as allow maintaining satisfactory performance.(ii) Multi-scales provide a more comprehensive sense of context within the EEG signals.This contextual information at multiple lower resolutions can better guide the processing and decision-making at higher resolutions.
Herein we utilize 4 Pooling Attention units, each with different kernel sizes and pooling strides ranging from 1 to 4. These units operate on the same input feature map, enabling the extraction of information from multiple receptive field scales.By applying pooling operations at various scales, we generate feature maps for each scale.These feature maps are then combined to create a multi-scale feature map.This integration of information from different receptive field sizes enhances the representation of the input data, allowing for a more comprehensive understanding of its underlying patterns and structures.
where PoTrAttention is the Trend-aware Pooling Attention with kernel sizes and pooling strides from scales 1 to 4, and X r a is the multi-scale feature map.
To minimize the number of parameters, we incorporate a bottleneck layer, which is responsible for reducing the channels in the concatenated feature map as follows: where E F is the final multi-scale feature map and has C out = C in /rate channels.C in is the channel number of X ms and rate is the down-sampling rate of bottleneck layer.

C. Model Training
Considering the presence of M UDLs in UDNet, we have M UEMs responsible for estimating the denoising results and uncertainty maps.To formulate the overall objective, we define the following: where N is the total number of samples, ŷ is the final denoised EEG signal, L m r is the reconstruction loss described in Eq. 9, θ f and θ m are weight factors.

V. EXPERIMENTS
To validate the effectiveness of the proposed method, we conduct performance comparisons on semi-simulated EEG recordings using publicly available datasets, ISRUC and TUSZ.Each dataset contains EEG that has undergone visual inspection and noise reduction processing.We synthesize contaminated signals based on corresponding artifact sources.

A. Dataset and Experiment Settings
1) ISRUC Dataset: ISRUC-S3 dataset [46] contains 10 healthy subjects (9 male and 1 female).Each recording in the dataset includes 6 EEG channels, 2 EOG channels, and 3 EMG channels.Furthermore, domain experts have classified these polysomnography (PSG) recordings into five sleep stages, adhering to the standards set by the American Academy of Sleep Medicine (AASM).
2) TUSZ Dataset: TUSZ dataset [47] stands as one of the largest annotated datasets available for EEG seizure classification.It comprises a total of 5,612 EEGs, encompassing 3,050 annotated seizures extracted from clinical recordings, and encompasses four distinct seizure types [48].The dataset includes 19 EEG channels following the standard 10-20 system.

B. Implementation Details
We implemented the UDNet model based on the PyTorch framework and trained by the Adam optimizer with the learning rate of 10 −3 .The model dimension F is 64, and the number of layers L is 4. We empirically set θ f and θ m to 1.
We perform subject-independent experiments and split each dataset into training and testing sets in the ratio of 6:1 in TUSZ and 9:1 in ISRUC.Each experiment was repeated 5 times, and the reported results represent the mean values.
We use an end-to-end way to train and rely on synthetic data as the contaminated EEG and ground truth to optimize the total objective To be specific, the contaminated EEG x can be generated by linearly combining the clean EEG segments y with EOG or EMG artifact segments, as described by the following equation: where the term n represents either ocular or myogenic artifacts.The hyperparameter λ is utilized to regulate the signal-to-noise ratio (SNR) in the contaminated EEG signal, as indicated: where RMS(•) denotes the root mean square, N z denotes the number of EEG points in the segment z, and z i denotes the ith sampling point.The SNR in our experiment is −10 dB.
C. Performance Metrics 1) Change in Signal to Noise Ratio: We define the metric SNR as the change in the signal-to-noise ratio before and after artifact removal.The calculation of SNR is defined as SNR = SNR after − SNR before (25) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where SNR before and SNR after are the signal-to-noise ratio before and after artifact removal, respectively.
2) Normalized Mean Squared Error: The normalized mean squared error (NMSE), in decibels, is defined as: where y i is the i-th sample of the signal y.
3) Change in Correlation: We defined the change in correlation R before and after the artifact removal as: R = R after − R before (27) where R before and R after are the Pearson correlation coefficients between the ground truth and the signal before and after artifact removal, respectively.4) Improvement in Spectral Coherence: We defined the improvement in coherence I coh before and after the artifact removal as: where C before and C after denote the average magnitude squared coherence, calculated between the ground truth and the signal before and after artifact removal, respectively.

D. Comparison With the State-of-the-Art Methods
To validate the effectiveness of the UDNet, we conducted experiments with the subject-independent procedure.We compare the proposed UDNet with four traditional methods (i.e., EMD-ICA, EMD-CCA [12], WT [15], and WQN [16]), two signal processing-based methods (i.e., WOSG [17] and DBPEWT [18]), and five DL-based methods (i.e., MMNN [38], Novel CNN [35], 1D-ResCNN [34], DeepSeparator [33], and GRUMARSC [37]).We included artifacts of different natures (ocular, muscular) using datasets of semi-simulated.In Table I, it was observed that UDNet achieved the highest increase SNR and the lowest NMSE across all datasets.Furthermore, UDNet demonstrated superior performance in terms of correlation and coherence improvement, because UDNet can effectively preserve frequency information and avoid spectral distortion.In particular, EMD-ICA and EMD-CCA showed inferior performance on seizure-related datasets since TUSZ contains more subjects and channel numbers than ISRUC.WOSG and DBPEWT exhibit stable results across different datasets.MMNN is ineffective in NMSE and SNR indicators, while it performs well in R and I coh , demonstrating its ability to recover information in the spectral domain.Novel CNN and 1D-ResCNN had relatively stable performance against different noise types and datasets.DeepSeparator severely damages spectral domain information in EOG artifact scenarios.GRUMARSC had a significantly reduced effect on more complex epilepsy data.Overall, UDNet stood out among the tested methods, achieving the best improvement in both temporal and spectral domains.
1) Ablation Experiment: To assess the individual contributions of each module in our model, we designed several variant models.These variants involve modifications to specific modules while keeping the rest of the architecture unchanged.We start with a 4-layer 1-D temporal convolution architecture to construct the basic model, which serves as the foundation upon which we gradually add and stack the remaining modules to create a complete branch.First, we use 4-layer 1-D temporal convolution as the basic model to gradually stack the remaining modules to form a whole branch.Then, we integrate data and model uncertainty estimation into the basic model, separately.Finally, we integrate the multi-scaled pooling-attention feature enhancement module to form the proposed model.The specific process is described as follows: • variant a (Temporal Convolution (Base Model)): We utilize a 4-layers temporal convolution as the base model.
• variant b (+ Data Uncertainty): We add data uncertainty estimation σ 2 m A based on variant a to form a datauncertainty-aware temporal convolution network.• variant e (+ Single-scale Pooling-Attention): We replace single-scale temporal convolution with single-scale pooling-attention feature enhancement based on variant d equipped with an uncertainty estimation module.
• variant f (+ Multi-scale Pooling-Attention): We replace single-scale temporal convolution with multi-scale pooling-attention feature enhancement equipped based on variant d with the whole uncertainty estimation module.Figure 6 demonstrates the effectiveness of the key modules in our model.We begin by evaluating the impact of the two types of uncertainty individually.The results show that both types of uncertainty lead to performance improvements.Additionally, using the fused uncertainty U m = σ 2 m A + σ 2 m E , which combines both types, further enhances the performance.Moreover, trend-aware pooling attention provides a global receptive field for capturing the global trend context.Meanwhile, multi-scales are better than single scales since more potential patterns can be characterized distinctively.In summary, the ablation experiment validates the effectiveness of each module in our model.
2) Effectiveness of UEM: We further visualize the denoised result, uncertainty map, and reconstruction errors in Figure 5. From top to down, firstly, our denoised results are superior to traditional and deep methods.More importantly, we can observe that the uncertainty maps are intricately linked to the reconstruction errors, whereby the magnitude of the error directly correlates with the corresponding uncertainty value.Utilizing these maps, epochs that receive high uncertainty scores from our model can be appropriately deferred to clinical experts for further scrutiny and examination.Therefore, it is of great significance to provide uncertainty-aware reconstruction, which actively prevents misleading data use and decision-making.

4) Downstream Task Performance of Different Artifact
Removal Methods: To compare the quality of generating noise reduction results, we compare UDNet with 6 representative artifact removal methods on two downstream tasks (i.e., seizure classification and sleep staging) with corr-DCRNN [48], and MSTGCN [50] as the task-related classification models.We pre-train them on clean datasets and test on the contaminated datasets.Figure 9 shows the total F1 score and each class F1 score results of the TUSZ and ISRUC datasets with different artifact sources, respectively.We also provide clean EEG signal classification results for comparison.
UDNet improves the classification performance of almost all types in TUSZ and ISRUC datasets.Specifically, in the TUSZ dataset, the pre-trained corr-DCRNN performs extremely worse except for CF whose training samples are much more than the summation of all other types.Importantly, our method significantly improves the classification accuracy of minority classes, i.e., AB, CT, and GN.Almost all the remaining methods had similar results to the noisy EEG signal.For the ISRUC dataset, the results were similar to TUSZ.MSTGCN cannot accurately judge noisy EEG signals and drops severely in all five stages.Our UDNet improves the classification performance, especially for REM and Wake stages.Overall, the denoised signals generated by traditional methods only have limited improvement in downstream tasks.DL-based methods like the 1D-Res, DeepSeparator, and GRU-MARSC are relatively better-performing and stable models.Our UDNet can produce robust denoising results on both datasets.

VI. DISCUSSION
In this paper, we propose an innovative denoising network called UDNet, which aims to produce reliable and accurate denoised results.Our method has a more stable and accurate denoising performance compared with traditional signal processing methods.Additionally, by utilizing dimensionalityinvariant convolution operations, our method can process noisy signals of any length.Furthermore, our approach seldom requires retraining the network when processing data closely resembles the training set once our model is fully trained.We incorporate uncertainty quantification to improve model interpretability compared to deep learning methods.While providing denoised output, the corresponding credibility can be provided at the same time, which can improve the clinical application prospects of deep learning.Extensive experimentation on synthetic datasets demonstrates the remarkable quantitative and qualitative enhancements achieved by UDNet, surpassing the current state-of-the-art techniques.Despite this, the solution in this article has a large demand for computing resources.When computing resources are limited, more Monte Carlo samplings cannot be performed, thus limiting the ability to quantify uncertainty.

Fig. 1 .
Fig. 1.The visualization of uncertainty maps and reconstruction errors (a) Denoising results.(b) Uncertainty map.(c) Error maps (i.e., the absolute value of the difference between predicted and true values |ŷ − y|).The deeper color in the uncertainty map and error map indicates a higher uncertainty over the corresponding original noisy signal or higher reconstruction error.

Fig. 2 .
Fig. 2. Overview of the proposed UDNet.The UDNet is composed of M Uncertainty-aware Denoising Layers (UDLs), where each layer, UDL m {m = 1, . . ., M}, incorporates the use of UEM m to estimate the corresponding uncertainty map U m , along with an intermediate denoising result Ĵm .Additionally, FEM m is employed to modulate HF m−1 , generating an enhanced feature EF m and amplifying uncertain features from HF m−1 .Furthermore, a gate unit is utilized to aggregate HF m−1 and EF m , producing a more reliable and improved representation, HF m .

Fig. 3 .
Fig. 3.The uncertainty estimation module (UEM) consists of aleatoric uncertainty and epistemic uncertainty estimation.Aleatoric uncertainty is obtained through 1-layer convolution and Sigmoid.Epistemic uncertainty is obtained through the Monte Carlo estimation of multiple shared convolutions.

Fig. 4 .
Fig. 4. The feature enhancement module (FEM) consists of multiple multi-scale pooling attention units and a bottleneck layer.

•
variant c (+ Model Uncertainty): We add model uncertainty estimation σ 2 m E based on variant a to form a model-uncertainty-aware temporal convolution network.

Fig. 6 .
Fig. 6.A comparison of the designed variant models to assess the effectiveness of different modules in UDNet.

3 )
Influence of Denoising on Nonlinear Characteristics of Signals: Figure 7 and 8 show the power spectral density (PSD) of clean EEG, contaminated EEG, and denoising EEG treated by MMNN, Novel CNN, 1D-ResCNN, DeepSeparator, GRU-MARSC, and our proposed UDNet, which includes EOG noise and EMG noise.As depicted in the two figures, the PSD of the EEG signal noticeably decreases in the specific frequency range of the noise after applying the noise Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I RESULT
ON VALIDATION ON DIFFERENT DATASETS FOR THE PROPOSED UDNET METHOD.∆SNR AND NMSE VALUES ARE IN dB.THE BEST PERFORMANCE FOR EACH METRIC IS HIGHLIGHTED IN BLACK