An Approach for EEG Denoising Based on Wasserstein Generative Adversarial Network

Electroencephalogram (EEG) recordings often contain artifacts that would lower signal quality. Many efforts have been made to eliminate or at least minimize the artifacts, and most of them rely on visual inspection and manual operations, which is time/labor-consuming, subjective, and incompatible to filter massive EEG data in real-time. In this paper, we proposed a deep learning framework named Artifact Removal Wasserstein Generative Adversarial Network (AR-WGAN), where the well-trained model can decompose input EEG, detect and delete artifacts, and then reconstruct denoised signals within a short time. The proposed approach was systematically compared with commonly used denoising methods including Denoised AutoEncoder, Wiener Filter, and Empirical Mode Decomposition, with both public and self-collected datasets. The experimental results proved the promising performance of AR-WGAN on automatic artifact removal for massive data across subjects, with correlation coefficient up to 0.726±0.033, and temporal and spatial relative root-mean-square error as low as 0.176±0.046 and 0.761±0.046, respectively. This work may demonstrate the proposed AR-WGAN as a high-performance end-to-end method for EEG denoising, with many on-line applications in clinical EEG monitoring and brain-computer interfaces.


I. INTRODUCTION
E LECTROENCEPHALOGRAPHY (EEG) can directly reflect brain activities by recording scalp potential changes, and thus it is widely used in cognitive psychology, neurology, psychiatry, and brain-computer interface (BCI) as well [1], [2], [3]. In clinical applications, the recorded EEG contains not only useful information of brain activities but also unexpected artifacts like eye blinks (ocular artifacts), head muscle signals (myogenic artifacts), heartbeat signals, etc., which usually contaminate EEG in different ways and generate high energy in multiple frequency bands, e.g., myogenic and ocular artifacts can bring EEG with large amplitude spikes and significant drifts, respectively [4]. These interferences strongly lower the quality of EEG recordings and therefore denoising is very critical for EEG processing and analysis [5], [6], [7]. Traditionally, these artifacts were removed by manual operations and visual inspection [8], which is time/laborconsuming and inappropriate for online and real-time EEG data processing [9].
Several conventional machine-learning (ML) approaches, including regression-based methods, adaptive filter-based methods, blind source separation (BSS) methods, etc., have been developed to remove artifacts in raw EEG data. The regression-based methods firstly estimate and then subtract artifacts from raw data to gain pure EEG, but these methods are sensitive to outliers which may yield misleading results [10]. The adaptive filter-based methods dynamically estimate filter parameters based on EEG input and then remove artifacts, while they are linear least-squared estimators for stochastic processes, searching for different schemes to do the estimation, which is relatively slow to reach convergence and computationally expensive [11]. The BSS methods, especially the independent-component-analysis (ICA), decompose EEG into different components corresponding to brain activities and artifacts, respectively, and thereafter the components of brain activities are manually selected and combined to reconstruct pure EEG. Therefore, the BSS method cannot identify artifact-related components automatically, and a pre-trained classifier [12], [13], [14], [15] or manual selection is needed to identify and reject artefactual components. Moreover, the ICA-based methods cannot cope with single-channel data and need auxiliary channels. In addition, they are commonly less effective in processing non-biological artifacts such as sudden impedance changes caused by headset movements [9]. Some studies investigated the artifact subspace reconstruction (ASR) [16], [17] for EEG denoising, which is similar to the principal-component-analysis (PCA) that large-variance components are rejected, and purified data are reconstructed by the remained components. Similarly, the ASR also needs auxiliary channels and has relatively high time complexity since it contains eigenvalue decomposition of the sample covariance. To address these issues, a fully automated and efficient artifact-removal algorithm that can process large amounts of data in real time is needed.
The deep learning (DL) methods [18] boomed with the coming out of Alexnet in 2015 [19] and have made a huge success in multiple fields, especially for computer vision and natural language processing, and also gained much attention in neural engineering [20], [21], [22]. Considering the boosting data size and advancing hardware support, more and more researchers preferred to apply the DL methods for EEG applications, including motor imagery classification [23], [24], [25], emotion recognition [26], [27], [28], data augmentation [29], [30], etc. Recently, the use of DL methods in EEG artifact removal was introduced, which gained more favorable performance compared with the conventional ML-based methods. Zhang et al. [31] provided a publicly available structured dataset and tested four deep neural networks as benchmarks, including a fully-connected neural network, a simple convolution neural network (CNN), a complex CNN, and a recurrent neural network (RNN). Sun et al. [32] proposed a one-dimensional residual CNN (1D-ResCNN) to remove various susceptible physiological signals from EEG. Inspired by the U-Net architecture in image segmentation, Chuang et al. [33] developed a novel model to remove pervasive EEG artifacts and reconstruct brain signals, which implemented the ensembled loss function to model complex signal fluctuations in EEG recordings. Leite et al. [34] presented a 2D deep convolutional autoencoder to filter noises of eye blinking and jaw clenching. These DL methods avoid the time-consuming preprocessing and feature extraction on raw EEG data to directly learn meaningful information, which can capture underlying and high-level features. However, they are vulnerable to adversarial samples, where real-world EEG data are prone to be corrupted by small perturbations [35], [36]. These perturbations may mislead the model and lead to dramatic degradation in the denoising performance.
Adversarial training [37] can solve this problem by enhancing the robustness of the model intrinsically, and Generative Adversarial Network (GAN) is one of the most representative works. GAN is a novel technology for unsupervised learning, which was first used in the area of computer vision to implement multiple tasks like generating non-existed images, transferring styles of images, and improving the resolution of images. The initial GAN was proposed in 2014 by Goodfellow et al. [38], who synthesized real-like images by implicitly modeling data into high-dimensional spaces. Usually, a GAN contains two opposite components, i.e., a generator and a discriminator, and both are trained in the learning stage and compete with each other. The generator tries to generate real-like data with random input to fool the discriminator, whereas the discriminator receives both generated and realistic data and aims to judge reality. After multiple rounds of competition, the generator can synthesize real-like data that the discriminator cannot tell apart, reaching a so-called Nash equilibrium. As a simple analogy, the generator is a counterfeiter making fake bank note without detection and the discriminator is a policeman trying to distinguish between fake and real money. Different types of GAN have been proposed in the past few years. The Deep Convolutional GAN (DCGAN) replaces the multi-perceptrons with convolutional layers [39]; the Conditional GAN (CGAN) provides additional information which helps the discriminator in finding conditional probability instead of joint probability [40]; the CycleGAN provides a way for image-to-image translations, like translating horse images into zebra images [41]; the Wasserstein GAN (WGAN) minimizes the approximation of Wasserstein distance rather than the Jensen-Shannon divergence as in the original GAN formulation, which provides more stability in training and could avoid mode collapse [42].
Many applications of GAN have been proposed and realized. Some studies kept tracking GAN-based works and tried to implement GANs for sequence and time-series data generation, imputation, and augmentation [43]; Some other employed GANs in EEG data augmentation to solve the problem of data scarcity and imbalance, improving accuracies in various classification tasks [29], [30]; A few works have implemented GANs in EEG denoising, e.g., Brophy et al. [44] used GAN to denoise real-world EEG signals by mapping noisy signals to clean signals according to the nature of respective artifacts. An et al. [45] proposed a GAN-based denoising method to automatically filter multichannel EEG signals and defined a new loss function to ensure that the filtered signals could preserve effective original information and energy as much as possible. Nevertheless, most current GAN-based models only remove EMG and EOG individually from EEG but did not remove the combined artifacts. Although some pioneering studies, e.g., Sumiya et al. [46], have proposed a GAN-based noise-reduction model to remove the combined artifacts in EEG, more solid evidence of the improvements in filtered data should be explored and provided.
To address these issues, we presented an EEG denoising approach based on the Wasserstein GAN (WGAN), named Artifact Removal WGAN (AR-WGAN), and evaluated its performance by using a public dataset and a self-collected dataset. This study systematically overviews the WGAN model and demonstrates the algorithm of AR-WGAN. The proposed model was compared with some commonly used filtering algorithms, including Denoised AutoEncoder, Wiener Filter, and Empirical Mode Decomposition, in both time and timefrequency domains, with evaluation metrics such as power ratios, relative root mean squared error (RRMSE), and correlation coefficient (CC). The robustness and generalization capacity in real implementations of this work was discussed and some outlook was provided. The main contribution of this study is that we provided an end-to-end denoising model based on WGAN, which can filter massive raw EEG data within a short time. The approach can embody enough robustness and cope with real-world EEG data with inevitable perturbations, and the fast denoising demonstrates the AR-WGAN as an automatic and online-capable artifact removal approach.

II. DATASET AND METHODS
A. Methods 1) Generative Adversarial Network: In a basic GAN, the discriminator can be approximately characterized as a function to map the data into a probability (0 to 1) where the input belongs to the real data. The generator is fixed when the discriminator is being trained to classify the real data (output close to 1) and fake data (output close to 0) from the generator. Similarly, the discriminator will be frozen when it is optimized, and the generator continues to learn the real data distribution to lower the accuracy of the discriminator.
The critical point of GANs is the probability density or probability mass function of observation data. GANs are trained by implicit computing and maximizing the similarity of probability distribution between the real data and generated data from the candidate model. The training of GANs contains the optimization of discriminator parameters that can maximize the classification accuracy, and also the finding of generator parameters that can maximally generate fake data to confuse the discriminator. The cost of training can be formularized as where θ d , θ g , x r , and x g are the discriminator parameters, generator parameters, real data, and synthetic data, respectively, D(x) returns the probability (0 to 1) of x belonging to the real data ( p r ) or generated data ( p g ) distributions. Specifically, the input data to the generator are random noises, which can be stated as where variables G and D denote the generator and discriminator, respectively. In the training procedure, the parameters in one model are updated, whereas the parameters of the other model are fixed. Goodfellow et al. [28] demonstrated that for a fixed generator, the optimal discriminator is D(x) = p data (x)/( p data (x) + p g (x)). They also proved that the generator is optimal when p data (x) = p g (x), which is equivalent to the situation that the probability for any sample predicted by an optimal discriminator is 0.5. In other words, the desired state is that the generator can maximally fool the discriminator to prevent any accurate distinguishing between the real and fake samples.
Initially, the fully connected layers were used in GAN architectures for both generator and discriminator, and were applied to three simple image data sets of MNIST, CIFAR-10, and Toronto Face Data Set [38]. Arjovsky et al. [42] proposed the WGAN with an alternative cost function which is an approximation of the Wasserstein distance (also known as the Earth-Mover distance). Considering its stability and fast convergence in training, the WGAN was implemented as the basic model in this study.
The adversarial training of conventional GANs can be formularized by minimizing the Jensen-Shannon divergence between the probability of real and generated data. However, the discontinuity of the Jensen-Shannon divergence makes it hard to provide useful gradients to the generator for optimally updated parameters, which is one of the main reasons for the instability of GANs. To solve this problem, the Jensen-Shannon divergence is replaced with the Wasserstein distance in WGAN as where (P r , P g ) denotes all the possible joint distributions of real distribution P r and generated data distribution P g . The Wasserstein distance is continuous and differentiable almost everywhere, and thus can provide useful reasoning gradients to optimize generator. Since it is difficult to implement (3), the early studies on WGAN normally provided the Kantorovich-Rubinstein duality to measure the Wasserstein distance as where f denotes the set of 1-Lipschitz function. In real implementations, f is replaced by discriminator D and || f || L ≤ K is replaced by ||D|| L ≤ 1, and the final loss function can be formularized as In GAN applications, given the fact that the convolutional layers are powerful to extract features and with relatively less computational burden compared with the fully connected layers, many EEG-relevant works concentrate more on CNNs. Generally, most studies utilized 2D convolutional layers in GAN-based models to extract EEG features. Some of them treated 64 EEG channels like a map of 8 × 8, however, the whole model could not contain too many convolutional layers and failed to extract deep latent features in EEG. Some other studies directly reshaped the EEG time series into a format of image, but this inappropriate operation may cause irreversible damage to local features. Considering that the 1D convolutional layer is extremely well suited for EEG time series, in this work, the 1D convolutional and transposed convolutional layers were implemented in the generator, and the 1D convolutional layers were used in the discriminator. The basic blocks and architecture for both the generator and discriminator are illustrated in Fig. 1, and the detailed input/output data are shown in TABLEs I and II. The training process can be seen in Fig. 2, where the contaminated EEG data y is input to the generator, and then decomposed and reconstructed by the generator, with output of the generated data G(y). Thereafter, the generated G(y) and the real data together are input into the discriminator, which gives the probability of the input belonging to real data that ranges  from 0 to 1. The discriminator can be seen as a binary classifier, and training of the discriminator strengthens its ability to distinguish contaminated signals, enforcing the generator to remove artifacts and generate filtered data as perfectly as possible. It is worth noting that the output G(y) from the generator is the input to the discriminator, and the outputs of the denoised data y and contaminated data n from the discriminator are D(G(y)) and D(n). Algorithm 1 summarizes the procedure and default parameters for AR-WGAN.
2) Denoised AutoEncoder: The Autoencoder (AE) is one of the generative and unsupervised learning models in the deep learning field with encoder-decoder architecture and is commonly used for the task of representation learning. Deep AEs can learn high-order statistical information from  2. The training process of AR-WGAN for denoising, where the generator received contaminated data and output the denoised data, and the discriminator received both initial and denoised data. After multiple rounds of adversarial training, the generator could remove artifacts in the contaminated data quite well.
the input data and are usually built symmetrically concerning their dimensionality. Normal AEs mainly contain three parts, i.e., encoder, bottleneck, and decoder. The encoder is a set of blocks to compress input data to the bottleneck. The responsibility for the bottleneck is restricting the flow of information from encoder to decoder, allowing the most important information to pass through. The bottleneck can also be seen as a compressed representation of the input data in higher dimensional space. Finally, the decoder serves as a decompressor to reconstruct the bottleneck's output.
Inspired by the work of Leite et al. [34], we designed a Denoised AutoEncoder (DAE) to remove artifacts from EEG data and compared the results with the cleaned EEG achieved by the AR-WGAN. We implemented 1D convolutional layers with 1D max-pooling layers in the encoder, and 1D convolutional layers followed by 1D upsampling layers in the decoder. The 1D max-pooling layers concentrate neurons and use the maximum value to represent these concentrated neurons, reducing the dimensionality after each convolutional operation. The 1D upsampling layers in the decoder restore the dimensionality to compensate for the effect caused by the 1D max-pooling. To prevent possible overfit of the discriminator, we implemented dropout regularization in every convolutional block.
3) Wiener Filter: The Wiener filter (WF) is a noise removal algorithm based on statistical approaches, which also served as a comparison for the method proposed by this work. The main idea of WF is to minimize the overall mean square error or the average squared error between the denoised and original signals, which means the difference between the pure and filtered signals should be minimized [47]. The wiener filter produces an estimated of the filtered signal, i.e.,d = W T Y .
The final optimization goal can be represented by a minimum mean squared error (MMSE) as

4) Empirical Mode Decomposition:
The Empirical Mode Decomposition (EMD) method is a data-driven and adaptive method, which breaks down signals into several components without leaving the time domain. Unlike the Fourier transform, the EMD method does not require signals to be stationary over time. The main idea of EMD method is to decompose the nonstationary and nonlinear signals x(t) into intrinsic mode functions (IMFs). Each IMF is assumed to capture a meaningful local frequency and different IMFs do not exhibit the same frequency at the same time [48]. Considering the frequency characteristics of EEG and the aim of artifact removal, in practical applications, high-frequency IMFs would be deleted, and the rest of IMFs and the residue are summed to reconstruct denoised EEG data. In this paper, we adopted the EMD method proposed by Bono et al. [49] to filter EEG and compared its performance with our proposed method.

B. Datasets
In this paper, we used a public dataset (EEGdenoiseNet Benchmark Dataset) and a self-collected dataset to evaluate the performance of proposed approach. The EEGdenoiseNet is a semi-synthetic EEG dataset released by Zhang et al [8], which contains 4514 clean EEG segments, 3400 ocular artifact segments, and 5598 muscular artifact segments. In the EEGdenoiseNet, the EEG, EOG, and EMG data are acquired from several public and open-access datasets, and then preprocessed including segmentation, filtering, and resampling with the frequency of 256 Hz. For the self-collected dataset, totally four healthy subjects were recruited, where 64-channel raw EEG data were acquired from three of them (Neuroscan EEG Acquisition System, USA), and EMG (Acquisition System of Neuromuscular Electrophysiological Signals, NES-64B01, SIAT, CAS, China) and EOG (Neuroscan EEG Acquisition System, USA) artifacts were from the rest one. Before the superimposition of artifacts into EEG, the self-collected raw EEG data were filtered, and their initial artifacts were manually removed. Health examination before experiments showed that all subjects were in a good mental state and qualified to participate in the whole experimental procedure. The experimental protocol of this study was approved by the Institutional Review Board of Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (IRB Number: SIAT-IRB-190315-H0325). All subjects agreed to participate in the study and signed informed consent permission for the publication of data for scientific and educational purposes.
The contaminated EEG can be seen as a linear combination of pure EEG and bio-signal artifacts, which can be written as where y denotes the contaminated EEG, n denotes the pure EEG, d denotes the ocular or myogenic artifacts, and λ is the parameter to control signal-to-noise ratio (SNR) in the mixed signal y. The SNR can be written as where RMS is the root mean square defined as where N denotes the number of data segments and d i denotes the ith data segments.

C. Preprocessing
Considering the public dataset has different data lengths for EEG, EMG, and EOG, we set the shortest length of 3400 among the three signals as the data length for processing. The self-collected data were resampled with a frequency of 256 Hz, which was the same as the frequency of the public data, and the data length was segmented as 3400, as well.
For the self-collected datasets, the EEG from three amputees was randomly mixed to evaluate the cross-subject performance of our proposed model, and the data size was kept consistent as that of public dataset, which was 512 × 3400 for all the EEG, EMG, and EOG. Besides, all the public and self-collected data were normalized to avoid possible gradient exploding and vanishing during training, and EMG and EOG artifacts were artificially added to pure EEG according to (7). Then, 448 and 64 samples were set as the training and testing data, respectively, to evaluate the generalization and robustness performances of the models. To better simulate the real situation, the original and artificial data with the sample amplitude were mixed, i.e., the amplitudes of the added EOG (A EOG ) and EMG (A EMG ) artifacts were the same as that of EEG (A EEG ). If only one type of artifact was superimposed, the amplitude (A EOG or A EMG ) was doubled to ensure the total amplitude (A total ) equals to three times of A EEG .

D. Evaluation Metrics
To quantitatively assess the performance of models, some metrics commonly used in the benchmark evaluation methods were also adopted in this work, including the power ratios, relative root mean squared error in temporal (RRMSE t ) and spectral (RRMSE s ) domains, and temporal CC, which are expressed as R R M S E s = R M S(P S D(g(y)) − P S D(n)) R M S(P S D(n)) , and CC = Cov(g(y), n) √ var (g(y)var (n)) , where y is the contaminated data, n is the Initial data, g(y) denotes the denoised data passed through the model, RMS is the root mean square defined as (9), PSD is the power spectral density of the input data, and var and cov denote the variance and covariance, respectively. Require: The number of discriminator iterations per generator iteration n disc , the batch size = 5, Adam hyper-parameters α, β1, β2, the clipping parameter c. Require: Initial discriminator parameters w 0 , initial generator parameters θ 0 . Input: Initial data n to the discriminator, contaminated data y to the generator. 1: while θ has not converged do 2: for t = 1, . . . , n disc do 3: sample {x(i)} m i=1 ∼ P n a batch from initial data. 4: sample {x(i)} m i=1 ∼ p(z) a batch of prior samples. 5: w ← Adam(w, L (i) , α, β1, β 2 ) 7: w ← cli p(w, −c, c) 8: end for 9: sample a batch of latent variables

Figs. 3 and 4 illustrate a comparison of denoising per-
formance among different methods on the public and self-collected datasets in the time domain, respectively, where "initial" denotes the original data from either the public database or the self-collected signals, "contaminated" means the data that were firstly denoised and then manually superimposed with EMG and EOG artifacts, and "denoised" represents the final data after denoising. It should be noted that for clearer visualization, the offsets were artificially introduced into the denoised data, and thus the amplitudes are changed and do not reflect true values. As can be seen, the proposed AR-WGAN shows better denoising performance for both public and self-collected data than other three methods, which may properly remove the artifacts, and therefore the output can perfectly match the initial data. Even though the other three methods can also filter the artifacts, some important and detailed time domain features are not preserved. For instance, some spikes contained in the initial data are abandoned and the denoised data are quite smooth. In special, DAE fails to preserve most features and the output is almost untrustworthy, even as reported its loss curve shows that the model is converged; WF just smoothens the contaminated data and some obvious artifact spikes still exist; although EMD performs a little bit better than DAE and WF, some obvious artifacts are still contained in the denoised data.
Figs. 5 and 6 show the time-frequency analysis of denoising performance for different methods on the public and selfcollected datasets, respectively, which could reflect energy changes after the removal of artifacts. By adding the artifacts artificially, the energy of contaminated data is much higher than that of initial data. As shown in the figures, AR-WGAN can effectively eliminate the artifacts and reduce the energy in high-frequency bands that contain most artifacts; the DAE and WF processed data lose much information in both time and frequency domains and could not be used in the following analysis; the data processed by EMD are much better in the time domain compared with those by DAE and WF, however, they are still noisier in contrast with the output by AR-WGAN.  Figs. 7 and 8 summarize the calculated temporal RRMSE, spectral RRMSE, and temporal CC to visualize the performance of different methods on the public and self-collected datasets, respectively. In general, AR-WGAN exhibits the best performance across all the methods. For the public dataset, AR-WGAN has the lowest temporal and spectral RRMSE and highest CC with limited standard deviation. For the selfcollected dataset, AR-WGAN still owns the lowest temporal and spectral RRMSE. Although it shows similar CC as DAE and EMD, its standard deviation is less.
TABLEs III and IV summarize the power ratios in different EEG frequency bands of different methods on the public and self-collected datasets, respectively, where EEG signals are divided into five frequency bands, i.e., delta (1-4 Hz), theta (4-8 Hz), alpha (8)(9)(10)(11)(12)(13), beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), and gamma . The mean SNRs across all the contaminated data from the public and self-collected datasets were calculated as −4 and −5 dB, respectively, and power ratios are used to evaluate the signal reconstruction quality under these computed mean SNRs. As can be observed from the tables, generally, the signals denoised by AR-WGAN in different frequency bands have power rations relatively closer to the initial data, especially in the frequency bands of delta and theta which we are normally interested in. This result means the proposed method can reconstruct signal powers very well, while other methods cannot properly filter the contaminated data and alter the power ratios of processed data.

IV. DISCUSSION
EEG provides a non-invasive way to reflect brain waves and activities with high fidelity, which can be used in BCI for real-time control of external devices and interaction with environments. However, EEG signals recorded on the scalp are inevitably contaminated by some other bio-signals such as EMG and EOG, and most of the existing filtering methods are usually time-and labor-consuming. Many traditional ML-based denoising methods have been studied and implemented, like the Wiener Filter, EMD-based methods, and ICA-based methods. These approaches need to set pre-defined parameters, and visual inspection and manual operations are required, which is time-consuming, laborious, subjective, and inappropriate for online and real-time implementations. Besides, the experiments in this work showed that transient, large-amplitude artifacts cannot be fully removed by these traditional methods. Many works tried to fix these problems with DL-based models, while these models are not robust enough when encountering small perturbations. Thus, an automatic, fast, and robust filtering algorithm that can support massive data processing is always needed. In this work, we proposed an artifact removal method called AR-WGAN, which can effectively remove artifacts and extract useful EEG features, and its denoising performance has been evaluated with comparisons to some commonly used methods including DAE, WF, and EMD. We used a dataset from the public databank and a dataset from our lab, to confirm the performance of proposed model. In addition, to simulate the most realistic scenarios, both EMG and EOG artifacts were mixed into EEG for signal denoising.
It is focused on how much information of initial data could be retained in the denoised signal, and hence in this work we analyzed the data in both time and frequency domains and calculated RRMSE and CC to evaluate the performance of different denoising methods, where RRMSE is used to measure the degree of information retention of initial data and CC reflects the similarity between the denoised and initial  data. The results on both public and self-collected datasets suggest that our proposed AR-WGAN has superiority over the reference methods in removing artifacts while preserving EEG features as much as possible, which may serve as a multipurpose and universal tool for extracting EEG from mixed signals.
The well-trained generator in AR-WGAN behaves like a non-linear function that can implement the decomposition  and the following reconstruction of contaminated signals. The 1D convolutional operations capture the latent features that contain artifactual components, and the 1D transposed convolutional operations reconstruct these components supervised by the discriminator, trying to minimize the difference between the original and reconstructed data. Considering that the deep convolutional layers were adopted by the model and some important features might deteriorate in the data flow, we implemented residual connections to ensure that the data quality would not be damaged before and after convolutions. We tried to go deeper with the transposed convolutional layers in both generator and discriminator, but the performance was not significantly improved, which proves that the existing convolutional blocks are good enough to extract EEG features. Besides, we found that adding 1D max pooling layers could significantly improve the performance of generated data. The intuitive interpretation of max pooling is that the network concentrates on some particular features of initial EEG and the network parameters are decreased, and thus the computational load and probability of overfitting are reduced and the performance is improved.
Theoretically, DAE contains an encoder and decoder to decompose data, extract features, and reconstruct data, which coincidences with the generator of AR-WGAN. However, DAE cannot learn to efficiently generalize the model and extract features without adversarial training, and the upsampling operation may not accurately reconstruct data, with low-quality output. WF aims to minimize the mean squared error between the expected and initial signals, which can preserve overall trends and smoothen signals in the time domain, but important features are usually lost and large artifact drifts are not removed. EMD can decompose data into several IMFs and a residual component, but it is also sensitive to noise and problematic in mode mixing.
Compared with other commonly used artifact removal algorithms which rely on prior knowledge and manual artifactual component removal, AR-WGAN can obtain better achievements with arbitrary EMG and EOG artifacts. Once the model is well-trained, the processing speed is fast and the model can filter massive data automatically. Besides, the quality of data filtered by AR-WGAN can be observed by the discriminator, which informs the generator whether the filtered data are still noisy. The training process would be terminated when the filtered data are good enough. By comparison, other DL-based filtering methods simply remove artifacts and the filtering model cannot be judged directly. Moreover, the results from the self-collected dataset show that AR-WGAN can cope with the data from different subjects and perform pretty well, meaning that the model may have a certain cross-subject ability and cope with EEG data that the model never sees. Most importantly, the EEG data in real applications are much more complicated, containing some perturbations and large-amplitude artifacts caused by headset motions, which are approximately insolvable by simple DL models but capable of AR-WGAN.
Even though the AR-WGAN shows great effectiveness in removing large-amplitude artifacts and handling with inevitable perturbations, it is worth noting that there exist several improvements for the proposed AR-WGAN method. The model tends to excessively lower the energy in the lowfrequency band, despite it has strong capability to remove EMG and EOG artifacts that are mostly in high-and lowfrequency bands, respectively. However, the initial EEG data still contain some energy in the band of 0-20 Hz, and it is challenging for the current model to accurately remove extra noise without changing the original energy, which is a common issue as reported in other work [34]. Besides, this work only focuses on the information in time and frequency domains, while the spatial information between EEG channels should be considered in future work. Additionally, the proposed method still needs some manually-processed data to train the model. In applications, it needs to implement some classical methods like independent component analysis (ICA) to process EEG data in advance, and then input purified and raw data to the discriminator and generator, respectively, to train the model. Therefore, it is suggested to design an unsupervised filtering network that can improve itself during training without annotated data.

V. CONCLUSION
In this work, we proposed a denoising network called Artifact Removal Wasserstein Generative Adversarial Network (AR-WGAN) and compared its performance with some other commonly used denoising methods based on two datasets. The experimental results demonstrated that the proposed method has superiority over the conference methods on EEG denoising. Additionally, this framework can extract EEG features from raw data of any number of channels and also be extended to other signal-filtering tasks.