Automatic Sleep Scoring Using Intrinsic Mode Based on Interpretable Deep Neural Networks

Sleep experts manually label sleep stages via polysomnography (PSG) to diagnose sleep disorders. However, this process is time-consuming, requires a lot of labor from sleep experts, and makes the participants uncomfortable with the attachment of multiple sensors. Thus, automatic sleep scoring methods are essential for practical sleep monitoring in our daily lives. In this study, we propose an automatic sleep scoring model based on intrinsic oscillations in a single channel electroencephalogram (EEG) signal. We applied noise assisted bivariate empirical mode decomposition (NA-BEMD) to extract the intrinsic mode components and an attention mechanism in deep neural networks to provide weights to the components depending on their significance to sleep scoring. In particular, through the attention mechanism, we found an interpretable model by examining the oscillations that correspond to specific sleep stages. Therefore, we analyzed which frequency components are more weighted to a sleep stage than the others, when the model classifies sleep stages, and, as a result, confirmed that the model assigns convincing weights to the frequency components for each sleep stage. Additionally, the model consists of a one-dimensional convolutional neural network (1D-CNN) to extract features of an epoch and bidirectional long short-term memory (Bi-LSTM) to learn the sequential information of the consecutive epochs. We evaluated proposed model using Fpz-Cz, Pz-Oz, and F3-M2 channel EEG from three different public datasets (Sleep-EDF-2013, Sleep-EDF-2018, WSC) and demonstrated that our model yielded the best overall accuracy (Fpz-Cz: 86.22%-82.67%, Pz-Oz: 83.63%-80.15%, F3-M2: 84.20%) and macro F1-score (Fpz-Cz: 80.79%-76.90%, Pz-Oz: 76.89%-72.98%, F3-M2: 74.88%) compared with the state-of-the-art sleep scoring algorithms using single channel EEG. As a benchmark test, FIR bandpass filters were compared, and it was confirmed that NA-BEMD was superior to the traditional filters in all experiments, demonstrating that the proposed model is interpretable and a state-of-the-art sleep scoring algorithm.


I. INTRODUCTION
Sleep is a crucial factor to maintain a healthy life for human beings. Deterioration of sleep quality causes various sleep disorders such as insomnia, sleep apnea, and parasomnia [1]. Numerous people suffer from sleep disorders, which are diagnosed by sleep experts who conduct a multi-parametric test called polysomnography (PSG) [2]. The classification The associate editor coordinating the review of this manuscript and approving it for publication was Kin Fong Lei . of multiple sleep stages, called sleep scoring or sleep stage scoring, is carried out through PSG. Sleep experts determine sleep quality by analyzing the distribution of each stage across all night sleep stages. PSG monitoring measures physiological signals including electroencephalogram (EEG), electrocardiogram (ECG), electrooculogram (EOG), electromyography (EMG). Based on the study of Rechtschaffen and Kales (R & K) [3] or the American Academy of Sleep Medicine (AASM) [4], sleep stages are decided over 30-second intervals called epochs, by sleep VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ experts. In the R&K standard, PSG signals are classified into six stages: wakefulness (Wake), non-rapid eye movement (NREM) stages (N1-N4), and rapid eye movement (REM). The AASM standard combines the N3 and N4 stages into slow-wave sleep (SWS) stages or N3 stages. In this study, we use the AASM standard to classify the five sleep stages. Manual sleep scoring by sleep experts is time-consuming, requires a lot of labor, and makes participants feel uncomfortable due to a number of sensors attached to them during sleep. In addition, different sleep scoring results are produced by different sleep experts, even during re-scoring by the same expert [5]. To solve these issues associated with manual sleep scoring, automatic sleep scoring algorithms have been developed using a single-channel EEG. In contrast to the manual approach, automatic sleep scoring is consistent and reduces the scoring time, labor, and cost [6].
Several state-of-the-art machine learning and deep learning algorithms for the analysis of biosignals have been suggested for automatic health diagnosis and monitoring systems. In [7],the authors proposed the prediction of stroke using the random forest algorithm based on ECG signals. In addition, a health monitoring system for the stroke prognostics using SVM is proposed by [8]. In [9], convolutional neural network (CNN) [10] was proposed for the seperation of the motor imaginary responses using EEG signals. Moreover, Senyurek et al. [11] analyzed respiratory signals for the puff detection based on CNN and long short term memory (LSTM) [12].
Early studies on automatic sleep scoring extracted time-frequency features from an epoch of a single channel EEG and employed machine learning algorithms to classify the sleep stages. For example, Oropesa et al. [13] used wavelet packet transformation (WPT) to provide localized time-frequency components and estimate the sleep stages using an artificial neural network (ANN). A continuous wavelet transform (CWT) was utilized by Fraiwan et al. [14] for their time-frequency component features obtained from EEG signals, which were classified using linear discriminant analysis (LDA). Furthermore, Fraiwan et al. [15] applied Choi-Williams distribution (CWD), Hilbert-Huang transform (HHT), and the CWT for instantaneous feature extraction, which was classified using a random forest algorithm. In addition, Sharma et al. [16] decomposed EEG signals into amplitude and frequency-modulated components using an iterative filtering method to extract the sleep stage-related features and applied naïve Bayes, k-nearest neighbor, multi-layer perceptron, decision tree, and random forest algorithms for classification.
Recently, deep learning approaches have been applied for automatic sleep scoring algorithms with or without traditional signal processing, which exhibit superior performance than conventional methods. In particular, the architectures of CNN and recurrent neural network (RNN) [17] are mainly applied for sleep scoring algorithms. Because CNN is well known to extract accurate feature information from input data, they can well recognize the characteristics of the EEG amplitude or frequency components. Additionally, RNN has a memory cell architecture to utilize the temporal features of the EEG signals across consecutive epochs to estimate sleep stages. For example, Yildirim et al. [18] constructed a 16-layer one-dimensional CNN (1D-CNN) to extract EEG features and classified six sleep stages using fully connected layers and a softmax function. In addition, Zhu et al. [19] divided an epoch into 29 multiple windows to effectively capture sleep spindles or k-complex features and applied 1D-CNN and a self-attention mechanism for each window. In another study, 1D-CNN and Bi-LSTM were combined to estimate five class sleep stages [20], where 1D-CNN was pretrained using oversampled data to solve the class imbalance problem. There are other approaches that combine frequency analysis and deep learning algorithms. For example, Phan et al.. [21] employed Short Time Fourier transform (STFT) to produce time-frequency representations of a single channel EEG and then conducted sleep scoring using 1D-CNN. Michielli et al.. [22] extracted 55 features of EEG in the time and frequency domain using various statistical parameters, an infinite impulse response (IIR) bandpass filter, and a power spectrum, which were fed into the LSTM to classify four class sleep stages.
Currently, the state-of-the-art deep learning models for the automatic sleep scoring use the frequency components extracted through Fourier-based decomposition algorithm such as Fast Fourier Transform (FFT) or only raw EEG signals as their input [23]- [26]. The Fourier-based decomposition algorithms could not deal with reflect the non-linearity and non-stationarity of EEG signals [27]. If the raw EEG signal is fed into the deep learning model, the combined characteristics of the frequency components could be learned. However, it would be difficult for the model to train the features of each frequency components. In addition, it is crucial to look into the frequency components separately since several studies have already established specific components corresponding to a certain sleep stage [28]. For this issue, we propose a deep learning model with a data-driven frequency decomposition algorithm and the attention mechanism for the automatic sleep scoring. The main contributions of this paper are as follows.
• We apply a noise assisted bivariate empirical mode decomposition (NA-BEMD) to extract frequency components with its dyadic filter bank property. As a datadriven model, NA-BEMD can extract intrinsic oscillations in a signal, which can be accurately decomposed the sleep stage-related components in EEG signals [29]. • We employ the attention mechanism to choose important frequency components for each sleep stage, producing weights to the input nodes of the networks based on their significance [30]. The frequency components decomposed through NA-BEMD are the input of the deep neural networks, and their attention weights are decided based on their contributions to the classification of each sleep stage.
• The weight information through the attention mechanism could make us understand how the deep neural network architecture processes the input data to obtain the final classification results rather than a black-box function.
• We demonstrate that the proposed model could classify 5 sleep stages using three different datasets, recorded in different conditions, without modifying the model hyperparameters and architecture, inferring the accomplishment of the generalized sleep scoring model. This paper is structured as follows. Section II illustrates the public datasets used in the experiment, the proposed model architecture, and the evaluation method of the model in detail. In Section III, the performance of the generalized model is demonstrated using the evaluation process and compared with the existing works. In Section IV, the main contribution is discussed in more depth based on the experimental results, and the last section concludes the paper and describes the future research directions.

A. DATASETS
We employed three different channels of EEG from three public datasets to train and evaluate the performance of the proposed method. There are two Sleep-EDF dataset (2013 and 2018 version) [31] and Wisconsin sleep cohort dataset (WSC) [32]. In this study, a single-channel of EEG signal is analyzed considering the development of convenient sleep monitoring device with minimum number of wires. The distributions of epochs from the 5 sleep stages and the number of participants for each dataset are described in Table 1. As shown in Table 1, the numbers of epochs among all the sleep stages are unbalanced. In addition, WSC has the highest number of participants among the datasets, and Sleep-EDF-2018 has the largest number of epochs among the datasets.

1) SLEEP-EDF DATASET
The Sleep-EDF dataset is often used in many automatic sleep scoring studies [33]. There are three PSG records, consisting of EEG signals with configuration of Fpz-Cz and Pz-Oz channel, EOG signals, and chin EMG signals, recorded at a sampling rate of 100Hz. In addition, it contains hypnogram information with annotation of six sleep stages labeled based on R&K standard. We combined the N3 and N4 stages into the N3 stage to satisfy the latest AASM standard [4]  with five sleep stages classification. There are two different participant groups in the Sleep-EDF dataset, that is, Sleep Cassette Study(SC) and Sleep Telemetry Study (ST) groups. SC group records were obtained in the study about the age effects on the sleep of the healthy participants aged between 25 and 101 years old, without any sleep-related medication. On the other hand, ST group was recorded in the study of the temazepam effects on the sleep of the insomnia participants. We only utilized records of healthy participants from SC group in the 2013, 2018 versions, which have records from 20 and 78 participants, respectively, on the Fpz-Cz and Pz-Oz channel configurations.

2) WISCONSIN SLEEP COHORT DATASET
Wisconsin Sleep Cohort (WSC) dataset includes 2570 PSG records from 116 participants of various races with various physiological signals such as EOG, EEG, EMG, ECG, and SpO2 [32]. EEG was recorded on the configurations of F3-M2, Fz-M2, Cz-M2, C3-M2, Pz-M2, O1-M2 channels at a sampling rate of 200 Hz, and F3-M2 channel data was utilized for our benchmark test. In WSC dataset, the participants have a maximum of 5 records, obtain during 5 different visits of the participants (visit1-visit5). In this study, we employed the dataset from visit4 to visit5, which are the most recent data group in WSC dataset and they had become sufficiently familiar with the laboratory environments. beta (13-30Hz), and gamma (30Hz-) band components [34]. Also, at the same time, the epochs are z-score normalized. The five decomposed components are fed into the input layer of the proposed networks together with the normalized EEG signal as Z-scores. They are weighted through an attention layer, and the features of the 5 weighted frequency components and the normalized EEG signal are extracted respectively through the 1D-CNN. The extracted features of 10 epochs are considered as one sequence, and Bi-LSTM learns the successive epoch patterns among the 10 epochs. Additionally, the model is trained in two-step to more train network of first step than network of second step, since network of first step is deeper and more complex than second step and cause overfitting issue faster. Therefore, the model is trained network of second step using pre-trained network of first step as backbone. Through this, the generalization performance of the model was improved with preventing the overfitting issue of network of first step.

C. MODEL DESCRIPTION 1) NOISE ASSISTED BIVARIATE EMPIRICAL MODE DECOMPOSITION
Empirical mode decomposition (EMD) is a data-driven frequency decomposition algorithm in an adaptive way, which decomposes signals without any basis function [35], and thus it could well extract the intrinsic oscillations from the nonlinear EEG signals as demonstrated by Park et al.. [36] In contrast with the wavelet and Fourier analysis, EMD has few restrictions on a signal and no fixed basis function [37]. A raw signal x(t) is sequentially decomposed into intrinsic mode functions (IMFs), c k (t) and residue r n (t), through the iterations of the sifting process (see Eq 1).
During the sifting process, upper envelope e u (t) and lower envelope e l (t) are identified with interpolating all local maxima and minima, and compute the local mean m(t) of these two envelopes. After this, m(t) is repeatedly subtracted from the input data. If the residue r k (t) after the subtraction satisfies the two stopping criteria, c k (t) is assigned with r k (t).
The two stopping criteria are defined as follows: first, the mean of the upper and lower envelopes is close to zero, and second, the number of extreme values and zero crossing points should be the same or only differ by one. Algorithm 1 describes the details of EMD. BEMD was originally developed for complex-valued time series data [38]. Unlike the conventional univariate EMD, BEMD can process bivariate data in the complex-valued form and decompose it to complex-valued IMFs, which have real and imaginary parts for the bivariate data. The procedure of the BEMD algorithm is elaborated in Algorithm2.
Since the EEG signal is non-stationary and non-linear time series data, the EMD method could be suitable to decompose EEG signals due to its data driven process [35]. However, the distribution of the frequency components in each IMFs Algorithm 1 Procedure of Empirical Mode Decomposition 1: Let x(t) = x(t) 2: Find all local maximum and minimum values of x(t) 3: Identify upper envelope e u (t) and lower envelope e l (t), through a cubic spline interpolation 4: m(t) = (e u (t) + e l (t))/2 5: r k (t) = x(t)m(t) 6: If the condition of r k (t) meet the stopping criterion, assign c k (t) with r k (t), otherwise x(t) = r k (t) and iterate step 2-5 7: x(t) = x(t)c k (t) 8: Iterate step 2-7, until r k (t) has no more oscillations building a tube in time, real and imaginary axes 5: Obtain the tangent g d n on a tube in the direction of d n 6: Iterate step 2-5 twice with n = 1, 2 7: Average all tangents: m(t) = 2(g d 1 + g d 2 )/D 8: From this step, it is the same as sifting processing of the conventional EMD.
decomposed using EMD is not uniform. For example, the range of the frequency components in the first IMF of one EEG signal would be different from the frequency range of the other EEG signal. Thus, even if an EEG training dataset including a specific order of IMF is learned by a classification model, the model might not classify the other testing dataset due to the inconsistent frequency distribution of the IMFs. In order to solve the above issue, we apply to a noise-assisted BEMD using two channels which are a raw single channel EEG and a white Gaussian noise signal with 0.1 standard deviations of the EEG. Since the IMFs of a white Gaussian noise have a dyadic filter bank property, it could also derive the similar dyadic frequency components in the EEG IMFs of the other channel, which could solve the mode mixing problem within the decomposed IMFs [39].
In this study, we applied NA-BEMD to the epochs to extract six IMFs covering a frequency range from 0.5Hz to 50Hz. One example of the six IMFs from an epoch are displayed in Fig 2., and it shows that NA-BEMD produces the different oscillations in data-driven way considering the non-linearity and non-stationarity of the EEG signal. In addition, the averaged power spectra of all epochs' IMFs of the participant are shown in Fig 3., where the dyadic filter bank property can be noted. As the sifting process of NA-BEMD proceeds, slower oscillations are extracted than the earlier components. IMF1 contains high frequency noise, which could be found heuristically during an experiment learning the sleep scoring model using all 6 IMFs and 5 IMFs discarding IMF1. The 5 IMFs without IMF1 improved the sleep scoring performance, and thus the proposed model learned the 5 components only. Fig 4. illustrate the structure of an attention layers. The attention layer generates attention weights and executes element-wise multiplication to the input data. This attention mechanism consists of squeeze and excitation operations [30]. In squeeze operation, the five IMFs are fed into the global average pooling (GAP) layer to compress the global spatial information into a descriptor. The excitation operation computes the dependencies among the components using fully connected layers and non-linear functions. The formulas for these operations are as follows in Eq 2-4.

2) ATTENTION MECHANISM
x where δ and σ denote the Relu and sigmoid functions. W 1 and W 2 are trainable weight vector in the attention layer, W 1 ∈ R 3×C and W 2 ∈ R C×3 . The attention weights, A w , extracted through the sigmoid function are multiplied by the IMFs. Through the attention layers, the performance is  improved by focusing on the important IMFs for classifying the sleep stage. An IMF with a large attention weight could contain significant features for the model task. Since the outputs of the sigmoid function have values between 0 and 1, they are scaled corresponding to the significance of the components [30].

3) 1D-CNN FOR AN EPOCH's FEATURE LEARNING
We use 1D-CNN with various filter sizes in all the convolution layers to extract features from the decomposed 5 IMFs and raw EEG signal. A large sized filter would detect large patterns well, while a small sized filter would recognize small patterns [40]. In detail, there are 13 convolution layers with various filter sizes instead of the fixed size, 6 max-pooling VOLUME 10, 2022 layers, and global average pooling layer in the 1D-CNN, shown in Fig 5.. Each convolution layer has the ReLU, which is a non-linear activation function. In addition, we apply the residual connection to solve the vanishing gradient problem and to ease the training of the network by adding identity x to the output F(x) [41]. Therefore, this method allows the CNN layers to be deeper and improve the performance with the increased layer depth. [42] In addition, the residual connections are relatively easier to be optimized, compared with vanilla CNN. The features extracted from the IMF components and the normalized raw signal are concatenated and processed through two fully connected layers, and finally, five sleep stages are classified using the softmax function shown in Fig 1.. The extracted features from 10 epochs are transferred to the next Bi-LSTM layers.

4) BI-LSTM FOR LEARNING OF EPOCH SEQUENCE
In general, successive sleep stages have regular patterns [43]. For example, if the participant is in the wake stage, then the next stage is likely to be the wake or N1 stage. In addition, N1 and N2 stages have similar patterns of the brain waves, that is both stages have low amplitude and mixed frequency properties in EEG signal. However, if sleep spindles or k-complexes appear during a few minutes, all epochs in that duration are scored as N2 stages. Therefore, in order to decide a correct sleep stage between N1 and N2 or between Wake and N1, a learning structure based on the consecutive epochs is required. To learn this rule, the proposed model considers 10 epochs as one set and uses three Bi-LSTM layers that could memorize information of the previous or subsequent epochs [44]. Bi-LSTM layers can process sequences in forward and backward direction with merging two LSTMs. The input sequence x t is a feature vector of 10 epochs extracted using 1D-CNN structure and t denotes the time index in Bi-LSTM. This structure is formulated as follows: where − → h t and ← − h t are hidden states of the forward and backward networks. Finally, the output y t is made by concatenating both hidden states. Thus, it could learn past and future information. Fig 6. illustrates the Bi-LSTM layer architecture. The outputs of Bi-LSTM layer from y 0 to y 9 are fed into the next Bi-LSTM layer, and these Bi-LSTM layers are stacked in three layers in the model.

5) TRAINING DETAIL
In this study, the hyperparameters of our model were designed experimentally through trial and error, where several configurations were tried with twenty-fold cross validations using the Sleep-EDF-2013 dataset. We adjusted experimentally hyperparameters of 1D-CNN and Bi-LSTM, the number of convolution layers and fully connected layers, epoch sequence length, learning epochs, learning rate, and batch size. We used the Adam optimizer with a learning rate of 0.0001 and categorical cross-entropy as the loss function. The batch size is 80 when batch feds to network of feature learning for an epoch. 10 features of epochs are bundled into one set and 8 sets fed to network of learning of epoch sequence. In Fig 1., the output shape of the first fully connected layer of 1D-CNN is (8,10,64), where the time step and the number of attributes of Bi-LSTM are 10 and 64.
In addition, each learning epoch of an EEG epoch feature extraction learning and EEG epochs sequence learning is 8 and 50. Furthermore, the weight initializer in the model is GlorotUniform and its random seed is 777.

D. EXPERIMENTAL SETUP 1) EVALUATION METHOD
Ten and twenty-fold cross validations were conducted to validate the performance of the trained model [45]. In addition, the participants' test data were completely blinded to the training dataset for a fair inter-participant test. It should except epochs of the test participants from the training dataset. Therefore, because there are 20 participants in the Sleep-EDF-2013 dataset, twenty-fold cross validation was conducted, and ten-fold cross validation was performed for the Sleep-EDF-2018 dataset and WSC dataset. This interparticipant cross validation is practical to implement real world applications.

2) ATTENTION WEIGHT ANALYSIS METHOD
When the proposed model classifies the sleep stages, the attention weights are calculated and averaged during the cross validation to analyze which frequency components are weighted by the attention layer in a specific stage. In addition, One-tailed t-test is also conducted to determine whether the difference of the attention weights assigned to the frequency components in each sleep stage is significant.

4) PERFORMANCE METRICS
To evaluate the performance of the proposed model, three performance metrics-recall (RE), precision (PRE), and F1-score (F1)-were calculated. True positive (TP), true negative (TN), false negative (FN), and false positive (FP) were used to calculate the performance metrics as follows: In addition, the overall accuracy (ACC), macro F1-averaging (MF1), and Cohen's kappa coefficient (κ) were also calculated to evaluate the performance of sleep scoring.

MF1
where C denotes the number of classes and N the total number of test data. P e denotes the hypothetical probability of the chance agreement [47].

III. RESULT
In this study, the proposed model was designed to classify the five sleep stages: Wake, N1, N2, N3, and REM stages.
To evaluate the performance of the model, experiments were conducted using Fpz-Cz, Pz-Oz, and F3-M2 channel EEG signals from the Sleep-EDF datasets and WSC dataset, and several performance metrics were utilized for the evaluation.
A. CLASSIFICATION PERFORMANCE   Fig 7. illustrates the accuracy graph of the proposed model during the two-step training. The proposed model yields a stable convergence of the training and validation accuracy. Therefore, it demonstrates that the model is robust against overfitting.
There is an imbalance problem in the number of sleep stage labels. The epoch number of the N1 and N3 stages are smaller than those of the other stages in the dataset, as shown in Table 1. Thus, the F1 score and MF1 are important metrics to measure the performance of the imbalanced dataset as well as the overall accuracy. We verified the generalized performance of the model through a total of 5 experiments with three datasets using cross-validation. Table 2 shows the confusion matrices and performances of the proposed model using Fpz-Cz, Pz-Oz, and F3-M2 channel signals from the Sleep-EDF dataset and WSC dataset. The left and right parts in Table 2 show the confusion matrix and the performance VOLUME 10, 2022  metrics (PRE, RE, F1) of each sleep stage, and the overall performance of the model is described using ACC, MF1 and κ across all sleep stages. In Table 2, the model trained with the Fpz-Oz channel performed better than that with the Pz-Oz channel in Sleep-EDF dataset. This result demonstrates that the Fpz-Cz channel has more distinct features than the Pz-Oz channel for sleep scoring. Due to the data imbalance issue among the classes in all the datasets, the overall accuracy is higher than 80%, whereas MF1 is relatively lower than 80%. In particular, the F1 score of the N1 stage is lower than those of the other stages owing to the fewer number of epochs. In contrast, the Wake and N2 stages with relatively a number of epochs yielded high F1 scores. As a result, the performance of sleep scoring using Fpz-Cz signals of the 2013 Sleep-EDF dataset and Pz-Oz signals of the 2018 Sleep-EDF dataset showed 80.79% and 72.98% MF1, i.e., the highest and lowest performance among the five experiments, respectively.
Additionally, the proposed model was tested for the separation between two class sleep stages (Wake, sleep), and among the three (Wake, NREM, REM), and four class sleep stages (Wake, Light Sleep, Deep-Sleep, REM), which can be seen in Table 3. The fewer classes were classified, the higher the performance of the proposed model was shown. Fig 8. displays an example of automatic sleep scoring algorithms scored by an expert and the proposed model. Notably, most stages are well predicted except the N1 stage, which might be due to lower occurrence than the others.

B. BENCHMARKING WITH OTHER STATE-OF-THE-ART ALGORITHMS
In this section, we compare the sleep scoring performances of the proposed model with those of other state-of-theart models. All studies used the same dataset and EEG channel and the same cross-validation method as ours. Table4 summarizes the benchmarking results of the per-class F1 score and overall performance metrics. Our proposed model using NA-BEMD has the highest overall accuracy and Cohen's kappa coefficient metrics for the three channel EEG dataset. Additionally, the MF1 score is the highest except the Pz-Oz channel EEG of the 2013 Sleep-EDF dataset, where our model has a higher F1 score in the N2 and REM stages than the other state-of-the-art algorithms and has a comparable or higher F1 score in the Wake, N1, and N3 stages than the other models. As a result of the benchmark test using an FIR bandpass filter instead of NA-BEMD, up to 1% lower performance was observed than the model using NA-BEMD. Therefore, the proposed model proved that NA-BEMD is an efficient EEG decomposition method to classify the sleep stages rather than the Fourier-based frequency decomposition approach.

C. INTERPRETABLE ATTENTION WEIGHTS OF THE DECOMPOSED IMFs
The higher the attention weight given to the IMF, the more important the frequency component of the IMF is to classify the sleep stages. Each IMF has a certain frequency range based on the dyadic filter bank property of NA-BEMD (see Fig 3.), and the spectrum includes a unique EEG frequency spectrum. IMF2 contains the beta (13-30Hz) and alpha band (8-13Hz) frequency components, IMF3 alpha, theta bands (4-8Hz), IMF4 little alpha, and theta and delta bands (0.5-4Hz), and IMF5 and IMF6 mostly include delta band frequency components. Fig 9. displays the mean and standard deviation of the attention weights given to the IMFs for each sleep stage. IMF2, which has the widest spectrum including high-frequency components, had the highest mean of attention weight in all stages. Notably, the highest mean of attention weight was assigned to the Wake stage among the mean of attention weights of IMF2. Also, the mean of attention weights of IMF3 and IMF4, which contains theta band spectrum, decreased from Wake stage to N3 stage In particular, IMF4, which contains a strong theta band spectrum, had higher mean of attention weights for N1 stage and N2 stage than that of IMF3. Additionally, the mean of attention weights of IMF5 for the N3 stage epochs, which contains a strong delta band spectrum, were higher than other stages. One-tailed t-test was conducted to confirm whether these differences were significant. Interestingly, these attention weights to the different IMFs are consistent with the findings of the previous studies [28]. In addition, it was confirmed that each sleep stage does not depend on only one IMF, but multiple IMFs contribute to a sleep stage.

IV. DISCUSSION
Compared to the existing sleep scoring studies, the main contributions of the proposed method are the applications of NA-BEMD to extract the EEG frequency components and the attention mechanism to emphasize the corresponding frequency components to each sleep stage. Additionally, a novel deep neural network consisting of 1D-CNN and Bi-LSTM, and interpretability with the attention mechanism are the main contributions to the automatic sleep scoring research. These approaches are robust and reliable for the sleep scoring using only a single channel EEG signal.
Since the sleep experts consider the slow and fast scales in EEG signals as well as their amplitudes when classifying the sleep stages, the automatic sleep scoring algorithm also needs to decompose an EEG signal into several frequency components and consider the significance of the components corresponding to the sleep stages. For the frequency decomposition, the traditional Fourier-based method could be utilized using its sine and cosine basis functions. However, due to the limiation of its basis functions, it would not handle the inherent non-linearity and non-stationarity of EEG signals. To solve this issue, the data-driven decomposition algorithm, EMD, has been applied. Furthermore, in order to utilize the advantage of the dyadic filter bank property, NA-BEMD is eventually utilized. The benchmark test of the sleep scoring algorithms using NA-BEMD and FIR filters demonstrates NA-BEMD outperforms the other.
The attention mechanism for the decomposed frequency components as inputs to the proposed model yield high weights to the components with large information about a sleep stage, which could provide interpretability regarding the proposed model and enhance the performance. For instance, since IMF2 considerably varies among the sleep stages, it could be crucial to classify the sleep stages. Thus, the attention weights of IMF2 are high in all sleep stages, which contain high-frequency components such as alpha or beta bands, contributing the most to the classification of all sleep stages. For the same reason, we found that the IMF5, which contains low-frequency components such as the delta band, contributes the most to the N3 stage classification, and other IMFs contribute to the wake stage. Therefore, this method not only improve the model performance, but also makes it reliable with the explainability of the designed deep neural networks.
Another main novelty of the proposed model is the combination of the 1D-CNN and Bi-LSTM to extract the features in each epoch and the continuous epochs. The relations among the EEG epochs are crucial to define the sleep stages due to the nature of the sleep cycles [49]. Therefore, the automatic sleep scoring model is designed to consider this in the hidden state during the learning process. To improve the performance of the model, a deeper 1D-CNN layer with the residual connection is built, which also prevents the overfitting of 1D-CNN through two-step training. As a result, the proposed model outperformed the other state-ofthe-art methods in the five experiments the same channel of EEG signal from the same dataset.
In the experiment, inter-participant cross validation was conducted using the data of three EEG channels (Fpz-Cz, Pz-Oz, and F3-M2) from the three public datasets. In particular, the F3-M2 channel is useful for the implementation of sleep monitoring wearable devices. It can be recorded from two areas of the head, that is, behind the ear and in the front of the head, which is hairless and more comfortable to be attached than the other channels. In addition, because the WSC dataset has data from participants of various races -i.e., Asian, Black, Hispanic, Native American, and White-the experiments using the WSC dataset demonstrated that our proposed model could be generalized for various ethnicities.
Although our model showed the best results with the most performance metrics, there were a limitations. As the number of N1 epochs is insufficient, the prediction probability of the N1 stage was still low, and the proposed model could not learn features of sleep spindle or k-complex separately. Therefore, for future works, we need to improve the performance of the N1 scoring using a rule-based method which reflects the N1 and N2 stage transition probability [50] or oversamples augmenting N1 epochs [51]. Also, the proposed model could be improved in performance by detecting the sleep spindle or k-complex and extracting features in future works.

V. CONCLUSION
In this study, we propose a robust model for an automatic sleep scoring method using an single channel EEG. In contrast to other state-of-the-art models, the proposed model uses the noise-assisted bivariate EMD for the extraction of the intrinsic modes of the EEG signals and applies a residual connection and various filter sizes for the CNN architecture to design a deeper and more robust model to improve the representative feature extraction performance. Additionally, to learn the relationship between consecutive epochs, 10 epochs are used as one sentence and are learned using Bi-LSTM. Furthermore, the model was interpretable using the attention mechanism. Based on these findings, we prove that the proposed model shows the best performance for three channel EEG signals from the three public datasets compared with the other state-of-the-art algorithms. Therefore, we expect that our robust model can be practically applied to professional decision support systems and wearable sleep monitoring devices.