A Lightweight Segmented Attention Network for Sleep Staging by Fusing Local Characteristics and Adjacent Information

Sleep staging is the essential step in sleep quality assessment and sleep disorders diagnosis. However, most current automatic sleep staging approaches use recurrent neural networks (RNN), resulting in a relatively large training burden. Moreover, these methods only extract information of the whole epoch or adjacent epochs, ignoring the local signal variations within epoch. To address these issues, a novel deep learning architecture named segmented attention network (SAN) is proposed in this paper. The architecture can be divided into feature extraction (FE) and time sequence encoder (TSE). The FE module consists of multiple multiscale CNN (MMCNN) and residual squeeze and excitation block (SE block). The former extracts features from multiple equal-length EEG segments and the latter reinforced the features. The TSE module based on a multi-head attention mechanism could capture the temporal information in the features extracted by FE module. Noteworthy, in SAN, we replaced the RNN module with a TSE module for temporal learning and made the network faster. The evaluation of the model was performed on two widely used public datasets, Montreal Archive of Sleep Studies (MASS) and Sleep-EDFX, and one clinical dataset from Huashan Hospital of Fudan University, Shanghai, China (HSFU). The proposed model achieved the accuracy of 85.5%, 86.4%, 82.5% on Sleep-EDFX, MASS and HSFU, respectively. The experimental results exhibited favorable performance and consistent improvements of SAN on different datasets in comparison with the state-of-the-art studies. It also proved the necessity of sleep staging by integrating the local characteristics within epochs and adjacent informative features among epochs.


I. INTRODUCTION
S LEEP is an important activity for human beings. Highquality night sleep contributes to maintaining physical and mental wellbeing [1]. While lack of sleep, sleep disorders can lead to adverse cardiometabolic risks such as obesity, hypertension, diabetes and cardiovascular disease [2], [3], [4], [5], [6]. Thus, it is necessary to monitor sleep quality and treat sleep disorders expeditiously. In clinical practice, the sleep condition is usually measured using polysomnography (PSG) device, consisting of electroencephalogram (EEG), electrooculogram (EOG), electromyogram (EMG), electrocardiogram (ECG) and so on [7]. Physicians will manually interpret the PSG recording and divide it into the corresponding sleep stage according to the Rechtschaffen and Kales (R&K) [8], which divides sleep into six stages, i.e., wake (W), rapid eye movement (REM) and four non-REM stages (S1, S2, S3 and S4) or American Academy of Sleep Medicine (AASM) [9], which divides sleep into five stages, i.e., wake (W), rapid eye movement (REM) and non-REM stages (N1, N2 and N3). Manual sleep staging is a very tedious and laborious task. It usually takes more than 4 hours to label a full night's sleep This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ recordings. Therefore, to alleviate the manual interpretation burden on physicians, automatic sleep staging is deemed to be an effective alternative.
The automatic sleep staging methods can be roughly categorized into machine learning-based approach and deep learningbased approach. Whereas, in recent years, deep learning approaches [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23] have gradually replaced traditional machine learning approaches [24], [25] in automated sleep staging. As traditional machine learning methods require extraction of hand-crafted features, which is time-consuming and proven to be unreliable when tested on unseen data. In contrast, deep learning methods can avoid these problems by using neural networks for adaptive and adequate feature extraction. The majority of existing deep learning-based sleep staging approaches are using convolutional neural network (CNN) architecture [26], [27], [28]. To illustrate, Yang et al. extracted features from raw EEG by using CNNs, and applied Hidden Markov Model (HMM) refinement as a post-processing step to correct the unreasonable sleep stage transitions of adjacent EEG epochs [27]. Perslev et al. proposed U-Sleep based on a fully convolutional neural network and evaluated it across several clinical studies [28]. A number of studies are using recurrent neural network (RNN) architectures such as long short-term memory (LSTM) and gated recurrent unit (GRU), where temporal features can be fully learned and explored. For example, H. Phan et al. proposed an architecture named SeqSleepNet to process the sequential signal based on RNN, which exhibited excellent performance, while it also suffered from a considerable amount of time consumption for training [16]. Dong et al. applied multi-layer perception (MLP) and LSTM to address the temporal pattern recognition challenge [14]. A few approaches proposed to combine CNN and RNN in order to extract both temporal and spatial information in the biomedical data [29], [30], [31]. Supratak et al. proposed an architecture named DeepSleepNet which was the combination of the CNN and RNN and the five-class sleep staging results can reach 86.2% [10]. Sun et al. proposed an architecture that considered both automatic and manual features based on CNN and RNN [11].
Although favorable results can be achieved by most of the existing automatic sleep staging approaches, they still face several enormous challenges. Firstly, for those architectures based on one RNN or multiple RNNs, it results in high model complexity mainly caused by computational approach and structural design of RNN [32]. Since the hidden states in RNN can only be calculated in serial, it relies on the information from the previous moment and therefore requires a lot of time to train the model. It is detrimental to transfer the model to new datasets, considering that most existing methods are lack of strong generalization capabilities. On a huge amount of sleep data, the application of RNN undoubtedly increases the computational time and the model complexity significantly. Secondly, in CNN based structures, only the characteristics of whole epoch or adjacent epochs are considered, and the local signal variations within epoch have been ignored [10], [11], [12], [17], [18]. The entire 30s EEG signal is usually fed directly into the network in these works, and features are extracted from the signal by convolutional kernels of different sizes. However, according to the American Academy of Sleep Medicine (AASM) rules [9], sometimes features of different sleep stages appear simultaneously in the same frame of the sleep record. This will then determine which sleep stage the sleep recordings in this frame belong to, based on the length of time that the features of the different sleep stages last. When feeding the 30s EEG signal into the CNN, it may cause some degree of confusion if there is a transitioning in the sleep stage and features are extracted in generalized whole epoch. On the one hand, extracting features from segmented signal can avoid this drawback and yield the contribution of different regions to the decision outcome. On the other hand, the segmentation operation actually divides the model into several submodules and the joint collaboration of multiple submodules facilitates the overall performance.
In this paper, a lightweight segmented attention network (SAN) model for automatic sleep staging is proposed. This model consists of two main constructions: feature extraction (FE) and time sequence encoder (TSE). The FE module is composed of multiple multiscale CNN (MMCNN) and residual squeeze and excitation block (residual SE block). The 30s EEG signal is divided into multiple equal-length segments, and then each segment is processed by a multiscale CNN for feature extraction. Multiscale CNN has both large and small convolutional kernels to fully extract the information in each EEG signal segment. By segmenting the EEG signal before feature extraction, the signal features can be fully extracted, and then features from different regions can be integrated. The residual SE block can adjust the weight of features and enhance them. The time sequence encoder is used to learn the temporal information from the extracted features and its core structure is multi-head attention. The multihead attention can process data in parallel, greatly improving learning efficiency, which is different from RNNs. We also apply a data augmentation approach to address the imbalance issue in sleep data and improve the generalization ability of the model. The main contributions are summarized as follows: 1) In consideration of exploring extensive characteristics within an epoch, we divided the whole epoch into multiple equal-length segments and fully investigated the local information of each segment and temporal information among segments. By integrating these characteristics, a comprehensive feature that can represent various regions is provided. 2) We propose MMCNN which consists of several multiscale CNN with large and small convolutional kernels to fully extract features from the EEG signal. Features with different temporal frequency resolutions are acquired and then residual SE block is used to focus on the channel-wise informative features. 3) Instead of using RNN, a time sequence encoder that mainly consists of a multi-head attention mechanism is proposed. This will significantly reduce the complexity of the network while ensuring efficiency. It can run in parallel, to learn time sequence information between features. Thus, the model can obtain the contribution of different segments to the classification results. This paper is organized as follows: Section II illustrates the details of the proposed model. In Section III, we introduce the datasets, the experimental process and the evaluation indicators. The sleep staging results of the proposed model on different datasets are shown and discussed in Section IV, where we also explore the computation efficiency of the proposed model and compare our approach with that of others. At last, we draw the conclusion in Section V.

II. METHOD
In this section, we introduce our proposed segmented attention network model for sleep staging using single-channel EEG signal. Fig. 1 shows the overall structure of our SAN model. In the process of feature extraction, to preserve as much as possible the local characteristics of the different regions of the signal, we divide the signal into fixed-length segments and maintain a 50% overlap, which helps prevent discontinuities in the signal. We also explore how the variation of segment length impact performance, which is illustrated in Section IV. Then the feature extraction is applied to deal with these segmented signals, which is composed of multiple multiscale CNNs used to extract the feature from the 30-second EEG signal. Multiple multiscale CNNs are designed to better extract comprehensive features at various temporal resolutions. Each multiscale CNN includes small kernel convolutions and large kernel convolutions. It is worth mentioning that in each multiscale CNN there is residual SE block [33], which can make the feature more distinctive. After the feature extraction, the TSE module is employed to learn the time sequence information from the features extracted by multiple multiscale CNNs. The time sequence encoder consists of positional embedding, multi-head attention and feed-forward parts. And the output of the TSE is connected to a fully connected layer with softmax classifier. In this work, to address the imbalance problem in the sleep stages, we adopt various data augmentation strategies to enrich the diversity of the input signals, such as adding Gaussian noise, scaling, etc. In the following subsections, the detail of the blocks is presented.

B. Feature Extraction
An epoch of EEG signal is divided into several segments after data augmentation. Each segmented signal is fed into corresponding multiscale CNN and residual SE block. After the multiple multiscale CNNs and residual SE block, all the features are integrated by a connection layer as the feature information.
1) The Segment of EEG Signal: As shown in Fig. 1, we divide the 30s single-channel EEG signal into segments. With the use of segmentation, which is equivalent to adding windows to the signal, we turn the segment of signal into a quasi stationary. Therefore, the model can learn more stable statistical properties and acquire robust features. Each segment of the EEG signal is fed separately into the multiscale CNN for feature extraction. It is worth mentioning that in order to prevent information loss between segments due to split signals, there is a 50% overlap between two adjacent segments. For the 30s EEG signal, the length of each segment can be calculated as follows.
where L represents the number of selected segments. When L is determined, there are L − 1 overlap segments, and the total number of segments is n = 2 * L − 1. We refer to the model with different L segments as SAN-L, and we explored the effect of different number of segments on final results in Section IV.
2) Multiple Multiscale CNN: Fig. 2 shows the specific structure of the multiscale CNNs applied for feature extraction from a segment of 30s single-channel EEG signal. We propose MMCNN to fully extract the features of different sleep stages in 30s single-channel EEG signal. The input of each multiscale CNN is a segmented EEG signal. As shown in the Fig. 2, each multiscale CNN has two branches: one branch with small kernel convolutions is applied to extract the detail features and high frequency components of the segmented EEG signal. Another branch with large kernel convolutions is applied to extract the morphological features and low frequency information. In multiscale CNN, three convolution layers and two max-pooling layers are performed for each scale. The first convolutional layer is to reduce the dimensionality of the input signal for subsequent feature extraction. The last two convolutional layers are applied for feature extraction, so the parameters of the last two convolutional layers in both scales are similar. In each convolutional layer, there is a batch normalization layer [34] that aligns the data and a ReLU that acts as an activation function. To prevent overfitting, dropout was performed after the max-pooling layer and the data concatenation of two scales.
3) Residual Squeeze and Excitation Block: Residual network can prevent gradient disappearance and gradient explosion while the network deepens [33]. Recently it has been improved and enhanced by many researchers. Hu et al. [35] proposed Squeeze and Excitation block (SE block), which can enhance the features that have a significant impact on the results and weakens the features that have a small impact on the results by scaling the extracted features. The structure of the module is shown in Fig. 3. In the residual SE block, it combines residual network and SE block. Given the input X ∈ R H ×W ×C , which is the output of the multiscale CNN. The residual layer is mainly composed of convolutional layers. After the residual layer, we get the X 1 ∈ R H ×W ×C . Next, the SE block compresses the extracted features. The global pooling is used to reduce the dimensionality of features, changing the X 1 ∈ R H ×W ×C toX 2 ∈ R 1×1×C . Afterwards, two fully connected layers and ReLU layer are applied to parameterize the pass selection mechanism, reinforcing the important features of the center and weakening the features of the edge. The following sigmoid activation function is used to give the proportion of weights for each feature. The entire process is shown in the following equation: where the F 1 (·) means the first FC layer, the F 2 (·) means the second FC layer, the ReLU(·) means the ReLU activation function and the σ (·) means the sigmoid activation function. Then, the feature weights are reassigned by matrix multiplication: Finally, shortcut connection is finally used to superimpose the original special input and the enhanced features. The final input results are as follows:

C. Time Sequence Encoder (TSE)
The function of the TSE module is to perform temporal learning on the extracted features. TSE module consists of a multi-head self-attention layer, an add and normalize layer and a feed forward layer. In the following subsections, the detail of the layers is presented.
1) Multi-Head Attention: Inspired by [36], an attention mechanism to obtain temporal features is proposed. It is more efficient than RNN and consists of several self-attention. Selfattention predicts the final outcome by focusing attention on different features. As shown in the Fig. 4, given the input signal X ∈ R N×M , the three matrices of Query (Q ∈ R M×d K ), Key (K ∈ R N×d K ), and Value (V ∈ R N×d V ), are obtained by multiplying with the linear transformation matrix W Q ∈ R N×d K , W K ∈ R M×d K , W V ∈ R M×d V . The dimensions of Q and K must be the same, and the dimensions of V and Q can be inconsistent. The lengths of K and V must be the same because K and V essentially correspond to representations of the input signal on different spaces. Finally, the output of self-attention is calculated by the following equation: where W o is the additional weight matrix, and it will be jointly trained in the model to adjust the weights. Compared with a single self-attention layer, multi-head attention extends the ability of the model to focus on different positions and gives multiple representation subspaces of the self-attention layer, which can find correlations between sequences from different angles, and reduces the dimensionality of each vector when calculating the attention of each head, which can prevent overfitting 2) Add and Normalize Layer: In the TSE module, there are two add and normalize layers. One is after the multi-head attention layer and the other is after the feed forward layer. It adds the input signal to the output via the residual connection, and then normalize the sum. The process can be explained as follows: where the x is the input signal of the multi-head attention or the feed forward layer and the SubLayer(x) is the output of the multi-head attention or the feed forward layer. The use of residual connection helps in feature learning, prevents gradient disappearance, and can speed up learning.
3) Feed Forward Layer: Feed forward layer is after the multi-head attention. Feed forward layer contains two linear transformation layers and the activation function between the two linear transformation layers is ReLU. The addition of feed forward layer introduces nonlinearity (ReLU activation function) and transforms the space of multi-head attention output, thus increasing the expressiveness of the model. The operation of the feed forward layer can be defined as follows:

4) Mask:
In TSE module, for the model to learn only information before the current moment and not to leak information after the current moment, we add the mask function to multi-head attention layer. Specifically, the matrix is made to be a lower triangular matrix after performing the operation. The operation can be defined as follows: Mask (X) = ( where X = ( The operation will be This operation will be performed after the calculation of QK T . Therefor the equation (5) can be updated to: In this way, at moment t, which is the t row of the matrix, only information from the first moment to the t moment can be read. Information after the t moment cannot be read.

D. Data Augmentation
In this work, We have made some transformations to the input signal. Specifically, we have designed three ways to perform data augmentation: 1) Adding Gaussian noise. 2) Inverting, that is, multiplying by a factor of −1. 3) Scaling, where the input signal is multiplied by a random factor which is in the range from 0.5 to 2. We use different combinations of the above three methods to produce sufficient signal variation. By applying various transformations to the input signal, we can achieve a more robust model.

III. EXPERIMENT
Our proposed new model is extensively validated on three datasets, including two public datasets and one clinical dataset. In this section, we introduce the database used for the experiment, and the process of our experiment.

A. Database
In this work, we apply Sleep-EDFX and MASS two public datasets and a clinical dataset called HSFU collected in Huashan Hospital, Fudan University, Shanghai, China, during 2019-2020 to validate the effectiveness of the proposed model.
2) Montreal Archive of Sleep Studies (MASS): Montreal Archive of Sleep Studies (MASS) is a large dataset which was collected from a number of different hospitals [38]. It has the whole-night sleep recording from 200 subjects (97 females and 103 males) aged from 18 to 76 years old. It has five subsets: SS1-SS5. The epoch of the recordings was manually labeled based on the AASM standard [9] and the R&K standard [8]. The length of the epoch in SS2, SS4 and SS5 is 20 seconds and the length of the epoch in SS1 and SS3 is 30 seconds. Each epoch recorded the EEG signals, EOG signals, EMG signals ECG signals and other signals. In our experiments, we used SS3 subset and adopted the C4 EEG channel.
3) Huashan Hospital Fudan University (HSFU): A nonpublic database collected in Huashan Hospital, Fudan University, Shanghai, China, during 2019-2020. The research was approved by the Ethics Committee of Huashan Hospital (ethical permit no. 2021-811). It consists of 26 clinical PSG recordings, which were acquired on patients diagnosed with obstructive sleep apnea, insomnia, and restless legs syndrome. The PSG recordings were annotated by one qualified sleep expert according to the AASM standard. We adopted the C4 EEG channel in this study.

B. Data Preprocessing
In this experiment, all used EEG signals are filtered by a notch filter and bandpass filter to eliminate industrial frequency interference. Then signals are resampled to 100 Hz to fit the model. EEG signals are normalized to zero mean and standard deviation of one to reduce differences between individuals. All the EEG signals were split into 30s epochs without overlap between each epoch. Each epoch of the EEG signal has a corresponding sleep stage label.

C. Evaluation Indicators
To evaluate the model performance, we adopt a series of commonly used evaluation metrics. Accuracy (Acc) shows the proportion of correctly predicted samples to the total samples. Macro-F1 score (MF1) is an evaluation metric that takes into account both precision and recall, and can evaluate model performance in multi-classification problems on imbalanced datasets. Cohen Kappa (κ) assesses the consistency of classifying the samples. Specificity (Spec) and Sensitivity (Sens) measure the ability of the model to correctly classify in positive and negative cases, respectively. They are calculated as follows. (15) where True Positives (T P i ), False Positives (F P i ), True Negatives (T N i ) and False Negatives (F N i ) mean the number of correct or incorrect categories identified for the i-th class.
N is the total number of samples and K is the number of sleep stages.
We also evaluated the running time of each network to choose an efficient and expeditious model. The average time for each model to run a fold is recorded as an evaluation reference.

D. Baseline Networks and Setup
In this experiment, we compared the proposed approach with several baseline networks with good performance, namely DeepSleepNet [10], SeqSleepNet [16] and SimpleSleep-Net [20]. A brief description of these networks is given below.
• DeepSleepNet [10]: An architecture proposed in 2017 used for sleep staging, which consists of a multiscale CNN and an LSTM with shortcut residual connection. This structure combines the capabilities of two networks for feature extraction and temporal learning. • SeqSleepNet [16]: A hierarchal bi-directional RNN structure. SeqSleepNet converts the raw EEG signal into power spectrum images by Short-time Fourier transform (STFT), which allows the signal to be characterized in both the time and frequency domains. • SimpleSleepNet [20]: It consists of two bidirectional Gated Recurrent Unit structure. It also converts the raw EEG signal into power spectrum images and the channels and frequency of the power spectrum images are recombined after STFT. This network has few parameters and small hidden layer size so that it runs very fast. To avoid serendipity as well as to accurately test the performance of the models, we took a 10-fold crossvalidation approach for each model, on each dataset. In each cross-validation, we tested the models using the leave-onesubject-out method. We finally superimposed the results of 10 cross-validation tests as the final test results of the model. In addition, for the comparison of running times, we calculated the time to train one-fold for each model. We adopted the early stop method and terminated training when the validation set loss does not decrease for a consecutive period.

A. Effect of Different Number of Segments
In order to investigate the effect of different number of segments on the final result, we conducted experiments on three different numbers of segments, SAN-0 (no segments), SAN-5 (L = 5, each segment length is 6s), SAN-10 (L = 10, each segment length is 3s) and SAN-15 (L = 15, each segment length is 2s) and then performed 10-fold cross-validation to evaluate the impact of segmentation on model performance.
As shown in Fig. 5, within a certain range, from SAN-0 to SAN-10, indicators of the model on three datasets, such as the accuracy, MF1 score and kappa coefficient, have increased steadily. As the number of segments increases, more regions will be divided and a relatively comprehensive result originating from these regions is provided. It plays a similar role to ensemble learning, where multiple submodules collaborate together to enhance the overall performance. However, segmentation with shorter duration may destroy the original morphological characteristics, and thus degrade the performance. This is why SAN-15 performs worse than SAN-10. It indicates that the appropriate segment length is also an important parameter. Besides, the running time and complexity of the model gradually increases as the number of segments increases. SAN-5 requires about four times the runtime of SAN-0, and SAN-10 requires about six times the runtime of SAN-0.

B. Effect of the Number of Heads in Multi-Head Attention
We explored the effect of the number of heads on the model performance in our experiments. With other parameters fixed, we will do the validation on the MASS dataset using models with different number of heads. As shown in Fig. 6, the number of heads does not have a significant impact on the performance of the model, and the values vary only in a small range. However, it can be seen that as the number of head changes, a relatively good setting can be found, which will have some improvement on the model performance. While when the number of heads increases to 18, the model performance decreases a bit. In our experiments, we set the number of heads to 6 in SAN in order to accurately assess the impact of the segmentation we are interested in.
C. Hypnogram Fig. 7 shows the hypnogram output using our proposed method as well as the real hypnogram and the posterior probability distribution per stage of sleep of a subject of the Sleep-EDFX dataset. It can be seen that the output hypnogram aligns very well with the corresponding ground truth. And the model discriminates the wrong sleep stage mostly in the stage of sleep stage transition. This result suggests that the transitioning sleep stages are much harder to correctly classified compared to the non-transitioning ones. The rationale is that the transitioning epochs often contain information of two or three sleep stages. Even with segmentation of the EEG signal to extract feature information, there are still difficulties in discriminating the sleep stages in the transitioning sleep stages. As a result, these present stages are active as indicated in the probability distribution in Fig 7. However, we need to pick one of them as the final discrete output label for the sleep staging task.

D. Compared With State-of-the-Art Approaches
We compared our proposed method with some state-ofthe-art approaches. The accuracy, MF1, kappa coefficient, sensitivity, specificity and runtime of these methods were compared on three datasets.
As shown in the Table. I, compared with state-of-the-art methods, our proposed method obtains the best results on the Sleep-EDFX dataset and HSFU dataset, and only slightly inferior to SeqSleepNet on the MASS dataset. The reason why SAN is inferior to SeqSleepNet in MASS is attributed to the fact that the input to SeqSleepNet is multiple 30s EEG signals that capture the information of adjacent sleep stages.  In out proposed model, the signal of an epoch is divided into different segments, and different segments may have different features of sleep stages. Along with these features, our proposed network model integrates them to output a decision in which all segments contribute, and thus a fairly robust performance can be obtained. If the duration of certain feature is short, the contribution of that segment may be overwritten by other segments. Despite the attention mechanism adopted to try to solve this problem, more satisfactory results are still not obtained for the N1 period. The proposed SAN obtained good results in tests on all three datasets. Although the SAN is slightly less effective than SeqSleepNet on the MASS dataset, it is noteworthy that SAN can significantly improve the discrimination accuracy of wake, N2 and N3 stages compared to other approaches, especially in HSFU clinical dataset. The excellent performance of the SAN in these stages makes it potentially useful for the diagnosis and prevention of a number of sleep disorders. It is unlikely that SAN trained directly on MASS, Sleep-EDFX, and HSFU would work well for recording sleep disorders because the structure and features of the data samples are different. However, we can use SAN to train on the dataset of sleep disorders or fine-tune transfer learning based on MASS, Sleep-EDFX, HSFU datasets. Whereas the short running time of SAN provides the basis for relatively fast training and transfer learning on new dataset.

E. Ablation Study
As shown in Fig. 8 By comparing the results in Fig. 8, we can conclude the following points. First, adding either the SE block or the TSE module alone after the MMCNN leads to some degree of performance degradation. It is difficult for the network to learn the deep features and the connections between these features when using only one of the module. Second, by combining the residual SE block and the TSE module, the model can further improve its performance, and the network can be more efficient by obtaining deep features and internal associations. The results on three datasets illustrate the importance of the combination of these modules.

V. CONCLUSION
In this paper, we proposed a novel architecture called SAN for sleep stage classification by single EEG channel. We used multiple multiscale CNN for feature extraction of different segments of EEG signal, applied residual squeeze and excitation block to enhance the feature and assigned weights to the features in different regions based on the multi-head attention mechanism. In addition, we added noise to the raw EEG signal for data augementation to solve the class imbalance problem. The method improved the system performance by making decisions based on each segment feature in an integrated manner. The proposed method performed well on two public datasets and one clinical dataset. We compared it with recent state-of-the-art researches and demonstrate the effectiveness of the algorithm. The results showed that our proposed method is competitive and can obtain a better performance on the sleep stage classification. In future work, the idea of object detection could be used to clearly locate the features at different locations in the signal segment, thus achieving higher accuracy identification and facilitating the diagnosis of related sleep disorders.