A Two-Stage Learning Convolutional Neural Network for Sleep Stage Classification Using a Filterbank and Single Feature

Sleep is an essential process for the body that helps to maintain its health and vitality. The first stage in the diagnosis and treatment of sleep disorders is sleep staging. Due to the complications in manual sleep staging by the physician, computer-aided sleep stage classification algorithms are gaining attention. In this study, a novel approach was introduced to extract distinctive representations from single-channel EEG signal for automatic sleep staging. Standard deviation as a single feature was extracted from the frequency subbands of EEG, which gave a comprehensive understanding of the signal and its activity within various frequency ranges for different sleep stages. The features formed the input space of the proposed two-stream convolutional neural network (CNN) for classification and two-stage learning was used to train the model that achieved improvements in terms of accuracy, reliability and robustness against traditional classifiers and conventional training method of the neural networks. For the performance evaluation, three well-known benchmark datasets including Sleep EDF, Sleep EDFx and DREAMS Subject were used. The proposed algorithm by utilizing simple and effective methods improved sleep stage classification results by achieving an overall accuracy of 93.48%, 93.14% and 83.55%, respectively. The introduced framework in this study has great potential for practical implementation on a home-based sleep staging device.

patient should spend one or two nights in the clinic. Some people may feel uncomfortable when sleeping in new places for the first time and may have difficulty falling asleep [4]. On the other hand, manually checking the PSG signals by the physician to detect changes in the sleep stage is not an efficient way to classify sleep stages. Manual sleep staging is a time-consuming and costly task, and the results can be affected by human error. As a result, many studies were conducted to overcome the mentioned problems. These studies introduce algorithms for automatic sleep stage classification, which can acquire hypnogram from the PSG signals without a need for a sleep expert. Generally, automatic sleep staging methods use single-channel EEG or a combination of EEG, EOG, and EMG signals for classification [5]- [7].
Automatic sleep stage classification algorithms include an extensive area of machine learning methods [8]. These algorithms can be divided into two groups including, classical machine learning methods and deep learning-based methods. The first stage in classical machine learning methods is feature extraction. Various signal processing and feature extraction methods have been developed to extract unique characteristics of the signals for classification. Common features that have been extracted from PSG signal for sleep staging include time domain, frequency domain, and nonlinear features [9]- [12]. Feature selection is essential to achieve acceptable recognition for different sleep stages. In the next step, extracted features provide the input space for a classifier. Well-known classifiers such as k-nearest neighbor (KNN) [13] , decision tree (DT) [14], random forest (RF) [11], and support vector machine (SVM) [15] are commonly used in classical sleep staging algorithms.
By introducing novel deep learning frameworks and the development of neural network structures, deep learning methods gained more attention in sleep staging studies. Neural networks can perform significantly flexible for different data types and outperform traditional classifiers in different scenarios. Different types of neural networks have been evolved for sleep stage classification. One of the most popular deep learning models is recurrent neural networks (RNN) which uses current input and previous input of the network to determine the current state of the network. RNN can handle sequential datasets like sleep data whose current sleep stage is related to the previous stage. A specific type of RNNs called long short-term memory (LSTM) [16] is popular in sleep studies [5], [17]. Convolutional neural networks (CNN) by achieving promising results for classification of different datasets and a variety of data types such as signal, image and video, is one of the commonly used classifiers [18]- [20]. Consecutive non-linear functions and adaptive convolutional filters in different layers of a CNN make it possible to design a flexible model to extract complex features of the input data and make good predictions.
Zhu et al. [21] proposed an attention-based convolutional neural network that helped convolutional layers in feature extraction. Attention layers are used to resemble an RNN for detecting sequential information of sleep stages. Chambon et al. [22] introduced a deep learning architecture for temporal classification of sleep stages using multi-channel EEG, EOG, and EMG signals. The proposed model in this study is based on performing spatial filtering on different channels before convolutional layers and separate pipelines for EGG/EOG and EMG signals. In another study conducted by Cui et al. [23] multi-channel EEG time series are used for sleep staging without extracting hand-crafted features. Raw EEG signals are segmented by a method called fine-grained segments and fed to a multi-stream CNN. Li e al. [24] proposed an endto-end CNN for sleep staging using raw EEG signals. The main idea of this work is to classify three subsequent EEG epochs using a novel CNN architecture based on squeezeand-excitation block and micro-network.
Most of the deep learning-based studies concentrate on using only raw EEG for classification, therefore valuable hand-crafted features that were shown to be beneficial for sleep staging are neglected in these methods. On the other hand, raw EEG is a complex, non-stationary, and non-linear time series that takes too much effort to design a deep learning structure that can capture its information. Deep learning frameworks have the potential to perform perfectly when there is enough data to train the network. However, in the case of sleep staging because of the difficulties of data recording and labeling the data, there is a limited number of data samples. Therefore, this point needs special attention to prevent inconsistent results with low generalizability [25].
Considering the mentioned problems, we introduced a sleep staging framework to process the single-channel EEG signal and extract hand-crafted features. A single feature is extracted from the subbands of the EEG signal, which is decomposed by a filterbank. Besides, a two-stage learning CNN is introduced to classify extracted features. We are aiming to demonstrate that in the case of the sleep stage classification it is possible to propose a simple solution to distinguish the main characteristics of the EEG for different sleep stages. Furthermore, training the CNN in two stages can help to achieve consistent and reproducible results with high accuracy even with databases with small size like sleep staging databases. The outline of the proposed method is shown in Fig. 1. The key contributions of the current work are as follows.
1) A novel framework is introduced for extracting handcrafted features from EEG signal using a filterbank and a single statistical feature. A simple and straightforward approach to extract distinctive representations of EEG signal. This framework is designed to obtain only useful information from EEG with minimum redundancy and complications. 2) A new method is proposed for training the CNN in two stages that improves the performance of the CNN in terms of, consistency, reproducibility, reliability, and accuracy without adding extra computational complexity to the problem. 3) The proposed algorithm is evaluated using three benchmark datasets for sleep staging including Sleep EDF, The remainder of the study is organized as follows. In Section II, benchmark datasets that are used for model evaluation are introduced and the proposed method is described in detail. In Section III, various experiments are designed to evaluate the performance of the algorithm in different situations. In Section IV proposed algorithm is discussed from different points of view. Section V draws a conclusion on the current work.

A. DATASETS 1) Sleep-EDF
Sleep-EDF is a publicly available dataset that is accessible in the PhysioNet database [26], [27]. Sleep-EDF is the most commonly used dataset for the evaluation of sleep staging algorithms. This dataset is composed of two different data recording scenarios including sleep telemetry (ST) and sleep cassette (SC). In SC recording, physiological signals of the 4 subjects are recorded during their daily lives for 24 hours using a portable recording device. ST recordings are recorded in the hospital during the night and include the physiological signals of 4 healthy subjects who had mild difficulty falling asleep. The subjects were healthy males and females aged between 21 and 36. PSG signals including EEG, EOG and EMG are recorded continuously from subjects during sleep and sleep specialist assigned a sleep stage to each 30-second epoch of signals according to the R & K rule [28]. The acquired hypnogram contains Wake, REM, S1, S2, S3, S4 sleep stages and also movement time and unscored stages, which should be trimmed. In this study, an EEG signal from Fpz-Cz channel with a sampling frequency of 100 Hz is used for classification. Furthermore, S4 and S3 sleep stages are merged into N3 according to the AASM rule and the epochs that are related to the movement time and unscored are removed from hypnogram and EEG signal.

2) Sleep EDFx
Sleep EDF expanded (EDFx) is the expanded version of the Sleep-EDF dataset. More subjects participated in the recording of the Sleep-EDFx. The same scenario as in the Sleep-EDF is performed on the current dataset to acquire ST and SC recordings. However, in Sleep EDFx, ST recordings were obtained to examine temazepam effects on sleep from 22 healthy male and female Caucasians and SC recordings were obtained to examine age effects on sleep from healthy Caucasians aged between 25 and 101.

3) DREAMS Subject Database
DREAMS Subject database is another publicly available benchmark dataset for the assessment of sleep staging algorithms. It consists of overnight polysomnographic signals of 20 healthy subjects (4 male and 16 female) aged between 20 and 65. The dataset was collected at the Sleep Laboratory of the André Vésale hospital in Belgium. Two EOG channels, three EEG channels (CZ-A1 or C3-A1, FP1-A1 and O1-A1) and one submental EMG channel were recorded from each subject at a sampling frequency of 200 Hz. Each 30-second epoch of signals was labeled according to the AASM rule as Wake, N1, N2, N3, REM, and unknown sleep stage.
Detailed information about the datasets is mentioned in Table 1.

B. FEATURE EXTRACTION
In the manual sleep staging methods, the frequency activity of the EEG signal is the most powerful indicator for sleep specialists to recognize sleep stages. According to the AASM rule, two important factors are described for sleep experts to help them distinguish different sleep stages, including frequency content and the amplitude of the EEG signal. An EEG signal can be decomposed into different waves so that each of these waves is active in a specific frequency range. Table 2 shows the frequency range of different waves and the presence of these waves in different sleep stages [29]- [32].
Various studies have tried to reveal the relationship between the frequency content of the EEG signals and sleep stages. These studies have been trying to utilize frequency domain, time domain and non-linear features to detect different sleep stages. However, in these works due to a large number of involved parameters and complexity of the algorithm, it is not possible to discuss the effectiveness of frequencydomain features in recognition of different sleep stages. On  the other hand, most of the frequency domain features are based on frequency bands such as delta, theta and alpha that are not good at providing frequency information with acceptable resolution. From Table 2, we can see the same frequency bands are present in different sleep stages. Therefore, band powers are not the best features for distinguishing different sleep stages. Due to the mentioned problems, we propose a framework to extract features from single-channel EEG signal that provide distinctive frequency and amplitude information for different sleep stages.

1) Preprocessing
To generalize the method to different datasets, a baseline preprocessing method is considered. First, all of the signals are downsampled to 100 Hz using the polyphase filter because only the 0 to 50 Hz frequency information is needed for further processing. Then, the recorded overnight EEG of each subject is centered and normalized by standard deviation. It helps to achieve a good representation of the signals in the feature extraction steps, and it is also necessary to have normalized input for a convolutional neural network (CNN) to have a faster training process.
According to the AASM rule, each 30-second epoch of EEG signal should be associated with a specific sleep stage. Furthermore, in the AASM rule in many cases, to recognize each epoch of the EEG signal, the first half and second half of the epoch can be inspected for the presence of specific wave and frequency activity. Therefore, in the proposed method, 30-second epochs are divided into two 15-second subepochs for further processing. As a result, 15-second subepochs of EEG with 1500 data points are prepared for filtering and feature extraction.

2) Filterbank
The first step is the decomposition of the input signal into frequency subbands. The proposed filterbank consists of 50 bandpass filters with a frequency range of 1 Hz without overlap that spans the 0 to 50 Hz range. The filters are infinite impulse response Butterworth with an order of 3. Choosing a higher order for the bandpass filter will produce sharper frequency boundaries. However, it may cause unstable results and fluctuating signals at the output. In the case of sleep staging, selecting a lower order for the filterbank and using slick bandpass frequency boundaries showed meaningful improvements in the results. Afterward, each subepoch is passed through the filterbank separately and 50 subbands of EEG signal are produced as output. In the proposed setup for filtering, an EEG signal with higher frequency activity in a specific frequency range will have active subbands for the corresponding frequencies after filtering while other subbands are suppressed.

3) Standard Deviation
In the feature extraction step, a common approach in signal classification studies is to use a combination of different features from various domains such as time domain, frequency domain, and non-linear features. These methods obtained acceptable results in sleep staging. However, these features are not specifically designed to distinguish different sleep stages. In the proposed method, we used standard deviation as a single feature to extract characteristics of EEG subbands. Standard deviation is a measure of the dispersion of data around its mean, which can be calculated as, where x is the input signal, N is the number of data points and x is the mean of the input signal.
Standard deviation will be calculated for each subband of the first and second subepoch. As a result, we will have 49 features corresponding to 1Hz to 49Hz frequencies for each 15-second subepoch for the classification. Due to power line interference, the standard deviation of the 50Hz frequency subband was removed from the features. Fig. 2 shows the extracted feature from one epoch of the EEG signal.

C. TWO-STREAM CNN AND TWO-STAGE LEARNING
A convolutional neural network (CNN) as a feature extractor and classifier achieved outstanding results on different data types and different tasks. Convolutional layers as the feature extractor are able to capture the low-level and high-level structures of input data in shallow layers and deep layers of the network, respectively. On the other hand, the high number of hyperparameters of the neural networks adds more flexibility to these models in comparison to the conventional classifiers.

1) Network Architecture
The input of the CNN can be interpreted as a signal with simple and meaningful structures in it. By considering the input as a whole, each sleep stage has a unique input shape in comparison to the other sleep stages which can be recognized by convolutional layers.
As shown in Fig. 3 the proposed CNN architecture comprises two streams. Each stream takes data points of one subepoch as input. This structure extends the capability of the network for representation learning of each subepoch due to separate convolutional layers for the first and second half of each epoch. Each stream includes a set of convolutional, batch normalization, and max pooling layers. In each stream, a batch normalization layer is placed after the first convolutional layer which helps the model to be more stable during training and accelerates the learning process. Each convolutional layer and fully connected layer is followed by a dropout layer with a rate of 0.3 and 0.5, respectively to prevent the model from overfitting. After another two convolutional layers, the activations are concatenated and fed to two consecutive fully connected layers with 200 neurons each. The order of the layers and other hyperparameters of the model including, the number of neurons, kernel size, optimizer, and learning rate are tuned based on experimental investigation.

2) Two-Stage Learning
Convolutional neural networks are based on learning representations of input data using convolutional layers and making predictions using fully connected layers. Each of these sets of layers includes a large number of trainable parameters, therefore a large number of training data is needed to achieve satisfactory results [33]. In the case of sleep stage classification, datasets with large sizes are not available and this leads to inconsistent results in classifying sleep stages with CNN. Additionally, convolutional layers and fully connected layers pursue a different goal in the learning process which is feature extraction and classification, respectively. However, each parameter of the different layers gets an update at the same time in each epoch regardless of the type of the layers which can lead to a slow pace of learning and divergence in the learning process.
To overcome the mentioned problems, in our last work [34] we proposed a method for training a two-stream CNN such that each stream of convolutional layers was trained and frozen in the separate sub-networks, and transferred to the main model.
Each layer of a feedforward CNN consists of a number of neurons with learning capability. Each neuron has weights and a bias that hold the learned information during the training process. For the training, an optimizer computes the gradients of the loss function. Then, through the backpropagation method, gradients will be used to update weights and biases for all of the neurons from the last layer toward the first layer. The described process is the conventional way of training CNNs. In this process, all of the parameters of the model will be updated in each iteration of the training.
The proposed framework for training takes place in two stages. The hyperparameters of the model are manually tuned at first and are not changed during the training. The model includes 6 convolutional layers and 3 fully connected (FC) layers with a total of 390461 trainable parameters. Each neuron in a layer of a CNN is connected to all of the neurons in the previous layer by a trainable parameter. This parameter can be updated in the training to pass useful information to the next layer.
The Adam optimizer [35] with a learning rate of 0.0003 is used for the training. At the first stage of the two-stage learning, the CNN is trained in the conventional way, and all of the model parameters are updated for 100 epochs. At this stage, convolutional layers are adapting themselves to the input data and learning the representations of it to extract distinctive information. The outputs of convolutional layers from two streams are concatenated as activations for the FC layers. Meanwhile, FC layers are learning to recognize these VOLUME 4, 2016 activations as the respective sleep stages. At the end of the first stage's training, the model is not overfitted and shows an acceptable performance in classifying sleep stages. At the second stage, the parameters of the convolutional layers are frozen, so their values will not change during training. At this stage, train data is fed to the model for another 100 epochs for FC layers to continue learning. In the second stage, the data representations do not change during the training, and this gives the model a chance to train FC layers at their full capacity for the classification.
In this approach, the training of convolutional layers will be stopped before the end of the training. After 100 epochs of training, only the weights of FC dense layers will need to be updated, resulting in a faster training process for the model.

III. EXPERIMENTS
In this section, the performance of the proposed method will be investigated from different points of view. Different experiments are designed to make sure the algorithm will perform in different situations on different datasets. In the following, the important criteria for performance assessment of sleep staging algorithms are introduced.
P recision = T P T P + F P (4) Where TP, TN, FP, FN stand for True positive, True negative, false positive, and false negative, respectively.
In addition to the accuracy as a measure for the performance of the algorithm, Cohen's kappa [36] is also used to calculate the agreement between expert scored sleep stages and algorithm scored sleep stages. Generally, Cohen's kappa is known to be a robust measure in comparison to the accuracy because it takes into account the possibility of the agreement between raters occurring by chance.

A. EXPERIMENT 1: THE LEARNING CURVE OF THE PROPOSED CNN
This experiment is designed to evaluate the performance of the proposed CNN architecture and two-stage learning process based on the accuracy and loss of the model during the training. Two scenarios are considered. In the first one, the CNN is trained for 200 epochs in the traditional way and in the other scenario the proposed method for 2-stage training is implemented and test loss and test accuracy in both scenarios are considered for comparison. Train and test data samples are selected the same in both scenarios where the test data and train data include 20% and 80% of the whole dataset, respectively.
The results are shown in Fig. 4. Each row represents a specific dataset including DREAMS, Sleep EDF, and Sleep EDFx. As it can be seen, in 2-stage learning a significant improvement occurred after epoch 100. In comparison to the 1-stage learning, the fluctuations in accuracy curves decreased and the loss value has finely diverged to its optimum value. For a quantitative comparison, mean, standard deviation, minimum and maximum of the accuracy and loss curves are mentioned in Table 3 for epoch 100 to 200. In 2stage learning, the results are more consistent, reliable, and reproducible for different runs. In addition, the results show an improvement in accuracy values.

B. EXPERIMENT 2: PERFORMANCE COMPARISON OF THE PROPOSED METHOD AND TRADITIONAL CLASSIFIERS
This experiment is designed to illustrate the effectiveness of the proposed CNN in comparison to the other classical approaches in feature classification. Four cases includ-     Table 4. Each value represents the sensitivity of a specific class using a specific classifier for a particular case. The proposed algorithm has the highest sensitivities for different classes with a significant value. As we can see, RF and DT classifiers have acceptable results in comparison to the other methods and even higher sensitivities in recognizing the Wake stage in comparison to the proposed method. However, the sleep EDF dataset that is used for this experiment is a highly imbalanced dataset with more epochs in the wake stage and few epochs in the N1 stage. Therefore, we can say classifiers with a high recognition rate for the wake stage in comparison to the other stages, are biased towards the wake stage. On the other hand, the low number of N1 training epochs caused LR, SVM, and GNB to be completely unable to recognize the N1 sleep stage.

C. EXPERIMENT 3: PERFORMANCE COMPARISON AGAINST BASELINE WORKS
In this section, a comprehensive evaluation is conducted on the proposed method based on different criteria using various datasets. The performance of the algorithm will be compared to the state-of-the-art and baseline methods for automatic sleep stage classification. Each of the datasets is randomly split into train and test sets in which test and train samples comprise 20% and 80% of the total dataset, respectively. For a small dataset such as Sleep EDF, selecting data samples for the test and train set can affect the final result. Therefore, to have a fair comparison between the performance of the proposed method and other methods, the reported result for the Sleep EDF dataset is the average of five different runs. Table 5 is the confusion matrix of the algorithm on the Sleep EDF dataset. Each row in the confusion matrix represents the epochs classified by the sleep expert for a specific stage, where each column includes epochs classified by the algorithm. The diagonal elements of the confusion matrix are where the expert's and algorithm's opinions meet, which represent the sensitivity for a specific sleep stage. Per-stage specificity, F1 score, and precision are also mentioned in the bottom lines of the table. These measures help to have a comprehensive and robust evaluation of the proposed algorithm. As can be seen from Table 5, the algorithm is succeeded VOLUME 4, 2016 7 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and    Table 6 and Table 7 illustrate the confusion matrix and a detailed description of the obtained results from classifying Sleep EDFx and DREAMS Subjects datasets. The sensitivity values for the DREAMS Subjects dataset are lower in comparison to the Sleep EDF and Sleep EDFx datasets. The reasons for this are the differences in data acquisition system, EEG channel, and distribution of data samples over different sleep stages. However, the overall performance of the method is similar in classifying different sleep stages for three benchmark datasets that shows the consistency of the proposed method in the face of different datasets. The final step in the assessment of the algorithm is to find the place where the proposed algorithm stands between the sleep staging algorithm. Baseline and state-of-the-art studies in the single-channel sleep stage classification area are compared based on accuracy and Cohen's kappa in Table  8. The comparison is based on the subject non-independent paradigm and the data samples of each subject can be used for training and testing of the model. It can be seen, the proposed algorithm outperformed other works in classifying 8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  Sleep EDF, Sleep EDFx, and DREAMS Subjects datasets by accuracy and kappa pair of (93.48, 0.898), (93.14, 0.890) and (83.50, 0.775), respectively. We can conclude from the obtained results, the algorithm is capable of performing on different datasets and has good generalizability.

IV. DISCUSSION AND FUTURE WORK
In this study, a novel algorithm is proposed for sleep staging which includes a simple and effective framework for feature extraction and a 2-stage learning strategy for classification that achieves robust and reliable results, and improves sleep staging results in comparison to the baseline methods. Now we discuss the important details of the automatic sleep staging problem and the significant improvements this study contributed to. Finding the parameters of a CNN can be interpreted as an optimization problem that in each update of the parameters, loss value decreases, accuracy increases, and the model moves towards the optimal point. However, the network should be tested on different data than the train data and it is favorable that the model does not perfectly learn the train data to prevent overfitting. Therefore, regularization methods like the dropout layer can be used to prevent the model from overfitting.
Regularization methods by adding some error to the problem can improve results on test data but escalate the overshoot around the optimal point which can be seen in Fig. 4, 1-stage learning in loss curves. In this setup, the results are not reproducible and less reliable. One possible solution is to use a lower learning rate for the optimizer which can solve the problem to some extent but prolong the training process. However, finding the best learning rate has its complications. The proposed 2-stage learning is an alternative solution to overcome the mentioned problems and achieve improve-ments in consistency and accuracy of the model without any additional cost. Furthermore, in the second stage of the 2-stage learning, trainable parameters of the convolutional layers are frozen hence the computational cost and overall training time of the model are reduced.
The CNN structure was designed based on experimental investigation. Searching through different CNN architectures to find the model with the best performance for a specific application can be a cumbersome task. However, in our study, we tried to use a simple structure with good results and improved the model's performance by two-stage learning. For choosing the best number of CNN streams and EEG subepochs, 4 cases are considered, including 1-stream to 4stream CNN. Fig. 5 shows the accuracies for the different cases for the last 50 epochs of the training. It can be seen that the 2-stream CNN shows better performance in comparison to the other networks. Despite the simple nature of the hand-crafted features in this study, they are capable of reflecting distinctive characteristics of the EEG signal for various sleep stages and improving sleep stage classification results. In the following, we discuss the provided information by features in detail. Fig.  6 illustrates the extracted features from an EEG subepoch which provides useful information about each sleep stage and its characteristics. The vertical axis is the logarithm of the calculated standard deviation of each EEG subband related to the 1 Hz frequency range. Each sleep stage curve was obtained by averaging over feature vectors of all of the epochs that scored as a specific sleep stage. This plot can be interpreted as an indicator of the frequency activity of each sleep stage signal in various frequency ranges. In the wake stage, theta waves are compressed and alpha waves have dominant activity. The broadband activity of the wake stage stands out in comparison to the other sleep stages. The unique shape of the wake curve makes it easy for the CNN to recognize it. However, as reported in confusion matrices in Table 5, Table 6 and Table 7 stage N1 has the lowest recognition rate among sleep stages and can be falsely VOLUME 4, 2016  From the confusion matrices, the N3 epochs are falsely classified as N2 stage with 10.5% 13.4% and 18.7% rates for Sleep EDF, Sleep EDFx, and DREAMS dataset, respectively. This is a well-known problem in sleep staging that happens because of the presence of sleep spindles in both N2 and N3 stages. It can be seen from Fig. 6, N2 and N3 curves are overlapped in the frequency range of 12 to 16 Hz where the sleep spindles are active. However, the N3 sleep stage has a dominant frequency activity in the delta frequency band which makes it easy to recognize by the classifier.
The amplitude and frequency of the EEG signals are two important characteristics that change over time and determine the nonstationarity of the EEG and it does not depend on a specific application like sleep staging. The proposed approach utilizes these two characteristics for further processing. Therefore, the method for extracting hand-crafted features is not designed for a specific task such as sleep staging but to obtain a detailed understanding of the EEG. Hence, it can be used in different tasks that depend on EEG signals such as brain-computer interface [55] (BCI), emotion recognition [56] and seizure detection [57]. On the other hand, 2-stage learning sets a novel point of view in training CNNs that helps the model learn faster and achieve robust 10 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and and reliable results without any additional complications and it can be implemented on any CNN for various applications.
The current study is subject to some limitations that we need to highlight, and draw a future direction for them. Recognition of the N1 sleep stage due to the small number of epochs is a challenging problem in sleep staging studies. Furthermore, the N1 sleep stage is the transition between wakefulness and sleep, which makes it hard to recognize. Our study was unable to fully recognize the N1 stage and it needs novel feature extraction and data augmentation techniques specifically designed to solve the N1 stage classification problem. The datasets that are used in this study include healthy subjects with minor sleep problems like difficulty in falling asleep. A study can be conducted to test the generalizability of the model over subjects with sleep disorders like obstructive sleep apnea (OSA). Furthermore, it is possible to utilize the proposed method to recognize different sleep problems or classify a sleep disorder by its severity. OSA is one of the most common sleep disorders which can be classified into three stages including mild, moderate, and severe [58]. This classification can help sleep experts in the diagnosis and treatment of the patient's problem.
Considering the transitions between sleep stages and the sequential property of the sleep cycle, for future work a long short-term memory (LSTM) can be added to the proposed model in this study to enhance the model's performance in subject independent paradigm. The goal of using a singlechannel signal for sleep staging is to acquire a method that can be implemented on a device and easily utilized by an individual at home without any supervision. However, in a sleep clinic, there is no such limitation and more signals can be used for sleep staging. In addition, extra signals add useful information to the problem and will improve the performance of the method. Therefore, a future study can be considered to investigate computer-aided sleep staging using multichannel PSG signals to surpass the expert-scored gold standard for sleep staging.

V. CONCLUSION
This paper presents a computer-aided sleep staging algorithm using a single-channel EEG signal. The main idea is based on introducing a method to extract the most informative features from EEG signal and classify them in a reliable framework with minimum complications. A novel framework was introduced for feature extraction to achieve useful representations of the sleep EEG. In this framework, the EEG signal was decomposed into its frequency subband by a filterbank and standard deviation as a single feature was calculated for each subband. These features provide distinctive characteristics of EEG signals for different sleep stages. A 2-stream CNN was designed for the classification, and a 2-stage learning method was used to train the model. Using 2-stage learning improved the model's performance in terms of accuracy, robustness, and reproducibility. The performance of the method was evaluated from different aspects on three benchmark datasets, and it outperformed state-of-the-art sleep staging methods. Moreover, the introduced framework for classification outperformed conventional classifiers and 2-stage learning showed great improvement in comparison to traditional CNN training. The proposed method has the potential for practical implementation, and the need for only one channel of EEG signal makes it possible to use it on carry-on devices at home without supervision.