Deep learning enables accurate automatic sleep staging based on ambulatory forehead EEG

We have previously developed an ambulatory electrode set (AES) for the measurement of electroencephalography (EEG), electrooculography (EOG), and electromyography (EMG). The AES has been proven to be suitable for manual sleep staging and self-application in in-home polysomnography (PSG). To further facilitate the diagnostics of various sleep disorders, this study aimed to utilize a deep learning-based automated sleep staging approach for EEG signals acquired with the AES. The present neural network architecture comprises a combination of convolutional and recurrent neural networks previously shown to achieve excellent sleep scoring accuracy with a single standard EEG channel (F4-M1). In this study, the model was re-trained and tested with 135 EEG signals recorded with AES. The recordings were conducted for subjects suspected of sleep apnea or sleep bruxism. The performance of the deep learning model was evaluated with 10-fold cross-validation using manual scoring of the AES signals as a reference. The accuracy of the neural network sleep staging was 79.7 % (κ=0.729) for five sleep stages (W, N1, N2, N3, and R), 84.1 % (κ=0.773) for four sleep stages (W, light sleep, deep sleep, R), and 89.1 % (κ=0.801) for three sleep stages (W, NREM, R). The utilized neural network was able to accurately determine sleep stages based on EEG channels measured with the AES. The accuracy is comparable to the inter-scorer agreement of standard EEG scorings between international sleep centers. The automatic AES-based sleep staging could potentially improve the availability of PSG studies by facilitating the arrangement of self-administrated in-home PSGs.


I. INTRODUCTION
Current guidelines defined by the American Academy of Sleep Medicine (AASM) divide sleep into five stages: wake (W), N1, N2, N3, and rapid eye movement (R) [1]. The stages are identified based on electroencephalography (EEG), electrooculography (EOG), and chin electromyography (EMG) signals recorded during polysomnography (PSG). PSG is usually conducted in a sleep laboratory (type I PSG), as the standard 10-20 system EEG electrodes (Fig. 1c) require pre-application by a trained expert and standard type II PSG EEG electrodes are too complex to be fully selfadministrated in a home environment.
To overcome this shortcoming of the PSG, various types of headbands and electrode sets have been developed for EEG measurement [2]- [4]. Moreover, we have previously introduced an ambulatory electrode set (AES) enabling the recording of the EEG, EOG, and EMG during sleep [5]. The Ag/AgCl electrodes in the AES are screen-printed on a flexible polyethylene terephthalate (PET) film that attaches easily to the skin with a self-adhesive hydrogel membrane and medical foam. The design of the AES and illustrated instructions help to achieve consistent placement of the electrodes. The AES has 1.5 mm touch-proof safety socket connectors suitable to be used with most modern EEG amplifiers. The EEG and EOG electrodes are located around the face, near the hairline, as presented in the Fig. 1b. The EEG measured with AES has been shown to be suitable for manual sleep staging [6] and the success rate and technical quality of the EEG, EOG, and EMG channels in self-applied. AES recordings have been shown to be comparable to conventional type II in-home PSGs [5], [7].
In addition to the complex measurement setup used in a standard type I PSG, the sleep staging process requires timeconsuming manual analysis of the recordings by experienced sleep technicians. However, recent advances in deep learning have brought a surge of automated sleep staging applications [8]- [11]. These deep learning applications have achieved excellent sleep staging accuracies and compare well with the inter-scorer agreement between experienced manual scorers from international sleep centers [12]- [14].
Recently, we introduced a deep learning model for the automatic identification of sleep stages [15]. The model achieved an accuracy of 82.9 % (κ = 0.77) by utilizing only a single EEG channel (F4-M1) extracted from type I PSGs. To further enhance the diagnostics of sleep disorders and facilitate the arrangement of type II PSG studies, we aim to utilize a similar model [15] for sleep staging based on signals acquired from AES recordings. Building on the recent results on deep learning-based automatic scoring [15] and the feasibility of the AES for manual sleep scoring [6], we hypothesize that deep learning can be utilized for accurate automatic AES-based sleep staging. Furthermore, electrode malfunctions and other EEG artifacts appear as high amplitude variations in the signals. Thus, we present a novel method for eliminating electrode-originated noise in the neural network input. We hypothesize that the automatic scoring accuracy can be increased by using the variance of the concurrent EEG epochs as a simple metric for selecting the input for the deep learning model. In addition, we investigate which AES channel derivations provide the most accurate results to advance the development of the AES. Successful implementation of automatic AES-based sleep staging could potentially enhance the clinical usability of the AES and improve the availability of PSG studies by facilitating the measurement of self-administrated type II PSG recordings.

A. PATIENTS AND DATA
Related to the development and clinical validation of the AES, we have tested it in several small cohorts, some of these being currently under analysis and some already published [5], [7]. Three separate datasets were utilized in this study ( In all datasets, the AES signals were recorded with a Nox A1 PSG monitor (Nox Medical, Reykjavík, Iceland). The AES data comprised four EEG (Af8-T9, Fp2-T9, Fp1-T10, and Af7-T10), two EOG (F8-T9 and F7-T10), and four EMG (S1-SF, S2-SF, MassL, MassR) channels (Fig. 1a). EEG, EOG, and chin EMG channels are used for manual scoring of sleep stages, whereas MassL and MassR are for the identification of sleep bruxism episodes. In addition, each electrode of the AES was also recorded against the common ground to enable the acquisition of non-standard channel derivations. The sleep staging was initially performed manually by four experienced scorers according to the latest AASM guidelines [1] based on EEG, EOG, and EMG signals recorded with the AES. Two individual scorers took part in the scoring of the bruxism dataset in Finland, sleep apnea dataset 1 was scored by one scorer in Iceland, and sleep apnea dataset 2 was scored by one scorer in Australia.

B. NEURAL NETWORK ARCHITECTURE
We developed a deep learning model in a previous study, which achieved a sleep staging accuracy of 82.9 % (κ = 0.77) based on a single standard EEG channel (F4-M1) using a dataset comprising 891 PSG recordings [15]. The same neural network architecture was used without modifications in the present study. The architecture comprises a combination of a convolutional neural network (CNN) and a recurrent neural network (RNN) (Fig. 2). The CNN is used for epoch-wise feature extraction and the RNN for the learning of the sequential characteristics of the sleep stages. The input for the neural network is a sequence of a hundred consecutive 30 s EEG epochs and the output is a sequence of the corresponding sleep stages. The sequence length was set to one hundred as a compromise between memory preservation, data augmentation, and sufficiently long sequences for capturing sleep cycles.
The CNN comprised three convolutional blocks with each including two convolutional layers and a pooling layer. The first two blocks end with a max pooling layer with a size of 2 and the last block ends with a global average pooling layer. In the first block, the number of kernels was 128 and the kernel size was 21 data points. The first convolutional layer had a stride size of 5 to reduce the amount of data, and all later convolutional layers had a stride size of 1. In the latter two blocks, the number of kernels was 256, and the kernel size was 5. Each convolutional layer was followed by a batch normalization layer. After the convolutional neural network, a Gaussian dropout was applied. The data was then fed into a bidirectional LSTM neural network. The LSTM had 256 units, a dropout value of 0.3 and a recurrent dropout value of 0.5. The last layer of the neural network was a fully connected dense layer with softmax activation, which can be interpreted as a probability for each sleep stage.

C. DATA PREPARATION
The EEG and EMG signals were zero-phase filtered with Butterworth bandpass filter (0.3-32 Hz) and downsampled from 200 Hz to 64 Hz as the deep learning model was originally developed for 64 Hz signal [15]. To unify the scale of the signals, each signal was z-score normalized. After the signal processing, the signals were divided into 30 s epochs and the epochs were set to sequences of 100 consecutive epochs. In the training and validation sets, these sequences were generated with 75% overlap between adjacent sequences to multiply the amount of the training data whereas test set sequences had no overlap. In accordance with previous studies [10], [15], [17], we included a maximum of 30 minutes of wake EEG signal before and after sleep whenever excess signal was recorded. Furthermore, to mimic AASM guidelines of switching to contralateral channels in case of poor signal quality, we also created combination channels. The combination channels were constructed by calculating the variance of concurrent EEG epochs of the opposite EEG channels (e.g., Fp1-T10 and Fp2-T9) and adding the epoch with lower variance to the combination channel. With this method, we can ideally select the side with less high amplitude electrode-based noise from the opposite EEG channels. These combination channels are later referenced as Fp1/Fp2 and Af7/Af8 combination channels.

D. TRAINING OF THE NEURAL NETWORK
The 10-fold cross-validation was used to test the performance of the network to reduce the bias caused by the relatively small dataset. The three datasets were divided into 10 equal-sized folds so that each fold had a constant number of recordings from each dataset. One of the folds was used for testing, one for validation, and the remaining eight folds were used for training. Ten different models were then trained so that each fold was used once in the validation set and once in the test set. Each model was trained for a maximum of 200 cycles or until the value of the validation loss function had not decreased during 20 consecutive cycles. All calculations were conducted on a server with 32-core AMD Ryzen Threadripper 2990WX, 128GB of RAM, and NVIDIA GeForce RTX 2080. With our server setup, the training of a single model takes approximately 2-4 hours and evaluation of a single patient data with a trained model takes only a few seconds.

E. DATA ANALYSIS
The performance of the deep learning model was evaluated as the test set accuracies across all the 10 folds. Furthermore, the inter-scorer agreement between automatic scoring and manual scoring was evaluated with Cohen's kappa coefficient (κ). Performance was calculated for complete test sets (all datasets included) and independently for each dataset by extracting the corresponding patients from the test sets. The confusion matrices were drawn for 5-stage, 4-stage, and 3-stage scorings across all folds including patients from all datasets. The same model was used for the 4-stage and 3stage scorings and confusion matrices were derived by combining N1 and N2 to light sleep (4-stage), and N1, N2, and N3 to NREM (3-stage).
To further assess the usability of the AES in clinical use, total sleep time (TST) and wake after sleep onset (WASO) were calculated for the automatic and manual scorings. The results are reported as medians, interquartile ranges, and Bland-Altman plots with mean difference and 95 % confidence interval. Furthermore, the time spent in each sleep stage was calculated for each dataset for manual and automatic scoring. The statistical significance between manually and automatically acquired sleep parameters was evaluated with Wilcoxon signed-rank test using p < 0.001 as the limit for statistical significance.
Leino: Deep learning enables accurate automatic sleep staging based on ambulatory forehead EEG VOLUME XX, 2017

III. RESULTS
The performances of the neural network-based sleep staging with different channel derivations are presented in Table 2. The highest accuracy was achieved with the Fp1/Fp2 combination channel, outperforming the Fp2, Fp1, and Fp2-GND channels by 2.2 %, 0.7 %, and 1.9 %, respectively. The lowest accuracy was acquired either with Fp1/Fp2 combination channel accompanied with S2-Sf EMG channel or when using all EEG and EOG channels. However, when considering all datasets, the overall accuracies with every tested input were close to each other, and the difference between the highest and the lowest accuracy was only 2.9 %. The accuracies of the individual datasets show the largest differences in the sleep apnea dataset 2, where the highest accuracy with Fp1/Fp2 combination channel was 80.5 % and the lowest accuracy with all EEG and EOG channels was 72.3 %. The most consistent performance throughout the three datasets was achieved with Fp1/Fp2 combination channel. However, adding S2-SF EMG channel to the input with Fp1/Fp2 combination channel slightly worsened the accuracy, especially in the sleep apnea 2 dataset.
The confusion matrices of the most accurate Fp1/Fp2 combination channel are presented in Fig. 3a-c for the 5stage, 4-stage, and 3-stage predictions. The 5-stage neural network model achieved high accuracy in the wake, N2, N3, and R stages. However, it struggled to identify the N1 stage. When the N1 and N2 stages were combined into light sleep, the overall accuracy increased from 79.7 % to 84.1 % (κ = 0.773). Scoring to only 3 stages (wake, NREM, R) further increased the accuracy to 89.1 % (κ = 0.801). Furthermore, the sensitivity, specificity, and F1-score for each sleep stage with 5-stage scoring based on Fp1/Fp2 combination channel in all datasets are presented in the Table 4. In general, sensitivity and specificity are high in all sleep stages except N1.
Optimal sequence length was also tested with the best performing Fp1/Fp2 combination channel. The accuracies for sequence lengths of 25, 50, 75, 100, and 125 were 78.5 %, 79.0 %, 79.6 %, 79.7 %, and 79.4 %, respectively. Thus, optimal sequence length was 100 epochs also used in the previous study by Korkalainen et al. [15] The total sleep time (TST, Fig. 4a, Table 3), wake after sleep onset (WASO, Fig. 4b, Table 3), and time spent in different sleep stages (Table 3) were calculated from the manual scoring and the predictions made with the Fp1/Fp2 combination channel. TST and WASO show only minor differences between medians of manual and automated scoring (Table 3). Furthermore, the number of each sleep stage and unscored epochs based on manual scoring are presented in the Table 5. The deep learning model slightly overestimates the time spent in N2 and underestimates the time spent in N1 (Table 3), which can also be seen from the 5-stage confusion matrix (Fig. 3a). Based on the Bland-Altman plots, the constant bias of the TST and WASO is minimal (Fig. 4).
To better illustrate the performance of the automatic scoring algorithm, hypnograms of the automatic and manual scoring of a subject with median scoring accuracy were plotted (Fig. 5). By visual inspection, the automatic scoring manages to identify the ultradian rhythm well. The manual scoring, however, has more rapid sleep stage transitions. Furthermore, a Sankey diagram of the scored sleep stages between manual and automatic scoring was plotted (Fig. 6). From the Sankey diagram and confusion matrix (Fig. 3a) sleep stage pairs that basically never get mixed up (e.g., REM-N3) and pairs that sometimes cause confusion (e.g., N1-N2 or N2-N3) can be identified.

IV. DISCUSSION
In this study, we utilized a previously developed deep learning model [15] for sleep staging based on frontal EEG, EOG, and EMG data measured with an ambulatory electrode set (AES). When considering all datasets, the highest automatic scoring accuracy 79.7% (κ = 0.729) was achieved with Fp1/Fp2 combination EEG channel. The accuracy was only 3.2 % lower than that achieved with the same deep learning model trained with a nearly 7 times larger PSG dataset [15]. The accuracy is also comparable with the interscorer agreement of standard EEG between individual scorers, in which kappa values have been reported to be between 0.58-0.76 [12]- [14], [18]. The accuracy also exceeded the 76.1 % manual inter-scorer agreement previously reported for the AES recordings [6]. Furthermore, the overall accuracy was similar in each of the three datasets, which indicates that the neural network generalizes well to different datasets with distinct recording environments (home, sleep lab), and different independent scorers in multiple international centers.
The most accurate EEG derivation was the Fp1/Fp2 combination channel. This confirms the hypothesis that the automatic scoring accuracy can be slightly increased by using the variance of EEG epochs as a simple metric to choose the optimal input for the deep learning model. The rationale behind this metric is to mimic the AASM scoring rules that recommend switching the scoring from the F4-M1, C4-M1, and O2-M1 channels to F3-M2, C3-M2, and O1-M2 channels in the case of an electrode malfunction [1]. Electrode malfunctions and other EEG artifacts appear as high amplitude variations in the signal. Furthermore, typical electrode-related artifacts appear only in single channels with the AES [5]. Thus, calculating the variance of the epochs from the opposite EEG channels can be used to select the epoch from the opposite channels with less noise induced by poor electrode contact.
In addition, the scoring accuracy of the neural network was the most consistent across the different datasets when using the combination channels. The differences between datasets were negligible when using combination channels, Leino: Deep learning enables accurate automatic sleep staging based on ambulatory forehead EEG VOLUME XX, 2017 the accuracies being within 0.9% with the Fp1/Fp2 combination channel and 1.1% with the Af8/Af7 combination channel. With single bipolar channels, the differences between datasets were slightly larger. This, however, could be explained by electrode-based artifacts or other types of noise in single channels.
The largest difference between different inputs was seen in sleep apnea dataset 2, in which the scoring accuracy increased by 6.4 % and 3.4 % when comparing Fp2 and Fp1 channels to the Fp1/Fp2 combination channel, respectively. Even though the combination channels improved the accuracy compared to single channels, adding more EEG, EOG, or EMG channel derivations did not improve the accuracy. However, this is not the first time this kind of behavior has been reported [19], [20]. Several different deep learning models have been shown to have only minor benefit or even performance degradation from additional input signals [19], [20]. This may be caused by the fact that adding possibly redundant input increases the noise of the deep learning model's input, making it more difficult for the model to extract relevant information from the signals.
When considering all datasets, the highest accuracy (88 %) was achieved when scoring wake. Furthermore, N2, N3, and REM were predicted with high accuracy (81-84 %). The lowest accuracy (31%) was acquired when predicting the N1 stage. This was expected, as the N1 stage is also the most difficult sleep stage to score manually even for experienced scorers. The inter-scorer agreement between experienced scorers for N1 has been reported to be only κ = 0.31-0.46 [12], [13], [18] based on standard EEG. In addition, the dataset was slightly imbalanced (Table 5), and the N1 stage is the least prevalent stage in all datasets, which further complicates the feature learning of the N1. However, balancing the dataset with sample weights had negative effect on the overall performance based on preliminary tests. This indicates that despite the imbalance, there was enough information of each sleep stage for reliable feature learning, and the low accuracy of N1 is mainly caused by inconsistent manual scoring.
Based on the confusion matrix of the 5-stage scoring (Fig.  3a), a slight bias towards the N2 stage can be seen. This most probably results from the N2 stage being the most prevalent sleep stage in the datasets ( Table 2). An unequal number of different sleep stages causes the loss function to get its lowest value when uncertain epochs are scored as the most prevalent one. This leads to a higher number of false positive N2 identifications when compared with other sleep stages. The bias could be corrected by using sample weights inversely proportional to the number of the corresponding sleep stage during the training. This would, however, have a slight negative impact on the overall accuracy of the neural network.
The medians and Bland-Altman plots for TST and WASO (Fig. 4, Table 3) show only minor differences between the automatic and manual scorings. Furthermore, the 95 % confidence intervals of the difference between the manual and estimated parameters are fairly low. Only one extreme outlier is present in the dataset, which is caused by significant ECG artifacts in the EEG channels. The ECG artifacts cause the deep learning model to score significantly more wake than the manual scorer. Similar misinterpretations could be easily avoided in clinical circumstances with a visual inspection before feeding the signal to the neural network.
The achieved 79.7 % (κ = 0.73) scoring accuracy compares favorably to the recent advances in the deep learning-based automatic sleep scoring presented in the literature. Based on standard EEG recordings, several deep learning models with different architectures have achieved an accuracy of 82.9-86.2 % (κ = 0.77-0.80) [10], [15], [17], [21], [22] with single-channel inputs. Furthermore, a random forest classifier has been used to achieve 72.98 % accuracy based on a single frontal EEG channel [26]. There have also been other studies using automatic sleep staging methods on signals collected with different ambulatory EEG systems. Based on signal measured with ear-EEG electrodes, random forest classifier has been used to achieve scoring accuracies of κ = 0.45-0.65 [23] and κ = 0.73 [24]. A fabric headband with forehead electrodes developed by Cognionics Inc. (San Diego, CA, USA) has been recently used to achieve 74.0 % deep learning scoring accuracy based on two-channel data [25]. Furthermore, a commercially available device, Dreem headband (Dreem Inc., Paris, France), has been used to achieve 83.5 % (κ = 0.748) accuracies with a deep learning model that was trained with consensus hypnograms of five experienced scorers [4]. However, the deep learning models of these studies have not been published in detail. The high accuracy based on consensus manual scoring highlights the importance of the scoring quality and using data from multiple different scorers to avoid unintentional and unwanted bias in the learning process.
A minor limitation in this study is the size of the dataset (n = 135). To compensate for this, we multiplied the number of the EEG sequences in the training and validation sets by overlapping consecutive sequences. Furthermore, we utilized 10-fold cross-validation to minimize the bias that could be caused by a small test. Despite the relatively small dataset, the model achieved good accuracy, and further increasing the size or number of the AES datasets could increase the accuracy of the automatic sleep staging even further. Furthermore, the neural network was trained with three different datasets analyzed in three different international sleep centers. The automatic scoring accuracy would arguably be higher if the used dataset were scored by one person. Considering this, the achieved accuracy is excellent which indicates that the algorithm is able to find the relevant signal features despite differences in the scorer habits and recording environments. In any case, the algorithm should be thoroughly validated on large AES datasets preferably scored by multiple sleep experts before advancing to routine clinical use.
Another factor that might limit the scoring accuracy is that the AES has been designed to be used without skin abrasion to allow easy self-application. This leads to higher electrodeskin impedances when compared to a standard EEG setup with cup electrodes accompanied with skin abrasion procedure. The high electrode-skin impedance has been reported to lead to noticeable sweat artifact [27], [28] in the delta wave bandwidth (0.5-4.0 Hz), which is suggested to partly originate from the imbalanced impedances of the electrodes [29]. However, with further research and product development, the signal quality of the AES could be enhanced to improve its resistance to artifacts [27], [28]. This has the possibility of simultaneously enhancing the sleep scoring accuracy of the utilized deep learning model.
The transition from type I PSGs to type II in-home measurements has been shown to improve the availability and reduce the waiting times of the sleep studies in diagnostics of obstructive sleep apnea [30]. Furthermore, the direct costs of OSA management have been shown to decrease because of the absence of hospitalization [30]. In addition to financial benefits, type II PSGs with accurate automatic scoring algorithm could potentially produce better quality sleep data, as the sleep efficiency has been shown to be better and total sleep time longer compared to type I PSGs since the patients can sleep in a familiar and comfortable home environment instead of a sleep laboratory [31], [32]. Self-applicable EEG electrodes with automatic scoring could also enable easy and cost-effective multi-night measurements, further improving the data quality by enabling the consideration of the first-night effect and nightto-night variability [2], [33], [34]. Therefore, type I PSG could be replaced in some cases with self-administrated type II PSG with AES to free up human resources, given due consideration on the needs of the patient. In addition to facilitating the type II PSGs, the AES could be useful when conducting type I PSG recordings in cumbersome situations, such as for patients with reduced mobility in stroke units where routine screening for sleep apnea is uncommon [30]. Furthermore, the AES together with the proposed automatic sleep staging model could also be used as a supplement in home sleep apnea test (HSAT) to improve its diagnostic accuracy without a significant increase in the costs or manual labor.
Our study also included subjects with possible sleep bruxism. Our previous study confirmed that home PSG with the AES, recording at least 3 nights, can help to better capture the oral behavior in a natural sleeping environment [7]. For such extensive clinical studies to be possible, new wearable technologies and automated analytical methods are needed. The neural network-based sleep staging together with self-applicable AES described in this study could also offer new possibilities for the diagnosis of sleep bruxism. Our study also showed that a very simple measurement solution might be sufficient to perform reliable automatic analysis. This will allow simplification of the current layout of the AES.
In conclusion, the previously developed deep learning model [15] was successfully utilized for automatic sleep scoring based on EEG recorded with an AES. The automatic scoring reached an accuracy comparable to inter-scorer agreement of standard EEG recordings [12]- [14], [18]. Therefore, the automatic scoring algorithm that can be used with a self-applicable EEG electrode set could significantly facilitate the arrangement of type II in-home PSGs.

V. ACKNOWLEDGMENT
We thank Sigríður Sigurðardóttir and Erna Sif Arnardóttir from Reykjavik University Sleep Institute for their valuable contribution in the data analysis.       The values are presented as median (interquartile range). *Missing data from six patients BMI, body mass index; AHI, apnea-hypopnea index