Detailed assessment of sleep architecture with deep learning and shorter epoch-to-epoch duration reveals sleep fragmentation of patients with obstructive sleep apnea

—Traditional sleep staging with non-overlapping 30-second epochs overlooks multiple sleep-wake transitions. We aimed to overcome this by analyzing the sleep architecture in more detail with deep learning methods and hypothesized that the traditional sleep staging underestimates the sleep fragmentation of obstructive sleep apnea (OSA) patients. To test this hypothesis, we applied deep learning-based sleep staging to identify sleep stages with the traditional approach and by using overlapping 30-second epochs with 15-, 5-, 1-, or 0.5-second epoch-to-epoch duration. A dataset of 446 patients referred for polysomnography due to OSA suspicion was used to assess differences in the sleep architecture between OSA severity groups. The amount of wakefulness increased while REM and N3 decreased in severe OSA with shorter epoch-to-epoch duration. In other OSA severity groups, the amount of wake and N1 decreased while N3 increased. With the traditional 30-second epoch-to-epoch duration, only small differences in sleep continuity were observed between the OSA severity groups. With 1-second epoch-to-epoch duration, the hazard ratio illustrating the risk of fragmented sleep was 1.14 ( p = 0.39) for mild OSA, 1.59 ( p < 0.01) for moderate OSA, and 4.13 ( p < 0.01) for severe OSA. With shorter epoch-to-epoch durations, total sleep time and sleep efficiency increased in the non-OSA group and decreased in severe OSA. In conclusion, more detailed sleep analysis emphasizes the highly fragmented sleep architecture in severe OSA patients which can be underestimated with traditional sleep staging. The results highlight the need for a more detailed analysis of sleep architecture when assessing sleep disorders.


I. INTRODUCTION
LEEP is a restorative state with a multitude of functions such as memory consolidation and clearance of metabolic waste products from the brain [1], [2]. Sleep can be objectively assessed using electroencephalography (EEG), electrooculography (EOG), and electromyography (EMG) recorded during overnight polysomnography (PSG) and subjectively categorized into stages according to defined criteria [3]. The current practice of sleep staging utilizes a segmentation method of assigning a nominal sleep stage to each non-overlapping 30-second epoch from the onset of the recording [3]. This 30-second division is an arbitrary system that is a historical remnant of when EEG recordings were printed on paper and is not wholly based on physiological factors [4]- [8]. The epoch-based scoring was optimized for convenience and less labor rather than producing a more accurate representation of sleep [5]- [7].
A defining characteristic of the 30-second epoch staging system is that multiple sleep stages may be present in a single epoch but only a single stage can be assigned for each epoch. Therefore, many transitions between sleep stages and between sleep and wakefulness remain overlooked and, for example, wake periods with a duration up to 30-seconds may be completely overlooked when divided over two consecutive epochs. This can cause overestimation of sleep quality and underestimation of sleep fragmentation and can also significantly affect the determination of sleep onset or the onset of REM sleep. Furthermore, the 30-second epoch-based sleep staging can cause large uncertainties in tests objectively assessing daytime sleepiness, for example Multiple Sleep Latency Test and Maintenance of Wakefulness Test, where the accurate identification of sleep onset would be paramount. Additionally, missing transitions from sleep to wakefulness can affect the estimation of the duration of continuous sleep periods and also affect the values of various clinical parameters used to illustrate the sleep architecture, for example, the total sleep time (TST), sleep efficiency (SE), and duration of wake after sleep onset (WASO). The current sleep staging practice with nonoverlapping 30-second epochs is problematic especially when the sleep architecture is disturbed due to sleep disorders [5].
Obstructive sleep apnea (OSA) is one of the most common sleep disorders affecting over 900 million individuals [9]. OSA is characterized by recurrent obstructions of upper airways which often lead to arousals from sleep causing sleep fragmentation and multiple transitions between sleep stages and wakefulness [10,11]. However, due to the current convention of sleep staging based on non-overlapping 30-second epochs, many of these transitions can be easily missed. Therefore, we hypothesize that the current sleep staging with non-overlapping 30-second epochs heavily underestimates the extent of sleep fragmentation in patients suffering from OSA. As deep learning-based methods have demonstrated remarkable accuracy in automatic sleep staging [12][13][14][15][16][17], we hypothesize that deep learning offers a unique possibility for providing a more feasible and accurate representation of sleep architecture beyond the non-overlapping 30-second epochs.
We have recently introduced a deep learning-based automatic sleep staging method [16] that surpassed previous state-of-the-art methods on a publicly available research dataset (Physionet Sleep-EDF [18,19]). In a clinical dataset of patients with suspected OSA, the developed method reached at least similar inter-rater reliability (83.8% accuracy, κ = 0.78) [16] as between two manual scorers [20][21][22]. The deep learning-based sleep staging also succeeded in accurately identifying sleep stages from a single EEG channel [16] or even from a photoplethysmography signal [17]. However, the main advantage of the automatic sleep staging over manual scoring is the ability to always score the sleep stages consistently. Therefore, we aim to utilize this previously developed automatic method to assess sleep architecture in a more detailed manner. Furthermore, we aim to study how the sleep architecture of patients with varying degrees of OSA differs with more detailed sleep staging II. METHODS

A. Dataset
We have previously presented a deep learning-based automatic sleep staging model utilizing a clinical dataset of Type 1 polysomnographies (PSGs) [16]. The PSGs were conducted at the Princess Alexandra Hospital (Brisbane, Australia) for the clinical suspicion of OSA and recorded with the Compumedics Grael acquisition system (Compumedics, Abbotsford, Australia) between 2015 and 2017. The dataset comprised of 891 recordings out of which 717 were used to train the final model. In the present study, we retrained the model using half of the same population (n = 445). The remaining 446 recordings were left outside the retraining process and were included in the analyses conducted in the present study (Table 1). The data collection was approved by the Institutional Human Research Ethics Committee of the Princess Alexandra Hospital (HREC/16/QPAH/021 and LNR/2019/QMS/54313).

B. Sleep staging
The deep learning model comprised of a combined convolutional (CNN) and recurrent neural network (RNN) conducting the sleep staging in a sequence-to-sequence manner from sequences of hundred 30-second epochs (Python 3.6 with Keras API 2.24 and TensorFlow 1.13 backend). The CNN architecture consisted of six 1D convolutions each followed by batch normalization and a ReLU activation. A max-pooling layer was located after the first two and the two following convolutions. The final layer comprised a global average pooling layer. The complete network architecture consisted of a time distributed layer of the CNN described above followed by a gaussian dropout layer, a bidirectional long short-term memory (LSTM) layer, and a time distributed dense layer with softmax activation. An EEG channel (derivation F4-M1) and an EOG channel (derivation E1-M2) were used for the automatic sleep staging. No preprocessing was conducted on the signals aside from downsampling to 64 Hz from the original sampling frequency of 1024 Hz. The architecture of the model and the workflow for training the model is presented with more details in Korkalainen et al. [16].
In the present study, the model architecture presented previously in [16] was trained using only half (n = 445) of the complete population of 891 recordings. This was further split into a training set (n = 400) and a validation set (n = 45). The validation set was used in selecting the best performing model during training, i.e. the model with the lowest validation loss was selected. The remaining 446 recordings were not used in the training of the model and were included only in the further  analyses. The retraining of the previously presented model was conducted to allow for a larger dataset to be used in the present study without having to rely on recordings that have been used during the training of the model. After retraining the model, the study population not included in the training was reanalyzed with the deep learning-based sleep staging method. In addition to the traditional sleep staging with non-overlapping 30-second epochs, the sleep staging was conducted by allowing consecutive 30-second epochs to overlap with four different epoch-to-epoch durations: a new 30second epoch taken every 15 seconds (50% overlap), every 5 seconds (83.3% overlap), every 1 second (96.7% overlap), or every 0.5 seconds (98.3% overlap). Each scoring then formed a time series of sleep stages (Fig. 1). The sleep architecture of the different OSA severity groups (non-OSA, mild OSA, moderate OSA, and severe OSA) were compared using three different approaches: 1) calculating the sleep stage percentages in each severity group, 2) calculating commonly used sleep parameters (total sleep time, sleep efficiency, and wake after sleep onset) for each individual in the groups, and 3) evaluating the continuity of sleep in each group based on survival analysis.
Three commonly used sleep parameters, total sleep time (TST), sleep efficiency (SE), and amount of wake after sleep onset (WASO), were calculated for each patient and mean and standard deviations were calculated for each OSA severity group. The statistical significance was evaluated using the Mann-Whitney U test when comparing the OSA groups to the non-OSA group and with Wilcoxon signed-rank test when comparing the sleep parameters between the more detailed sleep staging to the traditional sleep staging within the same OSA severity group.
Sleep continuity was evaluated based on survival analysis methodology. The rationale behind evaluating the continuity of sleep with survival analysis was previously presented by Norman et al. [23]. A continuous sleep period was defined as the interval between the transition to any sleep stage from wakefulness until the next epoch was scored as wake. The mean duration of sleep periods was calculated for each individual in the study population and was used as the time to event (transition to wake) in the survival analyses. The sleep continuity of the OSA groups and the non-OSA group were compared using Cox proportional hazards model with the hazard ratio illustrating the risk for fragmented sleep (i.e. short continuous sleep periods during the night). Furthermore, sleep continuity was studied with Kaplan-Meier survival curves. All statistical analyses were conducted with Matlab 2018b using the Statistics and Machine Learning Toolbox (The MathWorks, Natick, MA, USA).

A. Sleep stages
The deep learning model reached a training accuracy (Cohen's kappa κ) of 89.2% (κ = 0.85) and a validation accuracy of 81.9% (κ = 0.76) during the retraining. In the current study population, the deep learning model had a sleep staging accuracy of 83.2% (κ = 0.77) when compared to the original manual analysis. This corresponded to accuracies of 91.7% for identifying wake, 41.4% for N1, 84.0% for N2, 83.4% for N3, and 90.9% for REM.
When comparing the deep learning-based sleep staging with traditional 30-second epochs and with varying overlap, the more detailed sleep staging decreased the amount of scored wake, N1, and REM and increased the amount of N2 and N3 in the non-OSA, mild OSA, and moderate OSA groups (Table 2). In the mild OSA and moderate OSA groups, the amount of wake first decreased to the same level as with manual scoring with decreasing epoch-to-epoch durations but decreased even further with the shortest durations. In contrast, the amount of wake increased and the amount of N3 and REM decreased in patients with severe OSA with decreasing epoch-to-epoch duration. Examples of scored sleep stages with the traditional sleep staging and when decreasing the epoch-to-epoch duration is shown in Fig 2.

B. Sleep parameters
With the deep learning-based sleep staging, the TST and SE increased while WASO decreased in the non-OSA, mild OSA, and moderate OSA groups with decreasing epoch-to-epoch duration (Table 3). In contrast, the TST and SE decreased while WASO increased in the severe OSA group with shorter epochto-epoch durations thus increasing the differences in the parameter values between severe OSA group and other groups.

C. Sleep continuity
With 30-second epoch-to-epoch duration in the deep learningbased sleep staging, the differences in the sleep fragmentation between the groups were small with the Cox proportional hazards model or Kaplan-Meier survival curves (Table 4, Fig.  3). When the sleep staging was conducted in more detail with shorter epoch-to-epoch durations, differences between the non-OSA and the OSA severity groups began to increase. The hazard ratios illustrating the risk of fragmented sleep with 1second epoch-to-epoch duration were 1.14 (p = 0.39), 1.59 (p < 0.01), and 4.13 (p < 0.01) in mild, moderate and severe OSA groups, respectively. The obtained hazard ratio for the severe OSA group even surpassed the value obtained with the manual sleep staging indicating that the deep learning-based sleep staging with short epoch-to-epoch duration reveals larger differences between the non-OSA and severe OSA group. Similar differences between groups with decreasing epoch-toepoch duration can be seen in the Kaplan-Meier survival curves (Fig. 3).

IV. DISCUSSION
In this study, we introduced a novel method to analyze sleep in a more detailed manner using deep learning-based automatic sleep staging. Our results reveal that reducing the epoch-to-epoch duration between consecutive 30-second epochs considerably affects the evaluated sleep architecture and can provide greater insights into the sleep architecture beyond the traditional 30-second epoch-to-epoch duration. The results further reveal the highly fragmented sleep architecture of patients suffering from severe OSA. Overall, the results suggest that based on more detailed sleep analysis, severe OSA patients have considerably less REM sleep and slightly less N3 sleep than estimated via traditional epochs while the amount of N2 and wakefulness during the night is higher. Similarly, total sleep time and sleep efficiency were increasingly lower in OSA patients compared to traditional epochs when the epoch-toepoch duration was decreased. Finally, the results with our detailed sleep staging approach expose larger differences in the sleep continuity between individuals without OSA and individuals in different OSA severity categories.
In the non-OSA group, the amount of wakefulness, N1 sleep and REM decreased with shorter epoch-to-epoch durations. At the same time, the amount of N2 and N3 increased. Similar behavior was observed in the population with mild or moderate OSA. Conversely, the severe OSA group differed from the other groups: the amount of wakefulness and N1 increased while the amount of REM N3 decreased along decreasing epoch-to-epoch duration. This illustrates that the sleep architecture of severe OSA patients is more disrupted than estimated with the traditional 30-second epoch-to-epoch duration, with more wakefulness and N1 sleep present during  Based on these results, the traditional 30-second epoch approach may be suitable for a healthy population but does not have the necessary detail for assessing the sleep of patients suffering from sleep disorders. Our more detailed sleep staging approach would appear to provide a more realistic representation of the highly disrupted sleep architecture which is easily overlooked when using the traditional non-overlapping 30-second epochs. This could ultimately lead to a more informed diagnosis of various sleep disorders and their effects on sleep architecture. Moreover, further studies linking detailed sleep architecture to daytime symptoms, cardiovascular risks, therapeutic outcomes, and perceived sleep quality are warranted.
In our sleep continuity analyses, only small differences between the healthy population and different OSA groups were seen using the traditional non-overlapping 30-second epoch approach. In contrast, the more detailed sleep staging approach revealed larger differences in the sleep continuity between the OSA groups and the healthy population, even surpassing manual scoring. For example, with 1-second epoch-to-epoch duration, the hazard ratio illustrating the risk of fragmented sleep was 1.14 (p = 0.39) for mild OSA, 1.59 (p < 0.01) for moderate OSA, and 4.13 (p < 0.01) for severe OSA. This shows that the risk of fragmented sleep increases with increasing OSA severity. Similarly, the Kaplan-Meier survival curves (Fig. 3) show that differences between all OSA groups become more apparent with more detailed sleep staging. However, it must be noted that with decreasing epoch-to-epoch duration, the mean duration of continuous sleep decreased in all the groups, as can be seen from the Kaplan-Meier curves. This is expected as the overlapping epochs provide a way to assess sleep architecture in a more detailed manner capturing more transitions to wakefulness during the night. Moreover, decreasing the epochto-epoch duration even further to 0.5 seconds produced slightly smaller differences between the OSA severity groups. This may be due to too small differences between adjoining epochs or due to the small uncertainty always related to sleep staging; that is, epochs on the verge of being scored to wake may falsely be scored as such with short epoch-to-epoch durations.
Investigating this effect and finding the optimal epoch-to-epoch duration warrants further studies.
We hypothesized that the current sleep staging procedure underestimates the degree of sleep fragmentation caused by OSA. These results support our hypothesis in severe OSA patients and only to some extent in mild and moderate OSA patients. The percentage of sleep stages, total sleep time, sleep efficiency, and wake after onset were similar in non-OSA patients and patients with mild or moderate OSA. However, the survival analysis-based assessment of sleep continuity also revealed differences between non-OSA patients and patients with mild or moderate OSA. This can be seen both in the Kaplan-Meier survival curves (Fig. 3), and the Cox regression (Table 4). A similar effect was also seen by Norman et al. who reported that no significant differences exist between the normal and mild OSA groups in traditional sleep parameters and differences only emerge when considering the sleep continuity with survival analysis [23]. However, it has to be noted that the division into OSA severity groups is highly artificial and simplistic and the severity assessment with AHI might not sufficiently reflect the physiological effects of OSA [24][25][26]. Therefore, it could be beneficial to study how sleep fragmentation varies when defining the OSA severity differently or even attempt to define the severity of OSA by using sleep fragmentation as a metric. Regardless, a more detailed analysis of sleep provides more insight into the sleep architecture and could be highly useful when assessing the sleep of OSA patients, and could supplement the evaluation of disease severity.
Our approach for detailed sleep analysis was based on overlapping 30-second epochs, which is both a strength and a limitation in the present study. The developed approach benefits from decades of clinical practice of sleep staging, is easily applicable to daily work, and the results can be interpreted similarly as with traditional manual sleep staging. However, the detailed analysis was still based on identifying a sleep stage for each epoch and does not provide a continuous scale of sleep depth in that sense. However, this approach allows comparison with the traditional 30-second epoch-based sleep staging and eases the interpretation of results over an arbitrary, continuous scale of sleep depth which has been previously attempted based on EEG frequency content [4,6]. The main advantage of the developed method over traditional sleep staging is the capability to observe the transitions between the discrete stages with better temporal resolution. However, further studies are warranted to conduct similar approaches using shorter epoch durations without overlap to gain a deeper understanding of sleep microstructure. Furthermore, we only investigated the transitions between sleep stages and did not consider arousals from sleep. Arousals were discarded as the reliability of arousal scoring can be relatively low [27] and to focus solely on the sleep staging process and on how it could be improved. Therefore, future studies investigating the effect of arousals alongside the more detailed sleep staging are warranted.
The deep learning-based automatic sleep staging method was trained using manual sleep stage scoring. This manual scoring material can understandably suffer from human error and differences between scoring traditions of different scorers. However, all the manual scorers involved in scoring the study material participate regularly in intra-and inter-laboratory scoring concordance activities. Furthermore, in a previous study on inter-rater reliability at the sleep laboratory, the mean Cohen's kappa (standard error of the mean) of sleep staging was 0.74 (0.02) [28] illustrating high reliability between the scorers. Furthermore, the use of the deep learning approach can be considered as one of the biggest strengths of our study. In contrast to manual scoring, the deep learning-based sleep staging is always conducted consistently and all the scorings are highly comparable. This is also the rationale behind why the comparison between traditional and detailed sleep staging was possible. The scoring of the same recordings with different approaches would have been highly biased and laborious with manual scoring. Furthermore, our method can alleviate the biggest limitations of the manual sleep staging with a fast, and easily applicable method requiring no increase in working hours spent currently in clinical practice. Automatic sleep staging could reduce the workload and simultaneously produce the traditional sleep staging alongside the more detailed representation with overlapping epochs in a timely manner, generally in less than a minute.

V. CONCLUSION
More detailed sleep staging using a deep learning-based automatic method is a highly promising approach to gain further insight into the characteristics of the fragmented sleep architecture of patients suffering from sleep disorders. The detailed sleep staging emphasized the highly disrupted sleep architecture of patients with severe OSA which can be vastly underestimated with traditional sleep staging. These results highlight the need for a more detailed analysis of sleep architecture in daily clinical practice.