A Comparison of Signal Combinations for Deep Learning-Based Simultaneous Sleep Staging and Respiratory Event Detection

<italic>Objective:</italic> Obstructive sleep apnea (OSA) is diagnosed using the apnea-hypopnea index (AHI), which is the average number of respiratory events per hour of sleep. Recently, machine learning algorithms for automatic AHI assessment have been developed, but many of them do not consider the individual sleep stages or events. In this study, we aimed to develop a deep learning model to simultaneously score both sleep stages and respiratory events. The hypothesis was that the scoring and subsequent AHI calculation could be performed utilizing pulse oximetry data only. <italic>Methods:</italic> Polysomnography recordings of 877 individuals with suspected OSA were used to train the deep learning models. The same architecture was trained with three different input signal combinations (model 1: photoplethysmogram (PPG) and oxygen saturation (SpO<inline-formula><tex-math notation="LaTeX">$_{2}$</tex-math></inline-formula>); model 2: PPG, SpO<inline-formula><tex-math notation="LaTeX">$_{2}$</tex-math></inline-formula>, and nasal pressure; model 3: SpO<inline-formula><tex-math notation="LaTeX">$_{2}$</tex-math></inline-formula>, nasal pressure, electroencephalogram (EEG), oronasal thermocouple, and respiratory belts). <italic>Results:</italic> Model 1 reached comparative performance with models 2 and 3 for estimating the AHI (model 1 intraclass correlation coefficient (ICC) = 0.946; model 2 ICC = 0.931; model 3 ICC = 0.945), and REM-AHI (model 1 ICC = 0.912; model 2 ICC = 0.921; model 3 ICC = 0.883). The automatic sleep staging accuracies (wake/N1/N2/N3/REM) were 69%, 70%, and 79% with models 1, 2, and 3, respectively. <italic>Conclusion:</italic> AHI can be estimated using pulse oximetry-based automatic scoring. Explicit scoring of sleep stages and respiratory events allows visual validation of the automatic analysis, and provides information on OSA phenotypes. <italic>Significance:</italic> Automatic scoring of sleep stages and respiratory events with a simple pulse oximetry setup could allow cost-effective, large-scale screening of OSA.

O BSTRUCTIVE sleep apnea (OSA) is a breathing disorder characterized by recurrent, complete (apnea) or partial (hypopnea) breathing cessations during sleep due to upper airway obstruction [1]. Globally over 900 million people are estimated to have OSA [2]. OSA is diagnosed based on the apnea-hypopnea index (AHI), which is defined as the number of apneas and hypopneas divided by the total sleep time (TST) [3]. Since the AHI does not take the type and duration of the respiratory events into account, additional parameters such as the obstruction and desaturation duration, as well as the severity, have been suggested [4]. Another parameter that considers the duration of the respiratory events is the apnea-hypopnea time percentage (AHT-%), which is defined as the proportion of total apnea and hypopnea durations from TST [5]. Especially in the severe OSA group, there is significant variation in the AHT-% among patients with similar AHI [5].
Since manual analysis of the polysomnography (PSG) recordings is laborious and expensive, supervised machine learningbased automatic analysis methods have gained attention. However, one major problem in using machine learning methods mimicking the human scorers is the limited intra-and interrater reliability between the scorers. In a study conducted by the American Academy of Sleep Medicine (AASM) including over 2500 scorers, the overall sleep staging agreement was 83%, with rapid eye movement (REM) sleep agreement the highest (91%), and non-REM (NREM) stage 1 (N1) sleep agreement the lowest (63%) [6]. In another study, the overall agreement for sleep staging between seven human scorers was 82% (Cohen's κ = 0.76), with the agreement on REM sleep being the highest (κ = 0.91), and agreement on N1 sleep the lowest (κ = 0.46) [7]. In a third study, the interrater agreement between nine international sleep centers (one scorer from each of the sites) for sleep staging was κ = 0.63 [8].
Automatic sleep staging from electroencephalography (EEG) has been studied extensively. The deep learning methods have achieved accuracies similar to the agreement between human scorers [9], [10], [11]. More traditional approaches using handcrafted EEG features and simpler classifiers such as k-nearest neighbors and random forests have also been tested with reported accuracies ranging from 72% to 87% [12]. Since the EEG measurement setup is quite complex and requires expert assistance for both in-laboratory and home measurements, alternative signal sources for sleep staging, such as the electrocardiography (ECG) and body movements [13], [14], have been studied. In addition, automatic deep learning-based sleep staging from photoplethysmography (PPG) has been performed with promising results [15], [16], [17]. Especially the detection of REM sleep from PPG has been performed with an accuracy of 87% using deep learning [16].
The interrater agreement on manual respiratory event scoring has also been studied, although these studies have used relaxed criteria, such as the epoch-wise agreement and overall event counts, to assess the agreement. In a study conducted by the AASM with over 3600 scorers, the agreement on whether an epoch contained obstructive apnea was 77%, and agreement on whether an epoch contained hypopnea was 65% [18]. In another study, the intraclass correlation coefficient (ICC) of the AHI between nine scorers was 0.95 [8]. However, the ICCs for total counts of apneas and hypopneas were 0.73 and 0.80, respectively [8]. Automatic scoring of apneas and hypopneas has been previously performed, for example, with long short-term memory (LSTM) networks using respiratory signals [19], [20]. However, these methods do not estimate the sleep architecture, which limits their diagnostic value, since the actual sleep time cannot be estimated, and sleep stage-related parameters cannot be calculated.
Many of the previously developed machine learning methods to assess OSA severity directly estimate either the AHI value or the OSA severity category [21], [22]. These kinds of approaches are not well suitable for clinical usage due to difficulty in manually validating the outputs of the methods. Another class of automatic OSA detection methods are the automatic event scoring algorithms [23], [24], which currently need to be revised by a qualified sleep technologist [25]. However, the potential of machine learning to augment and assist expert evaluation has been recognized [26].
By first scoring both sleep stages and respiratory events (i.e. apneas and hypopneas), and then calculating the parameters derived from them (such as the AHI), the model predictions can be validated through visual inspection. In addition, more information on the OSA phenotype can be acquired from the explicit scorings. For example, REM-related OSA is a phenotype in which the respiratory events occur mostly during REM sleep [27]. To assess REM-related OSA, the sleep stages during the respiratory events need to be estimated.
To the best of our knowledge, automatic scoring of both sleep stages and respiratory events has been conducted in only one previous study [28]. However, in that study separate models were used for the tasks, and the type of respiratory events was not taken into account [28]. Moreover, the methods developed in [28] utilized full PSG recordings including EEG, which is currently not widely included in the home sleep apnea tests (HSAT). It would be especially useful to be able to detect both sleep stages and respiratory events with a simple measurement setup such as a finger pulse oximeter, since this would allow easy at-home screening of OSA and its different sleep stage-related phenotypes.
The aim of this study was to evaluate the applicability of pulse oximetry for deep learning-based automatic scoring of both sleep stages and respiratory events, and for subsequent estimation of the AHI and AHT-%. This was done by training the same deep learning architecture for simultaneous scoring of sleep stages and respiratory events with three different signal combinations and comparing the model performances. The first hypothesis was that the deep learning models are capable of performing the simultaneous segmentation tasks with different temporal resolutions (30-s for sleep staging, 1-s for respiratory event detection). The second hypothesis was that the estimation of the AHI and AHT-% could be done for both REM and NREM sleep separately utilizing only pulse oximetry data.

A. Data
The dataset used in this study consisted of 933 consecutive PSG recordings of suspected OSA patients collected in 2015-2017 at the Sleep Disorders Centre, Princess Alexandra Hospital (Brisbane, Australia). The recordings were collected and analyzed using Compumedics Grael devices with Profusion 4.1 software (Compumedics, Abbotsford, Australia). Nonin Xpod 3011 pulse oximeters (Nonin Medical Inc, Plymouth, MN, USA) were used in the PPG recordings. The sleep stages and respiratory events related to the recordings were manually scored at the Sleep Disorders Centre, Princess Alexandra Hospital, according to the recommendations of version 2.4 of the AASM rules [3]. In particular, the recommended hypopnea criteria was used, where a reduction in flow of 30% or greater for a duration of 10 seconds or greater was accompanied by either a 3% desaturation from baseline or an EEG arousal at the termination of the event [3]. After excluding corrupted signals and recordings containing less than 1 h of sleep, 877 recordings were included for the final analyses. The data usage was approved by the Institutional Human Research Ethics Committee at the Princess Alexandra Hospital (HREC/16/QPAH/021 and LNR/2019/QMS/54313).
The data was randomly split into separate training (n = 710), validation (n = 79), and test (n = 88) sets. First, 10% of the PSG recordings were sampled to the test set. Then, 10% of the remaining recordings were sampled to the validation set, and the rest of the recordings were used as the training set. The validation set was used to monitor the training process and choose the optimal model for further analyses. To ensure that no information from the test set was leaked to the model, the test set was only used to report the final model's performance and for calculating the automatic parameters such as the AHI and AHT-%. OSA severity was defined using the AHI (no OSA: AHI < 5; mild OSA: 5 ≤ AHI < 15; moderate OSA: 15 ≤ AHI < 30; severe OSA: AHI ≥ 30). Demographic information on the whole population and the different subpopulations is provided in Table I. Three models with different input signal combinations were tested. Model 1 utilized only pulse oximetry data (PPG and SpO 2 signals). Model 2 utilized PPG, SpO 2 , and nasal pressure signals, to compare the performance improvement when a breathing signal that can be used to reliably detect the respiratory events is added to pulse oximetry. Model 3 utilized SpO 2 , nasal pressure, EEG (C4-M1 channel), oronasal thermocouple, and the sum of thoracic and abdominal respiratory inductance plethysmography (RIP) belts (RIPsum). This combination was chosen to compare the performance of the models 1 and 2 with a setup that could be used to reliably detect both sleep stages and respiratory events, also manually by a human scorer.
All signals were exported from the PSG software with the original sampling frequency without any additional filtering. The exported signals were downsampled to 32 Hz (except SpO 2 which was upsampled since it was originally sampled at 16 Hz) to reduce computational complexity and memory consumption without sacrificing classification performance. All input signals were downsampled to the same frequency to simplify the model design. An 8th order Chebyshev type I lowpass zero-phase filter was used for antialiasing with cutoff frequency of 25.6 Hz and passband ripple of 0.05 dB. In addition, the signals were standardized to zero mean and unit variance. The downsampling and standardization were performed on the whole-night signals, and the signal segments were fed to the models as such without any further transformations.

B. Deep Learning
The source code for the models used in this study is available at. 1 A schematic diagram of the deep learning architecture used in this study is shown in Fig. 1. The architecture was based on the U-time architecture introduced by Perslev et al. [10], which in turn is based on the U-net [31]. The U-time is an encoder-decoder structure consisting of blocks of consecutive convolutional, batch normalization, and pooling layers. To preserve the low-level features, skip connections are added from the encoder layers to the decoder layers corresponding in size. The decoder output is aggregated to lower temporal resolution and classified to the final representation by a segment classifier, which performs average pooling to downsample the decoder's dense feature maps to the final output resolution (30-s for sleep staging in the original paper [10]). The segment classifier applies pointwise convolution before and after the average pooling, and finally applies softmax activation to produce the class confidence scores.
In this study, the original U-time architecture was modified in three major ways. First, the network had two outputs, and thus two segment classifiers, since it simultaneously performed sleep staging and respiratory event detection. It is notable that the outputs had different temporal resolutions (30-s for sleep staging, 1-s for respiratory event detection). Secondly, a channelwise attention mechanism called Squeeze & Excitation (S&E) [29] was added within each block in the architecture. S&E has been shown to improve the performance and generalizability of state-of-the-art convolutional neural networks (CNN) [29]. Thirdly, to capture the information efficiently on multiple scales, a technique called Atrous Spatial Pyramid Pooling (ASPP) [30] was adopted. ASPP incorporates multiple dilated convolutions with different dilation rates in parallel to produce feature maps of the same size with different receptive fields. Then, the feature maps are concatenated and a pointwise convolution is performed to learn the dependencies between the scales [30].
The model outputs were the softmax scores for each class (wake/N1/N2/N3/REM for every 30-s segment in sleep staging; no event/hypopnea/apnea for every 1-s segment in respiratory event detection). Obstructive, central, and mixed apneas were not separated. When scoring the sleep stages, the class with highest softmax score was chosen. In respiratory event detection, a 1-s segment was scored as apnea or hypopnea if the softmax score for no event was lower than 0.5. Then, the respiratory event type with higher softmax score was chosen. This policy was chosen for respiratory event detection instead of simply taking the highest softmax score since there may be cases where the model is unsure about the respiratory event type but it is still desirable to score the event. For example, if the softmax scores for no event, hypopnea, and apnea were 0.35, 0.31, and 0.34, respectively, the 1-s segment was scored as apnea. Finally, consecutive 1-s segments of apnea or hypopnea were combined, and if the length of the combined segment was ten seconds or more, it was considered as a respiratory event in the further analyses. The respiratory event was considered as an apnea if it was labeled as apnea for more than half of the duration, otherwise the event was scored as a hypopnea.
The model weights were adjusted using stochastic gradient descent with momentum and decoupled weight decay [32]. Categorical cross-entropy loss was used. To adjust the training hyperparameters (learning rate, batch size, momentum, and weight decay), a disciplined approach described in [33] was adopted. In this approach, a one-cycle learning rate policy was used, in which the learning rate was first set to a small initial value, then increased linearly to a maximum, and then decreased back to the initial value. Finally, the learning rate was decreased exponentially until it was two orders of magnitude smaller than the initial learning rate. The initial and maximum learning rates were chosen as 0.05 and 0.5, respectively, using a learning rate range test [33].
A model with the same architecture and training process was trained separately for each input signal combination. Models 1-3 were separately trained for 100 epochs. During each model training epoch, the whole training set was iterated. First, the training set was randomly shuffled so that the ordering of the PSG recordings was different for every epoch. Then, the data was iterated in batches of 8 recordings. For each recording, two consecutive hours of data starting at a randomly chosen 30-second sleep staging epoch were sampled and fed to the network. The shuffling and sampling was repeated for every training epoch.

C. Statistical Analyses
Classifier performances for sleep staging and respiratory event detection were assessed using precision, recall, and F1scores. For a given class, precision was calculated as T P/P P , where T P denotes the number of correctly detected segments (30-s for sleep staging, 1-s for respiratory event detection), and P P denotes the number of segments predicted as belonging to the class. Similarly, recall was calculated as T P/P , where P denotes the number of segments belonging to the class in the manual scorings. Finally, F1-score was calculated as 2 × (precision × recall)/(precision + recall). The metrics were calculated for each class (wake/N1/N2/N3/REM; no event/hypopnea/apnea) separately, and weighted averages of the classwise metrics were also calculated by multiplying the classwise scores by the proportion of segments manually scored to belong to the class over the total number of segments, before averaging. In addition, confusion matrices and overall Cohen's kappa (κ) values for each model were provided.
AHT-%, apnea time percentage (AT-%), hypopnea time percentage (HT-%), AHI, apnea index (AI), and hypopnea index (HI) calculated from the manual and automatic scorings were compared using Bland-Altman plots. With the plots, intraclass correlation coefficients (ICC) and mean absolute errors (MAE) were also provided. Statistical significance of differences between the values calculated from manually vs. automatically scored recordings were reported using p-values obtained with Wilcoxon signed-rank test.
Interrater agreement between the manual PSG-based scoring, and the automatic scoring for each of the models was assessed using the ICC [34]. From the six different ICC versions described in [34], ICC(1,1) was used since although each of the PSGs were scored by a single human scorer, there were more than one scorer in total, and no information was available on which scorer processed which PSG recordings. Python 3.8.11 was used for all analyses. The deep learning models were implemented with Tensorflow 2.4.1 and its Keras API [35]. Statistical testing was performed using SciPy 1.6.0 [36]. Classification metrics were calculated using Scikitlearn 0.23.1 [37]. Bland-Altman plots and ICCs were calculated with Pingouin 0.4.0 [38].

A. Sleep Staging and Respiratory Event Detection
Confusion matrices of manual vs. automatic sleep staging for each model are provided in Fig. 2(a). Precision, recall, and F1-scores are reported in Table II. Cohen's κ values were 0.59, 0.60, and 0.74 for models 1-3, respectively. In all signal combinations, precision, recall, and F1-score were the highest for wake and REM sleep. Adding the nasal pressure signal to pulse oximetry signals (model 2) did not improve the performance of sleep staging, but when multiple signals including the EEG were used (model 3), the performance was significantly improved. Especially classification of N1 and N3 sleep was more accurate with the most comprehensive signal combination, although the accuracy of classifying N1 sleep was generally low across all the models. Fig. 2(b) shows the confusion matrices for manual vs. automatic classification of apneas and hypopneas for each second of all recordings. The corresponding precision, recall, and F1scores are listed in Table III. Cohen's κ values were 0.54, 0.63, and 0.65 for models 1-3, respectively. The results show that combining the nasal pressure signal with pulse oximetry data improved the detection of apneas and hypopneas significantly (model 1 vs. model 2). When EEG, oronasal thermocouple, and RIPsum signals were added, the performance of automatic respiratory event detection was only slightly increased (model 2 vs. model 3).
Box plots for distributions of patient-wise F1-scores in respiratory event detection are shown in Fig. 3. When the apneas and hypopneas were not differentiated, the performance of model 1 was markedly lower compared to the additional signal combinations (Fig. 3(a)). Similar results can be seen when only hypopneas were considered (Fig. 3(c)). In contrast, model 1 performed worse for apnea detection (Fig. 3(b)). The performance of model 2 in terms of patient-wise F1-scores was almost on the level with model 3.

B. AHI and Apnea-Hypopnea Time
Bland-Altman plots for AHI, AI, and HI for manual vs. automatic scoring for each model are shown in Fig. 4. The ICC, MAE, and p-value for statistical significance of the differences between manual and automatic scoring are reported with the plots. The results show that based on the ICCs, model 1 reached similar performance on AHI estimation as models 2 and 3. The mean of differences shown in the Bland-Altman plots was higher with model 1 compared to models 2 and 3. The downside of using only pulse oximetry becomes apparent when looking at the apnea index (AI). Model 3 yielded superior results (ICC = 0.971) compared to model 1 (ICC = 0.793). Model 3 also performed better for AI estimation when compared to model 2 (ICC = 0.914). When looking at the HI, a similar trend of model performance increasing along with added input signals was found but the relative differences between the models were smaller (ICC = 0.769 with model 1, ICC = 0.829 with model 2, and ICC = 0.883 with model 3).
Results similar to AHI, HI, and AI can also be seen in the Bland-Altman plots of AHT-%, AT-%, and HT-% (Fig. 5). When the apneas and hypopneas were not differentiated, model 1 performed comparatively with model 3 (ICC = 0.932 for both models). However, unlike with the AHI, there was no difference in the mean of differences between manual vs. automatic AHT-% between the models. When only apnea time (AT-%) was considered, the differences between the models became apparent; ICC = 0.734 for model 1, whereas model 3 yielded ICC = 0.960. The gap in performance can also be clearly seen in the 95% intervals in the Bland-Altman plots (Fig. 5).
The ICC and MAE were also calculated for manual vs. automatic AHI separately for REM and NREM sleep. In REM-AHI estimation, model 1 performed slightly better than model 2, while the performance of model 3 was lower (model 1 ICC

IV. DISCUSSION
In this study, a deep learning architecture was developed to simultaneously score both sleep stages and respiratory events. Pulse oximetry-based automatic scoring (model 1) was compared to scoring based on pulse oximetry combined with nasal pressure (model 2), and to scoring based on a comprehensive signal combination also applicable for manual scoring (model 3). The results show that the proposed architecture was able to perform the multi-output segmentation task with different Fig. 3. Patient-wise F1 scores for respiratory event detection in the test set (n=88). The pulse oximetry-based model had poor performance on apnea detection. When the type of respiratory events was not taken into account, the performance relative to additional signal setups was significantly better. Model 1 utilized photoplethysmography (PPG) and oxygen saturation (SpO 2 ) signals. Model 2 utilized PPG, SpO 2 , and nasal pressure signals. Model 3 utilized SpO 2 , nasal pressure, electroencephalography, oronasal thermocouple, and sum of thoracic and abdominal respiratory inductance plethysmography signals.  temporal resolutions (30-s for sleep staging, 1-s for respiratory event detection), which was the first hypothesis of this study. The pulse oximetry-based model achieved similar performance compared to models 2 and 3 in terms of AHI and AHT-% calculated from the automatic scorings. In addition, the pulse oximetry-based model performance was on par with models 2 and 3 when estimating the REM-AHI and NREM-AHI, which was in line with the second hypothesis of this study.
The high performance of the pulse oximetry-based model when estimating the AHI and AHT-% was a surprising finding, since the more comprehensive signal combinations performed clearly better when considering only apneas (AI, AT-%) or hypopneas (HI, HT-%). This was also seen in the patient-wise F1-scores (Fig. 3); when the type of respiratory events was not taken into account, the pulse oximetry-based model performed well compared to the other models. One probable cause for this is that although apneas and hypopneas are often related to oxygen desaturations seen in the SpO 2 signal, it is hard to determine without airflow information whether the obstruction is complete or partial. An example of the pulse oximetry-based model 1 confusing the apneas and hypopneas can be seen in Fig. 6. Moreover, even the professional human scorers may be prone to confuse apneas and hypopneas. For example, in one earlier study the ICC of the AHI was reported to be 0.95, whereas the ICCs for total apnea and total hypopnea counts were 0.73 and 0.80, respectively [8].
It is notable that the patient-wise F1-scores of the pulse oximetry-based model 1 when only considering the apneas were significantly lower compared to hypopneas. One possible explanation for this is that most of the patients experienced only a few apneas, while the number of hypopneas was more evenly distributed in the studied population (median AI 1.2 vs. median HI 14.6, Table I). According to the AASM rules, oxygen desaturations are not considered when scoring apneas, and a single apnea of short duration may not cause desaturation. This may explain the challenges model 1 had in detecting the occasional apneas, leading to lower F1-scores in patients with a low apnea index. In the severe cases, where the apneic events occur frequently and lead to clearly visible desaturations, model 1 could detect the apneas most of the time.
In addition to good performance when estimating the total AHI, the pulse oximetry-based model also performed well for Fig. 6. An example where the pulse oximetry-based model confused apneas and hypopneas. This supports the results that pulse oximetry-based model estimated the apnea-hypopnea index well, but had issues when the type of respiratory events was taken into account. All models correctly classified REM sleep and wakefulness, but none of the models correctly detected the single epoch of N1 sleep. REM = rapid eye movement; N1-N3 = non-REM stages 1-3; PPG = photoplethysmography; SpO 2 = oxygen saturation; EEG = electroencephalography; RIPsum = sum of thoracic and abdominal respiratory inductance plethysmography signals. Model 1 utilized PPG and SpO 2 signals. Model 2 utilized PPG, SpO 2 , and nasal pressure signals. Model 3 utilized SpO 2 , nasal pressure, EEG, oronasal thermocouple, and RIPsum signals.
estimating the REM-AHI when compared to models 2 and 3. This was also our hypothesis, since we have previously shown that the REM classification accuracy of pulse oximetry-based deep learning models is almost as high as with EEG-based models [16]. Since REM-related OSA is linked to an increased risk of hypertension [39], it is important to be able to differentiate between the sleep stages when developing novel screening tools. REM-related OSA is traditionally defined as the overall AHI≥5, REM-AHI/NREM-AHI≥2, and NREM-AHI<15 [40]. Since the amount of REM sleep is only a fraction of the amount of NREM sleep (Table I), merely estimating the total AHI may lead to overlooking REM-related OSA.
The deep learning architecture used in this study supports arbitrary temporal resolutions of the outputs as long as the resolution does not exceed the sampling frequency of the input signals. This is very useful when the data contains scorings at different temporal resolutions, such as the 30-s epochs used in sleep staging, and the manually scored respiratory events which in this study were scaled down to 1-s resolution. Another beneficial feature of the architecture is that the segment classifier performs pointwise convolution with the decoder output, which has the same temporal resolution as the inputs, i.e. the sampling frequency. These high-resolution features are then just averaged before the final classification, and the number of channels in the dense feature maps before averaging is the number of output classes. Thus, as suggested in the original U-time paper [10], the dense feature maps could be used to study the microstructure of the sleep architecture, since by comparing the dense representations to the averaged 30-s segments it can be inspected how the predicted sleep stage information fluctuates within the 30-s segments. However, the usability of these features warrants further studies.
One limitation of this study is that the apneas and hypopneas were not manually scored when the sleep stage was scored as wake. This is not necessarily a scoring error, since the AASM scoring rules suggest to discard respiratory events if the whole duration of the respiratory event has been scored as wake [3]. An example of this can be seen in the upper part of Fig. 1 where the manually scored apneas are highlighted in red color in the input signals (PPG, SpO 2 , and nasal pressure), and the predicted apneas are similarly highlighted in the respiratory events output. In the middle of the example figure, there are clearly visible apneas and corresponding oxygen desaturations that were not manually scored but were detected by the automatic model. Visual inspection of PSG recordings with high errors between manual vs. automatic AHI suggests that a significant proportion of the manual vs. automatic scoring mismatches shared the same underlying issue. The problem is twofold; since the models were trained with the manual scorings, this limitation was also propagated to the automatic models. There were errors between manual vs. automatic scoring where the automatic model misclassified a segment as wake and discarded clearly visible respiratory events during that segment. One option to alleviate this problem and improve the consistency of the models could be to revise the scoring rules so that the respiratory events would be scored even if the whole segment was scored as wake. Then, these events could be discarded from subsequent analyses, such as AHI calculation, if needed.
Another limitation of this study is that a single split to training, test, and validation sets was used. We chose this option to simplify the experimental setup, since training deep learning models is computationally heavy, and we were comparing three separate models. There were more men and patients with severe OSA in the test set compared to the training and validation sets (Table I). We chose to keep with this initial random split, since the difference was randomly produced, and we wanted to avoid any data manipulation by redoing the splits to get more even distributions.
To our knowledge, there is no previously published research on machine learning-based simultaneous automatic scoring of both sleep stages and respiratory events with a single deep learning model. There has been work on using deep learning for both sleep staging and respiratory event detection, but separate models were used for the different tasks [28]. The reported sleep staging accuracy using two-channel EEG was 81.9%, and r 2 correlation between automatic and manual AHI was 0.85 [28]. Since the datasets in this study and in [28] were different, and since the dataset used in [28] was significantly larger (10 000 PSG recordings), the results cannot be directly compared.
It can be reasoned that simultaneous scoring has synergy benefits when training deep learning models. For example, respiratory events will likely not occur when the person is awake, so knowledge of the sleep architecture could be used to set constraints for the respiratory event detection algorithms. The same applies vice versa; knowledge on the occurrence of respiratory events could be used to set constraints for sleep staging. It can be assumed that modern deep learning methods are able to learn these constraints implicitly when trained to score both sleep stages and respiratory events simultaneously. However, the convention to discard events during manually scored wakefulness may cause confusion to the deep learning algorithms, which is one possible explanation why the model 3 sleep staging accuracy (79.3%) was moderately lower compared to EEG-based sleep staging with a combination of CNN and LSTM networks (82.9%) that was previously performed with the same dataset [9]. On the other hand, the pulse oximetry-based model 1 achieved a higher sleep staging accuracy (69.2%) compared to previously performed PPG-based automatic sleep staging with a combination of CNN and LSTM (68.7%) [16].
Comparison of respiratory event detection performance to previous studies is problematic since they have often reported aggregated metrics such as whether a respiratory event was present in a 30-s epoch or not, which does not give accurate information on the performance of a segmentation algorithm. In addition, many studies that report automatic AHI values have not estimated the sleep time, but use the total recording time or manually scored total sleep time in the calculations, and thus are not comparable to this study. The only study that was found to report the respiratory event detection performance using the predictions at every time point is [19], in which the reported accuracies in 0.5-s resolution for no event/apnea/hypopnea were 87% / 68% / 49%. Those accuracies were lower than in this study (97% / 71% / 55% for model 3), but the signal combination used in [19] was different, consisting of oronasal thermocouple, and thoracic and abdominal RIP belts.
A challenge with using pulse oximetry solely for diagnostic purposes is that it is hard to validate the model's predictions visually if airflow and EEG are not recorded. Moreover, although the performance of the pulse oximetry-based model for estimating the AHI was good, there are still problems with differentiating apneas and hypopneas. Thus, we propose that pulse oximetry-based deep learning models could be used for screening of OSA whereas a more comprehensive recording setup would still be needed for diagnosis. The deep learning architecture proposed in this study is also suitable for automatic scoring of the full PSGs, for example to rapidly analyse large amounts of PSG data for research purposes. The computational resources needed to utilize the models are moderate. Using 8 cores of an AMD Ryzen Threadripper 2990WX CPU and an NVIDIA GeForce GTX 1080 Ti GPU, analyzing a full-night recording takes less than half a second on average. Using only the CPU cores, the analysis of a full-night recording takes less than two seconds.
As a possible future direction, large quantities of unlabeled pulse oximetry data could be collected, since pulse oximetry signals are easy to measure and pulse oximetry is already widely present both in clinical monitoring applications and in consumer-grade wearable devices. With unsupervised and semisupervised learning, for example using generative adversarial networks [41], the unlabeled data could be utilized to make the pulse oximetry-based models more accurate and to generalize well for usage with heterogeneous pulse oximetry hardware. Another future direction would be to study the effect of patient characteristics, such as age, gender, BMI, and the history of comorbidities and medications, on the model performance. However, including patient characteristics as model inputs would limit the applications, since all included characteristics would be required to evaluate the models on new data.

V. CONCLUSION
Pulse oximetry recordings can be used for automatic simultaneous scoring of sleep stages and respiratory events. By combining pulse oximetry signals with nasal pressure signal, the accuracy of differentiating apneas and hypopneas improves significantly. When EEG, oronasal thermocouple, and breathing belts are further added to the combination, the sleep staging accuracy improves significantly, although the effect on respiratory event detection accuracy is minor. The proposed pulse oximetry-based model works for AHI estimation in both REM and NREM sleep nearly as accurately as the models trained with the respiratory signals and EEG. Thus, a pulse oximetry-based approach for automatic OSA screening is a promising tool to better reach the subjects in need for OSA treatment. In addition, the proposed deep learning architecture can be used for fast automatic scoring of full PSG recordings.