On the Generalization of Sleep Apnea Detection Methods Based on Heart Rate Variability and Machine Learning

Obstructive sleep apnea (OSA) is a respiratory disorder highly correlated with severe cardiovascular diseases that has unleashed the interest of hundreds of experts aiming to overcome the elevated requirements of polysomnography, the gold standard for its detection. In this regard, a variety of algorithms based on heart rate variability (HRV) features and machine learning (ML) classifiers have been recently proposed for epoch-wise OSA detection from the surface electrocardiogram signal. Many researchers have employed freely available databases to assess their methods in a reproducible way, but most were purely tested with cross-validation approaches and even some using solely a single database for training and testing procedures. Hence, although promising values of diagnostic accuracy have been reported by some of these methods, they are suspected to be overestimated and the present work aims to analyze the actual generalization ability of several epoch-wise OSA detectors obtained through a common ML pipeline and typical HRV features. Precisely, the performance of the generated OSA detectors has been compared on two validation approaches, i.e., the widely used epoch-wise, $k$ -fold cross-validation and the highly recommended external validation, both considering different combinations of well-known public databases. Regardless of the used ML classifiers and the selected HRV-based features, the external validation results have been 20 to 40% lower than those obtained with cross-validation in terms of accuracy, sensitivity, and specificity. Consequently, these results suggest that ML-based OSA detectors trained with public databases are still not sufficiently general to be employed in clinical practice, as well as that larger, more representative public datasets and the use of external validation are mandatory to improve the generalization ability and to obtain reliable assessment of the true predictive power of these algorithms, respectively.


I. INTRODUCTION
The obstructive sleep apnea (OSA) syndrome is a condition characterized by repetitive episodes of respiratory arrests during sleep [1]. According to recent global screenings, OSA The associate editor coordinating the review of this manuscript and approving it for publication was Venkata Rajesh Pamula . prevalence is considered high, ranging from 9 to 38% in the general population [2]. Patients suffering from this disorder generally describe excessive sleepiness and a characteristic loud snoring during sleep. These symptoms often lead to bad school performance, job loss, family or marital problems, and road accidents in the most severe cases [3]. Besides, there exists substantial evidence on OSA correlation with major cardiovascular diseases, such as atrial fibrillation, stroke, and heart failure [4]. Such comorbilities today represent the main form of non-communicable diseases and, since they are responsible of millions of deaths every year, the early detection of OSA has become a major public health concern [5]. However, most OSA cases are still undiagnosed [6], which may be in part because polysomnography (PSG) remains as the gold standard for its detection [7]. This diagnostic procedure is a resource-intensive and time-consuming method, which limits its global coverage to a large extent and leads to delayed diagnosis and increased waiting lists [8]. Hence, many researchers have committed to find a convenient alternative method to PSG [9], [10].
For that purpose, OSA detection from single or a limited set of physiological signals among those involved in PSG have been widely analyzed in the last years [9]. The problem has been addressed from two different perspectives, such that whilst some works have specifically focused on OSA severity diagnosis, others have pursued the development of epoch-wise OSA detectors [9]. Generally, OSA severity is quantified by means of the apnea-hypopnea index (AHI), a widely used parameter in clinical practice measuring the number of apnea or hypopnea events per hour of sleep [11]. A broad variety of studies have searched direct correlation between OSA severity (i.e., mild, moderate, and severe) and parameters derived from diverse physiological signals to globally diagnose subjects suffering from the disease. Although they have reported promising results, the proposed methods are unable to accurately identify and locate OSA episodes, thus hindering real-time and continuous monitoring of subjects outside of a sleep laboratory and without supervision of specialized technicians, such as at home [10], [12]. Thus, epoch-wise detection with minimum hardware requirements and computational burden currently plays a critical role in the OSA realm and will be the focus of the present work.
To date, epoch-wise OSA detectors based on single-lead ECG recordings have reported the greatest diagnostic accuracy when compared to other physiological signals, also providing a more convenient opportunity for the design and development of single-sensor wearable technology [8], [10]. Furthermore, beyond ECG morphology analyses from statistical [13], [14] and transformed domains [15], [16], the ECG signal also allows to easily obtain heart rate variability (HRV) and then study activation of the sympathetic nervous system, which is the main drive of breathing control and a major contributor to the metabolic dysfunction and elevated blood pressure noticed in OSA presence [8]. Actually, most of the recently proposed epoch-wise OSA detectors are based on the combination of HRV features through common machine learning (ML) classifiers. A complete list of these algorithms can be found in Table 1. As an illustrative example, Rajesh et al. [17] presented an ensemble random forest-based classifier to detect OSA episodes from traditional HRV time-frequency features. In the same line, Liang et al. [18] proposed a single-feature discriminant analysis for the identification of OSA epochs by means of a non-parametric form of sample entropy, and Tang and Liu [19] presented a similar approach based on the temporal dependency complexity analysis of the 2-Dimensional form of sample entropy. Eventually, Afrakhteh and Soltani [20] appealed to several ML algorithms, such as support vectors machine (SVM), the k-nearest neighbors (KNN), linear discriminant analysis (LDA), and diverse boosted ensemble variants, to detect OSA episodes from the empirical mode decomposition of the HRV information and the ECG-derived respiration.
The methods proposed in these works, as well as in most of those listed in Table 1, have reported very promising classification results when discerning between normal and apnea epochs, exhibiting accuracy rates higher than 85% with well-balanced values of sensitivity and specificity. However, these performance measures have been mostly obtained by training and validating the OSA detectors on the same set of ECG intervals through an epoch-wise, k-fold cross-validation approach and, therefore, they are suspected to be overestimated [21], [22], [23]. Compared with resubstitution validation, where the classifier is learned from all the available data and then tested on the same set of data, k-fold crossvalidation offers a more general and unbiased view of the algorithm's performance [24]. Nonetheless, the inclusion of ECG segments from the same patients in the subsets of learning and testing could facilitate classification, thus involving a bias [23], [25]. To avoid this problem, after developing a classification model, it is strongly recommended to evaluate its performance in a different set of patients from those used for training [24], [26]. Unfortunately, among the recently proposed OSA detectors, only a few ones, e.g., those presented by Varon et al. [27], Martín-Gonzalez et al. [13], and Lin et al. [28], have been validated using this external approach (such as Table 1 shows).
Given this context and bearing in mind that external validation in clinical applications has been vigorously advocated by the Transparent Reporting of a multivariate prediction model for Individual Prognosis Or Diagnosis (TRIPOD) initiative [26], the main goal of the present work is to analyze the actual generalization ability of several epoch-wise OSA detectors obtained through a common ML pipeline and typical HRV features. Precisely, the performance of the generated OSA detectors will be compared on two validation approaches, i.e., the widely used epoch-wise, k-fold cross-validation and the recommended external validation. For both cases, diverse combinations of the most widely used databases in the development of epoch-wise OSA detectors will be considered, which are freely available on PhysioNet [29]. Indeed, the Apnea-ECG database [30] has been the favorite one among most authors, being the only one used for training and validation of many epoch-wise OSA detectors (see Table 1). Nonetheless, the MIT-BIH Polysomnographic Database [31] and the St. Vincent's University Hospital/University College Dublin Sleep Apnea Database [32] have also been used by some authors [7], [14], [33] and will be employed in the present study.

A. STUDY POPULATION
The three aforementioned databases were annotated by clinical experts. However, in knowledge that the labeling system for each one was inherently different, the original annotations were retrieved to build the compatibility framework represented in Figure 1. This is explained in further detail in the following subsections.

1) APNEA-ECG DATABASE
The Apnea-ECG Database was first published in the 27 th International Conference of Computers in Cardiology (CinC) 2000 Challenge [30] and consists of 30 male and 5 female subjects from 27 to 63 years of age. These subjects were in turn classified into three groups according to their AHI level. Notwithstanding, since OSA severity was out of the competence of the present work, the AHI-related information was ignored. From the patients, 70 single-lead ECG recordings from 7 to 9 hours of duration, sampled at 100 Hz and 200 ADC units per mV, were obtained. These recordings were grouped into a released set and a withheld set of 35 recordings each (accordingly used as the training and testing subsets for the CinC Challenge). The released-as-training and withheldas-testing paradigm has been widely used by many authors for global subject diagnosis based on OSA severity, even after the end of the Challenge [9]. Conversely, many epoch-wise OSA detectors have only been trained and validated on the released subset (see Table 1). In the present work, all 70 ECG recordings were considered as a complete database. Last but not least, the original annotation system aimed to assess whether right at the beginning of each 1 minute-length ECG interval there was an apnea-related epoch (A-labeled), or not (N-labeled). This labeling system was adopted as a standard reference for the entire study since, compared to the rest of databases, this one presents the lowest annotation time resolution (see Figure 1a).

2) MIT-BIH POLYSOMNOGRAPHIC DATABASE
The MIT-BIH Polysomnographic Database [31], henceforward the MIT-BIH, consists of 18 PSG recordings between 2 and 7 hours of duration acquired from 16 male subjects within 32 and 56 years old. The single-lead ECG recording was sampled at 250 Hz and 12-bit per sample. Besides, the ECG recordings were annotated every 30 seconds by clinical experts, identifying OSA, central apnea, and hypopnea episodes with and without arousals. Considering these annotations, the presence of an apnea or hypopnea episode was evaluated right at the beginning of each one minute-length ECG interval to identify apnea (A-labeled) and normal (N-labeled) epochs, and then maintain a consistent labeling system with the ECG-Apnea database (see Figure 1b).  [32]. It comprises 21 male and 4 female subjects. Each entry of the dataset contains a single-lead ECG recording of 6 to nearly 8 hours of duration, sampled at 125 Hz and ADC units not specified. Annotations were made by a sleep technologist following the standard Rechtschaffen and Kales rules, indicating the exact hour, date, and duration of VOLUME 10, 2022 each obstructive/central apnea, hypopnea, and other kinds of non-sleep-related episodes. Again, on the basis of these annotations every minute in the ECG recordings was accordingly re-labeled as apnea (A-labeled) or normal (N-labeled), following the criterion employed in the ECG-Apnea database to maintain coherency between datasets (see Figure 1c).

B. DATA PREPROCESSING
To improve further analysis of the ECG recordings, baseline wander removal, signal filtering, and R-peak detection were initially developed. Firstly, every single-lead ECG was re-sampled at 500 Hz to enhance R-peak detection quality. Secondly, the baseline wander and high-frequency noise were subtracted from the original signal through a band-pass second-order Chebyshev filter with cut-off frequencies of 0.5 Hz and 100 Hz. In order to preserve phase and amplitude characteristics, a zero-phase digital filter was implemented [57]. Secondly, R-peak detection was performed with the Pan-Tompkins algorithm, which is well-known due to its effectiveness and low computational cost for real-time purposes [58]. However, since this algorithm may produce slight delays in the output, these were corrected by analyzing the range of samples around the corresponding R-peak and updating the location to the maximum value. Thirdly, the ECG recording was broken down into one minute-length segments to then compute the RR intervals as the consecutive differences of R-peaks in time. Such differences, in general, represent what is usually known as the HRV series [59].
Thereafter, a manual signal screening was conducted to discard segments that contained high presence of noise or artifacts, such as those located at the boundaries of the ECG signal, which are often caused by non-physiological reasons (e.g., lead detaching, signal saturation by motion, sneezes, cough, etc.). After this data clearance, 81.83% of Apnea-ECG, 87.2% of MIT-BIH, and 91.1% of UCD-DB 1-minute length ECG segments remained as the actual payload of each database for the present study. Should any of the employed databases be named in the present work, it will be referred to as this corresponding payload. Segments altogether from the three selected databases added up to 38,150 ECG intervals of 1-minute length, i.e., about 635 hours of recording.
Eventually, given that both groups of ECG intervals (i.e., A-labeled and N-labeled) were notably unbalanced in the three databases, a balanced subset for each one was also built. The balancing process was conducted by randomly discarding segments from the bigger group. Thus, the group with more segments was downsized until it equaled the number of segments of the other group. Precisely, the balanced subset for the Apnea-ECG was 83.5% of the corresponding payload, whilst that subset was 79.6% for the MIT-BIH, and 21.4% for the UCD-DB. These balanced subgroups were used to train the generated ML-based OSA detectors, whereas the complete payloads were reserved for their external validation.

C. FEATURE EXTRACTION
Most relevant markers and indices of HRV present in the literature of epoch-wise OSA detection were computed following the guidelines of the Electrophysiology Task Force of the European Society of Cardiology and the North American Society of Pacing, hereafter, the Task Force [59]. In this regard, the Task Force contemplated three different domains of study: time domain, frequency domain, and complexity domain. In the time domain, multiple statistical parameters have been extracted from the HRV series. It is worth mentioning that all 1-minute length ECG segments selected in the present work contained only normal RRi (NNi). In this particular case, a determined RRi was considered normal if it was free of artifacts and a high presence of noise [59]. Be that as it may, the following features were computed from the ECG segments: • Standard deviation of the differences between adjacent RRi (SDSD): defines the SD of the differences between a determined RRi and its successive value.
• Root mean square of differences between adjacent RRi (RMSSD): defined as the square root of the mean squared differences of successive RRi.
• NN50: defined as the number of adjacent NNi pairs whose time difference is greater than 50 ms.
• pNN50: is the quotient between the NN50 and the number of all available NNi.
• Interquartile range (IQR): equal to the difference between the 75 th and the 25 th percentiles.
In the frequency domain, the Task Force contemplates three different spectral bands or spectral components of interest of study [59], i.e.: • Very low frequency (VLF, 0.003 -0.04 Hz): carries information on the recording baseline and may be affected by breathing modulation.
• Low frequency (LF, 0.04 -0.15 Hz): it is considered to vary by means of changes in autonomic modulations of the heart period.
• High frequency (HF, 0.15 -0.4 Hz): carries further information on autonomic modulation of RRi available in short-term recordings (less than 5 minutes).
In the present work, these parameters were computed using two spectral analysis, i.e., the fast Fourier transform (FFT) and the Lomb-Scargle periodogram (LSP) [60]. Although FFT-based spectrum bands were widely used in the corresponding literature [9], the LSP is deemed to provide further information on power spectral density (PSD), particularly in unevenly sampled signals, such as the HRV data [61].
In the complexity domain, the Task Force initially included some non-linear methods with the aim of characterizing HRV sequences influenced by complex physiological behaviors. Such methods included the estimation of Lyapunov exponents, Kolgomorov entropy, and spectral scaling indices. Afterward, Richman & Moorman introduced the concept of sample entropy (SampEn) [62], which proved to outperform the former non-linear markers in several works [9], [63].
In brief, the computation of SampEn of N points, with tolerance r and maximum vicinity of m points, can be defined as follows [62]. Let u(n) be a set of N points, two sequences can be conformed, the target sequence, and the posterior sequence, Now, let the distance between two vectors be defined as: ≤ r are counted as well. Thus, SampEn can be obtained following the expression in Equation (1). Also, the effect of m and r may change significantly the final SampEn value. For that reason, the selected parameters were m = 2 and r = 0.2 times the SD of the evaluated series (HRV), as suggested in Richman & Moorman's works on empirical grounds [62], [64].
Additionally, recurrence plot-based features have also revealed interesting results in the scope of epoch-wise OSA detection [13], [16]. A recurrence plot (RP) is an advanced technique of non-linear data analysis, usually applied to the study of dynamical chaotic systems [70]. Specifically, a RP determines the relationship between instants of time for each state in a phase space (which is comparable to each sequence x m (i) defined for SampEn), which results in a squared matrix that represents the remoteness between states according to a threshold distance, expressed as [71]: The value of RP i,j denotes the recurrence of a state at time i at a different time j, the threshold distance, and (·) the Heaviside function. Note that this function will only bring a binary result for each RP i,j depending on , for which a black dot is represented if the norm is lower than the threshold, whilst a white dot is represented otherwise. This would output a symmetric graph, very similar to those shown in Figure 2a and Figure 2b, which may vary depending on the selected threshold distance [72]. As for computation of aforementioned entropy-based indices, an embedded dimension of m = 2 was used. Moreover, to minimize the dependence of the resulting RP on and establish a common and reproducible criterion, the neighborhood distance was determined using the fixed amount of nearest neighbors (FAN) algorithm, in which for each state x i all columns of the RP have the same recurrence density. This neighborhood criterion leads to an asymmetric RP, as illustrated in Figure 2c and Figure 2d [73]. From this map, the most common RP-based features were extracted as [73]: • Recurrence rate (REC): retrieves the percentage of recurrence points in an RP.
• Determinism (DET): retrieves the percentage of recurrence points which form diagonal lines. VOLUME 10, 2022 • Shannon entropy (ENTR): is the Shannon entropy of the probability distribution of the diagonal line lengths.
• Average diagonal line length (L): is defined as the average length of all diagonal lines within an RP.
• Divergence (DIV): is the inverse of the length of the longest diagonal line, which also corresponds to the sum of the positive Lyapunov exponents.
For the sake of completeness, a summary of all extracted features is provided in Table 2. Additionally, it should be noted that the computational software tool for feature extraction and data analysis was MATLAB R2020b, alongside with PhysioNet Cardiovascular Signal Toolbox [74], Marcus Vollmer's HRVTool [75], and Gaoxiang Ouyang's Recurrence Quantification Analysis (RQA) Toolbox [76]. Eventually, for database downloading and preparation, PhysioNet's WaveForms DataBase Toolbox for MATLAB was employed [77].

D. FEATURE SELECTION AND DEVELOPMENT OF THE ML-BASED OSA DETECTORS
Before the development of diverse ML models, two sequential feature selection (SFS) algorithms were employed, i.e., the sequential forward feature selection (SFFS) and the sequential backward feature selection (SBFS). In SFFS, features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion function [78]. Conversely, in SBFS, features are sequentially removed from a full candidate set until the removal of further features increases the criterion function [78]. It is worth noting that both procedures use an internal cross-validation approach for the evaluation of each feature set [79].
For both cases, the features selected by minimizing the classification error between A-labeled and N-labeled ECG intervals were employed to build several OSA detectors using five common ML classifiers, such as support vector machine (SVM), k-nearest neighbors (KNN), decision tree (TREE), and two random forest-based algorithms, i.e., adaptive boosting (ADA) and bootstrap aggregation (BAG). Moreover, to serve as a reference, additional five OSA detectors were also obtained by combining all the computed features (see Table 2) through the same five ML classifiers. As a summary, three predictive models were built using each ML classifier and a total of 15 OSA detectors were then obtained.
In this regard, the SVM models were trained with Gaussian radial basis function kernel, kernel scale = 1, and polynomial order = 3. The KNN models were trained with distance method = euclidean, and equal distance weights. The TREE models were trained with maximum number of splits = 1, minimum leaf size = 1, minimum parent size = 10, and following the standard CART (Classification and Regression Trees) predictor. Eventually, the ensemble models were both trained with the same number of learning cycles (100 cycles) under the tree decision learner. The only difference in ensemble models consisted in the ADA's learning rate, which value was established at 1.

E. TRAINING AND VALIDATION OF THE ML MODELS
To thoroughly analyze the generalization ability achieved by the obtained ML-based OSA detectors, they were trained and validated by making use of two different approaches. The first one consisted in a common epoch-wise, k-fold crossvalidation strategy, where the training set is divided into k stratified subsets of randomly selected observations. The most common value for k was chosen based on the reviewed literature, that is, k = 10 [9]. Thereafter, the corresponding model is trained with k −1 subsets and tested with the remaining one [80]. This procedure is subsequently carried out until all combinations of training and testing subsets are evaluated. The approach was repeated for the six combinations of the balanced subsets of the three analyzed databases, such as described in Table 3. Thus, each obtained OSA detector was trained and validated six times using the balanced datasets of Apnea-ECG, MIT-BIH, UCD-DB, Apnea-ECG along with MIT-BIH, Apnea-ECG along with UCD-DB, and MIT-BIH along with UCD-DB, respectively.
For direct and fair comparison, the second validation approach consisted in training the ML-based OSA detectors with the same aforementioned six combinations of the balanced subsets of the analyzed databases and then testing them with the non-balanced sets of the remaining databases, such as Table 3 describes. Hereafter, this latter approach will be referred to as external validation. More exactly, the followed combinations of training and testing sets were collected as experiments, these were:  Figure 3 shows the workflow of the first experiment, in which Apnea-ECG balanced payload was used as the training set and the full datasets of UCD-DB and MIT-BIH constituted the testing set.
For both validation approaches, performance classification of the OSA detectors was assessed in terms of the common values of accuracy (Ac), sensitivity (Se), and specificity (Sp). These measures were computed as: where true positives (TP) represent the number of 1 minute-length ECG intervals correctly classified as apneic,

III. RESULTS
On the grounds of the large amount of experiments conducted and the high similarity between the results provided by all the generated ML-based OSA detectors, only those from the algorithms based on the BAG classifier are presented below. Nonetheless, the outcomes for the remaining OSA detectors, based on the SVM, KNN, TREE and ADA classifiers, can be found in Appendix. Table 4 shows the classification results obtained for the six conducted experiments, the three analyzed feature sets, and the two considered validation approaches. As can be seen, SBFS usually brought more features than SFFS for all the experiments. An overview of how often each feature was selected by any of the two selection methods, i.e., SFFS or SBFS, is displayed in Figure 4. In general terms, the measures computed from the time domain of the HRV data were more frequently selected than those from the frequency domain, and these latter ones more frequently than those from the complexity domain. Within each domain, the variables MEAN, LS-LF and FuzzEn were respectively the most prevalent in the obtained OSA detectors. However, regardless the set of features, the classification performance noticed in all the experiments maintained the same trends for both validation approaches. Of note, SFFS only selected a feature in the experiment #2 (i.e., NN50), and classification obtained VOLUME 10, 2022 by both 10-fold cross-validation and external validation was notably lower than when a larger number of variables was considered (i.e., in the cases of SBFS and ALL). In fact, about 7-10% lower values of Ac were seen regarding the other experiments for both validation approaches.
Anyway, for the six experiments the OSA detectors presented drastically lower performance values with external validation than with 10-fold cross-validation, globally around 10 to 30% lower in terms of Ac, Se, and Sp. Moreover, this tendency was more intense when the ML models were trained with the balanced form of the UCD-DB (e.g., highest performance drop in experiment #3 for all cases, see Table 4 and tables in Appendix), which corresponds to the smallest training subset. As previously mentioned, similar results were observed when the OSA detectors based on other classifiers than BAG were analyzed (see Appendix). However, the SVM-based models presented slightly higher resistance to performance dropping from one validation approach to another, because differences about 5 to 15% between both validation approaches were observed. Conversely, the TREE-based OSA detectors presented the lowest resistance with drops in performance about 15-25%, as well as the lowest classification rates for all the experiments in general terms, with values of Ac about 50% (see Table 7 in Appendix).

IV. DISCUSSION
Validation is a very important step in the development of every ML-based prediction model. To the best of our knowledge, the present work is the first one assessing the generalization ability of diverse epoch-wise HRV-based OSA detectors by comparing their classification performance with two different validation approaches. The results obtained by all the generated ML models for an epoch-wise, 10-fold cross validation approach were similar to those reported in most of the previous works, which also used the same validation methodology (although some variety in the number of folds was noticed). In fact, in line with the values of Ac, Se, and Sp about 85% reported by the ML-based OSA detectors in the present work (see Section III and Appendix), Table 1 shows that many previous works obtained Ac values above 80%, with well-balanced classification rates both for normal and apnea ECG segments.
These outcomes are initially promising, suggesting that the obtained OSA detectors were able to properly generalize the apnea detection problem in a minute-by-minute basis, such as some previous authors have claimed [25], [28], [47]. However, the same OSA detectors performed significantly worse when externally validated with alien datasets to those employed for their training, thus demonstrating that epoch-wise cross-validated results overestimated the prediction accuracy of the ML models. As can be seen in Table 4, drops in diagnostic accuracy between 20 and 40% were noticed in most of the conducted analyses, along with notably unbalanced values of sensitivity and specificity. This outcome was previously suspected by other authors [21] and partially proven by only training two ML-based OSA detectors with ECG-based features on the Apnea-ECG [22]. Nonetheless, in the present work a wider study has been conduced by considering a longer set of freely available databases, combinations of them, HRV features, and ML classifiers.
The main reason behind that overestimated classification performance of the ML models reported by the epoch-wise, 10-fold cross validation approach could be the bias introduced by involving ECG intervals from the same patients into training and testing subsets [23]. In this setting, complex classifiers can pick up a confounding relationship between a subject-specific feature (e.g., the heart rate) and diagnostic status, and so produce unrealistically high prediction accuracy [23], [81]. Such association is still easier to be established when more limited and unrepresentative databases are analyzed, because poor diversity of subjects could be found [23]. However, even when the balanced subsets of the three databases analyzed in this work were considered together, an epoch-wise, 10-fold cross validation approach was unable to offer a realistic view of the classification performance of the ML models. In this case, BAG-based OSA detectors still obtained values of Ac, Se, and Sp of 80.77%, 80.94%, and 80.61% for the SFFS-based feature set, 80.72%, 80.67%, and 80.77% for the SBFS-based feature set, and 78.74%, 78.25%, and 79.22% for the set of all HRV features, respectively. These overestimated results, along with the drastic drops in classification observed when external validation was used, suggest that the three most famous public databases are not sufficiently large and representative to train general ML-based OSA detectors.
To this last respect, although more than 38,000 ECG intervals were analyzed, they were obtained from 70 patients collected in the Apnea ECG database, 16 in the MIT-BIH, and 25 in the UCD-DB. Moreover, whereas Apnea-ECG and UCD-DB involved both males and female subjects (notably unbalanced), in MIT-BIH all subjects were male. It is known that OSA has been considered a risk factor for cardiovascular death in men, but whether it was also in women was unknown until recent years [82]. Namely, it has been demonstrated that severe OSA is also associated with cardiovascular death in women [83], especially in women with polycystic ovary syndrome [84]. Another aspect is the age range, which is wider in Apnea-ECG and UCD-DB than in MIT-BIH. Knowing that age represents one of the highest risk factors for OSA right after weight and sex [1], this aspect should be considered in future proposed OSA detectors. To a lesser extent than age and sex, ethnicity may also influence the OSA behavior in HRV patterns, as Quin et al. recently stated in their latest work [85]. As a consequence, the collection of larger, more representative, and multi-center datasets is mandatory to achieve general and accuracy ML-based OSA detectors, which can be transferred into clinical practice.
On the other hand, it is worth noting that in the context of OSA detection, two main dimensional reduction algorithms are frequently mentioned, i.e., principal component analysis (PCA) and SFS [10]. Whilst PCA represents an approach focused on the variance of relative components computed from an altered space of features, SFS adds or removes features from a candidate subset while evaluating a criterion function, which usually depends on the classifier internal belief [38]. In the present work, SFS was chosen instead of PCA because of its ability to quantify the exact contribution of each combination of features for every experiment [16]. Anyway, the main goal of both algorithms is to lessen the computational burden and improve the model generalization, but no significant differences in the classification performance of the generated ML-based OSA detectors were noticed for the three feature sets (SFFS, SFBS and ALL). Indeed, although SFFS-based and SBFS-based models were slightly lower in Ac than those models trained with all HRV-based features when a epoch-wise, 10-fold crossvalidation approach was used, a considerable drop in performance was still reported for the three cases in the setting of an external validation (see Table 4).
Despite this last outcome, decreasing the feature set size is always convenient to minimize computational burden and some interesting observations were noticed to this respect. Given the usefulness of HRV non-linear characterization in the OSA realm [8], some of the most recent and innovative non-linear features were implemented beyond the well-known SampEn, such as DispEn, DistEn, QSampEn, RP-based features, etc., to verify their global contribution with SFS. However, they did not perform as well as expected, since they were apparently unable to properly separate A-labeled and N-labeled ECG segments. Only the original form of FuzzEn appeared in almost all the feature sets selected by SFS algorithms, demonstrating certain improvement in the ML models' performance, providing that it was used in combination with other time and frequency measures. Moreover, both the frequency domain and time domain features were the most selected ones by the SFS algorithms (near the 90% of the cases), therefore they are VOLUME 10, 2022 also useful in combination with other time features to detect OSA [10], [51]. It is worth saying that the proportion of feature appearance is not totally indicative of the individual feature efficiency. For instance, the individual separation of groups using solely SampEn (Figure 5a) is very similar to FuzzEn (Figure 5b), but this latter one is more prone to be used in combination with further features (according to SFS) to contribute to the ML model's performance.
Regarding the performance provided by the OSA detectors based on the analyzed ML classifiers others than BAG, no relevant differences were noticed (see Section III and Appendix). Nonetheless, it is remarkable that more advanced ML classifiers, such as ensemble classifiers (ADA and BAG), did not bring significantly better results than traditional ML classifiers, like SVM or KNN, especially in the case of external validation. Moreover, the SVM-based models presented the highest resistance to performance dropping from cross validation to external validation, e.g., experiments #1 and #2 in SVM with all features shown a performance drop of approximately 5%, whereas for KNN, TREE, ADA, and BAG, have shown over 10% drop in the same conditions (see Appendix). These results makes SVM an attractive choice for OSA detection, such as several authors like Pombo et al. [50] and Zarei et al. [16] have also addressed previously.
Finally, in the interest of understanding and reproducibility of the results, solely well-known and publicly available databases were analyzed in the present work. The generalization ability achieved by the ML models for OSA prediction using other wider private or limited databases is out of scope of this work, but it will be conduced in the future. Additionally, the generalization ability of the growing deep learning algorithms for OSA detection will also be tackled in further works. These algorithms have not been included in the present work because, unlike the case of ML models, there is no a generic pipeline for their development and, even worse, a broad variety of factors can impact the resulting model. In this respect, many kinds of networks can be found, such as recurrent neural networks and convolutional neural networks, presenting different forms of layer architectures and several hyperparameters. Besides, diverse initial training conditions, learning rates, and techniques of early stopping are available. Also, many alternatives to transform the HRV signal into a 1-D sequence (e.g., the signal itself, instantaneous frequency, entropy-based indices, etc.), and into a 2-D image (spectrograms, wavelet scalograms, Poincare plots, etc.) are possible. This convoluted pool of options may hamper appropriate comparison among deep learning models, as well as their impact on OSA detection, and their thorough analysis will require a totally different work.

V. CONCLUSION
The present study has shown that epoch-wise OSA detectors based on common ML models and HRV features, and trained on the three most famous freely available databases, such as most of those published in the last decade, are not sufficiently general to be employed in clinical practice. To improve the generalization ability of these OSA detectors, larger, more representative, and multi-center public databases are still required. Also, external validation, where separate subjects must be considered for training and testing, has resulted to be essential for reliable assessment of the true predictive power of these algorithms. This finding is in line with the recommendations found in recent guidelines to progress in the transition into ML-driven, data-rich medicine, such as the TRIPOD guidelines.

APPENDIX CLASSIFICATION RESULTS FOR THE GENERATED OSA DETECTORS BASED ON SVM, KNN, TREE, AND ADA
In previous Section III, only classification results obtained by the OSA detectors based on the BAG classifier were presented. Thus, the corresponding results for the OSA detectors based on SVM, KNN, TREE and ADA are presented below in Tables 5, 6, 7 and 8, respectively.