Multi-Objective Hyperparameter Optimization of Convolutional Neural Network for Obstructive Sleep Apnea Detection

Obstructive sleep apnea (OSA) is a common sleep disorder characterized by interrupted breathing during sleep. Because of the cost, complexity, and accessibility issue related to polysomnography, the gold standard test for apnea detection, automation of the diagnostic test based on a simpler method is desired. Several signals can be used for apnea detection, such as airflow and electrocardiogram. However, the reduction of airflow normally leads to a decrease in the blood oxygen saturation level (SpO2). This signal is usually measured by a pulse oximeter, a sensor that is cheap, portable, and easy to assemble. Therefore, the SpO2 was chosen as the reference signal. Feature based classifiers with shallow neural networks have been developed to provide apnea detection using SpO2. However, two main issues arise, the need for feature creation and the selection of the more relevant features. Deep neural networks can solve these issues by employing featureless methods. Among multiple deep classifiers that have been developed, convolution neural networks (CNN) are gaining popularity. However, the selection of the CNN structure and hyperparameters are typically done by experts, where prior knowledge is essential. With these problems in mind, an algorithm for automatic structure selection and hyper parameterization of a one dimension CNN was developed to detect OSA events using only the SpO2 signal. Three different input sizes and databases were tested, and the best model achieved an average accuracy, sensitivity, and specificity of 94%, 92%, and 96%, respectively.


I. INTRODUCTION
Sleep is a circadian rhythm that significantly contributes to maintaining a pleasant daily routine. However, more than sixty sleep related disorders have been identified and obstructive sleep apnea (OSA) is one the more prevalent in the population [1]. It is characterized by obstruction or The associate editor coordinating the review of this manuscript and approving it for publication was Sabah Mohammed . reduction of the airflow during sleep decreasing the oxygen level in blood. It was also verified that sleep apnea increases cognitive impairment [2] and increases the risk of hypertension [3], coronary artery disease [4], stroke [5] and other diseases. In population based studies it was estimated that around 200 million people suffer from this disorder [6], [7] that is more dominant in adult men (around 4% compared to 2% adult of women) [7]. However, 80% of the patients can be unaware of that they possess the disorder.
Polysomnography (PSG) is the gold standard for OSA detection, using signals from multiple sensors to produce accurate diagnostics. Nevertheless, it is a tedious and time-consuming task [8] that requires specialized equipment, with a high economic cost for maintenance, and professionals to perform the exam [9], [10]. So, a simple and less costly system with a lower number of sensors is desirable for analysis.
Full reviews on commercial devices [11], algorithm performance [12], deep learning [13] and classification methods [14] for OSA detection have been performed. Different source signals, such as oximetry [15], [16], electrocardiogram (ECG) [17], [18], respiration [19], [20], and sound [21], have been previously evaluated. Among the studied sensors, pulse oximetry which estimates blood oxygen saturation (SpO2), has shown significant potential for OSA detection in a noninvasive way that is easy to use and self-assembly. Therefore, SpO2 was chosen for this work. The problem of building a reliable system using SpO2 sensor is mainly two-fold, finding the best features that describe the apnea events and use these features to detect the apnea accurately. Multiple researchers have analyzed different time domain [22] and frequency domain characteristics [23] thus, creating a vast pool of suitable features. The creation of handcrafted features that achieve good performance requires significant domain knowledge. In addition to that, it is becoming significantly harder to find a new set of features that can achieve a higher performance since combining two or more features does not guarantee an improvement [24]. Therefore, a large number of features needed to be sorted according to relevance to increase the accuracy. Various techniques have been employed to address this problem such as minimum Redundancy Maximum Relevance (mRMR), Sequential Forward Search (SFS) [25] and Genetic Algorithms (GA) [26]. However, these techniques are either slow or most of the time it does not guarantee that the best features were chosen. In addition to that, the best features are sometimes dependent upon the classifiers used. Deep learning has the ability to automatically learn features from raw data [27]; thus by using deep network these problems can be solved.
Deep convolution neural network (CNN) is one of the most successful deep networks which are inspired by the vision system. Traditionally, CNN are designed for two-dimensional (2D) images as input with different channels [28] however, it can also be used for one-dimensional (1D) signals with single-channel [29], [30]. In most of the cases, automated detection of obstructive sleep apnea events using a CNN performs better when compared with shallow classifiers [31]. Some authors used nasal airflow [32] or a combination of SpO2, oronasal airflow, and ribcage and abdomen movements [33] and converted these one dimensional signals to a two dimensional input to employ the two dimensional CNN (CNN2D) directly for apnea detection. One dimensional CNN (CNN1D) is a good alternative requiring far less preprocessing (does not need to convert 1D to 2D) for 1D signals. ECG [34], [30], [35] and nasal pressure or airflow [36], [19] signals have been previously used with a CNN1D for OSA classification. Haider et al. [37] employed three one dimensional signals (nasal airflow, abdominal and thoracic plethysmography) to feed a CNN1D with three channel inputs for OSA detection. Following this research line, the SpO2 signal which is 1D in nature was selected to be directly fed to CNN1D without any dimensional change stage.
However, implementing CNN presents significant changes. The structure and/or hyperparameters of the network are typically selected through an experimental search. Such methods require a significant amount of time as well as experience and expert knowledge for the creation of handcrafted network structure and hyperparameters [38]. A possible alternative is to use evolutionary algorithms, such as GA, to solve the structural optimization problem. The algorithm starts with a random individual generation and using mutation and crossover over a defined number of generations to achieve the optimized solution by optimizing fitness function. Zhining and Yunming [39] designed a genetic convolutional neural network model based on random sample and found it has a better performance than CNN in MNIST data set. Evolutionary algorithms also achieved significant success in the configuration of topologies [40] and connection for convolution layers [41]. Regarding the selection of the network hyperparameters, an asynchronous evolutionary approach was successfully used on a Titan supercomputer [42]. Also, neuro-evolution was able to construct large, accurate networks from trivial initial conditions while searching a large space without experimenter participation [40]. By combining Dynamic Structured Grammatical Evolution (DSGE) with GA Assunção et al. [43] was able to achieve better results without resorting to prior knowledge. Grammatical Evolution (GE) was also used for handwritten digit recognition [44] as well as human activity recognition [45]. EvoDeep, a DNN where GA was used for optimization, was developed by Martín et al. [46]. This concept of using GA for choosing the best network was also successfully extended to transfer learning [47]. For sleep apnea detection, Falco et al. [48] used Evolutionary Algorithms (EAs) to optimize the hyperparameters of a deep network with Heart Rate Variability (HRV). Therefore, a GA was also employed in this work for hyperparameter optimization of the CNN.
Unbalanced data is also a common issue in the sleep apnea detection, having insufficiency of one class (apnea) level and prevalence of another class (normal) level. Thus, a single objective technique (which was applied in the previously mentioned applications) commonly tries to maximize the accuracy, leading to a biased classifier since an increase of the accuracy can sacrifice the sensitivity (apnea events detection) that is related to the less prevalent class. The designed model addresses this issue by simultaneously considering the accuracy (Acc), sensitivity (Sen) and specificity (Spc) in a multi-objective problem. The Non-dominated Sorting Genetic Algorithm II (NSGA-II) [49] was selected for this work due to the large success it has in other areas, such as filter design [50], water distribution system [51]. Therefore, the primary objectives of this work are: -Design an automatic featureless sleep apnea events detection algorithm using CNN and SpO2 signal; -Develop an independent algorithm capable of choosing the CNN structure and hyperparameters without any human intervention, using the multi-objective optimization; -Analyze the effects of input sizes, database dependencies and layer size on the classification.
To achieve the desired objectives, an algorithm was developed for the automatic creation and hyper parameterization of CNN using one database with three input sizes (Section II). The result of the creation and hyper parameterization of CNN is discussed in Section III. In the end, the conclusion, limitation and future works are discussed in Section IV.

II. MATERIALS AND METHODS
The proposed system used CNN with NSGA-II algorithms to solve the multi-objective problem by choosing a suitable structure that can achieve the goal of attaining balanced results. Three different databases were used to create and test the system. A simplified workflow of the developed methodology is shown in Fig. 1. For simplicity, the iterative elements of the workflow are not shown (such as two-fold, leave-one-out, different inputs etc.). First, the databases are re-sampled with a minute based annotation. Then CNN hyperparameters optimization is carried out with the HuGCDN2008 database with a two fold technique using half of the subjects (SNo/2) for training and half of the subjects for testing, where SNo is the total number of the subjects. After achieving the best CNN for three inputs, these classifiers were tested with other two (AED and UCD discussed later) databases using cross-database testing and transfer-learning testing. Finally, the performance of the proposed system was analyzed with different performance parameters. A brief description of the materials and methods are discussed in the subsequent subsections.

A. DATABASES
Three databases were tested in this work. Two of them were collected from Physionet and are freely available, specifically the Physionet apnea ECG database (AED) and the St. Vincent's University Hospital / University College Dublin Sleep Apnea Database (UCD). The third one was collected in Gran Canaria University Hospital, named Dr. Negrín University Hospital Database.
The dataset from Sleep Unit of Dr. Negrín was collected in Gran Canaria University Hospital has 70 referred (suspected sleep apnea) patients (51 males and 19 females, from 18 to 82 years old) which will be referred as HuGCDN2008 database. The subjects do not have any arrhythmia and SpO2 signal was sampled at 50 Hz. The annotations were made in 30 s epochs [23].
AED has 70 recordings but only eight have SpO2 signals available. These eight recordings were used and their duration ranges from 7 to10 hours with minute-by-minute annotation [52], [53]. The sampling frequency of this data is 50Hz.
UCD [54] has 25 referred (suspected sleep apnea) subjects including 21 males and four females. This database is continuously annotated with hypopnea (HYP) and central (C), obstructive (O) and mixed (M) apnea. The sampling rate of the SpO2 signal was 8 Hz. It has a continuous annotation.

B. PREPROCESSING OF THE DATABASES
Due to the non-uniformity between datasets, such as sampling frequency and annotation methods, a normalization preprocess was performed.
To have all of the databases at the same sample rate UCD [54] was resampled at 50Hz. In this work, apnea events were detected with one minute epochs (as employed by AED [53]). Therefore, the annotations for HuGCDN2008 were produced by labeling the minute as apnea minute if any or both of its 30 second windows were annotated as apnea by the physician. For UCD [54] if 10 seconds or more, in a minute, were annotated as apnea by the physician, then the one minute epoch was labeled as apnea.
The input sizes of 1 minute, 3 minute (with 2 overlapping minutes) and 5 minute (with four overlapping minutes) were created considering the central minute as the one that defines the label. Therefore, taking into consideration the selected sampling frequency the 1 minute (60 seconds), 3 minute (180 seconds), and 5 minute (300 seconds) windows had 3000, 9000 and 15000 sampling points, respectively.

C. CONVOLUTION NEURAL NETWORK
CNN commonly comprise different types of layers. In this work, the CNN has an input layer, convolution layers, nonlinear layers, fully connected layers, batch normalization layers, softmax layer and a class output layer.
The input layer is the first layer of the network and receives the raw data. Thus, the size of the input layer is the same as the input data so three input layers were tested.
Different convolution kernels of the convolution layer, or filters, slide over the input providing a diversity of features which capture different local information [55]. The learners choose the values of the filters during the training process [56]. The feature map size depends on the number of filters (sometimes referred as depth), filter size (for 1D only width is considered), stride (which is the number of sample points the filters slides in each step) and padding (adding zeros at the end of the data input so that filters can be run on the bordering elements).
If the whole convolution layer is considered, the feature maps can be seen as a n + 1 dimension map where n is the dimension of the input [57]. The equation for the feature map of the convolution layer is given by where 1 ≤ d < nk d , nk d is the number of convolution kernels in a layer, C is the feature map of the entire convolution layer (C ∈ R i×j×nk d ), is the n(= 1) dimensional convolution operation, k is the kernel, f is the input matrix, b is the bias and d is the number of kernels. In this work, Nonlinear Layer uses the Rectified Linear Unit (ReLU) for a non-linear operation that introduces the non-linearity in the network by replacing all negative values by zero. This non-linearity can improve the classification performance [58] therefore, ReLU was chosen as the nonlinear function. It is defined as where x is the input of the function. The Pooling layer uses a subsampling operation for reducing the dimensionality of the data that passes through the network. It has the same parameters as a convolution filter and can perform operations such as average and maximum (max). Max pooling commonly provided the best results thus it was used [59], defined as where x is the input of the maxpooling layer with the same number of input samples as the pooling size of maxpooling. Batch normalization layer normalizes its input over a batch as [60] where µ B is the mean and σ B the variance over the batch size.
∈ is a constant which stabilizes the system if the batch size is too small. This layer is added to increase the training speed and decrease the network initialization sensitivity. A fully connected layer infers that every neuron in the previous layer is connected to every neuron on the next layer and helps to learn the nonlinear parameter. In this layer, a neuron has the same function as a common perceptron with an activation function defined by [61] where x is the input, w is the weight, b is the bias, n is the number of inputs and ϕ is the nonlinear function. This layer is followed by a SoftMax layer, also known as normalized exponential function. The softmax function ϕ softmax allows to represent a categorical distribution which is a probability distribution over k different possible outcomes. In this implementation k = 2 (binary case), thus Afterwards, a class out layer generates the output of the desired class level of the input.

D. OPTIMIZATION OF CNN HYPERPARAMETERS USING MULTI-OBJECTIVE HYPERPARAMETER OPTIMIZATION
CNN's hyperparameters were optimized using a multi objective genetic algorithm named NSGA-II [49], [62]. The multi objective technique was used to have an equally better performance in all objectives, contrary to what it is present in the single objective optimization method [50]. and where v is the vector of design variables in V parameter space with N elements with upper bound v U i and lower bound v L i , y(v) is the objective space and O(v) is the vector representation of the objective functions that has to be maximized [63].
A simplified representation of the implementation strategy is presented in Fig. 2 where all inputs of layers are represented as x and outputs as y. For every generation (Gen) the chromosome of each population (Pop), named P t , was generated using mutation and crossover with the information needed to create a CNN. Then, it was translated to the CNN structure and parameters using the decoding methods indicated in Table 1. After twofold training and testing, the next generation population (P t+1 ) was chosen according to Pareto fronts and crowding distance using Acc, Sen and Spc of twofold test.
The implemented technique can be described in 11 steps: Step 1: A parent population P 0 with the size of N is randomly generated.
Step 2: The system converts the chromosome to a CNN network. A fixed input (3000 neurons for 1 minute, 9000 neurons for 3 minute and 15000 neurons for 5 minute) layer VOLUME 8, 2020 and output layer (fully connected layer with 2 outputs, softmax layer and class output) are always presented (fixed), regardless of the structure chosen by the GA algorithm. The GA algorithm was only allowed to choose the number of the layers between fixed layers, type of layers, size of kernels, pooling sizes, stride and number of neurons of a fully connected layer. A real coded chromosome that ranges from 0 to 1 was used in this work. However, different types of parameters for the CNN had different ranges, so proper decoding was done according to Table 1. First the generated chromosomes were scaled between the defined range and then a ceiling function was used to get natural numbers. To reduce the number of possible solutions, hence reducing the simulation time, two different types of layer combinations were used. First was a convolution layer with a ReLU layer and batch normalization defining it as ConvX. Second one was maximum pooling, indicated by MaxP. ConvX layer has three cascaded functions, doing convolution with the input and a defined kernel (k) then a batch normalization and finally a ReLU, indicated together by f bd in the Fig. 1. To prevent losing too much information in each layer, back to back MaxP layer was replaced by a ConvX layer and each parameter has its range defined in Table 1.
Step 3: After generating the CNN structure (hypermeters), the network was trained using the ADAM algorithm [64] using two fold methods. HuGCDN2008 database has 70 subjects which were divided into 35 subjects in each train and test set. Subject independence between the training and testing sets was ensured by not mixing the subjects data between the sets. An initial learning rate of 0.001 was employed during the training and each 10 epochs the learning rate drop factor was 0.1. The batch size was 256 and data were shuffled in every epoch. An average of the Acc, Sen and Spc was calculated to be used as objective parameters.
Step 4: Using the objective parameters a non-dominant sorting was performed for sorting the parent population, where P = P 0 [49].
Step 5: Simulated binary crossover [65] and polynomial mutation were used to create a new offspring population (Q) of size N .
Step 6: A combined population of R t was created using offspring Q t and parent population P t . Thus, the size of R t becomes 2N .
Step 7: Fast non-dominated sort was used to sort the entire population in the same way as in Step 4.
Step 8: Calculate the crowding distance using the method defined by Deb et al. [49].
Step 9: Combined population, Q t , was sorted according to a non-dominant sort and crowding distance. If the population size (from first F 1 to last F l front) was greater than N then a crowded-comparison operator, ≺ n , was used in descending order to populate the population size until N from F l and others (F >l ) are discarded. The partial order ≺ n was given by i ≺ n jif (i rank < j rank ) or((i rank == j rank ) and (i distance > j distance )) Step 10: Keep the N number of elements from the sorted list and increased the number of generations.
Step 11: Repeat from Step 5 until the termination condition (50 generations were produced) was met.

E. PERFORMANCE EVALUATOR
The multi-objective method optimized (Eq. 7 where k = 3) for Accuracy (Acc), Sensitivity (Sen) and Specificity (Spc) is defined by where, TP is the number of apnea minutes classified as apnea minute, TN is the number of normal minutes classified as normal minute, FP is the normal minutes classified as apnea minute, and FN is the number of the normal minute apnea minutes classified as normal minute. The optimization of CNN hyperparameters was performed using HuGCDN2008 database. The other two databases (UCD and AED) were used for cross-database and transfer-learning performance assessment.
Cross-database: For cross-database testing, the CNN is trained and developed using one database (HuGCDN2008) and testing the performance in another database (UCD and AED).
Transfer-learning: Traditionally, transfer-learning is the process of applying knowledge gain in one field applying in anther that is related but in a different problem. In this work, knowledge gain (optimization of CNN) with one database (HuGCDN2008) was implemented in other databases (UCD and AED) with a minimum modification. Practically, this is achieved by replacing the FC and output layer with new FC and output layer and retain the CNN.
In addition to this, this work investigated the relationship of input sizes, number of layers and performance.

III. RESULTS
The algorithm was implemented in MATLAB and ran in a computer with Intel Core (TM) i7-8700k processor, 64 GB RAM, and two NVIDIA GeForce GTX 1080 Ti GPUs. Two folds were run in parallel in the two GPUs and the average of the obtained results of the objective functions were computed. The optimizations were carried out with three different input sizes, 1 minute with 3000 samples, 3 minute with 9000 samples and 5 minute with 15000 samples, respectively. The termination condition was 50 generations with a population size of 50, which leads to (50(Gen) * 50(Pop = offspring population = N)) 2500 different networks and 5000 networks to train for each input size (because a two-fold method was employed). It took, respectively 587.83926 hours (≈ 24.49 days), 832.04399 hours (≈ 34.67 days) and 911.226716 hours (≈ 37.97 days) for 1 minute, 3 minute and 5 minute input to finish the optimization.

A. HYPERPARAMETER OPTIMIZATION
In the first step, the algorithm generates a random population to ensures the diversity of the population and rank them according to the multi objective optimization (Acc, Sen and Spc) [50]. From Fig. 3 it can be seen that for the three simulations (1 minute, 3 minute and 5 minute input) started (1st Gen) with some random solutions all over the problem space. Over each generation, using mutation and crossover, the algorithm was able to reach better solutions as can be verified in Fig. 4, Fig. 5 and Fig. 6. Almost all the solutions of 50 th generation were better than the 1 st Gen. From the 50 th Gen solutions, among three different inputs, it is noticeable that 3 minute and 5 minute inputs have better results compared to 1 minute solutions but in a similar range.
The multi objective algorithm, NSGA-II, ranked the outputted solution according to Pareto front numbers. All the solutions on first Pareto front are valid solutions as NSGA II do not generate a single solution but a set of Pareto non-dominated solutions. One way of assessing the classifiers performance is the receiver Operating Characteristic (ROC) curve (Sen vs 1-Spc). Since the algorithm is using a three dimensional problem space (Fig. 3) a modified version of the ROC curve (with 2 dimension), where all of the first Pareto front solutions are showed in Fig. 7, can allow to compare the solutions. For all three different inputs, CNN's first Pareto front of 50th Gen has better and more solutions then 1st Gen's first Pareto front, as can be seen in Fig. 7 (a), (b), (c). For the 50th Gen it was assessed that 5 minute input has the best solution followed by 3 minute and 1 minute CNN (Fig. 8).
Although the algorithm was trying to solve a multiobjective optimization, NSGA-II treated equally the optimization variables (Acc, Sen, and Spc) due to a restriction with space problems and constraints such as overlap of apnea and normal events (where the Sen and Spc are dependent and Acc is dependent on Sen and Spc) the solutions do not have equal Sen, Spc, and Acc. Therefore, the final solution was chosen with the highest Acc among all the valid solutions which will also help to compare with other methods presented in the literature. By using Acc as a strategy of choosing one solution over others, three solutions for 1 minute, 3 minute and 5 minute inputs were selected with Acc of 88.2%, 89.24%, and 89.32%; Sen of 72.55%, 74.05%, and 74.75%; and Spc of 94.21%, 94.60%, and 94.44% respectively for two-fold methods. These solutions are marked with a black dot and the values are indicated in a box in Fig 3, Fig 4,  Fig 5, Fig 6, Fig. 7 and Fig. 8. The detailed results of two-fold method are shown in Table 2 and the average is shown in Table 5. The numbers of flexible layers (NoFL) was 5, 5 and 7 ( Fig. 9 Section III C) resulting into 19, 19 and 21 layers for 1 minute, 3 minute and 5 minute CNN networks, respectively (TABLE 3). The layer sequence in 1 minute and 3 minute CNN is also similar where first Conv layer (L2) and Batchnorm (L3) are the same for both. However, the remaining layers have more kernels or channels. Five minute CNN has two more flexible layers and actual layers are in form of Maximum pooling (L9, L13). It has the same number of conv, ReLU and Batchnorm layers as 1 minute and 3 minute CNN, except that the number of kernels and channels in the first layer is higher.

B. CROSS-DATABASE PERFORMANCE
In order to check the universality of the system, the trained (in HuGCDN2008 database) CNNs were tested in AED [53] and UCD[54] databases. The results are presented in Table 2. The performance of all three networks with UCD [54] database was lower than the originally trained database but were higher in the AED [53]. The highest accuracy, 92.65%, was achieved with 1 minute input in AED [53] database. For HuGCDN2008 and UCD [54] database, the main difficulty was in the detection of short apnea events. This could be related to the fact that some respiratory pauses do not produce a clear pattern in the oximetry signal. This could be related to the hemoglobin dissociation curve, where short events would not be able to decrease the SpO2 percentage because a marked reduction in the partial oxygen pressure did not occur. In addition to that, pH, temperature and 2,3-diphosphoglycerate (2,3-DPG) levels, which are specific to each person, can displace the hemoglobin dissociation curve [23].

C. EFFECT OF INPUT SIZE
With increasing the input size, the performance of HuGCDN2008 dataset was improved slightly in Acc, from 88.52% to 89.28% and 89.32%. However, between 3 minute and 5 minute, the results were almost the same. The Sen (apnea events) was affected by the input size with an improvement of more than 2% when comparing with the 1 minute and 5 minute input size. The Spc (normal events) remains almost the same. By increasing the Sen and keeping the Spc stable, the classifiers were able to increase the Acc. A possible reason to justify why longer inputs achieve better Sen could be related to the fact that an apnea event could be  present into different minutes, thus, having the information of longer apnea events increases the detections capabilities. Another reason could be in some work [23] higher (five) minutes spectral features show more relevance. However, in other datasets (AED [53] and UCD [54]) this trend was not consistent. For the AED [53] dataset the highest Acc, 92.65%, and Sen, 91.64%, were achieved by the 1 minute input and for UCD [54] the best results, Acc of 84.96% and Sen of 67.35%, were achieved by the 3 minute input. Sometimes one network of higher input size performs worse than the lower input size. Therefore, the performance parameters are more dependent on the data and train weight than the input size.

D. EFFECT OF LAYERS
Due to the success of big (deeper) networks, one can assume that more layers give better results in case of deep learning. But this assumption is not always true. The number of chosen layers for each solution can be seen in Fig. 9. Analyzing the figures, it is possible to assess that the algorithm was trying different layer sizes to solve the problem and a better solution did not have the highest number of layers (NoFL). A Similar conclusion was presented by Urtnasan et al. [30], [34] that have the occurrence of optimum six layered CNN while testing from 3 to 9 layers. VOLUME 8, 2020

E. TRANSFER LEARNING PERFORMANCE
Transfer learning could be useful apply the information learned from one problem on others. This work mainly focused on OSA detection, so transfer learning performance was analyzed with AED [53] and UCD [54] where the main network was trained using HuGCDN2008 database. The last three layers (L 17-19 for 1 minute and 3 minute, L 19-21 for 5 minute, indicated in Table 3) were removed and replaced with a similar types of layers. Afterwards, it was retrained with leave one out methods (due to their low number of subjects). There were two different weighted networks for each CNN input network (generated using two-fold methods implemented in HuGCDN2008 dataset). Because the actual train data for these transfer learning networks were coming from different dataset there is no need for averaging the two networks results (like two fold methods in HuGCDN2008). Table 4 presents the results from two networks and best results are summarized in Table 5. It was verified that in all of the cases the transfer learned networks have a better accuracy. However, there were database dependencies. In some of the cases, it was not the same original network trained with HuGCDN2008 (e.g. CNN1DF2_1 second fold network for  Table 2 ) who performed the best in other two datasets (e.g. CNN1DF1_1 and CNN1DF1_1 in Table 4 ).

F. COMPARISON WITH STATE OF THE ART WORK
The closest match for the comparison with this work was developed by Ravelo-García et al. [23] where the same database, HuGCDN2008, was used. A shallow classifier, linear discriminant analysis (LDA), was employed with SpO2 signal and combination of SpO2 and HRV. The proposed work achieved 89.32% Acc using 5 minute window with only SpO2 compared to 86.5% and 86.9% with mixed of 1 minute and 5 minute window using SpO2, and SpO2 and HRV signals. The proposed implementation was also able to keep the same performance level with 3 minute window and not sacrificing parameters. Even the one-minute window has better ACC and Sen compared to the other works in Table 5.
For the AED [53] dataset the proposed optimized CNN achieved 92.65% Acc, 93.36% Spc and 91.64% Sen. Though the Acc was not the best among other implementations it has one of the best Sen only surpassed by long short-term memory (LSTM) [66] and deep auto encoder network (DAE) [26]. However, neither of these works, [26], [66], were subject independent. For UCD [54] the transfer learning approach has achieved the highest accuracy compared to the other works except the DAE [26] that was also not subject independent. In both databases, transfer learning increases the performance parameters.
If the comparison only includes deep learning, the proposed networks achieved the best accuracy among all   subject independent implementation. Even compared to some implementation where more signals were employed, such as a combination of SpO2, airflow and respiration [67] or the combination of SpO2, oronasal airflow and movements (ribcage and abdomen) [33].

IV. CONCLUSION, LIMITATIONS AND FUTURE WORK
The goal of the work was to develop and test a novel fully automated hyperparameters optimization algorithm for CNN and significant results were attained.
Three different window sizes were also tested and is was verified that there is almost no difference between 3 minute and 5 minute window sizes. In some cases, the 1 minute outperformed the 3 minute and 5 minute inputs. Compared to shallow networks, the developed CNNs were able to achieve a better performance parameter with smaller input size and without the need for feature extraction.
It was also verified that the performance of the almost a similar structure networks was more sensitive to train and data than the hyperparameters choice. Also, it was verified that transfer leaning has a strong potential for implementation in similar domains.
One of the limitations of the work is the fact that multi objective optimization was only applied to hyperparameter optimization and not used for the training. So, when the transfer learning concept was implemented, the network was sacrificing Sen to achieve a better Acc. The second limitation is the population number, only 50 which cannot ensure that the network had a strong diversity to start with. However, this issue was mitigated by the use of mutation. This work was not designed to be optimized for the layer size. So, even the 50 th generation has a substantially different sized networks. One way of solving this issue would be running for more generations until stable solutions was found. Another way of doing it could be involving the number of layers as one of the objectives which are under consideration for future research. It was verified in the literature, that increasing the number of signals [37] or selecting a recurrent neural network (RNN) [68] could possibly improve the results [34], [30]. So, this could be investigated in the future increasing the number of signals.
GABRIEL JULIÁ-SERDÁ received the degree in medicine from the Autonomous University of Barcelona, the specialty of pulmonology from the Bellvitge Hospital, Barcelona, and the Ph.D. degree from the University of Las Palmas de Gran Canaria. He spent three years as a Postdoctoral Fellow with the Toronto General Hospital and the Mount Sinai Hospital, Toronto, ON, Canada, where he investigated issues related to airway physiology. He has worked as a Pulmonologist in several hospitals in Spain. He has been an Associate Professor of medicine with the University of Las Palmas. He is currently with the Dr. Negrín University Hospital of Las Palmas de Gran Canaria. He is also with the Perpetuo Socorro Clinic, Las Palmas. His research interests include physiology and respiratory diseases. He has made several publications both nationally and internationally on these topics. He was a Lecturer with the Technical University of Setuìbal, Portugal. He is currently an Assistant Professor with the Universidade da Madeira. He is also a Researcher with the Madeira Interactive Technologies Institute. He is also the Director of the Ph.D. degree in automation and instrumentation. His research interests include sleep monitoring, renewable energy, artificial neural networks, and FPGA implementations.
Dr. Morgado-Dias is also a member of the Fiscal Council of the Portuguese Association of Automatic Control. VOLUME 8, 2020