Automatic Sleep Arousals Detection From Polysomnography Using Multi-Convolution Neural Network and Random Forest

Sleep arousals is a type of sleep disorder, which refers to the phenomenon of waking up and falling asleep again. Monitoring the number and duration of sleep arousals is a crucial aspect of sleep quality assessment. The detection of sleep arousals caused by apnea is relatively easy, and existing methods have been able to give high quality results. However, sleep arousals caused by non-apnea remains an ongoing challenge, and this is also the subject of PhysioNet Computing in Cardiology Challenge 2018. We proposed a non-apnea sleep arousals automatic detection algorithm based on polysomnography (PSG) data. We took 8 most representative signals selected from the 13 channels of PSG signals as input, conducted preliminary classification through multiple convolutional neural networks, and then sent the initial results to the random forest module for ensemble voting, and obtained the final judgment. We carried out some experiments using the CinC 2018 database. We grouped the original dataset reasonably, and based on each group of data, we trained a corresponding CNN, ensuring the balance of positive and negative samples during the training. Our 4-fold cross validation results for the AUROC and AUPRC were 0.953 and 0.552, which were better than the results of the team which ranked first in the CinC 2018.


I. INTRODUCTION
Sleep disorders refer to the abnormal phenomenon of sleep quality caused by various reasons, mainly including insomnia, sleep arousals, abnormal sleep behavior and so on. The phenomenon of arousal during sleep is the main reason affecting sleep, and it has a negative impact on the sleep/wake cycle [1]. Arousal is a brief intrusion of wakefulness into sleep, after which sleep resumes [2]. Although the correct detection of arousal can help to accurately assess sleep quality [3], and make it possible for patients to get well-directed treatment, it is difficult to be detected.
One of the most studied types of sleep arousal is Obstructive Sleep Apnea Hypopnea Syndrome (or simply, apnea) [4]. While apneas are well-known sleep disturbances, they are not the only cause of disturbance. In addition, some other factors, such as respiratory effort related arousals (RERAs), snoring, partial airway obstructions and teeth grinding may cause sleep arousals [5], collectively called non-apnea arousals.
The associate editor coordinating the review of this manuscript and approving it for publication was Rajeswari Sundararajan.
The region of a non-apnea arousal was defined as from 2 seconds before a RERA arousal begins, up to 10 seconds after it ends, or from 2 seconds before a non-RERA nonapnea arousal begins, up to 2 seconds after it ends [5].
The representative indicators to evaluate sleep arousals detection performance are area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC), which are often used in medical target detection, machine learning, data mining and other problems to judge the classification and test results [6].
The polysomnography (PSG) is the gold standard to diagnose sleep diseases [7], which contains physiological signals of 13 channels such as electroencephalography (EEG), electrooculography (EOG), electromyography (EMG), electrocardiogram (ECG) and oxygen saturation (SaO2). It is obtained by medical staff using a variety of recording devices on patients, and it typically contains 6 to 8 hours of sleep data, which are used to analyze sleep disorders.
Since it is a tedious and time-consuming work to manually annotate various physiological signals, automatic detection of sleep arousals becomes an essential research field. In the previous researches, Cho S P, Lee J et al proposed an automatic detection method based on time-frequency analysis and support vector machine (SVM) classifier [8]. They used 9 patients single-channel EEG to do experiment. Alvarez Estevez and Moret-Bonillo proposed an automatic detection of arousals algorithm based on Fisher's linear and quadratic discriminants, support vector machines and artificial neural networks [9]. They used 2 EEG channels and EMG data from 20 patients. Kantar Tugce and Erdamar Aykut designed a decision support system algorithm for arousal by analyzing and obtaining features of EEG signals [10]. They used deep neural network to classification. However, all of these methods were developed on datasets containing relatively few patients, and may not generalize well across different populations.
However, the CinC 2018 [11] also interests this subject, and a lot of researches have been carried out. The winners of the CinC 2018 have come up with some models for arousal detection. Daniel Miller, Andrew Ward et al proposed a Convolution-Deconvolution arousal recognition network with cross connections [12], which consists of 8 convolution layers, 8 deconvolution layers and a fully connected layer. Matthew howe-patterson et al constructed a dense circular convolutional neural network (DRCNN) to detect sleep arousal [13]. This network has multiple dense convolution units (DCU) and residual skip connection bidirectional long short-term memory (BiLSTM) layer. Hei∂ar Már Þráinsson, Hanna Ragnarsdóttirand used recursive neural network (RNN) to classify between arousal regions and non-arousal regions [14]. Their network structure is divided into two layers, the first layer is a bidirectional recursive neural network (BRNN), the second layer is a dense neural network. From the achievements of CinC 2018, the large variance in performances across entrants (AUPRCs mean 0.28, and range from 0.07 to 0.55) indicates that arousal detection still requires further research [11].
In the above CinC 2018 researches, most of them use a single classifier to detect the arousal regions. Using a single classifier cannot consider the differences between different populations. Therefore, it cannot maximally learn the crucial features of physiological data for each group of people. In addition, most of them did not reorganize and recombine the experiment dataset, nor did they make in-depth analysis of which PSG channels are more meaningful for arousals detection, which could not guarantee that the most valuable information will be extracted from the subsequent networks.
In order to solve the above deficiencies, inspired by the thought of ensemble learning [15], in this paper we applied two classification schemes: multi-convolution neural network and random forest. We used many CNNs to extract signals features and make preliminary classification, and utilized random forest to determine the weight of these initial classifiers and give the final result.
The remainder of this paper is organized as follows. We show the database and related theories and concepts in section 2. Our methodology is presented in section 3.
We perform extensive experiments and make a detailed explanation in section 4. We exhibit our results and evaluations in section 5. Then we analyze and discuss our methods in section 6. Finally, conclusions and future directions are outlined in section 7.

A. DATABASE
We used the CinC 2018 training set to experiment, which consists of 994 PSG records of different subjects. The data were monitored at an MGH sleep laboratory for the diagnosis of sleep disorders. Each record contains approximately 7-8 hours of monitoring data, and it contains 13 channels physiological signals including: 6 channels of EEG, 1 channel of EOG, 3 channels of EMG, 1 channel of Airflow, 1 channel of SaO2, and 1 channel of ECG. All signals were sampled to 200 Hz and were measured in microvolts.
The dataset also counts the number of arousals and types, as shown in Table 1: In the training set, the target arousals labels had been given. By analyzing the whole dataset, we can know that the arousal regions marked with label '+1' account for 4 percent, the non-arousal regions labeled '0' for 80 percent, and the regions undefined marked with label '−1' account for 16 percent of the data.
The target arousals regions show that their minimum duration is about 30s, and the maximum duration is above 4 mins. Moreover, most of these regions of non-apnea arousals are located in stage 1, stage 2, and a few in stage 3 of the six sleep stages.

B. CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Network (CNN) is a kind of special deep Neural Network model. As long as a training schema is given, it can simulate the mathematical function relationship between input and output. In recent years, it has been widely used in classification and identification [16].
Its structure can be divided into three parts. The first part is the Convolutional Layer, which is generally placed at the beginning of the network as input to the system. It is often used for feature information extraction. In this layer, neurons are only connected to some adjacent neurons, and these neurons are connected together to form a feature plane. Neurons in this plane have the same weight, and the weight they share is called the convolution kernel. The advantage of this shared convolution kernel is that it can reduce the probability of network overfitting and reduce the complexity of the network by means of partial connection. The second part is the Pooling Layer. It exists between two successive convolutional layers. It compresses the data to reduce the risk of overfitting and it has characteristic invariance. In addition, it can reduce the dimension of the feature, by removing the redundant information. And the third part is the Full Connection Layer. Its structure is the same as the general neural network structure, each neuron is connected with each other. It connects the characteristic information of the parameter, and then send the output value to the next level structure.

C. RANDOM FOREST
As a newly emerging highly flexible machine learning algorithm, Random Forest (RF) has a wide application prospect. It is an algorithm that integrates multiple trees through the idea of ensemble learning. Its basic unit is decision tree. Each decision tree is a classifier, and it integrates the voting results of all categories, specifying the category with the most votes as the final output [17].
It can run efficiently on large datasets. It can also handle thousands of input variables without variable deletion. When a large proportion of the data are missing, it can also maintain accuracy.

A. THE FRAME OF AROUSAL DETECTION MODEL
We built the overall architecture of the automatic detection system for non-apnea arousals as shown in Figure 1. We first preprocessed the raw data. The input of preprocessing module is the raw multi-channel PSG signals, and the output is many processed data segments whose lengths are all of 4s. Then, we respectively put these segmented data into the automatic recognizer module, which will give each data segment a classification result. After that, we conducted post-processing on these classification results, and finally output the corresponding arousal detection results of the whole PSG signals.
We used ICA and double density DWT Algorithm [18] to remove the artifacts of 3 channels of EEG, and then applied a 50-order FIR filter (designed with Matlab FDA tool toolbox) to filter 8 selected channels signals. After that, we downsampled these signals to 50Hz, reducing the data size by 4 times. In order to eliminate the differences among these signals, we calculated and normalized the mean value and standard deviation of each signal. Finally, we divided the data into small segments of length 4s with non-overlap for the sake of facilitating subsequent network processing. The entire data preprocessing module is shown in Figure 2.

C. THE AUTOMATIC RECOGNIZER
This module is the core of the algorithm. We proposed a method that combines multiple CNN models with random forest. We input each data segment into n (the CNNs number) CNN models firstly, and each CNN model gives a classification result of this segment. Then we sent the multiple decision vector sets generated by CNNs to the random forest. We used the random forest to determine the weight of each CNN. The random forest gave the final verdict for each data segment. The structure of automatic recognizer is illustrated in Figure 3.

1) CNN NETWORK STRUCTURE
We used an original CNN model to extract features from small data segments. The proposed network structure is VOLUME 8, 2020  shown in Figure 4. The activation function of our network is RELU, and we used the MSRA (Micromechanical Silicon Resonant Accelerometer) method to initialize network weight, which has a better effect on the network initialization whose activation function is nonlinear [19]. Since the number of layers increases, its nonlinear fitting ability becomes better, we made a compromise between learning speed and accuracy. Our network consists of feature extraction part which is composed of 4 one-dimensional convolution layers and 2 pooling layers, and feature classification part which is composed of 2 fully connected layers and 1 softmax layer.
The input signal is the 8 channels data selected from the 13 channels of PSG signals. The data segment length is 4s and the sampling rate is 50Hz. Therefore, each channel has 200 data points and the input data size is 200 * 8.
Firstly, the input signals enter a one-dimensional convolution layer, where the number of convolution kernels is 12 and the size of the convolution kernel is 5. In order to ensure that the size of each channel data after passing through the convolution layer remains constant, we used zero-padding on the front and back ends of the data during convolution calculation. Then the batch normalization and ReLU activation function were performed. The output size after through the convolution layer is 200 * 12. Then we used a module with 3 convolutional layers and 2 pooling layers to extract the higher-order features of the signal. The number of convolution kernels in the convolution layer decreases by 2 times, and the size of the convolution kernel in the first convolution layer is 3, and the second is 2. The pooling method adopted in the pooling layer is maximum pooling. The first pooling layer stride is 4, and the second is 2. After passing this module, the size of the output data is 25 * 3. Then we stretched the data into vectors, mapped the feature space to a fully connected layer which is easier to classify, and regarded this layer as the final feature classification module. The first fully connected layer contains 50 units, and the second contains 5 units. The activation function adopted by the two full connection layers is ReLu, and the dropping rate is 0.2 to improve the generalization ability. Finally, we used the softmax layer to determine whether the area is an arousal region, and it gave a decision result.

2) RANDOM FOREST MODULE
Since each CNN model output a result, we need to use these results to make the final decision. Inspired by the idea of combining several weak classifiers into one strong classifier in ensemble learning [15], and considering that there is no dependency between each CNN model, we used the random forest which is an improved algorithm of Bagging algorithm [15] to realize this process. We utilized it so that the CNNs with better effect has a higher status, while diminish the impact of other CNNs judgment on the final result. The random forest we used is classification and regression trees (CART), which is commonly used in data mining. It selects the attribute with the smallest Gini coefficient as the optimal attribute partition method [20]. A schematic diagram of this process is shown in Figure 5. The input of the random forest is the decision vectors generated by the multiple CNN models, that is to say, we regarded as each CNN output as a character of a random forest, and the number of CNN structures is n, so the input data size is 1 * n. Then each decision tree randomly selects these decision vectors for classification, and finally carries out a simple vote on the classification results of each decision tree to determine the final results. For each data segment, it gives a 1 * 1 result.

D. POST-PROCESSING
For each segment, the automatic recognizer gives a decision value. Since the time dependence of several successive data segments, we adopted some post-processing methods to improve the performance of the system, as shown in Figure 6. Firstly, we smoothed the output of the automatic recognizer. We used the Moving Average Filter [21] to process the judgment value of the consecutive data segments and correct the judgment error that may occur in a short time. Next, we regarded each segment output value as the label for each data point in the entire segment, and then we upsampled the data to 200Hz, in order to obtain the arousal detection results of the whole signals.

IV. EXPERIMENTS A. INTRODUCTION OF THE EVALUATING INDICATORS FOR EXPERIMENT
We used the AUROC and AUPRC to evaluate the performance of the designed arousal detection system.
The ROC curve is based on a series of different ways of binary classification (boundary value or decision threshold), which is a curve drawn with sensitivity as vertical axis and specificity as horizontal axis. It can explore the relationship between the specificity and sensitivity of the algorithm and weigh the influence of fail to judge and miscarriage of justice. The area under this curve is AUROC. The larger the value (closer to 1), the higher the accuracy of the system will be. The system has a low accuracy when the AUROC is 0.5∼0.7, and has a certain degree of accuracy when it is 0.7∼0.9. When the AUROC is above 0.9, the system has a high accuracy [22]. The PRC curve is more able to reflect the real performance of classification when the ratio of positive and negative samples is large [23]. It reflects the relationship between precision and recall. Regarding them as vertical axis and horizontal axis, we can also draw a curve and get the area under it.
The larger its value (close to 1), the more comprehensive and accurate the recognition will be.

B. DATA PARTITION
We divided the records of 994 subjects into three parts, 696 records were used for training, 99 records were used for verification, and other 199 records were used for test. The entire dataset contains tr-03, tr-04, up to tr-14, a total of 12 series. We divided the three datasets according to the series, which is more universal and makes the final experimental results more convincing. Our data partition is shown in Table 2:

C. BUILD TRAINING DATA
Since the ratio of the arousal regions to the non-arousal regions in the dataset is about 1:20, the detection of arousal regions can be regarded as a classification problem of unbalanced categories. Some Studies have shown that if 90% of the samples of the training set belong to the same category, the classifier will divide all samples into this class [24]. In this instance, the classifier is invalid, although the final classification accuracy is very high. Therefore, balancing dataset is crucial. In addition, when the data is unbalanced, using accuracy to evaluate has little reference significance. Under these circumstances, it is more likely to utilize precision and recall for evaluation [25].
Our method of equalizing the dataset is to expand the arousal regions. We used a Synthetic Minority over-sampling Technique (SMOTE) algorithm [26] to oversampling, and used it to construct a new part of the data in arousal regions rather than the existing data. We eliminated those undefined data points, since they were not useful for the system to detect the arousal regions. We kept all the data points in non-arousal regions. Finally, we recombined these data into a new dataset and used it to train our network.

D. MODEL TRAINING
Our programming language was python 3.7, and we used the computer with a CORE i7 8700K CPU, a NVIDIA GTX 1080Ti GPU, and 32GB RAM to train this model.
We divided the records of 12 series in the reconstructed dataset into 12 groups. For each group of data, the total data length is about 5400, and we used a CNN model to train. There is no dependency between the training and calculations of each CNN model, so it can be done in parallel, which can save the model total training time. The batch size we set for training is 50. The loss function is the cross-entropy cost function, and Adam optimization method [27] was used to optimize our model. The learning rate is a parameter which affects the network convergence rate, and the optimal learning rate depends on the model structure and the experiment dataset. After several experimental adjustments, we set it to 0.0005. The maximum number of training epochs was set to 1000. If the value of AUPRC on the validation set growth rate does not exceed 0.0001 for 20 consecutive epochs, the training will be stopped. And the number of decision trees is 10 for training random forest. We randomly selected 8 properties as candidates from these 12 properties at a time, and we set the maximum number of iterations to 30.

V. RESULTS
We used the scoring program given by the CinC 2018 website [5] to test the performance of the system we designed. The AUROC on the whole test set was 0.953980±0.033123, and the AUPRC was 0.557513±0.113802. In addition, we calculated several other commonly used performance indicators (sensitivity of 88.92%, specificity of 96.56% and precision 87.6%), and the maximum and minimum scores of AUROC and AUPRC for each series, and the corresponding average score. The details are shown in Table 3 and Table 4. It can be seen from the above table that the method we proposed can obtain a remarkable detection result for most patients. The results indicate that it will be a powerful tool for automatic arousal detection in clinic. Moreover, in order to further proving the generalization of our model, we performed a 4-fold cross-validation experiment. We reclassified the training set, verification set and test set so that each model can be evaluated on its own test set. The results are shown in Table 5.
Finally, we compared the models and results used by the CinC 2018 winners with ours, as shown in Table 6, which is sufficient to testify the superiority of our method.

VI. DISCUSSION
In the pre-processing part, we selected 8 physiological signals from 13 PSG channels and sent them to the automatic recognizer. This is the result of the experiments on different combinations of physiological signals we conducted, as shown in Table 7. We carried out one experiment for each combination. Among them, the sixth combination: 3 channels of EEG, 3 channels of EMG, Airflow, and SaO2 achieved the best results, and the AUPRC is 0.552. These 8 channels data should be segmented before being sent to the automatic recognizer. If the data segment length is too small, the network could not extract some crucial features, and if the data segment length is too large, the accuracy of classification will be affected. We carried out several contrast experiments to find the optimal data segment length. The results are shown in Table 8. When the data segment length is 4s, the system gets the best effect. On account of the CNN can automatically extract the signal features and avoid the trouble of manual extraction, we utilized it in our model. And we used multiple CNNs to group training. The number of CNNs is consistent with the group of training data available.
Because of the individual differences of each subject, when they occur arousal, the different kind of physiological signals change is also distinct. In the CinC 2018 dataset, there are already 12 groups which are divided by population. In order to maximize the crucial characteristics of physiological data for each type of patient, we trained a CNN model with a data series respectively, which reduced the difficulty of CNN training and avoided the model underfitting due to the different group differences. We compared this group training method with the method of multiple CNN integration (training each CNN model with all the data). The results are shown in Table 9, and it demonstrates the superiority of group training. In the post-processing part, we used the moving average filter to smooth the output of the automatic recognizer to correct the decision errors in a short time, and the AUPRC is 0.03 higher than that of the AUPRC without processing.

VII. CONCLUSION
In this study, we have proposed a method for identification of arousals by using multiple convolutional neural networks and random forest. The AUPRC score under 4-fold cross validation we get was 0.552, it is better than the best score of 0.54 in the CinC 2018, which exhibits that our method has achieved an outstanding effect. However, our model is not accurate enough to detect sleep arousals for few numbers of patients, and the detection results of different individuals have some dissimilarity. Therefore, we will intend to further generalize our method for a different dataset, and improve our model universality with feature selection and code optimization. It will become a powerful tool for clinical automatic sleep arousal detection.
YITIAN LIU received the B.S. degree in communication engineering from Northeastern University, China, in 2019. He is currently pursuing the M.S. degree in electronic information engineering with Nanjing University, China.
HONGXING LIU received the Ph.D. degree from Xi'an Jiao Tong University, China, in 1997. He is currently a Professor of information electronics with the School of Electronic Science and Engineering, Nanjing University, China, and a Master's Tutor for signal and information processing. He is also a Member with the Biomedical Electronics Branch of the Chinese Institute of Electronics. In recent years, he has mainly carried out research in the field of intelligent information processing and biomedical electronics. He has focused on fetal electrocardiogram technology and achieved remarkable results. He has presided over three projects of the National Natural Science Foundation, one Social Development Project of Jiangsu Science and Technology Support Program, more than ten horizontal cooperation projects, and published nearly 100 articles by the first author or correspondent author, including more than 20 articles by SCI. The first inventor applied for 33 invention patents, of which 17 were authorized.
BUFANG YANG, photograph and biography not available at the time of publication.