Sleep Apnea Detection From Variational Mode Decomposed EEG Signal Using a Hybrid CNN-BiLSTM

Sleep apnea, a severe sleep disorder, is a clinically complicated disease that requires timely diagnosis for proper treatment. In this paper, an automated deep learning-based approach is proposed for the detection of sleep apnea frames from electroencephalogram (EEG) signals. Unlike conventional methods of direct feature extraction from EEG signals, the variational mode decomposition (VMD) algorithm is utilized in the proposed method to decompose the EEG signals into a number of modes. Use of such decomposed EEG signals for feature extraction offers efficient processing of the variations introduced in the frequency spectrum during apnea events irrespective of particular patients. Afterward, a fully convolutional neural network (FCNN) is proposed to separately extract the temporal features from each VMD mode in parallel while maintaining their temporal dependencies. The FCNN block utilizes causal dilated convolutions with increasing dilation rates along with multiple kernel operations in convolutions. Subsequently, for further exploration of the inter-modal temporal variations, these extracted features from different EEG-modes are jointly optimized with a stack of bi-directional long short term memory (LSTM) layers. Hence, the trained and optimized network is capable of generating predictions of apnea frames during the evaluation phase. Contrary to other studies, this study is carried out in a subject independent manner where separate subjects are considered for training and testing. Additionally, a semi-supervised approach is explored where for facilitating better classification performance on a subject’s frames, a small portion of the patient’s data is included in training to leverage insight regarding the possible environmental variations. Extensive experimentations on three publicly available datasets provide average accuracy of 93.22%, 93.25% and 89.41% in the subject-independent cross-validation scheme.


I. INTRODUCTION
Sleep apnea is one of the most prevalent and severe sleep disorders which causes restriction of airflow with repetitive interruptions of breathing during sleep. According to some studies, about 5 − 21% of the general adult population are diagnosed with sleep apnea [1], [2]. The apnea events can lead to substantial harmful physiological disorders which have The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott . the potential to develop longstanding sequelae, for instance, hypertension and cardio-respiratory disorders including heart attacks [3], [4], impairment of neuropsychological competence [5], [6], depression [7] and early all-cause mortality [8]. An increased level of cyclic alternating pattern rate was observed among apnea patients in [9] leading to possible impairments in certain cognitive domains. In traditional polysomnography (PSG), an expert collects numerous physiological signals including electroencephalogram (EEG) from the patient during sleep, and apneic episodes in the readings are later annotated manually. Shortage of sleep centers and sleep technologists [10] along with the inter-technologist scoring variability and other man-made errors [11], [12], are some of the major obstacles in proper diagnosis of apnea. An automated apnea detection scheme can provide a powerful tool for circumventing these human errors and infrastructural shortcomings.
Towards achieving this goal, various physiological signals, such as oxygen saturation, the variation of heart rate and respiratory effort along with varieties of bio-signals like ECG, EMG, and EOG have been used in numerous studies [13]- [15]. Due to the technical complications and high expenditures of acquiring multiple bio-signals along with the rapid development of wearable-wireless EEG acquisition systems [16], EEG based analysis has received special attention from researchers in recent times in numerous sleep-related problems [17]. In [18], the authors proposed Hermite coefficient based decomposition of EEG signal for apnea detection using artificial bee colony optimization. De-trended fluctuation analysis based study of non-linear behavior and power-law correlations of EEG signal is considered for apnea detection in [19].
According to [20]- [24], the spectral contents in various frequency bands of EEG signal are significantly different during apneic episodes compared to other non-apnea periods of sleep. Hence, a number of studies have been carried out to extract effective features for apnea detection from decomposed EEG data. Depending on the vigilance, frequency spectrum of the EEG signal is traditionally divided into five frequency bands, namely delta (0.25-4 Hz), theta (4)(5)(6)(7)(8), alpha (8)(9)(10)(11)(12), sigma (12)(13)(14)(15)(16) and beta . Since in these frequency bands there exists variations in frequency (Hz), amplitude and activity level, in various applications, such frequency divisions facilitate the feature extraction process [21], [25], [26]. Delta, theta and alpha bands correspond to deep sleep, mild sleep and relax state, respectively, while sigma and beta bands refer to alert states. In [26], it is shown that during apnea, the energy contents in various frequency bands change significantly with respect to non-apnea events, especially, at that time higher-frequency bands exhibit greater relative energy contribution than that of lower-frequency bands. In [27], increased delta and beta activity is reported for the obstructive sleep apnea patients with high Apnea Hypopnea Index (AHI) compared to the simple snoring group having lower AHI. In [28], relative decrease in delta band power is found during the apnea events. Instead of frequency division, empirical mode decomposition is also used to analyze the characteristics of the EEG signal during apnea events. In [22], it is shown that the use of variational mode decomposition with adaptive center frequency surpasses the conventional spectral division of EEG data with manual band-pass filtering due to its better capability in capturing the variation of EEG characteristics induced due to apneic events. All these approaches depend on complicated statistical hand-crafted feature extraction process, such as subframe based entropy and log-variance extraction [21], [22], transition of inter-band energy contribution extraction [20], and different spectral band ratio monitoring [20]. Nevertheless, it becomes extremely difficult with these manual, hand-crafted schemes to incorporate the most optimum features from various complex temporal and inter-modal relationships of EEG data.
To the best of our knowledge, all other studies have been carried out in a subject-dependent manner where information of each subject is used for both training and testing which severely limits the applicability of these schemes in real-life test scenarios with unknown subjects. In this paper, a subject-independent automated approach is proposed for detection of apnea frames where a customized deep convolutional-BiLSTM neural network is used for efficient feature extraction from the EEG signal. Rather than using the raw EEG signal, the pre-processed signal is decomposed into a number of modes using variational mode decomposition (VMD) algorithm for adaptive variations in center-mode frequencies. Instead of extracting hand-crafted features from the VMD-decomposed EEG data, a deep fully convolutional network (FCNN) is proposed to independently extract the complex temporal features from each EEG-mode separately. During this operation their causal relationships are maintained in large-time windows using causal dilated convolutions. These FCNN modules project the temporal information of each EEG-mode into a smaller subspace that contains more generalized variations of features relating apnea episodes utilizing multiple kernel convolutions for efficient pooling. Thereafter, a stack of bi-directional long-short term memory (BiLSTM) layers is introduced to process the outputs of all FCNNs together for exploring the inter-modal temporal feature variations in different VMD modes. The whole process is schematically shown in Fig. 1. This approach automatically extracts the complicated temporal and inter-modal feature relationships of the EEG data that are mostly supposed to be affected during apnea events. Intensive experimentations carried out in a subject-independent manner provide outstanding performance that proves the robustness of the proposed scheme for real-world applications.

II. PROPOSED METHODOLOGY
The proposed scheme consists of a number of operations for precise recognition of apnea events. All of these are discussed in detail in the following subsections.

A. FRAME CREATION AND PROBLEM FORMULATION
The raw sampled and digitized EEG signals collected from patients are divided into a number of frames of uniform length to facilitate the processing with deep neural network. Frames are created with a predefined length from the continuous EEG data and its corresponding label is assigned according to the annotated apnea label at the middle of the respective frame. Subsequently, a predefined shift is considered in between subsequent frames to introduce overlapping samples in-between frames. Such overlapping increases the number of available frames that can be used for training. Moreover, larger frame length can be considered for exploring long term dependencies of temporal features without a significant reduction in the number of training frames by increasing overlapping ratio. However, shorter frame length can also be used with a smaller overlapping ratio between subsequent frames.
Let's consider the set of extracted EEG frames as, D = {(x i , y i )| i = 0, 1, 2, . . . , N − 1}, where x i denotes the i th frame with corresponding label of y i . All the frames and labels are extracted from the raw signal X and annotation vector Y, respectively, which can be represented by where l denotes the length of each frame, s is the frame shift, and N is the total number of frames. Hence, smaller shift, s, between subsequent frames will increase overlapping samples between frames. A neural network model will be trained using these extracted frames and corresponding labels. Hence, a binary cross entropy loss function(L ) can be defined that will be optimized in training stage for generating correct predictions from the neural network, which is given by where n is the total number of samples in a batch, w denotes the weight vector, y i andỹ i are the actual label and generated prediction, respectively, for the i th input, and λ is the regularization parameter adjusted for reducing overfitting. During the evaluation phase, the trained and optimized model is used to generate predictions about occurrence of apnea in a frame by frame basis. For every case, the frame is labelled according to the annotation of the mid-sample of EEG frame.

B. VARIATIONAL MODE DECOMPOSITION (VMD)
As the apnea events lead to some complicated neurological activities, these may excite the patient in sleep to balance out the abnormalities. Hence, some random variations of spectral contents can be introduced in the EEG signals of different apnea patients. The time domain representation of the EEG signals doesn't provide the effective representation for detecting the apnea events restricting the improvements in recognition accuracy. Neurological activity of the brain changes from non-apnea to apnea episodes and this can cause significant variation in the various frequency bands of EEG signal which are difficult to observe and capture in the raw time domain signal. The dominant frequencies in these frequency bands caused by the neural activities can be expected to shift with time and depending on the person. Such limitations can be overcome by using a dynamic band division with adaptive center frequencies which can be achieved by using variational mode decomposition (VMD) algorithm [22]. Since VMD decomposes EEG into modes whose center frequencies are adaptively determined, these modes can better follow the frequency shifts and offer an improved representation of the apnea related neural variations. Hence, VMD facilitates the optimization process of the deep neural network for converging to the optimal solution by extracting more effective features. Moreover, VMD increases the computational overhead minimally to gain better performance compared to the endto-end deep learning approaches.
As proposed in [29], VMD decomposes an input signal, y(t), into predefined N number of principal modes, m i (t), which is given by These modes are dynamically determined to minimize the sum of bandwidths of all modes while reconstructing the input signal at the same time in least squares sense through the addition of modes. Hence, the constrained variational optimization problem becomes where m i and w i denote the i th mode function and its corresponding center frequency, respectively. However, several recording environmental conditions along with varying neurophysiological characteristics at VOLUME 9, 2021 different sleep stages may cause misleading amplitude contrast in-between frames of different subjects. These issues can be overcome by carrying out DC offset removal followed by amplitude normalization at the pre-processing stage. After that, the variational mode decomposition algorithm is incorporated to split each frame into number of modes that are processed later using the proposed neural network.
C. PROPOSED DEEP NEURAL NETWORK As shown in Fig. 1, the proposed deep neural network architecture consists of three separate sub-networks: a fully convolutional neural network, a bi-directional LSTM network, and densely connected layers. Firstly, each mode of EEG frame obtained from the VMD is processed separately using a fully convolutional neural network (FCNN). Number of such FCNNs are used in parallel to extract the temporal feature variations from each mode separately. These operations transform the feature space of each mode into a comparatively low temporal dimension while maintaining the causal temporal relationships among the extracted features. This process facilitates the extraction of the general trends of unimodal feature variations that are supposed to occur during apnea events irrespective of patients. Though dividing the raw EEG frames into number of modes with adaptive center frequencies facilitates the temporal feature extraction of apnea events, significant inter-modal temporal relationships exist among these modes. Hence, output feature maps obtained from each FCNN module operating with different modes of EEG-frames are subjected to processing together. Later, a multi-layer bi-directional long short term memory network (BiLSTM) is being fed with these temporal feature maps obtained from different FCNNs. This module operating with multi-modal features of EEG frames extract high dimensional inter-modal temporal features that are supposed to vary during apnea events. Eventually a combined feature vector is generated from the BiLSTM module that contains generic feature representation of all the modes of the corresponding EEG frame. This feature vector is processed with series of densely connected layers that extract the general representations among extracted features to converge towards the final prediction of apnea event. Detailed architectural analysis of all these sub-networks is provided in the following discussion.

1) PROPOSED FULLY CONVOLUTIONAL NEURAL NETWORK (FCNN)
All the operations in the proposed FCNN can be divided into two blocks of operations in general, which are: causal dilation block and multi-kernel block (Fig. 2). At any instant of the EEG signal, all the previous history of the EEG signal should be taken into consideration to distinguish the neurological activity pattern that instigates the apnea event. Furthermore, for proper recognition of the apnea event, the variations of the EEG signal should be analyzed over a longer temporal window. As short-term fluctuations at various stages of sleep may lead to improper recognition of apnea events, EEG signals should be analyzed from different range of observations. Both of these objectives are incorporated in the 'Causal Dilation Block' (CDB) utilizing a series of dilated causal convolutions. As causal convolution operation at any time instant takes into consideration all the previous history [30], it achieves the first objective. However, a very deep architecture is needed to achieve a large receptive field by only using causal convolution. Hence, by utilizing dilated convolution, the receptive area can be increased by a large margin (shown in Fig. 3). Causal dilated convolutions are widely used in numerous speech related applications relating speech generation, denoising and many others, where the context is to be extracted from large window [31], [32]. If the input signal corresponds to x ∈ R n and the filter of length k is represented by, f : {0, 1, . . . , k − 1} → R, the causal dilated convolution operation F on any element s is given by where * d denotes the causal dilated convolution with dilation rate of d and (s − d) represents the causal convolution with previous time stamps. Hence, the effective receptive area in one such layer is (k − 1)d. The dilation rate is varied exponentially with (d = O(2 i )) at the i-th block.
In each of such block, two causal dilated convolutions are done in series with application of parametric rectified linear unit (PRelu) [33] as nonlinear activation function followed by normalization operation, where Here, α is the slope used for mapping negative input values. In this study, α = 0.2 is chosen for its faster convergence.
If f i is the output feature map obtained from the i th CDB block, a residual output, R i , is also generated that will be fed to the following CDB block which can be represented by where R 0 = x i is the input signal and k is the total number of causal dilation blocks. Afterward, a combined output, F, is generated combining the features extracted using each of the CDB unit blocks that can be expressed as After processing with number of causal dilation blocks, the input dataframe of corresponding EEG mode is transformed to a resultant feature map that contains numerous extracted features containing diverse temporal relationships. However, the temporal dimension of this feature map is kept as same to the input dataframe which should be reduced further to extract the more general unimodal temporal features. Thus, multi kernel block (MKB) is proposed to reduce the temporal dimension of the feature map while performing convolution operations with multiple kernels in parallel. Here, average pooling operation is carried out in between convolutions with different kernels in parallel that helps to incorporate diverse temporal contexts in the pooling operation. Subsequently, all these pooled and convolved feature maps are converged together with another convolution operation that extracts the general variations. Hence, output from each of the MKB units can be generalized as where O 0 = F; h 1 , h 2 , h 3 represent convolutions with temporal kernels of size 1, 3 and 5, respectively; H represents the combined convolution while θ 1 , θ 2 , θ 3 , and θ H are their respective parameters, and m is the total number of such MKB units. After passing through series of multi kernel blocks, the transformed feature map will incorporate more and more generalized unimodal features with reduced temporal dimensions. Finally, an output feature map is obtained from the final multi-kernel block that contains the generalized temporal representation of a particular EEG mode.

2) PROPOSED BIDIRECTIONAL LONG SHORT-TERM MEMORY NETWORK (BILSTM)
The extracted generalized sequential representations of the modes of the EEG frame obtained from the FCNN module are processed together with the bidirectional LSTM network (shown in Fig. 4). LSTM units [34] are proven to extract long term temporal dependencies used in numerous sequence processing applications. Generally, processing in traditional LSTM units depend on the output of previous units and information flows in one direction. Its improved variant is the bi-directional LSTM units where two such layers of LSTM memory cells process the sequence simultaneously in opposing directions [35]. In many applications, such operations provide more temporal context over a longer time frame.
In the proposed BiLSTM module, two of such bidirectional LSTM layers are stacked together. Generalized variational features of different EEG modes undergo through further processing with these LSTM layers that extract effective features of higher levels considering long term inter-modal temporal dependencies. Hence, both the first forward and backward pass layer process the feature sequence {x 1 , x 2 , . . . , x N } produced by the FCNN blocks simultaneously. Each basic LSTM unit cell is comprised of three special data manipulating structures known as input, forget and output gate as outlined in Fig. 4b. Here, σ and tanh denote the logistic sigmoid and hyperbolic tangent activation functions, respectively. The relations used for calculating hidden state − → h t of a forward layer are given as follows: In our stacked model, hidden states of first forward layer and backward layer are concatenated and passed on to both the secondary forward and backward LSTM layers for more processing of inter-modal temporal relations. These secondary hidden layers further explore the feature space to produce a sequence of feature vectors for forward and backward VOLUME 9, 2021 pass layers, respectively. Finally, after concatenating corresponding feature vectors obtained from each secondary forward and backward pass LSTM unit, all of these feature vectors are added together to produce the final output feature vector. Therefore, stack of bidirectional LSTM layers converge all the temporal inter-modal features of different modes of respective EEG frame into a resultant feature vector that contains the global temporal representation of that particular frame.

3) DENSELY CONNECTED LAYERS AND CLASSIFIERS
The resultant temporal feature vector is needed to be converged into the final prediction of apnea incident. Series of densely connected layers are used to exploit the global relationships across all the extracted temporal features as shown in Fig. 4c. This can be given by where d i represents the output of the i-th densely connected layer with W i weight matrix and b i bias vector, σ represents the activation function and d 0 is the input feature vector. In total, three densely connected layers are stacked in series for converging the output feature vector of stacked LSTM units toward the final prediction. Finally, the output obtained from the final densely connected layer with single node is mapped into the final prediction using sigmoid activation function, which is given by

III. RESULTS
In this section, results obtained from extensive experimentation on various publicly available databases will be discussed and analyzed from diverse perspectives.

A. DATABASE
The proposed method was applied on three public datasets to validate the applicability and robustness of the proposed scheme in subject independent scenario. This is a large publicly available database used in a number of other studies. It contains overnight polysomnogram recordings of 4 female and 21 male patients with various Apnea-Hypopnea Index (AHI). The signals were obtained using the Jaeger-Toennies system (Erich Jaeger GmbH, Germany). EEG recordings were collected at 128 Hz. Starting and ending of all the apnea events are manually annotated in the polysomnogram recordings by sleep specialists according to standard scoring rules [37]. Although the dataset contains various common physiological signals for every patient including EOG, EMG, and EEG, we have focused solely on EEG for its ease of collection without disturbing the patient during sleep, and decomposed the signal after dc offset removal and normalization of the data. In line with the objective of this paper, only the EEG signals were extracted and pre-processed by removing DC offset and normalizing. In this database, the collected EEG signals were sampled at a rate of 200 Hz.

3) MIT-BIH POLYSOMNOGRAPHIC DATABASE [39]
This database is a collection of recordings of multiple physiologic signals during sleep. Subjects were monitored in Boston's Beth Israel Hospital Sleep Laboratory for evaluation of chronic obstructive sleep apnea syndrome. It contains over 80 hours' worth of polysomnographic recordings, each with an ECG signal annotated beat-by-beat, and EEG and respiration signals annotated with respect to sleep stages and apnea.
For our study, only the EEG signals were extracted from the dataset which had been digitized at a sampling rate of 250 Hz.
In this study, all the annotated apnea-hypopnea events are considered as apnea. The extensive size and wide range of AHI observed among the patients recorded in all the above-mentioned datasets provide adequate opportunity for extensive experimentation. Subject-independent k-fold cross-validation scheme is employed for evaluation of performance measures. In this scheme, total number of patients is divided into k-subfolds. Hence, in a single stage, all the patients in (k − 1) subfolds are considered for training while the patients in the remaining subfold are used for evaluation by applying the optimum model for the subfold. This process is repeated for k times such that each patient is considered in one of the test-folds and finally, the performance measures are averaged.
Since a number of empirical parameters are used in the proposed scheme, all the experimentations are carried out in a systematic way to achieve the optimum value of different parameters. The raw EEG signals are divided into number of frames with uniform length and overlapping in between subsequent frames. With an increase of frame length, the EEG frame contains information over a larger interval that facilitates the apnea detection process though it increases computational complexity along with. Taking both issues into consideration and depending on the available ground truth, the proposed method employs a relatively large frame length of 10 seconds for database [36], [38] and 30 seconds for database [39].
In the subject-independent cross-validation scheme, some frames of different patients are considered in the training and evaluation fold. Results obtained from different cross-validation schemes with the varying number of folds are summarized in Table 1. With an increasing number of folds, more data are available for a single training stage that provides better performance. It should be noticed that the best performance is achieved in the leave-one-out cross-validation scheme. However, considerable performance is achieved even with a 3-fold cross-validation scheme that contains a significantly smaller number of data for each training stage compared to the best performing leave-one-out scheme. This proves the robustness of the scheme that can learn the representations of apnea events using a significantly smaller amount of training data. Each frame of EEG data is decomposed into a number of mode-functions that are processed in parallel and the extracted features from each mode are jointly optimized later. In Table 2, the effect of the traditional bandpass filtering decomposition method of EEG data is compared with the VMD along with the no decomposition of raw data. The results reported in the first row of Table 2 correspond to the proposed end-to-end DL architecture without utilizing the VMD, which is denoted as Proposed Method (only endto-end). The results reported in the second and third rows of Table 2 correspond to the combination of the proposed end-to-end DL network with the bandpass filtering decomposition and the VMD, respectively, denoted as Proposed Method (end-to-end with bandpass) and Proposed Method (end-to-end with VMD). It should be noted that adaptive VMD provides higher average accuracy compared to the traditional scheme of band-pass filtering and no-decomposition, respectively, when these are integrated with the proposed scheme. Such improvements signify the effectiveness of the use of VMD in combination with the traditional end-to-end deep learning approaches. As the VMD adaptively adjusts the mode center frequency, it provides more opportunity to extract the general feature variations in the EEG signal introduced for the apnea events in the patient-independent scenario. In Fig. 5, the effect of the number of variational mode functions (VMFs) that are used in the VMD algorithm is shown. It is clearly visible that with the increasing number of mode functions, the proposed scheme seems to perform TABLE 3. Subject-specific cross-validation performance obtained using proposed method (database- [36]). better and the best performance is attained with 5 modes. For higher number of modes, performance gradually decreases for introducing frequent variations in mode-frequencies which make it difficult to adapt in the patient-independent scenario.
Performance of the proposed scheme on different subjects in the subject-independent leave-one-out cross-validation scheme for databases [36], [38] and [39] are provided in Tables 3, 4 and 5, respectively. After close observation of all the performance metrics obtained for different subjects, it is to be noted that the proposed scheme provides consistent performance irrespective of the subjects. It further proves the robustness of the method that provides significant performance even for unknown subjects in evaluation-phase. Moreover, all the evaluation metrics provide comparable performance that represent the balanced performance of this scheme in both apnea and non-apnea frames.  Stack of bidirectional LSTM layers is introduced for further processing of the temporal features obtained from the FCNN module with different modes. In Table 6, the effect of different number of nodes in the LSTM units of each bidirectional LSTM layer is investigated and summarized. The model provides optimum performance in most cases with 128 and 256 nodes in the LSTM units of first and second Bi-LSTM layers, respectively. With more number of nodes, the LSTM units employ more memory units that can extract longer temporal dependencies of inter-modal features. However, with a large number of nodes, the network gets more intricate as well which makes the convergence difficult for vanishing gradient and overfitting issues. After complete training and optimization, apnea frames are predicted for each unknown subject. The conditions affecting the quality of collected EEG signal may be quite different for each sleep center and the staff administering the process introducing difficulty in apnea prediction. Furthermore, naturally each subject exhibits unique biological and physiological characteristics leading to variations in the phenomena experienced during sleep apnea events. These changes may not be properly reflected in networks trained on other patients. Incorporating these patient specific information into the scheme can help the network to adapt and tailor its response to better accommodate each subject. Though in clinical scenario this would require manual annotation of a segment of the data, this can still reduce the burden on professionals and speed up the diagnosis process significantly. To explore this effect, a semi-supervised approach is explored where in the training phase, 30% and 50% of the data of the subject under consideration is included. As illustrated in Table 7, the insights gained from the new data boosts the prediction capabilities of the network. Utilizing a larger portion of the newly recruited subject's data is not feasible. In Table 8, the proposed scheme is compared with other state-of-the-art methods of automated apnea detection using singular channel EEG based schemes. While considering the comparisons, it must be noted that most of these approaches reported performance in a subject-dependent manner keeping data from same subject in training and testing set. These approaches are generally based on handcrafted feature extraction process along with traditional shallow classifier that makes the apnea detection task very difficult in a subject-independent scenario. As these features undergo significant variations with random perturbations of the high-frequency EEG data due to noises and other artifacts, such hand-crafted features aren't enough to extract the general feature variations that are introduced for apnea events. Bhattacharjee et al. used traditional band division of EEG data [21] as well as VMD [22] with rician modeling of entropy and log-variance features and proved the precedence of VMD of EEG signal while detecting apnea. However, it is to be noted that such handcrafted feature extractors aren't enough to provide considerable performance. Zhou et al. [19]  introduced a unique method to predict sleep apnea by analysing detrended fluctuation for feature extraction. This unique method also suffered from similar limitations. In our previous work [40], an end-to-end fully connected convolutional network incorporating residual units was proposed to automate the feature extraction process using full band EEG data. By using the end-to-end network, it provides better performance than [19], [41] and [22], thereby proves the advantage of deep learning in the prediction of sleep apnea. However it has lack of architectural variations in the CNN module to process the long term temporal variations of apnea events in EEG data along with the additional complexity of operating with EEG data in raw form. Due to the application of more sophisticated architectural blocks along with Bi-LSTM modules, the proposed end-to-end DL network outperforms the previous approaches even without using the VMD. However, It is clearly observed that the proposed method (end-to-end DL with VMD) provides the best achievable performance in all of the evaluation metrics compared to the other approaches reported in the table. To be precise, the sensitivity has been improved by more than 2% which is significant for any disease diagnosis scheme.

IV. DISCUSSION
In this study, an automated sleep apnea frame detection scheme is proposed using a deep fully convolutional-BiLSTM neural network for the subject-independent test scenario. It outperforms the existing apneic frame detection approaches significantly in all performance metrics using only single channel EEG signal. Consistently high accuracy and F1 scores are obtained for all three databases which have different properties such as sampling rate, data collection procedures and scoring standards. Two major reasons for getting very satisfactory performances are (1) utilizing the VMD operated input data and (2) employing the proposed end-to-end deep learning network. Instead of directly using the raw EEG data, the variational mode decomposed input EEG data are used. The advantages of using variational mode decomposed EEG signals in an end-to-end deep learning network are summarized as follows: • improved performance with minimal computational overhead, • better representation of the apnea events for gaining better generalization with the deep neural network, and • improving explainability of the achieved performance. According to Table 8, it is observed that the use of VMD in combination with the proposed end-to-end deep learning VOLUME 9, 2021 network offers relatively better performance compared to the other methods. Such improvements signify the effectiveness of the use of VMD in combination with the traditional endto-end deep learning approaches. Moreover, VMD increases the computational overhead minimally to gain better performance compared to the end-to-end deep learning approaches. Generally, neural activity varies during non-apnea and apnea periods, which is reflected in different EEG frequency bands. These changes are more distinct in the variational mode decomposed signals as the center frequencies of the modes in the VMD are adaptively calculated, it allows to capture the center frequency shift due to variation of neurological activity and reduces the burden on the proposed end-to-end deep neural network model to learn the complex relationships on its own. Hence, VMD facilitates the optimization process of the deep neural network for converging to the optimal solution by extracting more effective features. Furthermore, VMD facilitates the interpretability of the achieved performance by extracting the effects of the different frequency modes in the overall recognition performance. It can be observed from the results that the inter-modal information learned from the decomposed EEG modes generated through VMD helped improve the classification performance. The statistical significance of the improvement is proven by performing a paired T-test where a p-value of 0.0329 is obtained.
Use of dilated causal convolutions with the LSTM modules helps the proposed end-to-end deep learning network extract the general pattern of EEG variations for apnea events. As explained earlier, the stack of bidirectional LSTM layers offer a global temporal representation of a particular frame utilizing the temporal inter-modal features of different modes of respective EEG frame. However, further improvement of the proposed apnea detection scheme can be achieved by incorporating effective learning from other relevant applications into the proposed scheme using non-conventional learning mechanism called 'meta-learning' and other similar transfer learning based schemes. The basic idea of meta-learning is to effectively transfer knowledge from several similar tasks for extracting the general representation to learn a new task with fewer training samples. In [42], this mechanism was used for sleep stage classification task by adopting a transfer learning based technique to utilize sleep staging knowledge acquired from a large dataset to classify frames of unseen new subjects. As such knowledge-transfer based scheme from relevant other applications is beyond the scope of this study, it is left as a potential future work. It is to be noted that a major limitation inherent to most of the sleep apnea detection methods using EEG signal is the dependency on availability of a large amount of data of any new subject. In clinical scenario, when diagnosing a new subject, there will not be pre-existing annotated frames of such large volume rendering the methods infeasible and impractical for real life applications. Only a patient independent scheme where a network is trained on data of already diagnosed patients can be employed in real life situation. In addition, as can be seen in the semi-supervised approach of this paper and as evidenced by the results obtained in [43], such patient dependent approaches can produce misleading overestimation of the prediction accuracy. The proposed subject independent method does not suffer from these drawbacks and is suitable for practical deployment.

V. LIMITATIONS OF THE STUDY
While the results for individual datasets show promising performance, there are some limitations that need to be acknowledged. Training the proposed model requires a significant number of normal and apneic EEG frames from various patients. Reduction in the number of frames in the training set results in slight declination in the performance, as can be seen by comparing the patient independent cross validation results of dataset- [39] and dataset- [36]. As the number of patients in [36] was higher than the latter, the model seems to perform better for this dataset than dataset- [39]. However, this might not be a problem anymore due to the easy access to different public datasets. Another minor issue to be acknowledged is the absence of transfer learning from relevant other applications in the proposed model, as our goal was to analyze the advantage of using variational mode decomposed EEG data on an end-to-end deep learning network, while restricting the use of bio-signals to only one type (EEG signal). Lastly, our current study only considers the binary classification between sleep apnea frames and normal EEG frames without detecting the type of apnea. We have left the detection of multiclass apnea frames as well the addition of meta-learning or other transfer learning based approaches in the proposed method for future studies.

VI. CONCLUSION
In this paper, an automated sleep apnea frame detection scheme is proposed using deep fully convolutional-BiLSTM neural network for the subject-independent test scenario. For efficient processing of raw EEG data, variational mode decomposition is adopted for introducing adaptive variations in center mode frequencies. It is shown that such decomposition contributes considerably towards extracting the subject independent feature variations for apnea events resulting in substantial performance improvements compared to the raw EEG data processing. Furthermore, VMD facilitates the interpretability of the achieved performance by extracting the effects of the different frequency modes in the overall recognition performance. To automate the process of feature extraction as well as to analyze the long term temporal variations in the EEG data, the proposed network utilizes dilated causal convolutions with long-short term memory modules effectively that is proven to extract the general pattern of EEG variations for apneic episodes irrespective of subjects. Through experimentation, individual parameters of the proposed method are varied and optimal values are found that provide satisfactory prediction performance using the proposed scheme. Consistent performance is achieved for all the individual subjects in subject independent cross validation scheme. Moreover, for achieving further efficiency in the prediction of unknown EEG frames, a semi-supervised algorithm is introduced that increases the apnea detection performance by leveraging insight gained about the individual subject's conditions. Despite using separate subjects in the training and evaluation phase, significantly higher performance is achieved using the proposed scheme in all the evaluation metrics that ensures the applicability and robustness of this method for practical applications of sleep apnea occurrence predictions.