Simplifying Multimodal With Single EOG Modality for Automatic Sleep Staging

Polysomnography (PSG) recordings have been widely used for sleep staging in clinics, containing multiple modality signals (i.e., EEG and EOG). Recently, many studies have combined EEG and EOG modalities for sleep staging, since they are the most and the second most powerful modality for sleep staging among PSG recordings, respectively. However, EEG is complex to collect and sensitive to environment noise or other body activities, imbedding its use in clinical practice. Comparatively, EOG is much more easily to be obtained. In order to make full use of the powerful ability of EEG and the easy collection of EOG, we propose a novel framework to simplify multimodal sleep staging with a single EOG modality. It still performs well with only EOG modality in the absence of the EEG. Specifically, we first model the correlation between EEG and EOG, and then based on the correlation we generate multimodal features with time and frequency guided generators by adopting the idea of generative adversarial learning. We collected a real-world sleep dataset containing 67 recordings and used other four public datasets for evaluation. Compared with other existing sleep staging methods, our framework performs the best when solely using the EOG modality. Moreover, under our framework, EOG provides a comparable performance to EEG.


I. INTRODUCTION
S LEEP quality is vital for everyone's wellbeing, since an individual spends almost one-third of her life either sleeping or trying to do so [1], [2].Sleep staging is important for both monitoring sleep quality and diagnosing sleep disorders [3], which categorizes sleep into different stages, such as Wake, REM (Rapid Eye Movement) and non-REM sleep.In clinical practice, polysomnography (PSG) has been widely used for sleep staging, recording various physiological signals of the human body, such as EEG (electroencephalogram), EOG (electrooculogram), EMG (electromyogram), and ECG (electrocardiogram).PSG recordings are stored as consecutive epochs, each of which is 30seconds.Traditionally, experts categorize each epoch into five different stages, namely, W, N1, N2, N3, and REM, following the sleep staging standards established by the American Academy of Sleep Medicine (AASM) [4].It usually takes several hours for one expert to scoring the overnight PSG recordings of one person.Obviously, such manual process is time-consuming.Meanwhile, the sleep staging results are relatively subjective, since the manual staging heavily depends on experts' experiences.
With the rapid advancement of deep learning techniques, there is a growing interest in the development of automatic sleep staging methods using PSG recordings [5], [6], [7], [8].Given that there are various types of signals in PSG, many studies have tried different types of single modality for sleep staging, and relatively popularly used modalities are EEG, EOG, EMG, and ECG.For instance, some studies [10], [11], and [12] have solely used EEG for sleep staging and achieved good performance across multiple publicly available datasets.Instead of employing EEG, Eognet [13] proved that using single EOG modality also effectively discriminates different sleep stages, but the predictive ability of EOG is not so strong as that of EEG.As for EMG, Andreotti et al. [14] demonstrated that only using EMG modality is not feasible for sleep staging, and its predictive ability is much weaker than that of EEG and EOG.Similarly, solely using other types of signals (i.e., ECG [15]) in PSG recordings cannot work for the sleep staging task.Taken together, the EEG modality stands out as the most powerful for sleep staging among all PSG signals, and EOG is the second most powerful.
However, the collection of EEG signals is quite complex and expensive.Typically, subjects are required to do several inevitable preparations, such as preparing head skin, wearing a cap with dozens of electrodes, and injecting conductive gel.Moreover, EEG signals are very sensitive and subtle to environment noise or disturbance of other body activities (i.e., eye movement, leg movement).It is hard to guarantee the high quality of EEG signals, especially for one person lying for so long time of approximately 8 hours during sleep.These limitations severely restrict the usability of EEG in real-world practical sleep-related applications.Comparatively, EOG, another important type of cues for sleep staging, is relatively easily to collect by simply placing sensors near the eyes during sleep.Meanwhile, the EOG signals are not so sensitive to environment and other body activities.Due to the powerful ability of EEG in sleep staging and the easy collection nature of EOG, it is necessary to figure out how to solely use EOG modality for sleep staging but take advantage of EEG information.
In order to make full use of the advantages of EEG and EOG for sleep staging, some studies have employed both EEG and EOG modalities to address sleep staging [16], [17], [18], [19].As expected, the combination of EEG and EOG improves the sleep staging performance, compared with the performance when solely using EEG or EOG [14].The success of multimodal studies indicates the potential correlation between EEG and EOG modality.It motivates us to first learn the multimodal representation of EEG and EOG and capture their correlation.Then, based on such correlation, we try to generate multimodal representation from the single EOG modality for sleep staging.As shown in Fig. 1, we can simplify multimodal with single EOG modality to classify sleep stages when EEG is not available, making full use of the easy collection nature of EOG and powerful ability of EEG and avoiding the complex collection of EEG.
However, simplifying multimodal with EOG for sleep staging is a nontrivial task, and there are some difficulties.The first one is how to generate multimodal representations containing both EEG and EOG information when EEG modality is not available.Undoubtedly, the generated multimodal representations are the key factor for simplifying multimodal, since it decides the performance when we only use EOG for sleep staging.We train generators to generate multimodal representations with single EOG based on the correlation between EEG and EOG, by adopting the idea of generative adversarial learning.The second difficulty is how to align the characteristics in time and frequency of EOG with that of EEG.Many studies have proved the significance of temporal and spectral features of EEG for sleep staging [20].Thus, we conditionally guide the generators from the perspectives of time and frequency, respectively, to generate multimodal representations for sleep staging when EEG is not available.
In this paper, in order to overcome the limitations of EEG in clinical practice, we propose a novel framework to simplify multimodal sleep staging using a single EOG modality.It makes full use of the powerful predictive ability of EEG and easy collection nature of EOG.We collected one real-world dataset consisting of 67 subjects and used other four public datasets to evaluate our framework.Our contributions can be summarized as follows: • We propose a novel framework to simplify multimodal with single EOG modality for sleep staging, which can perform well with only inputting EOG instead of inputting EEG and EOG together.
• We first model the correlation between EEG and EOG.Then, we generate multimodal representations based on the correlation by adopting the idea of generative adversarial learning.In particular, we consider the temporal and spectral features into the generated multimodal representations.
• Our framework is evaluated on our collected dataset and four public datasets, and the results demonstrate its effectiveness.Compared with existing methods, when only using EOG as input, our framework performs the best.Moreover, by our framework, the EOG provides comparable performance to EEG.

II. RELATED WORK A. Sleep Staging With Single Modality
In previous studies, considering the multiple types of signals present in PSG, many studies have employed different modalities for BCI tasks [21], [22], [23], such as EEG, EOG and ECG.For instance, Supratak et al. [9] proposed DeepSleepNet which is a CNN-BiLSTM based network using EEG, aiming to extract invariant features across different shifts and learn the transition rules among different sleep stages.U-time [10], [24] is a fully CNN network based on the Fig. 2. Overview of the multimodal simplification framework.The entire process can be divided into three phases.The multimodal correlation will be modeled in Stage I.In Stage II, synthetic multimodal representations will be generated, and the dashed lines represent the guiding conditions.In Stage III, only the EOG data will be input for inference.Here, denotes the concatenation operation.
U-net architecture that can excellently model sleep-related features from EEG. RecSleepNet [11]

B. Sleep Staging With Multiple Modalities
Considering the complementarity between EEG and other modalities, some studies constructed sleep staging models based on multimodal feature representations, achieving better performance than single-modal based approaches.Based on EEG and EOG, Jia et al. [17] proposed SalientSleepNet, which includes a Multimodal Attention Module designed to extract multimodal features for specific sleep stage.Compared with SalientSleepNet, MMASleepNet [25] and XSleepNet [16] additionally introduced EMG modality, learning sleep information from three different modalities.These multimodal based models demonstrate better performance than singlemodal based methods on public sleep datasets.Whether employing single modal or multimodal based approaches, the EEG modality proves the irreplaceability for sleep staging.Some studies [26] and [27] have explored cross-modal knowledge distillation.Zhang et al. [28] proposed a visual-to-EEG cross-modal knowledge distillation for emotion recognition.Zhang et al. [29] proposed a knowledge distillation algorithm based on Multi-Channel Multi-Domain to enhance single EEG channel based sleep staging.Liang et al. [30] proposed SleepKD, a Teacher Assistant-Based model using knowledge distillation, which can capture multi-level sleep features using a single EEG channel.These existing knowledge distillation methods can effectively improve the performance of single EEG-based task.However, the complexity and the high-cost of EEG modality acquisition limit its practical application in real-world scenarios.Based on GANs (Generative adversarial nets) [31], Yan et al. [32] tackled the challenges of oneto-many cross-modal transfer in the domain of emotion recognition.Inspired by the previous work, we propose a novel multimodal representation generation based framework to simplify the multimodal sleep staging with single EOG modality, alleviating the limitations of EEG utilization.

A. Problem Formulation
Here, we simplify multimodal with single EOG modality, and introduce a novel task of generating multimodal representations for sleep staging.Formally, given a sleep sequence S=(x 1 , x 2 , x 3 , . . ., x L ) of length L, where x i denotes the i-th epoch in the sequence S. Our goal is to compute the sequence of outputs Y =(y 1 , y 2 , y 3 , . . ., y L ) that maximizes the conditional probability p(x 1 , x 2 , x 3 , . . .x L |y 1 , y 2 , y 3 , . . ., y L ).Here y i ∈ {0, 1} N is corresponding one-hot encoding of real sleep stage of x i and N = 5 denotes the number of sleep stage.

B. Overview
Figure .2 illustrates the architecture of the proposed framework, which is composed of three stages.In the first stage, we model the correlation between EEG and EOG to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
obtain real multimodal features.In the second stage, based on the modeled correlation, we generate synthetic multimodal features by training dual generators using a generative adversarial learning method.In the third stage, with the trained generators, we input single EOG modality for sleep staging.Stage I: Inputting the EEG and EOG modality, we first capture the correlation between EEG and EOG by pretraining a model, and obtain a real multimodal representations containing EEG and EOG information.Stage II: We generate multimodal representations from Gaussian noises based on the learned correlation in the Stage I, and employ adversarial learning with real multimodal features obtained in the Stage I to enhance the reliability of generated multimodal representations.The temporal and spectral features of EOG are considered into the generated multimodal representations.Stage III: In test stage, we only use the EOG modality as input to generate the reliable multimodal features, and classify sleep stages.In the subsequent subsections, we will introduce more details for each stage.

C. Stage I: Modeling Multimodal Correlation
In this stage, we first model the correlation between EEG and EOG by pretraining a sleep staging model, which is fundamental for simplifying multimodal sleep staging.After pretraining, we can obtain real multimodal features, and the quality of the obtained multimodal features is crucial for generating synthetic multimodal representations in Stage II.
Here, the pretrained model consists of three components: the Feature Extractor, Fusion Block and Temporal Encoder.Taking EEG and EOG sequences as inputs, we employ multi-scale convolutional networks to extract invariant features across different shifts from each modality.Given that small-scale convolutions are good at capturing temporal information, and large-scale convolutions are usually used to capture frequency information [9], we perform the small-scale and large-scale convolutions separately for each channel of EEG and EOG modalities.For the spatial features of different channels, we apply the Style-based Recalibration Module (SRM) [33] to weight each channel, and focus on the more important channels.The equation is as follows: We utilize several fully connected layers to fuse the EEG and EOG channels.Subsequently, a feedforward neural network is employed to integrate the EEG and EOG modalities and extract multimodal features.After fusion, the multimodal features will be input into the Transformer Encoder [34] to obtain real multimodal representations containing EEG and EOG information.We apply the feed-forward network as the classifier and train the model by minimizing the cross-entropy loss: where y i is the real sleep stage label for input epoch x i .

D. Stage II: Generating Multimodal Representations
Based on the real multimodal representation obtained in Stage I, our objective is to generate synthetic multimodal features using single EOG modality in this stage.Similarly to other generation methods, we generate multimodal features from gaussian noise.Considering the sequential nature of sleep signals, it remains challenging to guarantee the reliability of the generation without guidance.Hence, we utilize the EOG modality as a guided condition to generate synthetic multimodal features from gaussian noise.Given the real multimodal features obtained in stage I and the generated synthetic multimodal features, we first map these two multimodal representations into a high-dimensional feature space and then employ adversarial training to align them.Specifically, in our framework, a discriminator D is trained to classify the real and synthetic multimodal features, while two generators G1 and G2 try to generate indistinguishable representations between the real and synthetic features.By doing so, the generated multimodal features can be trained to approximate the distribution of the real ones, promoting better performance when using single EOG modality.Adopting a standard GAN loss, the discriminator can be optimized using a cross-entropy loss.The objective of this operation L D can be defined as: where f r and f g denote the real and generated multimodal features, respectively.X eog denotes the EOG conditional constraint.L D is used to optimize the discriminator separately so that it discriminates the real and synthetic features.Here we adopt the inverted labels to address the gradient vanishing [35].
Simultaneously, the generators are trained to confuse the discriminator by generating synthetic features that closely resemble the real ones.The objective function can be described as: where X eog and X n denote the EOG constraint and the gaussian noise.f r denotes the real multimodal features.Notably, we have applied conditional constraints to both the generator and the discriminator.Due to the freezing of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
pretrained network parameters, the first step of the Eq. 4 does not participate in network optimization.Therefore, the equation can be reformulated as: The above is about the process of how to generate synthetic multimodal features.We will introduce our generators, discriminators and classifiers in detail.
1) Time-Frequency Dual Generators: Considering the importance of time and frequency characteristics of sleep signals for sleep staging [36], [37], we employ time-frequency dual generators to generate different multimodal features: in the time domain generator generator 1, we apply the timedomain EOG as the input and we perform Fourier transform on the EOG signal and take its modulus as the input in the frequency domain generator generator 2. Then we adopt randomly generated Gaussian noise as one of the inputs for generators.It is worth noting that the variance of the standard normal distribution significantly differs from that of the original sleep signals.So we set its mean and variance to match those of the original EOG signals, making the distribution of Gaussian noise closer to that of the EOG signal.This step facilitates the subsequent adversarial generation training.The noise will be input into multiple fully connected layers to improve the fitting ability.Then, we input time-frequency domain EOG signals into different feature extractors, each of which contains multiple CNN layers and shares the same structure, respectively.After feature extraction, the timefrequency domain EOG signals will be combined with the noise in order to add conditional constraints.Notably, the Time Feature Extractor and the Frequency Feature Extractor share the same structure.Further details are illustrated in Fig. 3. Subsequently, we apply transformer encoder to learn temporal information within a sequence of synthetic features.The generated multimodal features F 1 and F 2 will be input into the discriminator to optimize the generators by minimizing Eq. 4. The total adversarial loss can be described as follows: where L adv1 and L adv2 denote the adversarial loss of generated time domain multimodal features F 1 and frequency domain multimodal F 2 , respectively.
2) Discriminator: The discriminator is a binary classifier used to distinguish the real multimodal features f r or synthetic multimodal features f g .We first concatenate these two multimodal features f r and f g .Then in the discriminator, the input vector will first be concatenated with the guiding conditions: the single EOG modality.Then, the concatenated vector undergoes a linear fusion layer, mapping it to a dimension of 512.We have constructed multiple layers of linear units, enabling the discriminator to output a binary classification probability indicating whether the multimodal features are from real or generated distributions.The discriminator can be optimized by minimizing Eq. 3. The total loss of discriminator can be described as follows: where L dis1 and L dis2 denote the discriminator loss of generated multimodal features F 1 and F 2 , respectively.3) Dual Classifiers: After generation, we do the addition operation for the multimodal features F 1 and F 2 and control the operation by a hyperparameter α as follows: Some existing studies [38], [39] have demonstrated that dual classifiers can assist the model in reducing variance during the training process and decreasing the probability of low-confidence predictions by utilizing the average prediction vector.Given the real multimodal feature R and the synthetic multimodal feature F out generated from dual generators, the discrepancies are aligned through discriminative crossmodality alignment.Then, the dual classifiers, which share the same architecture, will further enhance the sleep staging decision boundaries and improve the robustness of predictions.We choose two sets of initialization methods: He [40] and Glorot and Bengio [41] to ensure the diversity of predictions, making sure that the dual classifier does not converge to become the same one throughout training.We use crossentropy loss to optimize dual classifiers as follows: To sum up, we integrate the adversarial loss with the classification loss in one objective loss function as follows.
Notably, the loss of discriminator L D will be updated and optimized separately.In our study, we are more concerned with the parameters updated on the classifiers.We set γ to 0.7.For generator 1 and generator 2, we attach equal importance to them, setting λ 1 = λ 2 = 0.5.But for the multimodal features F 1 and F 2 with different generation conditions, we pay more Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I AN OVERVIEW OF THE DATASETS
attention to the features guided by time-domain information after we conduct the paramenter experiment of α.Finally, we set α = 0.7.

E. Stage III: Sleep Staging With Single EOG Modality
During the inference phase, we solely use single EOG modality as input.The trained generator 1 and generator 2 can generate corresponding synthetic multimodal features F 1 and F 2 .After addition, F out will be passed into trained dual classifiers.We calculate the average probability from the two classifiers and get the final prediction ŷ as follows: The algorithm of our multimodal simplification framework is illustrated in Algorithm 1.

IV. EXPERIMENTS A. Datasets
In order to evaluate our framework, we collected a sleep dataset, namely, SSND.For the fairness of the evaluation, we further conducted evaluation experiments on four publicly available sleep datasets, including ISRUC, SleepEDF-153, HMC, and MASS SS2.The PSG recordings in each dataset differ from others in many aspects, such as EEG channels, collection instruments, and populations.A brief summary of all the datasets is given in Tab.I.
SSND is collected by ourselves and contains 67 PSG recordings from 17 healthy subjects and 50 subjects with narcolepsy.The dataset consists of 42 females and 35 males, and the PSG recordings were collected at the Affiliated Mental Health Center & Hangzhou Seventh People's Hospital, Zhejiang University School of Medicine.The research was conducted at Zhejiang University with Institutional Review Board approval, and written consent was acquired from all the subjects or their caregivers.For each subject, we collected PSG recordings for an entire night, starting at approximately 21:00 and ending at 5:00 of the following morning, totaling approximately 8 hours.All signals were stored in the standard EDF+ data format with a.edf extension.The recordings were divided into 30-second epochs, with each epoch manually labeled as a sleep stage by sleep experts or technicians according to AASM [4] guidelines.In our SSND dataset, there are a total of 84,546 epochs, consisting of 56,895 sleep epochs from 50 patients aged 11 to 49, and 27,651 sleep epochs from 17 healthy people aged 22 to 32.All the 67 PSG sleep recordings were used for evaluation.
ISRUC is a public dataset [42] composed of 3 subgroups.We choose sub-group1 which includes overnight PSG recordings of 100 adults.We excluded subject 8 and 40 due to some missing channels.
SleepEDF-153 is a public PhysioNet dataset [43] consisting of 78 healthy subjects aged 25-101.Each subject contains two day-night PSG recordings except subjects 13, 36, and 52 whose one recording is missed due to device failure.All the 153 PSG sleep recordings were used for evaluation.
HMC is a public dataset [44] including a total of 154 PSG recordings gathered retrospectively from the sleep center dataset of the Haaglanden Medisch Centrum (The Netherlands).We excluded subject 14,32,33,64,112 and 135 due to some missing channels.
MASS SS2 is a subset of public MASS dataset [45] composed 20 PSG recordings which were segmented into 20s epochs.All the 20 PSG sleep recordings were used for evaluation.

B. Settings 1) Implementation and Metrics:
We employ K-fold crossvalidation (CV) to assess the performance of our model across the 5 different datasets and the values of K number are listed in Tab.I.In each fold, we apply a subject-independent policy where the samples of the same one subject cannot appear in the test data and the training data simultaneously.We use the Adam optimizer to train the model.The β is set to [0.5,0.99], the weight decay is set to 3e-4 and the learning rate is set to 1e-4.The length of sequence L is set to 20 and the batch size is set to 16.We employ Accuracy (ACC) and Macro-F1 score (MF1) as the evaluation metrics.The model is trained on a single machine equipped with an Intel Core i9 10900K CPU and eight NVIDIA RTX 3080 GPUs using PyTorch.
2) Compared Methods of Sleep Staging: Based on single EOG modality as input, we compared with several deep learning methods and one traditional machine learning method for automatic sleep staging.All the selected methods are designed for single modality.Some methods designed for multiple modalities are not compared here, such as SalientSleepNet [17], MMASleepNet [25], and XSleepNet [16], since we focus on the performance of using single EOG modality.RF [46] is a classical ensemble learning method, which has been widely used in classification tasks.Here, we calculate the average power spectral density of different frequency bands to construct features.DeepSleepNet For the sDREAMER which employs a selfdistillation strategy to transfer multimodal information to the single-modal-based network, it performs a little worse than our approach, especially on the average ACC with a difference of 1.4% (79.5% vs 80.9%).It even performs slightly better than our model on the easily distinguishable sleep stage of W (wake) on the MASS and ISRUC datasets.However, its average MF1 is worse than ours, with a gap of 3.4% (72.0%vs 75.4%).Moreover, it performs much worse than our model for N1 stage that is difficult to distinguish on several datasets, about 20.7% MF1 lower on SSND, 19.7% MF1 lower on HMC, and 10.1% MF1 lower on SleepEDF.and multiple modalities of EEG and EOG, using the pretrained model (baseline model) in the Stage I (as shown in Fig. 2 Stage I ).Notably, when we input the single EEG or single EOG modality, we remove the Fusion Block, designed to fuse multimodal features, from the model in Stage I.And then, we compare the results of the above three experiments with the performance of a single EOG modality using our framework, shown in Tab.II.As we can see, as expected in the most cases, inputting multimodal of EEG and EOG performs the best, which is reasonable as we mentioned above (On the MASS dataset, solely using EEG signals performs slightly better than using multimodal EEG and EOG).What is worth paying more attention is that, when using single EOG modality, our framework significantly improves the performance compared to the baseline model (1.96% average improvement in ACC and 2.4% in MF1), especially on the ISRUC (2.7% improvement in ACC and 3.5% in MF1) and MASS (2.8% improvement in ACC and 2.0% in MF1).
It proves the effectiveness of our multimodal simplification framework in generating synthetic multimodal features even without EEG.Moreover, solely using EOG modality under our framework provides a comparable performance to using single EEG modality under the baseline model (80.9% vs. 82.5% on average ACC, 75.5% vs. 77.2% on average MF1).This significantly closes the gap between single EOG and single EEG under the model (an average difference of 3.6% in ACC and 4.2% MF1 before and now with an average difference of 1.6% in ACC and 1.7% in MF1).Particularly, on the SSND dataset, using single EOG under our framework outperforms using single EEG modality under the baseline model (83.8% vs. 83.7% on average ACC and 79.3% vs. 78.8% on average MF1).This demonstrates the potential of EOG modality, which is convenient to collect using wearable devices in daliy life, making it possible to monitor the sleep quality in a home-based setting.
3) Ablation Study: In this experiment, we investigate the effectiveness of time-frequency domain generators of our framework.In our work, we design two generators using time and frequency domain EOG signal as guilding conditions, respectively, to generate synthetic multimodal features.Here, we conduct ablation experiments to validate the effectiveness of each generator in our model.The model variants are defined as follows: • G1: the generator 2 in frequency is removed from our framework.
• G2: the generator 1 in time is removed from our framework.
• G1+G2: we apply both of generator 1 and generator 2 in time and frequency.As shown in Tab.IV, both of time and frequency features contribute to generating synthetic multimodal features for sleep staging, and combining of them performs the best.As we can see, when we employ the time-domain EOG signal as the guiding condition, the model provides superior overall performance compared to using the frequency-domain EOG signal as the guiding condition (80.4% vs. 78.3% on average ACC and 74.6% vs. 72.6% on average MF1).In particular, on both HMC and MASS datasets, the frequency-domain EOG guided models perform much worse compared to the time-domain EOG guided models, about 3% to 5% lower in ACC and MF1, respectively.And using solely time-domain or frequency-domain as the guiding conditions perform closely on the datasets of ISRUC and Sleep-EDF.It may be caused by the differences in collection devices and environmental conditions during data gathering.In this ablation study, we set the hyperparameter α to 0.7 when fusing the dual synthetic features from different generators (α denotes the proportion of time-frequency domain feature fusion).The more details about the choice of α will be explained in subsequent experiment.
4) Comparison With Other EEG-Based Methods: In this section, we list three existing EEG-based methods [10], [11], [47] for comparision shown in Tab.V. We referred to the performance of the existing methods using EEG reported in the corresponding papers, and we obtained the performance of these methods using EOG by implementing them by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.ourselves.Due to the page limit, here, we conducted this experiment on the two datasets of ISRUC and Sleep-EDF.The results demonstrate that our proposed framework provides a competitive performance when only using EOG, compared with only using EEG, and the gap is only less than 1% (ACC: 0.9%, MF1:0.4%) on the ISRUC dataset, and the gap is 1.8% in ACC and 2.6% in MF1 on the SleepEDF dataset.When compared to other existing models, although their performance based on EEG surpasses ours, the gap under the distinct input of EEG and EOG is much larger than that of ours.For example, there is an average ACC gap of 7.4% and an average MF1 gap of 11.2% on the SleepEDF dataset for TinySleepNet.Among the compared methods, U-time has the smallest gap between EEG and EOG, with an average MF1 difference of 6.0% on the both ISRUC and SleepEDF datasets.

5) Single and Multiple Modal Features Visualization:
To demonstrate the effectiveness of our method, we chose a subject from the ISRUC dataset to visualize the intermediate features based on single or multiple modalities.The visualization is based on the t-SNE method [49].Fig. 4 (a As we can see, the samples from EEG modality belonging to the same sleep stage are nicely clustered within the same cluster in Fig. 4 (b), compared with the samples from the EOG modality in Fig. 4 (a).It demonstrates that the EEG modality has a more powerful predictive ability than that of the EOG modality.Notably, the samples represented by the generated multimodal features from the same stage also form a cluster, which looks quite similar to those by EEG modality features, as shown in Fig. 4 (b) and Fig. 4 (c).Compared with the single EOG modal features in Fig. 2 (a), where different stages lie in a chaotic, the nice clusters by the generated features in Fig. 2 (c) further prove the effectiveness of our method.As shown in Fig. 4 (d), based on the real multimodal feature of EEG and EOG, the different stages are clustered separately.The visualization comparison shows that, our method is capable of learning the correlation between EEG and EOG and generating reliable multimodal feature representations based on single EOG modality.
6) Analysis of Time-Frequency Generators Ratio: In our framework, the multimodal generators consist of two parts: a time-domain generator and a frequency-domain generator.The two generators share the same structure but use time and frequency domain EOG modality as a guiding condition, respectively.As mentioned above, we use a hyperparameter α to control the ratio of multimodal features fusion by dual generators.In this section, we explored the impact of the ratio α of time and frequency features for the fusion on the experimental results.We conduct the hyperparameter study on five datasets and set the α from 0 to 1 in increments of 0.1.As shown in Fig. 5, in most cases, when α is equal to 0.7, which means that the time domain synthetic feature has a higher proportion compared to the frequency domain feature (0.7 v.s.0.3), the model performs the best.In other words, the synthetic feature generated using the time domain EOG signal as the guiding condition includes much more important information in this study.Particularly, on the Sleep-EDF dataset, the model performs the best when α is set to 0.9 (with an average accuracy of 80.6% compared to 80.7%).On the MASS dataset, when α is set to 0.8, the average MF1 score is slightly better (76.4% compared to 76.8% with α set to 0.7 and 0.8, respectively).Notably, on the Sleep-EDF and SSND datasets, the curves of their evaluation metrics vary more smoothly when compared to the other datasets, implying the networks are not very sensitive to the changes in the feature fusion ratio in these two datasets.To summarize, the variation trends of the metrics on the five datasets remain consistent: as α increases, the model's performance first improves and then declines, reaching its optimal performance within the range where time domain features have a larger proportion.Some subtle variation differences may be because of the different environmental conditions during data gathering.

V. DISCUSSION
In this work, our proposed method can effectively simplify the multimodal sleep staging task, making the performance based on single EOG modality to closely approximate that based on EEG modality.This simplification framework allows us to make full use of the easy collection nature of EOG and the powerful capabilities of EEG in sleep staging.This makes it possible to use only single EOG for sleep staging.In current clinical practice, patients are required to wear a cap with dozens of electrodes to collect EEG and EOG data from several dozen channels for monitoring sleep quality in the hospital, which is complex and expensive.Hence, we are interested in exploring the possibility of homebased sleep monitoring.When in a home-based setting, it is hard to guarantee the high quality of the collected EEG signals due to its sensitivity to environment.The existing mainstream end-to-end sleep staging models require both EEG and EOG signals with high qualities to build sleep staging task to achieve good performance [50].Fortunately, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.we can collect high-quality EOG signals at home due to their low environmental requirements.As mentioned above, our framework can simplify the multimodal sleep staging task and solely require EOG data as input.Therefore, our proposed method can be integrated in wearable devices to gather individual EOG data for sleep staging.Once after inputting the single EOG modality into the pre-trained network, the highly reliable predictions could be obtained.This process can be conducted at home by wearing a lightweight EOG data collection device, eliminating the need to visit a hospital and use professional EEG acquisition equipment to collect EEG data for sleep staging.
On the downside, there are still some limitations of this work.Firstly, the total number of subjects in our utilized datasets is very small.As shown in Fig. 2, our framework has two pretraining stages before we can use it for inference: one for obtaining real multimodal features and another for training the time-frequency generators and dual classifiers.The quality of the real multimodal features obtained from the pre-trained network can directly impact the ultimate performance in the test stage to some extent.This necessitates our utilization of a large-scale sleep staging dataset for pretraining.However, the largest dataset in our experiments is HMC, which contains only 153 PSG recordings.The size of the datasets limits the generalization of the two-step pre-trained network.Secondly, the individual discrepancies among different subjects are significant.Our method is fundamentally based on the pretrained generators that can generate synthetic multimodal features from subjects using a single EOG modality.The input EOG modality combined with Gaussian noise, should align with the real multimodal features obtained from the training set.If there are significant individual differences between the target domain and the source domain, which means that the actual general multimodal feature distributions obtained from the training set may differ from those of the unseen subjects, potentially resulting in poor performance for them.

VI. CONCLUSION
In this paper, we propose a novel multimodal simplification framework for sleep staging that allows us to generate multimodal feature representations based on single EOG modality.Specifically, we first model the multimodal correlations between the EEG and EOG modalities.Leveraging this correlation, we adopt a conditional generative framework guided by the time-frequency EOG signals to generate multimodal feature representations in the absence of EEG modality.Then, we input single EOG modality in the test stage for sleep staging, reducing the dependence on EEG modality.The framework was evaluated on our collected dataset and four public datasets.Compared with existing methods, when only using EOG as input, our framework performs the best.Moreover, by our framework, the single EOG modality provides comparable performance to single EEG modality.The results demonstrate the potential of single EOG modality for sleep staging in clinics, overcoming the collection limitations of EEG.Motivated by the success of simplification multimodal with single EOG, in the near future, we plan to generalize the proposed framework to other more easily collected signals, such as ECG signal, for monitoring sleep quality, making sleep monitoring more easily accessible.

Fig. 4 .
Fig. 4. Feature visualization based on different modalities, where different colors represent different sleep stages.
) and Fig. 4 (b) illustrate the feature distributions by employing single EOG and single EEG as input.Fig. 4 (c) and Fig. 4 (d) depict the generated and real multimodal feature distributions.

Fig. 5 .
Fig.5.Analysis of the Time-Frequency Generator Ratio, where we vary the hyperparameter α from 0 to 1 in increments of 0.1.
Algorithm 1 Multimodal Simplification Algorithm Input: X E E G , X E OG Output: Evaluation indicators of test data The Stage I: Initialize parameters θ in the pretrained model.for i = 1 to n do Optimize θ by minimizing Eq. 2 end return multimodal feature R. The Stage II: Initialize generator G1 and G2, classifier C1 and C2, discriminator D. for i = 1 to n do Generate synthetic multimodal features F 1 and F 2 .Concentrate F 1 and F 2 with R, respectively.Optimize D by minimizing Eq. 7. Optimize G1, G2, C1 and C2 by minimizing Eq. 10. end return trained G1, G2, C1 and C2.The Stage III: Using random noise and X E OG to generate multimodal features through trained G1 and G2.Use the trained classifier C1 and C2 for sleep staging.return evaluation indicators.