Multi-View Spatial-Temporal Graph Convolutional Networks With Domain Generalization for Sleep Stage Classification

Sleep stage classification is essential for sleep assessment and disease diagnosis. Although previous attempts to classify sleep stages have achieved high classification performance, several challenges remain open: 1) How to effectively utilize time-varying spatial and temporal features from multi-channel brain signals remains challenging. Prior works have not been able to fully utilize the spatial topological information among brain regions. 2) Due to the many differences found in individual biological signals, how to overcome the differences of subjects and improve the generalization of deep neural networks is important. 3) Most deep learning methods ignore the interpretability of the model to the brain. To address the above challenges, we propose a multi-view spatial-temporal graph convolutional networks (MSTGCN) with domain generalization for sleep stage classification. Specifically, we construct two brain view graphs for MSTGCN based on the functional connectivity and physical distance proximity of the brain regions. The MSTGCN consists of graph convolutions for extracting spatial features and temporal convolutions for capturing the transition rules among sleep stages. In addition, attention mechanism is employed for capturing the most relevant spatial-temporal information for sleep stage classification. Finally, domain generalization and MSTGCN are integrated into a unified framework to extract subject-invariant sleep features. Experiments on two public datasets demonstrate that the proposed model outperforms the state-of-the-art baselines.


I. INTRODUCTION
S LEEP stage classification is important for the assessment of sleep quality and the diagnosis of sleep disorders. Sleep experts identify sleep stages based on American Academy of Sleep Medicine (AASM) standard [1] and observations recorded in polysomnography (PSG), which includes electroencephalography (EEG) at different positions on the head and electrooculography (EOG). The transition rules among different sleep stages recorded in the AASM standard, which This work was supported by the National Natural Science Foundation of China under Grant 61603029. Li-wei H. Lehman was supported by the NIH grant R01EB030362. Ziyu  can assist sleep experts in identifying the sleep stages. Although these rules provide valuable information, classifying the sleep stages by human sleep experts is still a tedious and time-consuming task. Moreover, the classification results are affected by the variability and subjectivity of sleep experts.
Automatic sleep stage classification can greatly improve the efficiency of traditional sleep stage classification and has important clinical value. Many researchers have made great contributions to automate this classification task. At first, traditional machine learning methods based on time domain, frequency domain, and time-frequency domain features are adopted [2], [3]. However, the classification accuracy of these methods depends heavily on feature engineering and feature selection, which require substantial expert knowledge. Recently, deep learning methods have been widely applied to automatically classify sleep stage thanks to its powerful ability of representation learning. For example, Convolutional Neural Network (CNN) [4] and Recurrent Neural Network (RNN) [5] are often utilized to learn appropriate feature representations from transformed data or directly from raw data.
Although the existing methods [6]- [10] achieve high accuracy for sleep stage classification, these methods have not sufficiently solved the following challenges: 1) The spatialtemporal features of sleep stages have not been fully considered. In particular, the topology among brain regions has not been effectively employed to capture richer spatial features. 2) Physiological signals vary significantly across different subjects, which hinders the generalizability of the trained classifiers. 3) Most deep learning methods, especially related graph neural network models, ignore the importance of model interpretability to the brain.
There have been several attempts to address the first challenge [7], [11]- [13]. For example, CNN is usually applied to extract the spatial features of the brain, and RNN is applied to capture temporal features during sleep transition. However, the limitation of these networks is that their input must be grid data (image-like representations) without utilizing the connections among brain regions [14]. Due to the fact that brain regions are in non-Euclidean space, graph is the most appropriate data structure to indicate brain connection. Therefore, GraphSleepNet [15] is proposed to classify sleep stages based on the functional connectivity of the brain network and using spatial-temporal graph convolution to achieve the state-of-the-art performance. However, in the brain network based on functional connectivity, there may not necessarily be connections among physically adjacent brain regions. In fact, arXiv:2109.01824v1 [eess.SP] 4 Sep 2021 existing neuroscience research shows that brain regions that are adjacent to each other at physical distances can influence each other [16]. However, GraphSleepNet only utilizes the functional connectivity of the brain to construct the sleep stage networks, which ignores the importance of the physical proximity of the brain in space. For the second challenge, some researchers try to apply transfer learning methods to improve the generalization of the models [17], [18]. The existing sleep stage classification models based on transfer learning are all two-step training paradigms. That is, these models need to be pre-trained and then fine-tuned to new subject data. The finetuning operation needs to collect sleep data from specific new subjects or datasets, which is quite expensive and inconvenient. In addition, the generalization of transfer learning models that need to be fine-tuned is limited. These models are designed for specific subjects and may not show excellent performance on other new subjects. Therefore, fine-tuning is only applicable to the personalized (subject-variant) model of the specific subject. And whenever a new subject needs to be evaluated, the existing model must be re-collected and re-trained. Therefore, for clinical systems suitable for unknown users, fine-tuning may become inefficient. For the third challenge, previous attempts to develop interpretable CNN or RNN classification models have been sparse [7], [12], [19]. Specifically, no attempt has been made to interpret the key modules of graph neural network for sleep stage classification from the perspective of the brain network.
In order to address the above challenges, we propose the multi-view spatial-temporal graph convolutional networks (MSTGCN) with domain generalization for sleep stage classification. Figure 1 illustrates the overall architecture of our model. Specifically, 1) we construct two brain view graphs based on the spatial proximity and functional connectivity of the brain, where each EEG channel corresponds to a node of the graph, and the specific connections among the channels correspond to the edge of the graph. 2) Then, we utilize spatial graph convolution to capture rich spatial features. Temporal convolution is applied for capturing the transition among different sleep stages. Actually, sleep experts usually identify the class label of one sleep state according to both the characteristic EEG waves of the current state and the class labels of its neighbors. 3) We design a spatial-temporal attention mechanism to capture the most relevant spatial-temporal information on the sleep stages. 4) Finally, we apply the adversarial domain generalization, which is a typical method of transfer learning without fine-tuning. In the process of model training, each subject is employed as a specific source domain for subjectinvariant sleep feature extraction. The subject-invariant sleep feature does not vary with different subjects and is related to sleep stage classification. The advantage of the domain generalization is that it does not require any information in the new subjects (target domain).
To the best of our knowledge, it is the first attempt to apply spatial-temporal graph neural networks with domain generalization for sleep stage classification. Overall, the main contributions of the proposed model for sleep stage classification are summarized as follows: • We construct different brain views based on the functional connectivity and physical distance proximity of the brain. The complementarity of different views provides rich spatial topology information for classification tasks. • We design a spatial-temporal graph convolution with attention mechanism, which consists of spatial-temporal graph convolution for spatial-temporal features and attention mechanism for capturing the most relevant spatialtemporal information for sleep stage classification. • We integrate domain generalization and spatial-temporal graph convolutional networks into a unified framework to extract subject-invariant sleep features. • We conduct experiments on two public sleep datasets, namely ISRUC-S3 and MASS-SS3. Experimental results demonstrate that the proposed model achieves the stateof-the-art performance. • We explore the interpretability of the key modules of the model. In particular, we present the functional connectivity obtained through adaptive graph learning. The results indicate that functional connectivity during light sleep is more complex than that during deep sleep. Compared to the Adaptive Spatial-Temporal Graph Convolutional Networks (called GraphSleepNet) published in our preliminary work [15], MSTGCN has the following important improvements: 1) The brain network based on physical distance proximity is constructed. It and the preliminary adaptive functional connectivity brain network form a multi-view brain network, which can provide rich brain spatial topology information for sleep stage classification. 2) Domain generalization is integrated with spatial-temporal graph convolutional networks into a unified framework to improve the generalization of the proposed model. 3) Experiments are conducted to evaluate the effectiveness of MSTGCN on two sleep datasets, of which ISRUC-S3 is not evaluated in our preliminary work. Moreover, we conduct the ablation experiments to evaluate the impact of each component of MSTGCN on the performance. 4) The interpretability of the key modules in MSTGCN is explored and discussed.
II. RELATED WORK In recent years, time series analysis has attracted the attention of many researchers [20], [21]. As a typical time series, physiological signals are used in many fields, such as motor imagery [22]- [24], emotion recognition [25], [26], and sleep stage classification [15], etc. With the development of deep learning, two popular deep learning models, CNN and RNN, are widely applied in sleep stage classification. Specifically, a fast discriminative complex-valued CNN (FDCCNN) [27] is proposed to capture the sleep information hidden inside EEG signals. A CNN model based on multivariate and multimodal physiological signals [7] takes into account the transitional rules of sleep stages to assist classification. A hierarchical RNN named SeqSleepNet [13] tackles the task as a sequenceto-sequence classification task. At the same time, hybrid models are also employed by some researchers. DeepSleepNet [12] utilizes CNN to extract time-invariant features, and Bidirectional Long Short-Term Memory (BiLSTM) to learn the transition rules among sleep stages. A hierarchical neural network [28] implements comprehensive feature learning

Spatial-Temporal Graph Convolution
The overall architecture of the MSTGCN for sleep stage classification. First of all, two different views on the brain are constructed: the functional connectivity-based brain graph and spatial distance-based brain graph. Different views reflect different spatial relationships of the brain. Then, an attention based spatial-temporal graph convolution is designed for the most relevant spatial-temporal features for sleep stage classification. Finally, a domain generalization with the gradient reversal layer is implemented to improve the generalization of the model. In domain generalization, each subject in training set is treated as a specific source domain. The advantage of domain generalization over other transfer learning methods is that this method does not require any information (a small number of labeled samples or unlabeled sample data distribution) from the test set (called unknown domain or target domain). Therefore, domain generalization improves the generalization of the model and is more suitable for clinical systems applied by unknown users.
stage and sequence learning stage, respectively. Additionally, with the development of attention mechanisms, a deep Bidirectional RNN with attention mechanism is utilized for single-channel sleep staging [29]. Although CNN and RNN models achieve high accuracy, their limitation is that the model's input must be grid data ignoring the connection among brain regions. As different brain regions are not in the Euclidean space, grid data may not be the optimal data representation. Hence, the graph is the most appropriate data structure. GraphSleepNet [15] is proposed to utilize graph neural network to model functional connectivity brain network to achieve the SOTA performance. However, it only considers the spatial functional connectivity, and to a certain extent ignores the spatial proximity of brain regions.
Some previous researchers attempt to solve the subject difference problem found in physiological signals. Transfer learning methods are applied to improve the robustness of deep learning models for individual differences [17], [18]. For example, MetaSleepLearner [18] based on model-agnostic meta-learning is proposed to overcome the subject difference problem by training in the source domain and fine-tuning in the target domain. Although the existing transfer learning methods for sleep stage classification can achieve improved results, almost all existing work needs to fine-tune the pretrained model for sleep stage classification. That is, these models require additional fine-tuning operations using part of the labeled data in the target domain. Therefore, these transfer learning methods that need to be fine-tuned are only suitable for the specific subject's personalized model. In this case, whenever a new subject needs to be evaluated, data must be collected again and the existing model must be fine-tuned again. Therefore, for clinical systems that need to be adapted to unknown subjects, fine-tuning operations may become inefficient.

III. PRELIMINARIES
Sleep Stage. Polysomnography (PSG) is usually employed for recording physiological signals during sleep in clinical medicine. The PSG is segmented into 30-second epochs for sleep stage classification. Sleep experts usually classify sleep epochs into different stages based on the sleep staging standard. Specifically, according to the AASM sleep staging standard, the human sleep process can be divided into three main parts: Wakefulness (Wake), rapid eye movement (REM), and non-rapid eye movement (NREM). Furthermore, the NREM can be subdivided into three parts: N1 stage, N2 stage, and N3 stage. In general, sleep experts directly divide the sleep state into 5 different classes (Wake, N1, N2, N3, and REM).
Sleep Brain Network. A sleep brain network is defined as a graph G = (V, E, A), where V represents the set of vertices and each vertex in the network represents an electrode on brain; |V | = N is the number of vertices in sleep brain network; E denotes the set of edges and indicates the connection between vertices; A denotes the adjacency matrix of sleep brain network G. As presented in Figure  2, G F C t represents sleep brain network constructed from the functional connectivity and G DC t represents sleep brain network constructed from spatial distance. And a 30s EEG signal sequence S t (called a sleep epoch) is transformed into G F C t and G DC t . Sleep Feature Matrix. The sleep feature matrix is the input of the graph neural network. We define the raw signals sequence as S = (S 1 , S 2 , . . . , S L ) ∈ R N ×Ts×L , where L is the number of sleep epochs, T s represents the time series length of each sleep epoch S i ∈ S(i ∈ {1, 2, · · · , L}). For each sleep epoch S i , we extract the node feature by using a feature extraction network in Supplementary Material S.1 and define each epoch

Channe ls
Multi-view sleep brain network. Left network is the functional connectivity-based network and right network is the spatial distance-based network.
Sleep Stage Classification Problem. The research goal is to learn the mapping relationship between the encoded signals and sleep stage classes. The problem of sleep stage classification is defined as: given S = (S i−d , . . . , S i , . . . , S i+d ) ∈ R N ×Ts×Tn identify the current sleep stage y, where S represents the temporal context of S i , y denotes the S i 's sleep stage class label, and T n = 2d+1 is the number of sleep brain networks, where d ∈ N + is temporal context. Specifically, in order to identify the sleep stage of the current sleep epoch S i , we utilize its previous d epochs and following d epochs as the context. For each epoch, we construct G DC and G F C respectively, and they are employed as the input of our model to identify the sleep stage y of the current sleep epoch.

IV. MULTI-VIEW SPATIAL-TEMPORAL GCN
The overall architecture of the proposed model is exhibited in Figure 1. We summarize four key ideas of the proposed MSTGCN model: 1) Construct multiple views of the brain connection to fully indicate the spatial information of the brain. 2) Combine spatial graph convolution and temporal convolution to extract both spatial and temporal features.
3) Employ a spatial-temporal attention mechanism to automatically pay more attention to valuable spatial-temporal information. 4) Integrate domain generalization and spatialtemporal GCN in a unified framework to extract subjectinvariant sleep features. The overall architecture is designed to accurately identify sleep stages.

A. Multi-view on Brain Graph
In this section, we introduce two different views from the brain graph: the functional connectivity-based brain graph and spatial distance-based brain graph. Different views reflect different spatial relationships of the brain. Specifically, the functional connectivity-based brain graph can present the collaboration of different brain regions in space. The actual physical locations of these brain regions may not be adjacent. However, existing neuroscience studies have presented physically adjacent brain regions also interact. Therefore, these two views on brain have a certain degree of complementarity and can fully demonstrate the spatial relationship of the brain.
1) Functional Connectivity-based Brain Graph: Functional connectivity is usually constructed based on correlations or dependencies among physiological signals [30]. Pearson Correlation Coefficient (PCC) [31] and Mutual Information (MI) [32] are two common methods to determine the functional connectivity of the brain. Due to the limited understanding of the brain, it is still challenging to determine a suitable graph structure in advance for sleep stage classification. Hence, we propose a data-driven graph generation for functional connectivity. This data-driven approach constructs the functional connectivity graphs adaptively for different sleep stages based on the feature correlation between nodes as displayed in Figure 3. We define a non-negative function A F C mn = g (x m , x n ) (m, n ∈ {1, 2, · · · , N }) to represent the functional connectivity between nodes x m and x n based on the input fea- is implemented through a layer neural network, which has the learnable weight vector The learned graph structure (adjacency matrix) A F C is defined as: where rectified linear unit (ReLU) is an activation function to guarantee that A F C mn is non-negative. The softmax operation normalizes each row of A F C . The weight vector w is updated by minimizing the following loss function, That is, the larger distance x m − x n 2 between x m and x n , the smaller A F C mn is. Due to the brain connection structure is not a fully connected graph, we utilize the second term in the loss function to control the sparsity of graph A F C , where λ = 0.001 is a regularization parameter. Fig. 3. The adaptive sleep graph learning to generate functional connectivity for sleep stage classification. xm and xn represent the features of two nodes respectively, w is learnable weight. The more similar the node features, the greater the probability of establishing a connection.
The proposed graph generation mechanism automatically constructs the neighborhood connection of the nodes. To avoid the trivial solution (i.e., w = (0, 0, · · · , 0)), which is due to minimizing the above loss function L graph learning independently, we utilize it as a regularized term to form the loss function.
2) Spatial Distance-based Brain Graph: Previous studies have presented that adjacent brain regions affect each other and the strength of the impact is inversely proportional to the actual physical distance [16]. That is, the closer the distance between brain regions, the greater the impact. Therefore, we construct a spatial distance-based brain graph for sleep stage classification, as illustrated in Figure 4.

B. Spatial-Temporal Attention
The attention mechanism is often utilized to automatically extract the most relevant information. In this study, we employ a spatial-temporal attention mechanism [15] to capture valuable spatial-temporal information on the sleep brain network. The spatial-temporal attention mechanism contains spatial attention and temporal attention.
1) Spatial Attention: In the spatial dimension, different regions have different effects on the sleep stage which are dynamically changing during sleep. To automatically extract the attentive spatial dynamics, we utilize a spatial attention mechanism, which is defined as follows (take the spatial attention based on the functional connectivity view as an example): P m,n = softmax(P m,n ) where X (l−1) = X 1 , X 2 , . . . , X T l−1 ∈ R N ×C l−1 ×T l−1 is the l-th layer's input. C l−1 represents neural network channel's number of each node, i.e., l = 1, are learnable parameters, σ denotes the sigmoid activation function. P represents spatial attention matrix, which is dynamically computed by current layer's input. P m,n represents the correlation between node m and n. The softmax operation is utilized to normalize the attention matrix P . In the proposed model, when the graph convolution is performed, the learned adjacency matrix A F C and spatial attention matrix P can dynamically adjust the update of nodes.
2) Temporal Attention: In the temporal dimension, there are correlations among neighboring sleep stages, and the correlations vary in different situations. Therefore, a temporal attention mechanism is utilized to capture dynamic temporal information among sleep brain networks.
The temporal attention mechanism is defined as follows: Q m,n denotes the strength of correlation between sleep brain network G u and G v . Finally, the softmax operation is utilized to normalize the attention matrix Q. The input of the MST-GCN is tuned by the temporal attention:X (l−1) = (X 1 ,X 2 , . . . ,X T l−1 ) = (X 1 , X 2 , . . . , X T l−1 )Q ∈ R N ×C l−1 ×T l−1 to pay more attention to informative temporal information.

C. Spatial-Temporal Graph Convolution
Spatial-temporal graph convolution is a combination of spatial graph convolution and standard temporal convolution, which is utilized to extract both spatial and temporal features. The spatial features are extracted by aggregating information from neighbor nodes for each sleep brain network and the temporal features are captured by exploiting temporal dependencies from neighbor sleep stages.
1) Spatial Graph Convolution: We employ graph convolution based on spectral graph theory to extract spatial features in the spatial dimension. For each sleep stage to be identified, the adjacency matrices A F C and A DC are provided for graph convolution. In addition, we employ the Chebyshev expansion of graph Laplacian to reduce computational complexity. Chebyshev graph convolution [33] using the K − 1 order polynomials is defined as: where g θ denotes the convolution kernel, * G denotes the graph convolution operation, θ ∈ R K is a vector of Chebyshev coefficients and x is the input data. L = D − A is Laplacian matrix, where D ∈ R N ×N is degree matrix.L = 2 λmax L−I N , where λ max is Laplacian matrix's maximum eigenvalue and I N is an identity matrix. T k (x) is the Chebyshev polynomials recursively.
The information of neighboring 0 to K − 1 order neighbors centered at each node is extracted via the approximate expansion of Chebyshev polynomial.
We generalize the above definition to the nodes with multiple neural network channels. The l-th layer's input iŝ X (l−1) = (X 1 ,X 2 , . . . ,X T l−1 ) ∈ R N ×C l−1 ×T l−1 , where C l−1 represents neural network channel's number of each node, T l−1 denotes the l-th layer's temporal dimension. For eachX i , we obtain g θ * GX i by using C l filters onX i , where Θ = (Θ 1 , Θ 2 , . . . , Θ C l ) ∈ R K×C l−1 ×C l is the convolution kernel parameter [33]. Hence, the information of the 0 ∼ K−1 order neighbors is aggregated to each node.
2) Temporal Convolution: To capture the sleep transition rules, which are utilized by sleep experts to classify the current sleep stage in combination with neighboring sleep stages, we employ CNN to perform convolution operation in the temporal dimension. Specifically, after graph convolution operation has sufficiently extracted the spatial features from each sleep brain network, we implement a standard 2D convolution layer to extract the temporal context information of the current sleep stage. The temporal convolution operation on the l-th layer is defined as: where ReLU is the activation function, Φ denotes the convolution kernel's parameters, * denotes the standard convolution operation.
After the multi-view ST-GCN extracts a large number of features, we employ the concatenate operation to perform feature fusion on X F C and X DC : where X F C , X DC represent the features respectively extracted from functional connectivity and spatial distance based view, is the concatenate operation. In order to reduce the influence of individual differences, we exploit an adversarial domain generalization method to enhance the robustness of our model. Figure 5 presents the intuitive idea of the adversarial domain generalization. Specifically, this method aims to make it impossible to distinguish which source domain the sample data originated from during model training. At the same time, it aims to improve the sleep stage classification performance as much as possible. This means that all subjects' common features (subject-invariant features) related to sleep stage classification are extracted. For example, the model cannot distinguish that the samples of Domain 1 are the data belonging to its own domain, but it can still accurately identify the sleep stages. This presents that the model did not learn personalized features (F-1) belonging to Domain 1, but some common features related to sleep stage classification. In fact, previous studies have presented the advantages of adversarial domain generalization [34], and theoretically this method aligns the marginal distribution of different domains. Specifically, domain generalization includes three parts: feature extractor G f , domain classifier G d and label predictor G y . The feature extractor G f maps the input data to a domain-invariant feature space,

D. Domain Generalization
where X is the input feature matrix, θ f is the trainable parameter and X is the transferred feature matrix. The transferred features are put into label predictor G y and domain classifier G d with softmax function: where X i denotes the transferred features of sample i.ŷ i and d i are the predicted results of G y and G d , respectively. Both of the G y and G d are multi-class classifier, we employ the cross entropy as the loss function: where L y is the cross entropy loss function of the multiclassification task, L denotes the number of samples, R y and R d denote the number of classes and the number of domains, respectively. y is the true label andŷ is the value predicted by the model. d is the true domain andd is the value predicted by the model. Besides, a special layer called Gradient Reversal Layer (GRL) is implemented between feature extractor G f and domain classifier G d to form an adversarial relationship [35]. Compared with other methods that usually require training classifier and discriminator in separate steps, GRL can integrate feature learning and domain generalization in a unified framework and execute backpropagation algorithms. The optimization process is defined as: where θ d , θ y are the parameters to minimize the loss of G d and G y , respectively. θ f is the parameters of G f to minimize the loss of G y and maximize the loss of G d at the same time. The aims of feature extractor G f and domain classifier G d are exact opposite. The feature extractor G f aims to make the domain classifier G d can't classify the right domain and the domain classifier G d aims to correctly classify the domain that the data comes from.
The whole loss function of the domain generalization is defined as: By optimizing the loss function, the feature extractor G f can achieve the goal of finding the domain-invariant feature space.
We compare our MSTGCN with 7 baselines, which are described in detail in Supplementary Material S.3. For a fair comparison, we employ the same experimental settings for all models. Specifically, we employ 10-fold cross-validation and 31-fold cross-validation to evaluate the performance of all models on ISRUC-S3 dataset and MASS-SS3 dataset, respectively. In addition, we adopt the subject-independent strategy for cross-validation. We implement the proposed model using TensorFlow. In addition, the code is released on Github 1 .

B. Comparison with the State-of-the-Art Methods
We compare the proposed model with the other baseline models for sleep stage classification on the ISRUC-S3 and MASS-SS3 as presented in Table I and Table II. The results present that our proposed model outperforms the baseline methods on multiple overall metrics (overall Accuracy, F1score, and Kappa) for ISRUC-S3 and MASS-SS3. Specifically, the traditional machine learning methods (SVM and RF) cannot learn the complex spatial or temporal features well. However, existing deep learning models such as CNN and RNN [7], [11]- [13] can directly extract the spatial or temporal features. Therefore, their performance is better than the traditional machine learning methods.
Although CNN and RNN achieve high accuracy, their limitation is that the model's input must be grid data ignoring the connection among brain regions. Due to brain regions are in non-Euclidean space, graph is the most appropriate data structure to indicate the connections. Therefore, the proposed model and ST-GCN can often achieve optimal or suboptimal overall results, especially on the MASS-SS3 dataset. In addition, the proposed model extracts both spatial and temporal features based on multi-view brain graphs and integrates domain generalization to learn subject-invariant features. Hence, the proposed model achieves the state-of-the-art performance. 1 https://github.com/ziyujia/MSTGCN For different sleep stages, MSTGCN can accurately identify most of the corresponding stages. Specifically, in the ISRUC-S3 dataset, the classification accuracy of Wake and N3 stages is the highest. In the MASS-SS3 dataset, the classification accuracy of the REM and N2 stages is the highest. However, the classification performance of the N1 stage does not meet expectations on the two datasets, like other baseline models. It may be because the N1 stage is a transitional period between the Wake stage and the N2 stage, and the sample number of N1 stage is relatively small. Therefore, as Figure S.2 in Supplementary Material shows, N1 stage is mistakenly divided into other sleep stages, such as Wake stage and N2 stage. Nevertheless, the classification performance of MSTGCN for the N1 stage is still higher than most baseline models. Table  II presents that MSTGCN has the highest F1-score for N1 stage on the MASS-SS3 dataset, which is 4% higher than the sub-optimal result.

1) Ablation Experiment:
To validate the effect of each module in our model, we design some variant models. First, we use the spatial graph convolution with spatial distance brain graph as the basic model to gradually stack the remaining modules to form a whole branch. Then, we add another whole ST-GCN branch with functional connectivity brain graph to form a multi-view ST-GCN. Finally, we integrate the domain generalization method to form the proposed model. The specific process is described as follows: • variant a (Spatial Graph Convolution (Base Model)): We utilize a spatial graph convolution network with spatial distance brain graph as the base model.     I  THE PERFORMANCE COMPARISON OF THE STATE-OF-THE-ART APPROACHES ON THE ISRUC-S3 DATASET   Method Overall results F1-score for each class  variant d, and variant e. Specifically, the attention mechanism helps to capture valuable spatial-temporal features to improve the classification performance of our model. The designed multi-view on brain provides complementary information for sleep stage classification. In addition, domain generalization is integrated into the multi-view ST-GCN to extract subjectinvariant features, which helps to improve the model generalization. In summary, the ablation experiment presents the effectiveness of each module in our model.
2) Adaptive Functional Connectivity Graph: To further investigate the effectiveness of the adaptive functional connectivity graph learning, we design five fixed functional connectivity graphs to compare with it. These graphs are defined as different adjacency matrices. The last three graphs are constructed by functional connectivity methods commonly found in neuroscience.
• Fully Connected Adjacency Matrix: A matrix whose elements are all 1. It represents that there are all connections among all nodes and each node also has self-connection in the graph. • K-Nearest Neighbor (KNN) Adjacency Matrix [38]: A matrix, which represents a k-nearest neighbor graph. That is, each node has k neighbor nodes. • Pearson Correlation Coefficient (PCC) Adjacency Matrix [31]: A matrix generated by the pearson correlation coefficient between each pair of nodes.  Figure 7 illustrates that the adaptive (learned) adjacency matrix achieves the highest accuracy for sleep stage classification. In addition, the adjacency matrix combined with prior neuroscience knowledge also achieves a suboptimal effect, such as the PCC, PLV, and MI adjacency matrix. The fully connected adjacency matrix does not work well because the brain network is not a fully connected graph. In general, the adjacency matrix can significantly affect the classification performance. The proposed adaptive functional connectivity graph for classification tasks is superior to the fixed functional connectivity graphs.
To present the interpretability of the adaptive functional connectivity graph, we visualize the brain adjacency matrices obtained by adaptive learning for different sleep stages. These matrices reflect the brain functional connectivity in different sleep stages as illustrated in Figure 8. Specifically, there are more functional connectivity in the Wake stage and N1 stage. On the contrary, the functional connectivity of the N3 stage is the least. These findings are consistent with existing neuroscience research [40], [41]. N3 stage is a typical deep sleep period, and the brain is usually in an inactive stage. In contrast, the N1 stage is a light sleep period, and the brain is relatively active. Therefore, the functional connectivity of the brain in the N1 stage is relatively complicated. 3) Attention Mechanism: To explore the interpretability of the attention mechanism, first we visualize the learned weight of temporal attention mechanism to indicate the importance of different sleep epochs for classification. The higher the weight, the higher the degree of attention. Figure 9 illustrates that the weight of the current sleep stage T is the largest. Previous and following sleep epochs received similar but lower attention. That is, this stage has received the most attention, which is consistent with the AASM sleep standard [1]. In fact, sleep experts mainly judge the current sleep stage type based on the characteristics of the current sleep state and appropriately refer to the adjacent sleep state. Therefore, the temporal attention mechanism has learned expert knowledge to a certain extent.
In addition, we also visualize the learned weight of spatial attention mechanism for EEG channels. Figure 10

VI. CONCLUSION
In this paper, we propose a novel deep graph neural network MSTGCN for sleep stage classification. In MSTGCN, we propose effective approaches in modeling the dynamics of sleep data along both the spatial and temporal dimensions, as well as considering the subject differences in sleep data. Specifically, we design different brain views based on the functional connectivity and physical distance proximity of the brain. The complementarity of different views provides rich spatial topology information. We develop a spatial-temporal graph convolution with attention mechanism to simultaneously capture the most relevant spatial-temporal features for sleep stage classification. Moreover, to extract subject-invariant sleep features, we integrate domain generalization and spatialtemporal graph convolutional networks into a unified framework. Experiments on two public sleep datasets demonstrate MSTGCN achieves the state-of-the-art performance. Finally, our proposed approach provides a general-framework for multivariate physiological time series.

S.3 Baseline Methods
We compare the proposed model with the following baselines: • Ref [2]: Support vector machine is a typical machine learning method for classification tasks.
• Ref [3]: Random forests are an ensemble learning method for classification tasks.
• Ref [12]: A mixed model combines CNN and BiLSTM to capture time-invariant features and transition rules among sleep stages.
• Ref [7]: A temporal sleep stage classification apply multivariate and multimodal time series.
• Ref [13]: SeqSleepNet changes the single sleep stage classification problem into a sequence-to-sequence classification problem by using attention-based bidirectional RNN (ARNN) and RNN.
• Ref [15]: A spatial-temporal graph neural network with the adaptive sleep graph learning mechanism achieves the state-of-the-art performance.

S.6 Effect of Different Network Configurations
To further investigate how hyper-parameter settings affect the model performance, we conduct experiments with different network configurations. The same experiment hyperparameter setting is utilized in all experiments except the studied varying factor. As shown in Figure S.5, the proposed MSTGCN is not sensitive to the hyper-parameter settings. Fewer ST-GCN layers and more convolution kernels help to improve performance slightly. The appropriate Chebyshev polynomial K and regularization parameter λ is also conducive to the improvement of the model performance.