A Spatial-Temporal Transformer Architecture Using Multi-Channel Signals for Sleep Stage Classification

Sleep stage classification is a fundamental task in diagnosing and monitoring sleep diseases. There are 2 challenges that remain open: (1) Since most methods only rely on input from a single channel, the spatial-temporal relationship of sleep signals has not been fully explored. (2) Lack of sleep data makes models hard to train from scratch. Here, we propose a vision Transformer-based architecture to process multi-channel polysomnogram signals. The method is an end-to-end framework that consists of a spatial encoder, a temporal encoder, and an MLP head classifier. The spatial encoder using a pre-trained Vision Transformer captures spatial information from multiple PSG channels. The temporal encoder utilizing the self-attention mechanism understands transitions between nearby epochs. In addition, we introduce a tailored image generation method to extract features within multi-channel and reshape them for transfer learning. We validate our method on 3 datasets and outperform the state-of-the-art algorithms. Our method fully explores the spatial-temporal relationship among different brain regions and addresses the problem of data insufficiency in clinical environments. Benefiting from reformulating the problem as image classification, the method could be applied to other 1D-signal problems in the future.

cognition and memory and other functions [1]. To solve sleep problems effectively, it is necessary to monitor the sleep quality of humans through sleep staging. Sleep signals collected in polysomnograms (PSG) consist of multi-channel electroencephalogram (EEG), electrooculogram (EOG), electromyogram (EMG), and electrocardiogram (ECG). By splitting the signal into a sequence of 30-second epochs, experts classify the sleep stages epoch-by-epoch. Sleep stages can be broadly divided into 5 categories: Wakefulness (W), Rapid Eye Movements (REM), Non-REM1 (N1), Non-REM2 (N2), and Non-REM3 (N3) [2]. Experts mostly use features of EEG frequency content, such as frequency bands and amplitude, to classify different stages [3], [4]. Eye movement inferring from the EOG signal also provides useful information about sleep status, especially for the REM stage [5].
The manual classification of the sleep stages by experts is tedious and exhaustive. Besides, accuracy depends on the professionalism of experts. The research community aims to reduce the intensive labour work and increase the robustness of results. Researchers use machine learning methods including support Vector Machine (SVM) and random forest (RF) [6], [7]. These methods heavily rely on feature engineering and struggle with complex spatial-temporal information.
With the advent of the deep learning era, we witness more algorithms based on convolutional or recurrent neural networks (CNNs and RNNs) to address the problem. Huy Phan et al. used deep bidirectional RNNs to process feature vectors learnt through filter banks [8]. Yuyang et al. used a similar model to process the time-frequency (TF) features extracted by fractional Fourier transform [9]. Supratak et al. proposed the DeepSleepNet to extract time-invariant features and learn transition rules between stages [10]. Goshtasbi et al. used residual dilated causal convolutions to process the temporal information [11]. Dongdong et al. utilized multiconvolution blocks and pooling layers to process 90-second epochs [12]. To reduce the computation overhead, they built the LightSleepNet which fully relied on convolution blocks with fewer parameters [13].
Graph-based methods are ideal to process multi-channel sleep signals. Ziyu Jia et al. proposed Graph-SleepNet to adaptively learn the intrinsic connection among different EEG channels through the adjacency matrix [14]. They then designed multi-view spatial-temporal graph convolutional networks (MSTGCN), constructing two graphs that separately represented the functional connectivity and physical distance proximity of different brain regions [15].
The attention mechanism has attracted many researchers in recent studies. Many works leveraged multi-resolution CNNs to extract spatial features and the temporal encoder with multi-head attention mechanisms to capture the temporal dependencies within 1 epoch for single-channel [16], [17] or multi-channel signals [18], [19]. Jing Huang et al. further improved the attention modules based on Squeeze-and-Excitation Networks to perform feature fusion [20]. While Transformer, one of the most successful applications of the attention mechanism, has demonstrated its superiority in various tasks, its research in sleep staging is rather limited. Huy Phan et al. proposed SleepTransformer to utilize the self-attention mechanism on both the epoch and sequence level [21].
Many works utilized transfer learning to address the data variability and data inefficiency issues with sleep signals, since well-trained models on certain datasets might suffer from a huge accuracy drop when the evaluated dataset differs. In computer vision, pre-trained models are fine-tuned on the target domain to increase efficiency and accuracy. In recent years we see similar approaches in sleep staging. Huy Phan et al. trained their model on the Montreal Archive of Sleep Studies (MASS) dataset with 200 subjects and transferred it to three small datasets to explore the impact of domain shift [22]. Jadhav et al. created TF RGB images for sleep signals through continuous wavelet transform and fed them into pre-trained convolution neural networks [3]. ElMoaqet et al. used pretrained GoogLeNet CNNs to process the TF images generated on multi-channel signals [23].
(2) although the attention mechanism has been widely investigated, researchers usually hybrid the attention modules with convolution or recurrent modules to construct the end-to-end framework [16], [17], [18], [20]. Since these modules operate differently and interpret the incoming information in their own fashion, a mixing of mechanisms can not exploit the performance to the fullest extent. (3) Current transfer learning strategies are not optimized in terms of the input and pretrained backbones. Public PSG datasets are comparatively small compared with public image datasets which contain millions of data. So pretraining on these datasets are not likely to provide enough information for domain learning [22]. Besides, while Transformer-based vision models have become state-of-the-art choices for various image tasks, recent studies processing TF images generated from sleep signals still relied on CNN backbones [3], [23]. Notice that it is suboptimal to just send the previously generated TF images to vision Transformer backbones since these backbones work ideally only under certain input structures (see details in section II-B).
The TF image generation method should also be tailored to meet the requirement.
The Transformer-based method is an ideal candidate to address the limitations mentioned above. It has shown great potential in the computer vision field [24], [25]. Researches about patch encoding and shifted windows reveal its ability to comprehend spatial information. For the first limitation, the ability of the vision Transformer to encode spatial information could contribute to understanding multi-channel signals. Longformer [26], a transformer architecture to process a long sequence could help to learn the transition rules within several epochs. Similar work has demonstrated that the combination of both could help to process a 3D input, such as the Video Recognition task [27]. Combing these blocks that fully build upon the attention mechanism provides a solution to the second limitation. For the third limitation, transfer learning using pre-trained language models won't work here since PSG signals show no grammatical and semantic features. However, pre-trained vision Transformer models could be adapted to classify feature images generated from 1D sleep signals, which share similar pixel-wise and patch-wise features. PSG signals can also be seen as a sequence of epochs [28]. By upscaling the one-dimensional signal from a single epoch to a multiepoch sequence centered on the current epoch, the PSG signal can also be processed like a video and output labels in a manyto-one manner [29]. This approach is closer to the pattern of sleep staging performed by human experts.
In this paper, we reformulate sleep stage classification based on 1D signals as the 2D image classification problem. We construct a novel image-level feature generation method to transform multi-channel PSG signals into fixed-size images for pre-trained image models. We propose the Visual Spatial-Temporal Transformer Network (VSTTN), which entirely relies on the attention mechanism to learn spatial dependencies between different channels and temporal transitional rules through several epochs on the image level. The VSTTN consists of a 2D feature encoder, a temporal attention-based encoder, and an MLP head for classification. Specifically, (1) Generated images are directly fed into pre-trained vision Transformer models to obtain a global feature representation.
(2) Longformer works as our temporal attention-based encoder to explore relationships between nearby epochs. We evaluate our model on 3 public datasets: ISRUC-S3 [30], PSG-35 [31] and ISRUC-S1 [30]. All experiment results demonstrate our model achieving state-of-the-art performance. Our contribution is threefold: • To the best of our knowledge, it is the first method to process spatial-temporal information on multi-channel PSG data purely based on Transformer.
• The benefit of our reformulation of the problem as image classification is shown from the transfer learning strategy based on pre-trained Vision Transformer models.
• We visualize and investigate the attention mechanism on the contribution of different PSG channels.

A. Overview
Sleep is associated with various physiological behaviours. Considering as many eye and brain activities as possible helps to improve sleep staging accuracy [5]. Thus, integrating multi-channel information features allows the system to better understand the spatial relationship among different brain regions.
Besides, sleep involves slow stage transitions across time. Sleep stages have strong dependencies between successive epochs. Experts must consider previous and following epochs when classifying the stage of the current epoch, allowing the wide application of the many-to-one scheme in machine learning-based sleep staging studies [28], [32].
Here, we first introduce a novel image generation method to transform 1D PSG signals into 2D images. The image generator extracts TF features from multi-channel signals. We then reshape the features into a 2D image with a size ideal for transfer learning on large image datasets. Details of the image generation process will be described in the section II-B.
Next, we put forward a novel spatial-temporal Transformer architecture for sleep stage classification. The architecture consists of 3 parts: a spatial encoder, a temporal encoder, and a classification head. The architecture is shown in Fig.3. The spatial encoder first extracts the features from input images. The temporal encoder then combines the spatial representations from multiple epochs. Since the temporal encoder takes a token sequence as input originally, several feature vectors are first grouped into a sequence before being embedded into the standard sizes to act as 1D tokens in Transformer. A special classification token ([CLS]) is added to the start of the token sequence before entering the temporal encoder. This classification token next goes into the MLP head as the output of the temporal encoder to obtain a final prediction. Details of the network architecture will be described in the section II-C.

B. Image Generation
Unlike other algorithms extracting an arbitrary size of features from raw signals, VSTTN anticipates inputs with a fixed size identical to those fed into large pre-trained image models. The image size would be 3 × H × W, where H and W denote the height and width of input images. We introduce our tailored TF image generation method that both satisfies the size requirement and reconstructs the spatial relationship within multi-channels. The pipeline is shown in Fig.2.
Most vision Transformer models split the images into fixedsize patches in advance for encoding [24], [25], with the attention mechanism then to explore the spatial relationship between different patches. Similarly, experts classify sleep stages based on features from time segments and multiple channels [2]. To cater to the calculation design and application, we decompose each 30-second PSG epoch into shorter segments and extract TF features from each channel within each short segment to create the fixed-size patches. They are eventually concatenated to form a full-size image.
Specifically, the raw signals sequence is defined as S = (s 1 , s 2 , · · · , s N s ) ∈ R N s ×N c ×T s , where N s and N c denotes the number of epochs and PSG channels. T s denotes the time series length of each epoch series s i ∈ S(i ∈ 1, 2, · · · , N s ).
To possess more short-time information, we apply three windowing methods to each PSG epoch. Each method transforms 1 epoch to N f time frames. However, the duration of each frame varies for the three methods. Then each PSG channel in one frame is multiplied with a Hanning window and transformed to the frequency domain by Fast Fourier Transform (FFT). Given that the patch size in vision Transformer is N p × N p , we slice the frequency band of the transformed signal into N p equally spaced sub-bands. We extract the differential entropy (DE) features on various sub-bands as follows: where N sub and f i denote the number of points within each frequency sub-band and the frequency magnitude at point i. After these procedures, the sequence is redefined as By repeating certain channels for emphasizing and stacking patches, we reshape the last 4 dimensions to produce the final set of images I = (i 1 , i 2 , · · · , i N s ) ∈ R N s ×3×H ×W . The TF images are then normalized by Min-max normalization on the image level and rescaled to the range of 0-255.
C. Network Architecture 1) Spatial Feature Encoder: The feature encoder acts as an operator to extract 2D spatial features from the input image. Any networks that work on 2D input could serve as the encoder, whether they are Convolution-based or Transformerbased. The reason we choose image models as the feature extractor is that physiological signals like PSG do not have grammatical or semantic features, which are unlikely to be understood by language models such as BERT. However, by transforming the signals into images, the features could be learned by image models that are pre-trained on large image datasets which share similar image-level features. Through transfer learning, pre-trained image models could help to improve performance with a short training time.
Specifically, Transformer-based image models have demonstrated their capability on multiple vision tasks, including image classification and video recognition. The multi-head self-attention mechanism, compared with the convolution approach, is more suitable to capture global information. In our study, the ith input image i i is first spilt into k patches with the same size N p × N p : i i = [ p 1 , p 2 , · · · , p k ], where i i denotes the ith image, N p denotes the number of pixels in each patch, p i denotes the ith patch. The sequence of patches then goes through the following process: where MSA denotes Multi-head self-attention; LN denotes LayerNorm; E and E pos denote patch and position embedding respectively; z denotes the input of each layer; the output of transformer encoder z 0 L serves as the image representation y.
2) Temporal Feature Encoder: Via the spatial feature encoder, the input images are transformed into feature vectors. To extract information from the context, we concatenate spatial feature vectors of 2n + 1 epochs together, with n previous epochs and n following epochs. The concatenated feature sequence represents the central epoch for learning. Then we pass them through a multi-head attention encoder to obtain temporal correlation. Here, we apply the Longformer [26] to solve the problem. Longformer performs local and global attention at the same time, reducing the computational complexity of spatial-temporal features by transforming the attention pattern into three sparse sliding attention windows. The attention scores are computed as follows: where Q, K, V refer to query, key and value respectively. Here, two set of projections are used: Q s , K s , V s for sliding window attention and Q g , K g , V g for global attention. With the layer going deep, the sizes of windows grow bigger.
3) MLP Head: The classification MLP head is similar to VTN [27], which provides a final predicted sleep stage for each epoch. Our MLP head is a two-layer fully connected network with GELU as an activation function and a dropout layer to prevent overfitting. Layer normalization is performed on the input before entering a fully connected network. Cross  TABLE I  DETAILS ABOUT THE ISRUC-S3, PSG-35, AND ISRUC-S1 DATASETS entropy loss is used as the loss function: where t i denotes the truth label, p i denotes the Softmax probability for the ith class.

III. EXPERIMENTS A. Dataset
We evaluate the performance on 3 datasets: ISRUC-S3 [30], PSG-35 [31], and ISRUC-S1 [30]. Table I summarises the details of these datasets. The 3 datasets are small, medium, and larger public datasets with recordings of 10 (9 males, 1 female), 35 (25 males, and 10 females), and 100 (56 males, and 44 females) volunteers, respectively. For all datasets, 6 EEG channels and 2 EOG channels are utilized to classify sleep stages. In experiments, Accuracy (ACC), F1-score (F1), and Cohen Kappa (Kappa) are used as evaluation metrics. Accuracy denotes the proportion of correctly predicted samples. F1-score shows the harmonic mean of the precision and recall. Kappa expresses the consistency of classification results [33]. Calculations are shown as follows, where N and C denote the total number of samples and sleep stage classes. Pr ecision i = T P i T P i +F P i , Recall i = T P i T P i +F N i . True Positives (T P i ), False Positives (F P i ), True Negatives (T N i ), and False Negatives (F N i ) represent different metrics for the classification correctness of i-th class.

B. Implementation
Since the images in the ImageNet dataset size 3 ×224×224, we want to generate a TF image with the same dimensions. We first construct a feature sequence with the size of 3 × 14 × 14 × 256, where each dimension represents a characteristic of the signal. We apply 3 different windowing methods to obtain the multi-scale temporal feature within 1 epoch. Each method decomposes the signal into 14 frames. The time duration of each frame under 3 methods is 2 s, 6 s, and 10 s respectively. Prior studies showed that channels in the occipital lobe (O1, O2) provide less information compared with other channels for sleep staging [35]. So we repeat the feature of the other 6 channels to form 14 PSG channels capturing the spatial relationship between brain regions. We then apply Fast Fourier Transform with Hanning windows to transform each frame into the frequency domain. We split the frequency signal into 256 equally spaced segments, and calculate the Differential Entropy features from the segments. Lastly, we reshape the feature sequence to form the TF image. We train the model in an end-to-end manner. For the comparison and ablation experiments, we use models from the field of computer vision as the spatial backbones to extract the spatial feature from TF images. Each backbone is pre-trained on ImageNet. The configuration and pre-training details could be referred to in the Hugging Face PyTorch Image Models (TIMM). We remove the last MLP classification head of these backbones that are designed to classify the ImageNet, only making use of the prior feature extractors.
Longformer and the MLP classification head are randomly initialised from a normal distribution with 0 mean and 0.02 std. We use an effective attention window of size 18 for Longformer. The embedding size of Longformer matches the output size after spatial processing, while the size of the intermediate layer is 1024. The number of attention heads is 8. We apply Attention Dropout with a probability of 0.1 to prevent overfitting. The MLP classification head has 2 layers, with an intermediate layer sizing 200. We further evaluate and justify the hyper-parameter settings in the section III-D.
We implement the proposed model using PyTorch. All experiments are performed on 8 GeForce RTX 3090 Graphics Cards, with 24GB of memory per GPU. The parameters are updated using the Adam optimizer. For each validation fold, we train the model for 30 epochs with an initial learning rate of 10 −5 , and a weight decay of 10 −8 .
From the result of classification metrics presented in Table II, the confusion matrix showed in Fig.4 and the visualization showed in Fig.5, we reach following conclusions: • VSTTN outperforms other state-of-the-art methods on all datasets. Machine learning techniques, such as SVM and  RF, struggle to learn complex spatial and temporal information through feature engineering. VSTTN achieves accuracy over 10% higher than these methods. Deep learning approaches such as CNN or RNN-based methods obtain high accuracy by directly extracting features from the raw input. However, they are constrained by ignoring the relationship between different brain regions. VSTTN has an improvement around 6% and 3.15%. Although GCN-based methods succeed in interpreting spatial information, it is too complex to create graph representation for EEG signals and unable to transfer graph knowledge obtained from other areas to apply to sleep staging. VSTTN succeeds those two methods by 2-4%.
• Specifically, we compare VSTTN with other attentionbased methods. First, unlike AttenSleepNet and Sleep-Transformer consider only single-channel signals, our method incorporates more spatial information leveraging multiple channels. Second, SleepTransformer has shown the benefit of only utilizing the attention mechanism over mixing the convolution and attention modules [21], which has been further validated in our result. Lastly, studies in computer vision have proven that the effectiveness of the attention mechanism is maximized with larger inputs [24], [25]. VSTTN could benefit most from the pre-trained weights on ImageNet. These factors result in an improvement of 3-10% compared with other attentionbased methods.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
• From the confusion matrix, VSTTN's accuracy to classify the N3 stage slightly falls behind MSTGCN and DeepSleepNet, while significantly outperforming these methods on the other 4 stages. We observe that VSTTN tends to mislabel the N3 to N2 stage, which is tolerable in clinical application. However, the great improvement in classifying the Wake and REM stages demonstrates its superiority to capture richer information. For example, the improvement in classifying REM may result from the two EOG channels which provide eye movement information during the REM stage.
• We observe a good alignment of the ground truth and predicted hypnograms. Since our model exploits the manyto-one scheme to consider nearby epochs, we achieve higher accuracy when the sleep stage is stable compared with transitioning. Errors mainly occur in classifying the N1 stage, which contributes to the rapid transition from N1 to N2 after leaving the Wake stage. As the N1 stage only accounts for 5-10% of the sleep overnight, the impact of misclassification is relatively marginal.

D. Ablation Study
We conduct several ablation experiments to discuss our design principles. Unless otherwise mentioned, parameters are the same with the section III-B and III-C. Results could be found in Table III. 1) Windowing Method: According to AASM, sleep experts pay attention to time fragments lasting 0.5-3 seconds in 1 epoch during artificial sleep staging [2]. To examine the impact of time duration within each frame, we compare the result of different windowing methods used in image generation. We equally decompose the 30-second signal into frames with a time duration ranging from 0.5 to 7 seconds. Examining a longer duration is not ideal since the feature waveforms usually last within 3 seconds. Furthermore, the generated frames will be similar if each frame almost represents the whole epoch. Overlapping time windows are employed to guarantee the same number of frames are generated for each experiment.
We observe a negligible impact on the classification performance when the duration of the time segment becomes longer. Although the feature waveforms utilized by the sleep experts during manual staging have their own characteristics, the parameters within the model seem to conclude from the given data as long as the whole epoch is presented. This result shows the robustness of our image-generation method.
2) Spatial Backbone Variations: To prove the effectiveness of the attention mechanism, we investigate the impact of different 2D spatial backbones on classification performance. we compare the results from utilizing ResNet-50 (Res50), ResNet-101 (Res101), ResNet-152 (Res152) [36], Vision Transformer (ViT) [24], and Swin Transformer (Swin) [25] as the spatial backbones Each backbone uses the pre-trained weight on ImageNet. We observe that Swin Transformer is the best-performing backbone and reaches 89.24% accuracy, following the Vision Transformer achieving 86.1% accuracy. Transformer-based backbones outperform CNN-based backbones by 3-7%, which demonstrates the advantage of the  III  ABLATION STUDY ON THE PSG-35 DATASET. TIME DENOTES THE  AVERAGE TRAINING TIME IN SECONDS PER EPOCH attention mechanism. We admit that transformer-based backbones require significantly longer training time due to their intrinsic complexities. However, while the exhaustive training process is done beforehand, the minimal time difference in the testing stage is negligible for real-time evaluation. Clinical applications also prioritize classification accuracy over time cost as long as the patients could tolerate the evaluation time.
3) Many-to-One Scheme: Next, we explore how the manyto-one scheme impacts classification performance. We denote the temporal sequence length as the number of epochs we consider when classifying 1 epoch. For example, a temporal sequence length of 1 indicates that we only care about the current epoch, while a length of 9 will combine 4 epochs ahead and 4 epochs behind. We vary the temporal sequence length from 1 to 9, as a longer length will be too computationally heavy. We observe that the accuracy is higher when the temporal sequence length increases. However, the steady increase in training time is also worth noting. The result indicates that the model understands the transitional rules better when we take more nearby epochs into consideration. However, time cost should also be considered based on the limitation of hardware.

E. Discussions
Here we mainly discuss the impact of pre-training, the attention mechanism, and excluding channels.
1) Pre-Training the Spatial Backbone: We fully explore the benefit of pre-training in our study. Here we compare 4 experiment setups regarding transfer learning: • 1: Directly training VSTTN from scratch • 2: Non-end-to-end training, where we freeze the pretrained weight in the spatial backbone, only allowing weight update of the other modules • 3: Training with pre-trained weights on ImageNet-22K. • 4: Training with pre-trained weights on ISRUC-S1 Notice that for the 4th setup, we only transfer the weight on ISRUC-S1 to train on ISRUC-S3, since PSG-35 has different experimental configurations compared with these 2 similar datasets. We show the results from 2 spatial backbones (ViT and Swin) as in Tab.IV. We reach the following conclusions: • VSTTN benefits from the pre-trained weight on image datasets. For Swin-VSTTN, we observe an improvement of 5.8%, 0.9%, and 1.6% on 3 datasets. The improvement is pronounced in ISRUC-S3 since it contains much fewer epochs for the model to learn from scratch. For the other 2 datasets, even though the accuracy percentage increases less prominently, it refers to the correct classification of more than 1000 epochs due to the large base of the datasets. This result proves that prior knowledge learnt from image datasets, such as the way to represent the pixel-wise or patch-wise features, could be transferred for TF images generated from 1D sleep signals.
• End-to-end training is necessary for VSTTN. We obtain a drop of accuracy of around 20 % when we freeze the weight from the spatial backbones. This result is predictable since the original weights are learnt based on 22k different classes, which do not match the current 5-stage classification. This indicates that the prior knowledge from the ImageNet itself is not enough to conclude the sleep staging. We need further calibration based on the pre-trained weight to achieve the best result.
• Transfer learning with pre-trained weights on large sleep datasets works for small datasets. For ViT and Swin backbones, the accuracy on ISRUC-S3 after pre-training on ISRUC-S1 is 0.4% and 3% higher than direct training from scratch. Although it is 2-3% less accurate compared with pre-training on ImageNet, it suggests the transfer of domain knowledge from larger to smaller datasets.
Here we briefly explore the reason that pre-train models on Imagnet could achieve good results in sleep staging. During training, large image models such as convolution-based or attention-based methods generalize the feature representation from 2D inputs. Updating their weights, these models develop components like edge filters or attention windows to better understand the features [24], [25], [36]. Although in our paper the time-frequency images are generated through physiological signals, it still contains specific edges and value distribution that could be utilized by the image models.
2) Interpreting the Attention Mechanism: By visualizing the attention mechanism, we could further understand how our method learns to focus on different features for classification. After training the Swin-VSTTN, we calculate the attention weight targeting the output layer of the spatial backbone. For each sleep stage, we average the weight across multiple subjects and epochs, resulting in an attention distribution across PSG channels and sleep stages. We first investigate the attention level received by each channel when classifying different stages. As shown in Figure 6 (a), O1 and O2 exhibit the highest attention in the Wake and N3 stages, while receiving the lowest attention in the remaining 3 stages. This suggests that information from O1 and O2 may play a crucial role in recognizing Wake and N3 stages, potentially attributed to characteristic theta wave oscillations and slow spindles in the N3 occipital lobe, as well as beta wave oscillations in the Wake stage [37], [38]. Furthermore, we observed that the algorithm's attention to E1 and E2 is similar to that of F3, F4, C3, and C4 in most stages, including the peak during the N1 stage. This observation aligns with the concept of cortical-pupil coupling during sleep [39], indicating that EOG signals can also provide valuable information to improve the classification performance compared to methods with a single modality.
We then show the attention distribution over channels and frames during automatic sleep staging. As shown in Figure 6 (b), our model assigns higher weights to specific time frames and channels to distinctively recognize different stages. We observe that the attention distribution of N1 input appears less pronounced. The difficulty to focus on certain time frames or channels explains the low classification accuracy of the algorithm for the N1 stage. This attention distribution may be attributed to the relatively lower number of N1 stages in the input, as well as the integration of nearby non-N1 epochs to classify the N1 epoch in the middle. Furthermore, we find that the method tends to concentrate on the middle frame within each epoch. This behaviour may arise from the continuous nature of the input signal, where the signal at the epoch boundaries often partially resembles the adjacent epoch.
3) Excluding PSG Channels: Since VSTTN learns from signals in multiple PSG channels, we are interested in the impact of excluding some channels on the classification performance. In each experiment, we exclude 1 or 2 PSG channels. According to the result shown in Table.V, we have following conclusions: • We observe a minimal impact on performance when removing a single channel. This may indicate that our model could capture missing information from the contralateral channel to maintain a stable performance. Interestingly, there is a comparatively larger drop in the  Heatmaps for the attention distribution across channels and frames. We observe a selective focus on the spatial-temporal feature to classify each stage, suggesting the effective recognition capabilities of VSTTN.
accuracy of classifying the N1 stage when we remove the E1 and E2 channels. This result aligns with the high attention our model assigns to these channels to classify the N1 stage.
• In general, the performance of our method is worse no matter which 2 channels are excluded, resulting in an accuracy drop of 1-2%. Comparatively, the impact of excluding F3/F4 or C3/C4 is less prominent than excluding E1/E2 and O1/O2. This result proves that different EEG and EOG channels all have their contribution to the staging performance.
• We notice that the worst performance occurs when we exclude E1/E2, which is mainly caused by the misclassification of the N1 stage. It further validates the observation we have in III-E.2 that E1 and E2 receive the most attention in classifying the N1 stage 6. Excluding these channels will have a significant impact on the performance. Similar arguments could be made for the contribution of O1 and O2 channels to classifying the N3 stage.
F. Limitations and Future Work There are several limitations and possible improvements for the proposed method. First, despite optimization methods including pre-training, parallel computing, and optimizing data structure, the computational cost remains high due to the large Transformer-based modules. Extended training time limits the application to larger datasets. Therefore, we need to improve efficiency by streamlining and pruning the Transformer architecture. Second, we currently only investigate the EEG and EOG channels, disregarding other physiological signals such as EMG and ECG. Expanding the scope to incorporate additional physiological signals can validate the algorithm's generality and further augment its classification performance.

IV. CONCLUSION
Deep learning has made great strides in sleep stage classification. However, many models neglect spatio-temporal information within PSG signals. Additionally, limited sleep datasets make it challenging to train complex networks from scratch. Our model focuses on solving these problems and improving classification performance. In this paper, we propose a novel Transformer-based network named VSTTN. We train the model in an end-to-end manner. We introduce a tailored image generation method to extract features from multi-channel PSG data. Then, the attention-based mechanism is utilized to understand the spatial relationship among PSG channels and transitional rules between epochs on the temporal domain. By reformulating the task to image classification, we utilize the pre-trained weights on large image datasets to address the data insufficiency. We validated our model on 3 public datasets of different sizes, achieving state-ofthe-art performance. We further visualize and investigate the contribution of different channels to the classification. The proposed method has the potential to be applied to other 1D physiological signals.