Channel Contribution in Deep Learning Based Automatic Sleep Scoring—How Many Channels Do We Need?

Machine learning based sleep scoring methods aim to automate the process of annotating polysomnograms with sleep stages. Although sleep signals of multiple modalities and channels should contain more information according to sleep guidelines, most multi-channel multi-modal models in the literature showed only a little performance improvement compared to single-channel EEG models and sometimes even failed to outperform them. In this paper, we investigate whether the high performance of single-channel EEG models can be attributed to specific model features in their deep learning architectures and to which extent multi-channel multi-modal models take the information from different channels of modalities into account. First, we transfer the model features from single-channel EEG models, such as combinations of small and large filters in CNNs, to multi-channel multi-modal models and measure their impacts. Second, we employ two explainability methods, the layer-wise relevance propagation as post-hoc and the embedded channel attention network as intrinsic interpretability methods, to measure the contribution of different channels on predictive performance. We find that i) single-channel model features can improve the performance of multi-channel multi-modal models and ii) multi-channel multi-modal models focus on one important channel per modality and use the remaining channels to complement the information of the focused channels. Our results suggest that more advanced methods for aggregating channel information using complementary information from other channels may improve sleep scoring performance for multi-channel multi-modal models.

parts of human body, e.g. electroencephalograms (EEGs), electrooculograms (EOGs) and electromyograms (EMGs). For annotation, PSGs of approximately 8-h sleep are segmented into 30-s epochs and annotated by sleep technicians following standardized guidelines. The Rechtchaffen and Kales standard (R&K manual) [1] and the American Academy of Sleep Medicine rules (AASM manual) [2] are the two most widely used guidelines, distinguishing between seven stages 1 : Wake, Non-REM1 (N1), Non-REM2 (N2), Non-REM3 (N3), Rapid Eye Movement (REM), Movement and Unscored. Each sleep stage is characterized by distinctive time-and frequencydomain patterns. Table I provides a summary of these specific patterns as defined in the AASM manual [2].
Sleep scoring is traditionally performed manually by sleep technicians. To reduce the manual effort and time for annotation, automatic sleep scoring approaches have been developed. In general, automatic sleep scoring approaches can be categorized into traditional machine learning approaches and deep learning approaches. The former (e.g., [3], [4], [5]) relied on manually defined features and applied traditional machine learning models to classify sleep stages based on these features. The latter (e.g., [6], [7], [8]) captured temporal and sequential features from raw sleep signals or transformed frequency representations (e.g., spectrograms) automatically using end-to-end deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In our work, we focus on deep learning approaches, as they are more generalizable when applied to highly heterogeneous data sets [9].
Most early deep learning approaches were based on single-channel EEG (e.g., [6], [7]), as it contains the most information [2]. Khalighi et al. [10] showed that incorporating multiple modalities and channels (i.e., EOGs and EMGs) could improve the performance. Yet, surprisingly, most current multi-channel multi-modal models obtained very little performance improvement compared to single-channel EEG models and sometimes even failed to outperform them (cf., Table II). Additionally, it was found in [11] that while adding EEG channels improved the performance, using more than 6 EEG channels did not improve it further. We propose the following hypotheses: i) some single-channel EEG models successfully added particular model features into their deep learning architectures, which improved the performance, such as combining small and large filters in temporal learning to respectively  [2] capture time-and frequency-domain features [6], but the utilities of these model features have not been tested in the multi-channel multi-modal setting; ii) although all modalities and channels contain information, the information of certain channels of modalities may be sufficient to obtain accurate predictions.
To verify the hypotheses proposed above, in this paper, we investigate in two directions: i) whether multi-channel multi-modal models can be improved by adding particular model features from high-performing single-channel EEG models and ii) which channels contribute to a high-performing multi-channel multi-modal model. Specifically, our contributions are: 1) We evaluate the impacts of particular model features proposed for high-performing single-channel EEG models in the multi-channel multi-modal setting on a public benchmark data set, SleepEDF-13. 2) We incorporate the model features that improve the performance into a multi-channel multi-modal model and evaluate it on two public benchmark data sets, SleepEDF-13 (39 PSGs, small) and SHHS-1 (5,793 PSGs, large), obtaining state-of-the-art results. 3) We apply the layer-wise relevance propagation (LRP) [12], a post-hoc explainability method for model agnostic, to extract channel importance. We also adopt an embedded channel attention network (eCAN), motivated by [13] and [14], which intrinsically incorporates channel importance to deep sleep scoring models. We compare the results from both methods. 4) Based on the observations obtained from the interpretability experiments, we hypothesize that incorporating all channels is not necessary to obtain acceptable performance and verify it in a reverse ablation study. The remainder of the paper is organized as follows. Section II presents the related work on deep learning based automatic sleep scoring and reviews the methods for extracting channel importance from deep learning models. Section III introduces data sets and data preprocessing. The experiment for evaluating single-channel model features in the multichannel multi-modal setting and the accordingly improved multi-channel multi-modal model structure are described in Section IV. Afterwards, we present two interpretability meth-ods to analyze channel contribution in Section V. Section VI provides the experiment setup for evaluating our multi-channel multi-modal model. All results are presented and discussed in Section VII. Finally, we conclude and outline the directions for future work in Section VIII.

II. RELATED WORK
In this section, we review deep learning based automatic sleep scoring approaches and the prior work on channel contribution analysis.

A. Automatic Sleep Scoring
We distinguish between single-channel EEG and multichannel multi-modal deep sleep scoring models. Table II provides an overview of them in terms of data sets, used features, approaches and performance.
1) Single-Channel EEG Models: The most classic architecture is a combination of convoluational neural networks (CNNs) and recurrent neural networks (RNNs) (e.g., [6], [7], [16], [17]). Usually, CNNs are used to extract temporal features from sleep epochs and RNNs are employed to capture transition information from sleep sequences. To further improve the scoring performance, particular model features were added to this base architecture. Supratak et al. [6] employed filters of small and large sizes in the first layer of CNNs to capture both time-and frequency-domain features. Sors et al. [15] used deep CNNs to extract complex patterns of sleep epochs, because feature complexity can be increased by deeper layers [27]. Mousavi et al. [7] applied attention mechanisms in RNNs to focus on the important parts of sleep sequences when considering context information. Moreover, to enable the model to consider temporal and sequential features evenly in the final classification of sleep stages, Supratak et al. [6] also built a residual connection that concatenated features from temporal encoding layers. Additionally, there were also different model architectures proposed, such as using large-scale CNNs to capture transition information from neighbouring sleep epochs instead of using RNNs [15], [20] and learning sleep features not only from raw signals but also from their frequency representations [8], [18].  II  OVERVIEW OF THE STATE-OF-THE-ART DEEP LEARNING BASED AUTOMATIC SLEEP SCORING MODELS. WE DISTINGUISH BETWEEN  SINGLE-CHANNEL EEG AND MULTI-CHANNEL MULTI-MODAL MODELS. THE 'FEATURES' COLUMN INDICATES WHETHER RAW  SIGNALS OR TRANSFORMED FREQUENCY REPRESENTATIONS WERE USED IN CORRESPONDING PUBLICATIONS 2) Multi-Channel Multi-Modal Models: Most of the multichannel multi-modal models also used classic CNN-RNN architectures but with some additional spatial learning modules added to incorporate the sleep features of multiple modalities and channels. Paisarnsrisomsuk et al. [21] employed large-scale CNNs to extract both temporal and sequential information from 2 EEGs and 1 EOG and found that adding EOG signals increased the accuracy by 1%. A similar result was observed by Phan et al. [22] who generated the spectrograms of sleep signals and trained multi-task CNNs to create joint predictions for the current and neighbouring sleep epochs. Chambon et al. [11] proposed a spatio-temporal CNN architecture and used linear spatial filters to increase the signal-to-noise ratio. Pathak et al. [24] also designed a spatialtemporal-sequential model to extract sleep features from multichannel multi-modal data input and verified that EEG is the most important modality using post-hoc interpretability methods. In [25], Phan et al. created a multi-view sequential model via learning joint representations from both raw signals and time-frequency images. Recently, more advanced architectures were developed to particularly model correlations among modalities and among channels within a modality. For instance, Jia et al. [23] employed graph CNNs to capture intrinsic connections among EEG channels. In another paper [26], they designed a multi-modal attention module which helped detect the relevance between EEG and EOG signals.
In summary, although some studies showed that adding the information of multiple modalities and channels could improve the performance, the improvement was rather small (cf., Table II). In addition, there was also evidence that the improvement vanished after adding many EEG channels [11].
Hence, it is important to understand how multi-channel multimodal models use the information from different channels of multiple modalities. To that end, we built a state-of-the-art multi-channel multi-modal model and analyzed channel contribution in detail. More specifically, we first tested the impacts of particular model features developed for single-channel EEG models in the multi-channel multi-modal setting. Then, we designed an architecture aggregating the promising model features and investigated channel importance using interpretability methods. Since advanced architectures (i.e., [23], [26]) were designed for particular data 2 and focused on modelling the relevance among modalities and channels instead of understanding channel importance, we stayed with classic CNN-RNN architectures in our study to analyze channel contribution.

B. Analyzing Channel Contribution
We propose to analyze channel contribution in a deep sleep scoring model by assessing the importance of channel information acknowledged by that model. To the best of our knowledge, no particular study has investigated channel importance in multi-channel multi-modal sleep scoring, although some literature (e.g., [6], [7]) showed that the scoring performance varied based on different channels.
In related domains, Bohle et al. [28] used the layer-wise relevance propagation (LRP) to assist clinicians in explaining the area importance of MRIs for diagnosing Alzheimer's In this work, we used two kinds of methods to assess channel importance in multi-channel multi-modal sleep scoring in different directions. We applied the LRP [12] where the importance score of a channel is concluded by its relevance score for predictions. We also employed an embedded CAN, motivated by [13] and [14], to intrinsically measure channel importance, i.e. the importance score of a channel is learned and allocated via an extra neural network.

III. DATA SETS AND PRE-PROCESSING
We based our experiments on two public benchmark data sets: the SleepEDF-13 data set containing 39 PSGs and the SHHS-1 data set containing 5,793 PSGs. Table III provides an overview of them.

B. SHHS-1
Sleep Heart Health Study (SHHS) [32] is a large data set used for sleep-disordered breathing research and consists of the data collected during two patient visits. Following prior work (e.g., [15], [24]), we used the subjects from the first visit (SHHS-1). Overall, 5,793 PSGs are collected from 5,793 subjects, where 2 EEGs (channels C3-A2 and C4-A1), 2 EOGs (left and right) and 1 EMG are recorded. The EEG and EMG signals are sampled at 125 Hz, while the EOG signals are sampled at 50 Hz. Similar to SleepEDF-13, the R&K manual [1] is used for annotating SHHS-1, also resulting in 8 sleep stages.

C. Data Pre-Processing
For both data sets, we merged stages N3 and N4 into stage N3 to comply with the AASM manual [2] and removed the Movement and Unscored epochs which are irrelevant for sleep scoring. In addition, following [6], we excluded long wake periods that are located 30 minutes before and after sleep periods for SleepEDF-13.
We preprocessed the signals in both data sets as follows. First, we resampled the signals at smaller sampling rates to the highest sampling rate among all signals in that data set (i.e., resampling the EMG signals in SleepEDF-13 to 100 Hz and the EOG signals in SHHS-1 to 125 Hz) such that all modalities in a data set share an identical feature extraction mechanism in deep sleep scoring models. Second, following [11], [24], [33], we filtered the EEG and EOG signals of both data sets to 0.16-30 Hz and the EMG signals to 10-30 Hz 3 and standardized the signals of every channel to mean 0 and standard deviation 1.

IV. IMPROVING MULTI-CHANNEL MULTI-MODAL MODEL
In this section, we present the improved multi-channel multi-modal model that is based on the promising model features developed for single-channel EEG models.

A. Evaluating Single-Channel Model Features
To apply particular model features that have been successfully used by single-channel EEG models for performance improvement to multi-channel multi-modal models, we first tested their utilities in the multi-channel multi-modal setting. Based on the review presented in Section II-A.1, we selected the four model features presented in Table IV as candidates. The assumptions for these model feature choices are as follows. Adding large filters to capture frequency-domain patterns enables the model to capture distinctive frequency features, e.g. the Delta waves in stage N3. Increasing feature complexity helps detect the sleep stages whose time-domain patterns are indistinguishable, e.g. stages N1 and REM. A focus on the important parts of sleep sequences improves the extraction of transition information thus benefits to associated transition stages. Moreover, an even attention on temporal and sequential features avoids the loss of temporal information in sequential learning. We selected the model by Pathak et al. [24] as the baseline, 4 because it was based on classic CNN-RNN architectures and obtained state-of-the-art performance. Additionally, we performed a reverse ablation study, i.e. adding  Table IV, iv): we measured the performance on the model obtained from the second step with and without the residual connection to verify its impact. We ran the experiment on SleepEDF-13 using all four accessible channels (cf., Section III-A) and evaluated all model variants under the nested cross validation scheme (cf., Section VI-C). To determine the significance of adding the model features from single-channel EEG models to multichannel multi-modal models, we employed statistical hypothesis testing. Specifically, we assumed that the sleep data of the 20 subjects in SleepEDF-13 is independent and identically distributed, thus the evaluation metrics computed over the data of each subject follow a Gaussian distribution [35]. Then, we reported a sequence of 20 macro F1-scores (cf., Section VI-B) per model variant, obtained from the 20-fold cross validation in the outer loop of the evaluation scheme. Afterwards, we compared the model variants pairwise on their respective sequences using an one-sided Welch's t-test, where the null hypothesis was set that the performance improved by adding a model feature is smaller than or equal to zero. We set a significance level to 0.05 for the test: if the p-value is smaller than 0.05, the added model feature is improving the performance of multi-channel multi-modal sleep scoring models.
The results for evaluating the model features from single-channel EEG models in the multi-channel multi-modal setting are presented in Fig. 1. Overall, all four model features are shown statistically significant, since they all achieved p-values smaller than 0.05 in the Welch's t-tests. We thus concluded that all four tested model features from single-channel EEG models are useful to improve multi-channel multi-modal models under classic CNN-RNN architectures.

B. Final Multi-Channel Multi-Modal Model
Based on the experiment presented in the previous section, we introduce our improved multi-channel multi-modal sleep scoring model here. Our model consists of four components: a temporal learning module used to extract temporal features, a spatial learning part embedded in the first layer of temporal learning to incorporate channel information within a modality, a sequential learning module applied to capture sequential features from sleep sequences and a residual connection employed to concatenate CNN and RNN features. The final classification of sleep stages is performed on the obtained feature representations via a fully-connected layer with the SoftMax activation function. Fig. 2 shows the full structure of the improved multi-channel multi-modal model on the SHHS-1 data set which contains 2 EEGs, 2 EOGs and 1 EMG. Note that, each modality m ∈ {EEG, EOG, EMG} can have more than one signal which is referred as a channel throughout this paper. We describe the four components above in more detail as follows.
1) Temporal Learning: The first convolutional layer has two pipelines, one with small filter size and the other with large filter size, to respectively capture time-and frequency-domain features from raw sleep signals. Additional convolutional layers are added to extract complex underlying features. Specifically, each pipeline of CNNs consists of four convolutional layers and two max-pooling layers. Each convolutional layer is followed by a batch normalization layer [36] and a rectified linear unit (ReLU) activation layer (i.e., ReLU(x) = max(0, x)). Details on the number of filters, filter sizes, stride and pooling sizes are shown in Figure 2. Following [6], we set the smaller filter size in the first convolutional layer to half the sampling rate, as distinctive time-domain features (e.g., K-complex) usually appear in 0.5-s ranges in sleep epochs. We set the larger filter size to 4 times the sampling rate to better detect the frequency components of these signals. Different from [6], we set the stride size in the first convolutional layer to 1 instead of a large value to prevent the information loss of basic features. Accordingly, we applied larger pooling sizes in the max-pooling layers to filter out more representative features and avoid overfitting. At the end of CNNs, the features extracted from time-and frequency-domain pipelines are concatenated as the final temporal feature representations of sleep epochs. We also employed two dropout layers [37] of probability 0.5 as regularization techniques to help prevent overfitting in the training process.
2) Spatial Learning: Li et al. [38] have shown that low temporal relevance can exist among EEG channels in Nonwake stages. To detect and incorporate channel information, i.e. spatial correlations among channels within a modality, we integrated a spatial block in the first temporal convolutional layer, following [24], [39]. First, we reshaped the signals of multiple channels of a modality m into an input of shape, C m × D, where C m is the number of channels in this modality and D is the number of data points. Then, we passed them to the first convolutional layer of temporal learning including C m input channels and 64 output channels. This layer learns temporal features from raw sleep signals and then spatially aggregates the feature maps learnt from each channel. Compared to [24] where spatial learning was applied directly on raw sleep signals, our spatial learning block is applied on temporal feature maps, which has the advantage that distinctive patterns (e.g., sawtooth waves) existing in raw sleep signals will not be changed before they are identified.
3) Sequential Learning: To concentrate on the important sleep epochs of a sleep sequence, we employed the encoder-decoder sequential learning module by Luong et al. [40] with attention mechanisms in order to learn transition information from sleep sequences, as the stage of a sleep epoch is determined by both its own features and the information of neighbouring epochs [2]. Specifically, there are two phases: an encoding phase used to capture context information of sleep sequences and a decoding phase used to predict sleep stages epoch by epoch. The encoder employs two bidirectional long short-term memory layers (Bi-LSTM) with 256 hidden units to learn the context dependencies of sleep sequences containing multi-channel multi-modal CNN features in both forward and backward directions. The decoder uses a block composed of two long short-term memory layers (LSTM), an attention module and a fully-connected layer (FC) to predict sleep stages iteratively. Consider a sequence of n sleep epochs. The specific computation to predict the sleep stage for an epoch t can be expressed as follows: ,

4) Residual Connections:
We added a residual connection that concatenates CNN features to RNN features in order to consider temporal and sequential information evenly in the final classification of sleep stages. The residual connection employs a fully-connected layer to map CNN features into a feature vector, RC t , which shares the same dimension of RNN features. Then, both features are concatenated side-by-side to address: i) data imbalance arising in the sequential learning as data balancing techniques discussed in Section IV-C were only employed in the training process for CNNs and not for RNNs and ii) possible information loss of temporal features when the model was trained for sequential features.

C. Addressing Class Imbalance
In PSGs, stages N1 and N3 usually occur much less frequently, yielding imbalanced data sets (cf., Table III). Moreover, complex deep neural networks are often biased to detecting majority classes better than minority classes [42]. To guarantee that all classes can be learnt equally, we employed two data balancing techniques: applying the weighted loss function (WLF) in the training process and oversampling (OS) the instances of minority classes [43]. For WLF, we calculated the categorical cross entropy loss with the weighted function, W c = 1 − N c /N, to assign higher loss on minority classes, where W c is the weight for class c, N c is the number of instances in class c and N is the total number of instances in all classes. For OS, we duplicated the whole batch of instances of minority classes multiple times until their number of instances were close to the number of instances of the majority class. Then, we randomly duplicated single instances from minority classes again to make the number of instances in all classes exactly the same. Note that, we only applied data balancing techniques during training the temporal and spatial learning components (i.e., in CNNs), as the arrangements of sleep sequences would be invalidated in the sequential learning phase (i.e., in RNNs) if we apply data balancing techniques there.

V. CHANNEL CONTRIBUTION ANALYSIS
In this section, we present two explainability methods to analyze channel contribution in multi-channel multimodal sleep scoring. The layer-wise relevance propagation (LRP) [12] is a post-hoc explainability method for model agnostic and extracts information from a trained deep neural network. Although widely applied, post-hoc explainability methods might not be faithful to the underlying model [44]. In contrast, the embedded channel attention network (eCAN), motivated by [13] and [14], learns channel importance intrinsically. In our study, we focused on channel importance in the CNNs of our model structure to exclude context information interactions from neighbouring sleep epochs. Furthermore, we proposed a hypothesis based on the obtained results and subsequently employed channel exclusion experiments to verify our conclusion. All experiments were performed on SHHS-1, as it contains a broad range of research subjects (5,793 subjects) and more channels of sleep signals than SleepEDF-13 (cf., Section III).

A. Layer-Wise Relevance Propagation
We employed the LRP [12] to compute the relevance scores of sleep signals of different channels to represent their importance on predictions. Since we used the ReLU activation layers in CNNs, which are always positive and monotonically increasing, we employed the propagation rule by Montavon et al. [45] to allocate the relevance scores from a current layer k to a preceding layer j . This rule has a positive and a negative contribution term. We focused on the positive one, because it shows channel importance straightforwardly. The relevance scores were then calculated as follows: where R j and R k are the relevance scores of the neurons at layer j and k, a j is the activations of the neurons at layer j and w + j k denotes the positive connections of the neurons between layer j and k. Note again, this propagation rule has a constraint that the activations in every preceding layer (including the input data) must be non-negative. However, our input is sleep signals and thus can be negative. To address this problem, we adapted the original rule to j (a j w j k ) + R k by considering a j and w j k as a whole. In this way, if the product of the input data and the associated weights in the first layer is positive, the input data has a positive contribution to the output of this layer and is counted. In the experiment, the relevance score of a channel to a prediction was defined as the sum of the relevance of all signal points in the data input of that channel. The channel importance for a particular sleep stage was obtained by averaging the channel relevance scores over all predicted sleep epochs of that stage.

B. Embedded Channel Attention Network
Our eCAN (cf., Fig. 3) uses a channel attention module which takes the sleep features of all channels as inputs and outputs the attention weights per channel. We used the same CNNs as outlined in Section IV-B to extract temporal features but removed the spatial learning component to capture the individual contribution from each channel. Specifically, the extracted temporal features of each channel of the modalities obtained from CNNs, in a shape of d × w where d and w respectively denote the depth and width of the feature representations, were first flattened and passed into a global average pooling layer to generate one representative feature per channel. Then, the generated features of all channels were input to a block of two fully-connected layers and a ReLU activation layer, to compute an attention weight for each channel. Note that, this block has the same number of input and output neurons as the number of channels. Next, both the features of each channel and the attention weights were self-normalized using the SoftMax function. The normalized attention weights were multiplied with the normalized features of corresponding channels, resulting in attention weighted features for every channel which were finally passed into another fully-connected layer with the SoftMax activation function for sleep stage classification. We trained the eCAN using the same data balancing techniques as introduced in Section IV-C. Here, the channel importance for a particular sleep stage was obtained from the trained model by averaging the channel attention weights over all predicted sleep epochs of that stage.

C. Verification Using Reverse Ablation
To verify the channel contribution results derived from the LRP and the eCAN, we also performed a reverse ablation study. Similar to the eCAN, we used the same CNNs in Section IV-B and removed the spatial learning component. Then, we excluded one channel of the data input at a time and trained the model on remaining channels. We reported performance decreases in terms of per-class F1-scores for particular stages to illustrate the importance of the excluded channel.
In addition, we also performed the same experiment on the whole model structure including the sequential learning and residual connection components, i.e. on CNNs & RNNs & RC, to investigate the influence of sequential features on compensating for the information loss of the excluded channel. We still focused on performance decreases to identify the positive contribution of a channel, i.e. the information added by incorporating that channel. 5

VI. EXPERIMENTAL SETUP
In this section, we introduce the training scheme, model parameters, evaluation metrics and evaluation designs for our multi-channel multi-modal sleep scoring model.

A. Training Scheme and Model Parameters
We used a two-step training scheme to address the class imbalance problem. In the first step, we pre-trained CNNs (i.e., temporal and spatial learning) via minimizing the categorical cross entropy loss between model predictions and the ground truth. We used one of the two data balancing techniques, WLF and OS, as discussed in Section IV-C. The CNN features were passed into a fully-connected layer with the SoftMax activation function for sleep stage classification. This step enables our model to capture the time-invariant information of a sleep epoch precisely and learn minority classes equally to the majority class. In the second step, we froze the parameters of CNNs and trained RNNs (i.e., sequential learning), the residual connection and the final fully-connected layer. We used the categorical cross entropy loss here again. Note that, in this step, we did not use any data balancing technique.
For both steps, we used early stopping with a patience of 16. We used Adam [46] as the optimizer and set the learning rate to 10 −4 , β 1 = 0.9 and β 2 = 0.999 in both training steps. Following [24], we set the mini-batch size to 192 segments of 30-s sleep epochs, as a sleep cycle usually lasts around 96 minutes. We expected that one mini-batch training can cover all classes of sleep stages. For training RNNs and the residual connection, we set the mini-batch size to 24 and the sequence length to 8. However, the number of epochs in a PSG may not be exact multiples of 8. To still include the last epochs in the training, validation and test set, we padded them with the starting epochs of the same PSG. Our models were implemented using PyTorch and the source code is publicly available: https://github.com/Bobby-Lu/Analyzing-channel-contributionin-multi-channel-multi-modal-sleep-scoring.

B. Evaluation Metrics
We used commonly used evaluation metrics to report the performance of our multi-channel multi-modal sleep scoring model (e.g., [6], [7]): accuracy (Acc), macro F1-score (MF1), Cohen's kappa (κ) and per-class F1-score (pF1). Among them, MF1 is the harmonic mean of precision and recall and reflects the detection performance on minority classes. κ measures the agreement between a model and the ground truth. pF1 shows the detection performance for specific classes. We list the formulas to calculate them as follows: where c is one class of sleep stages and C is the number of classes. T P c is the number of true positives of class c. N is the total number of epochs. pF1 c is the per-class F1-score of class c. p o is the relative agreement between the ground truth and the predictions; p e integrates the hypothetical probability of chance agreement, obtained from the number of sleep epochs in the ground truth for a class, n cg , and the number of epochs in the predictions for that class, n cp . Pr c is the precision of class c and Re c is the recall of class c.

C. Evaluation Designs
We used different evaluation designs for SleepEDF-13 and SHHS-1, because the two data sets differ greatly in size. The SleepEDF-13 data set only contains 20 subjects with 39 PSGs. Hence, we used a nested cross validation scheme. The outer loop is a 20-fold cross validation corresponding to 20 subjects used to estimate the global performance. This means, in every outer fold k, we left out one of the 20 subjects as the test subject at a time. At the end, we combined the results of all 20 test subjects. Every inner loop is a 10-fold cross validation used to optimize the trainable weights of the models. We trained 10 models on 10 training-validation data combinations and tested them on the data of subject k. Finally, we combined the results of 200 sets (i.e., 20 outer loops × 10 inner loops) and calculated performance metrics on the global confusion matrix. For the SHHS-1 data set, we randomly shuffled the 5,793 subjects and split the data set into training (81%), validation (9%) and test (10%) following [24]. We trained our model on the training set, used early stopping on the validation set and reported model performance on the test set. Note that, the subjects were always kept separate to prohibit information leakage.

VII. RESULTS & DISCUSSION
In this section, we present and discuss the results for the performance of our multi-channel multi-modal sleep scoring model and for the channel contribution obtained by the LRP and the eCAN methods. We also show the results of channel exclusion experiments. 85 on the small SleepEDF-13 and large SHHS-1 data sets, respectively. We observe that adding transition information from sleep sequences helps complement the insufficiency of temporal information from sleep epochs thus improves the detection of sleep stages, especially for stages N1, N3 and REM. The two data balancing techniques, WLF and OS, showed comparable impacts on the performance, whereas the latter is much more computationally expensive in the training process. Nevertheless, the minority classes, stages N1 and N3, were still difficult to detect; the prediction of the other three stages achieved a F1-score around 90% for respective classes.

A. Sleep Scoring Performance
Compared to the state-of-the-arts, our model outperformed all single-channel EEG and multi-channel multi-modal models that were based on classic CNN-RNN architectures. Moreover, comparing our model to SalientSleepNet [26] which used the advanced U-Net architecture, we observe close albeit slightly lower performance (i.e., 87.5% Acc vs. 87.2% Acc), showing that classic CNN-RNN architectures are still competitive for multi-channel multi-modal sleep scoring. SalientSleepNet relied heavily on inter-modality attention modules to minimize the redundancies in data streams, which may be extended to build on our conclusion (cf., Section VIII) to reduce inter-channel redundancies as well. However, note that, the focus of this paper is to investigate channel contribution and not to propose a novel multi-channel sleep scoring model that outperforms the state-of-the-arts.
To conclude, adding particular model features from single-channel EEG models (cf., Table IV) improves multichannel multi-modal models. The improved model outperforms previous single-channel EEG and multi-channel multi-modal models. Compared to the best performing single-channel EEG models, the advantage of incorporating the information of multiple modalities and channels is around 2% Acc on both SleepEDF-13 and SHHS-1. The result suggests that the information of part of the modalities and channels may be sufficient to obtain accurate predictions for sleep scoring.

B. Channel Contribution
The proposed channel contribution experiments were based on 60 randomly selected subjects 6 from the training set of SHHS-1. For the LRP, we computed channel importance for both models, including and excluding the spatial learning component. Comparing Fig. 4b and Fig. 4c, we observe that both the LRP and the eCAN attributed information usage to all channels, which suggests that deep sleep scoring models try to utilize all accessible information. However, the EEG modality achieved much higher importance scores than the EOG and EMG modalities, complying with the AASM manual [2]. More specifically, both methods identified channel C4-A1 as the most important EEG channel, which is also recommended in [2] with channel C3-A2 being a backup for channel C4-A1. However, the two methods gave vague views on the most important EOG channel, matching the fact that the sleep signals of the 2 EOG channels in SHHS-1 are  V  PERFORMANCE COMPARISON OF OUR MULTI-CHANNEL MULTI-MODAL MODEL AND THE STATE-OF-THE-ARTS ON SLEEPEDF-13 AND SHHS-1.  'CNN' AND 'CNN-RNN' IN THE 'MODEL' COLUMN CORRESPOND TO THE MODELS OBTAINED IN THE TWO TRAINING STEPS INTRODUCED IN  SECTION VI-A; 'WLF' AND 'OS' REFER TO THE TWO DATA BALANCING TECHNIQUES DISCUSSED IN SECTION IV-C. BEST VALUES AMONG  THE MODELS BASED ON CLASSIC CNN-RNN ARCHITECTURES ARE MARKED IN BOLD. THE MODEL WHOSE METRICS ARE MARKED IN  ITALICS USED ADVANCED ARCHITECTURES AND OUTPERFORMED OUR MODEL. "−" DENOTES THAT THE  VALUE IS NOT AVAILABLE IN THE RESPECTIVE  collected from symmetric sensors and thus contain similar information [2], [32].
Observations on the most important EEG and EOG channels suggest that deep multi-channel multi-modal sleep scoring models may select one channel per modality as their main feature sources and use other channels to complement the information. Moreover, compared to the model without the spatial learning component, the one with that tried to give more even attention to different channels in a modality (cf., Fig. 4a and Fig. 4b), which reflects the intention of spatial learning for feature attributions, i.e. aggregating information from multiple channels. Furthermore, the LRP and the eCAN are different feature attribution methods (i.e., posthoc vs. intrinsic). The specific importance scores of channels to particular sleep stages identified by them varied slightly. For instance, the LRP worked more naturally in assigning larger importance scores to the EOG channels for stages Wake and REM, as they contain more eye movements [2]. Similar patterns can also be found in the relations between the EMGs and stages Wake and N1. Despite this, the two methods both showed a general pattern: multi-channel multi-modal models mainly rely on a single important channel per modality, which suggests that incorporating all channels may not be necessary to obtain acceptable prediction performance. Fig. 5 presents the performance decreases when excluding single channels of different modalities. Overall, the results verify the hypothesis above: mostly, one channel per modality is relevant for model predictions. Specifically, excluding EEG channels, especially the EEG C4-A1 channel, resulted in a larger performance decrease than excluding others. The EOG left and right channels led to almost identical performance decreases. However, we observe that the exact difference of the decreases caused by excluding different channels of a modality is rather small. In addition, comparing Fig. 5a and Fig. 5b, it is interesting to find that CNNs & RNNs & RC achieved less performance decreases when excluding single channels than CNNs. This indicates that CNNs & RNNs & RC recovered from excluding channels and points toward that the addition of sequential information compensates for the information loss of the excluded channel. We thus conclude adding transition information is beneficial especially when only a few channels are available.

C. Hypothesis Verification
For the sake of completeness, we additionally tested whether one channel per modality is really sufficient for acceptable prediction performance. We trained our multi-channel multimodal model on SHHS-1 only with the identified important channel in each modality (the EEG C4-A1 channel, the EOG left channel and the EMG channel). Results are shown in the last two rows in Table V. Compared to the original multichannel multi-modal model (i.e., rows above), the best κ only dropped from 0.85 to 0.84. This indicates that incorporating multiple channels per modality, while increasing the number of parameters to train, does not improve the performance much. The most predictive information is contained in single important channels of different modalities.

VIII. CONCLUSION AND FUTURE WORK
In this paper, we investigated to which extent multi-channel multi-modal sleep scoring models utilize information from different channels of multiple modalities. To obtain a stateof-the-art multi-channel multi-modal model, we first tested the prospective impacts of particular model features from high-performing single channel EEG models on the performance in the multi-channel multi-modal setting. We found that all four model features presented in Table IV improve the performance. Second, we employed two explainability methods, the LRP and the eCAN, to extract channel importance in multi-channel multi-modal sleep scoring. We found that deep learning based multi-channel multi-modal models incorporate information from all accessible channels but tend to focus on one important channel per modality and use the remainders to complement information. We verified this hypothesis in a reverse ablation study, where we retrained the multi-channel multi-modal model by excluding single channels of different modalities. Overall, the performance difference between single-channel EEG and multi-channel multi-modal approaches is still rather small, indicating that while additional channels contain useful information, current multi-channel multi-modal models under classic CNN-RNN architectures may not be able to reliably use the predictive information from additional channels but may also get distracted by the confusing information or the noise from other channels.
The first direction in the future would be to evaluate the channel contribution results on a sleep data set with many channels per modality (e.g., Montreal Archive of Sleep Studies [47]). Moreover, based on our obtained empirical results, the second direction would be to analyze channel contribution deeply, combining deep learning based predictions and actual sleep mechanisms. The validated hypothesis can then be utilized for efficient sleep scoring in small sleep study laboratories: i) collecting and using only important channels of the modalities and ii) training small deep learning models on a limited amount of research subjects. Additionally, considering that sleep is not a global homogeneous event in the brain, another interesting direction is to design multi-channel multimodal sleep scoring models that, while learning from the channels that contain the most important predictive information, can incorporate additional predictive information from other channels but do not additionally learn the distractors.