Loading web-font TeX/Main/Bold
SleepFC: Feature Pyramid and Cross-Scale Context Learning for Sleep Staging | IEEE Journals & Magazine | IEEE Xplore

SleepFC: Feature Pyramid and Cross-Scale Context Learning for Sleep Staging


Abstract:

Automated sleep staging is essential to assess sleep quality and treat sleep disorders, so the issue of electroencephalography (EEG)-based sleep staging has gained extens...Show More

Abstract:

Automated sleep staging is essential to assess sleep quality and treat sleep disorders, so the issue of electroencephalography (EEG)-based sleep staging has gained extensive research interests. However, the following difficulties exist in this issue: 1) how to effectively learn the intrinsic features of salient waves from single-channel EEG signals; 2) how to learn and capture the useful information of sleep stage transition rules; 3) how to address the class imbalance problem of sleep stages. To handle these problems in sleep staging, we propose a novel method named SleepFC. This method comprises convolutional feature pyramid network (CFPN), cross-scale temporal context learning (CSTCL), and class adaptive fine-tuning loss function (CAFTLF) based classification network. CFPN learns the multi-scale features from salient waves of EEG signals. CSTCL extracts the informative multi-scale transition rules between sleep stages. CAFTLF-based classification network handles the class imbalance problem. Extensive experiments on three public benchmark datasets demonstrate the superiority of SleepFC over the state-of-the-art approaches. Particularly, SleepFC has a significant performance advantage in recognizing the N1 sleep stage, which is challenging to distinguish.
Page(s): 2198 - 2208
Date of Publication: 28 May 2024

ISSN Information:

PubMed ID: 38805336

Funding Agency:


SECTION I.

Introduction

Sleep is important for humans [1]. Different sleep stages, such as non-rapid eye movement (NREM) and rapid eye movement (REM), are essential for memory consolidation, attention improvement, emotion regulation, and so forth [2], [3]. Accurately classifying the sleep stages is indispensable for comprehending how the sleep impacts human physical and mental health. However, manual sleep staging heavily relies on the knowledge and labor of sleep experts. The laboring process is empirical and time-consuming [4], [5]. By contrast, automatic sleep staging is promising to enhance the accuracy and efficiency of sleep analysis [6], [7].

Sleep staging refers to distinguishing the stages of human sleep. Sleep specialists generally categorize the sleep stages based on polysomnography (PSG), which consists of EEG, electrooculogram (EOG), electromyogram (EMG), and electrocardiogram (ECG) [8]. This paper focuses on single-channel EEG for sleep staging. Compared with PSG or multi-channel EEG, single-channel EEG holds great practical significance, because it is quite convenient and efficient to collect only one sort of signals via single one channel. Besides, technological improvement of sleep staging based on single-channel EEG is very helpful for enhancing the performance of sleep staging using multi-channel EEG as well as PSD. According to the American Academy of Sleep Medicine (AASM) criteria, PSG data can be divided into WAKE, REM, and NREM. NREM can further be classified into N1, N2 and N3 stages. In different sleep stages, the EEG signals display different waveforms, amplitudes, and spectra [9]. For instance, the salient waves of REM stage are sawtooth waves, but the salient waves of N2 stage are sleep spindles or K-complexes [10]. Capturing the characteristics of signal wave patterns can be beneficial for sleep stage classification. Moreover, sleep transition rules are also informative to distinguish sleep stages, especially those between the neighboring sleep stages, such as W-N1-N1-W-N1-N1, N2-N2-N3-N2-N3, N2-N2-REM, etc.

Many researchers have recommended deep learning for sleep staging based on EEG. Typical methods include convolutional neural network (CNN) [11], [12], convolutional recurrent neural network (CRNN) [13], [14], fully convolutional network (FCN) [15], etc. Early methodology relies on the one-to-one scheme in which an EEG epoch corresponds to one sleep stage [16]. Generally, EEG signal waves of different sleep stages display distinctive temporal and spectral characteristics. For example, K-complexes occur approximately every 1.0-1.7 minutes, but alpha rhythm undergoes periodic oscillations with a frequency range of 8 to 12 Hz. Therefore, multi-scale feature extraction plays an important role in sleep staging, because it can capture the different characteristics of salient EEG waves. Eldele et al. [17] designed two parallel CNNs, which utilize small and large filters to learn the representations from EEG saline waves for classifying sleep stages. Wang et al. [18] employed the attention mechanism and the multi-scale convolution to extract the salient wave features from EEG to classify sleep stages. Although CNN models have shown inspiring performance in sleep stage classification, their one-to-one scheme ignores the important sleep transition rules between neighboring sleep stages.

In recent years, both many-to-one and sequence-to-sequence schemes, which rely on multiple EEG epochs for sleep staging, have attracted increasing research interests [11], [12], [14]. These two schemes take into account the transition patterns of neighboring sleep stages and thus achieves encouraging performances [19]. Dong et al. [20] put forward a rectifier neural network to learn the hierarchical features from EEG epochs and adopted long short-term memory (LSTM) to recognize sleep stages. Seo et al. [21] brought forward intra- and inter-epoch temporal context network (IITNet), which is composed of a deep residual network and two layers of bi-directional LSTM (BiLSTM), to extract the time-invariant features from single-channel EEG epochs and learn the sleep transition rules for distinguishing sleep stages. Phan et al. [22] came up with SleepTransformer which extracts the intra-epoch features from each 30-second EEG epoch and learns the inter-epoch temporal representation from these epoch-wise features to separate sleep stages.

Nevertheless, because humans have different sleep durations at different stages, the number of signal samples in each sleep stage is usually unequal. Therefore, we need to address such a class imbalance problem for sleep staging [12], [23]. Recently, some studies suggest using data augmentation to balance the class distribution for sleep datasets [24], [25]. Data augmentation approaches usually generate the synthetic samples of minority classes from existing samples at the expense of computational time. Other studies recommend applying cost-sensitive learning to penalize the misclassification of minority classes, which, however, will sacrifice the classification rate on the majority classes as a cost [17].

In this paper, we propose a novel and effective method named SleepFC for sleep staging based on single-channel EEG. The main contributions of SleepFC are summarized as follows.

  1. The proposed SleepFC has a new architecture, which consists of convolutional feature pyramid network (CFPN), cross-scale temporal context learning (CSTCL), and class adaptive fine-tuning loss function (CAFTLF) based classification network, as illustrated in Fig. 1.

  2. In SleepFC, CFPN takes charge of learning a feature pyramid of salient waves; CSTCL is responsible for capturing the multi-scale sleep transition rules between successive sleep stages; CAFTLF-based classification network plays the role in resolving the class imbalance problem for sleep staging, without causing extra computational expense or compromising the classification rate on the majority classes.

  3. Extensive experiments on three public benchmark datasets demonstrate the superiority of SleepFC over the related state-of-the-arts for sleep staging based on single-channel EEG.

Fig. 1. - Overall architecture of SleepFC. At first, CFPN extracts the multi-scale features of salient waves from successive EEG epochs. Then, CSTCL learns to capture sleep stage transition rules from the extracted multi-scale features. In more detail, CSTCL fuses the multi-scale features by SCT, TDCL and BUCL, and encodes the temporal context information of fused features via Transformer encoder. At last, CAFTLF-based classification network distinguishes the imbalanced classes of sleep stages.
Fig. 1.

Overall architecture of SleepFC. At first, CFPN extracts the multi-scale features of salient waves from successive EEG epochs. Then, CSTCL learns to capture sleep stage transition rules from the extracted multi-scale features. In more detail, CSTCL fuses the multi-scale features by SCT, TDCL and BUCL, and encodes the temporal context information of fused features via Transformer encoder. At last, CAFTLF-based classification network distinguishes the imbalanced classes of sleep stages.

SECTION II.

Method

As shown in Fig. 1, the proposed SleepFC is comprised of three components: CFPN, CSTCL, and CAFTFL-based classification network. The algorithmic procedures of SleepFC are briefly described as follows. At first, CFPN learns the multi-scale features of salient waves from successive EEG epochs. Then, CSTCL captures the sleep stage transition rules from the multi-scale features. At last, CAFTLF-based classification network predicts the sleep stages whilst tackling the class imbalance problem.

A. Preliminary

We denote L successive single-channel EEG epochs sampled at F Hz as \mathbf {S}^{(L)} \in \mathbb {R}^{T \cdot F \cdot L \times C} , where T is the number of seconds of EEG epoch duration and C is the number of EEG channels, we recommend T=30 and F=100 in our work, following the general research on the issue of EEG-based sleep staging [6], [15], [26], [27]. Besides, we denote the one-hot encoding label of an EEG epoch as y_{s}^{k} \in \{0, 1\}^{k} , which corresponds to the true label y_{s} . Here, we set k = 5 , following the five-stage sleep classification in the AASM criteria [28].

B. Convolutional Feature Pyramid Network

To characterize the intrinsic features of salient waves from EEG signals, CFPN learns the feature pyramid by means of convolutional blocks, max-pooling layers, and convolutional layers.

The feature pyramid consists of three feature maps \left \{{{\mathbf {F}_{3}^{(L)}, \mathbf {F}_{4}^{(L)}, \mathbf {F}_{5}^{(L)}}}\right \} , where \mathbf {F}_{i}^{(L)}\in \mathbb {R}^{d_{t,i}\times d_{c}} , d_{t,i} denotes the temporal dimension of the i-th feature map (i = 3, 4, 5 ), and d_{c} denotes the channel dimension of feature maps. CFPN involves five convolutional blocks, four max-pooling layers, and three convolutional layers, all of which are designed for unifying the channel dimension of feature maps. Each of the first two convolutional blocks contains two 1-D convolutional layers, two 1-D batch normalization layers, and two parametric rectified linear units (PReLU) [29]; each of the last three convolutional blocks contains three 1-D convolutional layers, three 1-D batch normalization layers, and three PReLUs. In each convolutional block, all the convolution layers have the same kernel size. Besides, a squeeze-and-excitation module is positioned before the last PReLU of every convolutional block. The squeeze-and-excitation module can adaptively recalibrate the channel-wise feature responses by explicitly exploring the inter-dependencies among feature channels [30]. A max-pooling layer is placed between every two convolutional blocks to decrease the temporal dimension of feature maps. Moreover, a 1-D convolutional layer with a kernel size of 1 is put after each of the last three convolutional blocks, for reducing and unifying the channel dimension of feature maps as d_{c} .

C. Cross-Scale Temporal Context Learning

To learn the EEG features for sleep staging, CSTCL captures the multi-scale sleep transition rules by integrating three context learning approaches and one transformer encoder.

The sleep transition rules have multi-scale characteristics according to the AASM criteria (i.e., short scale: N2-REM; middle scale: N3-N1-N1-N3; long scale: N2-N1-N1-W-N1-W-W; here, “-” means the sleep stage transiting from one to another) [31]. CSTCL learns to capture the multi-scale sleep transition rules from feature pyramid. Specifically, CSTCL contains top-down context learning, self-context learning, bottom-up context learning, and Transformer encoder. As the input of CSTCL, the feature pyramid \left \{{{\mathbf {F}_{3}^{(L)}, \mathbf {F}_{4}^{(L)}, \mathbf {F}_{5}^{(L)}}}\right \} contains both fine-grained feature map \mathbf {F}_{l}\in \mathbb {R}^{d_{t,l}\times d_{c}} at the low level and coarse-grained feature map \mathbf {F}_{h}\in \mathbb {R}^{d_{t,h}\times d_{c}} at the high level. In the feature pyramid \left \{{{\mathbf {F}_{3}^{(L)}, \mathbf {F}_{4}^{(L)}, \mathbf {F}_{5}^{(L)}}}\right \} , every feature map {\mathbf {F}}_{i}^{(L)} consists of a sequence of feature vectors \left [{{ {\mathbf {f}}_{i,1}^{(L)},{\mathbf {f}}_{i,2}^{(L)}, {\dots },{\mathbf {f}}_{i,T}^{(L)} }}\right]^{\top } along the temporal dimension of {\mathbf {F}}_{i}^{(L)} .

1) Self-Context Learning:

Self-context learning (SCL) extracts the features of salient waves from EEG in different sleep stages, and learns the contextual relationship along the temporal dimension of these features. The output \tilde {\mathbf {F}} of SCL has the same size as its input F. Firstly, we use M different learnable matrices to project the input F to M pairs of query and key. Then, we perform a convolutional operation on the input feature map F to obtain the value V. Next, we calculate the similarity score between each pair of \mathbf {Q}_{j} and \mathbf {K}_{j} , which are the j-th pair of query and key, by using mixture of softmaxes (MoS) [32]. The MoS-based normalization is formulated as \begin{align*} \mathbf {W}_{S} & =\sum _{j=1}^{M} \pi _{j} \sigma _{1}\left ({{\frac {\mathbf {Q}_{j}\mathbf {K}_{j}^{\top }}{\sqrt {d_{k}} }}}\right), \\ \left [{{ \pi _{1},\pi _{2}, {\dots },\pi _{M} }}\right ] & =\sigma _{2} (\mathrm {w}_{\mathrm {mos},j}^{\top }\bar {\mathbf {K}}), \\ \mathbf {Q}_{j} & =\mathbf {K}_{j} =f_{\mathrm {QKS,}j}(\mathbf F), \\ \mathbf {V} & =f_{\mathrm {VS} }(\mathbf F), \tag {1}\end{align*}

View SourceRight-click on figure for MathML and additional features. where M denotes the number of learnable linear projection matrices; d_{k} denotes the channel dimension of K; \pi _{j} denotes the j\text {-} th aggregation weight; \sigma _{1} (\cdot) and \sigma _{2} (\cdot) are the Softmax functions; \mathrm {w}_{\mathrm {mos},j} denotes the learnable linear projection vector for normalization; \bar {\mathbf {K}} denotes the arithmetic mean of K along the temporal dimension; f_{\mathrm {VS}}(\cdot) represents the convolutional operation, and f_{\mathrm {QKS,}j}(\cdot) the learnable linear projection matrice. Based on the MoS-based normalization, we can obtain the output feature map \tilde {\mathbf {F}} of SCL as \begin{equation*} \tilde {\mathbf {F}}=\mathrm {BN}(\mathbf {W}_{S}\mathbf {V})+\mathbf {F}, \tag {2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \mathrm {BN}(\cdot) indicates the batch normalization.

2) Top-Down Context Learning:

Top-down context learning (TDCL) adopts a top-down attention mechanism. This mechanism fuses the global information of high-level feature map \mathbf {F}_{h} and the local information of low-level feature map \mathbf {F}_{l} together.

The method pipeline of TDCL is briefly described in the following. First, we apply three convolutional layers f_{\mathrm {QT}} (\cdot) , f_{\mathrm {KT}} (\cdot) and f_{\mathrm {VT}}(\cdot) with a kernel size of 1 to reduce the channel dimension of \mathbf {F}_{l} and \mathbf {F}_{h} to d_{c}/2 , thus generating Q, K and V. After this process, the temporal dimension of Q is still d_{t,l} , and that of K and V is still d_{t,h} .

Next, we calculate the dot product between Q and \mathbf {K}^{\top } and take the normalization operation to produce the attention score. Then, we multiply the attention score by V to yield a new feature map. Finally, we utilize a convolutional layer with the kernel size of 1 and the stride size of 1 to increase the channel dimension of this feature map to d_{c} and keep the temporal dimension as d_{t,l} , so that \tilde {\mathbf {F}_{l}}\in \mathbb {R}^{d_{t,l}\times d_{c}} . The above process can be formulated as follows:\begin{align*} \tilde {\mathbf {F}}_{l}& =\mathrm {Conv}_{T}\left ({{\frac {\mathbf {Q}\mathbf {K}^{\top }}{ d_{t,h}}\mathbf {V} }}\right), \\ \mathbf {Q}& =f_{\mathrm {QT} }(\mathbf F_{l}), \\ \mathbf {K}& =f_{\mathrm {KT} }(\mathbf F_{h}), \\ \mathbf {V}& =f_{\mathrm {VT} }(\mathbf F_{h}), \tag {3}\end{align*}

View SourceRight-click on figure for MathML and additional features. where the size of output \tilde {\mathbf {F}}_{l} is the same as the input \mathbf {F}_{l} .

3) Bottom-Up Context Learning:

Bottom-up context learning (BUCL) fuses the local information of \mathbf {F}_{l} into \mathbf {F}_{h} . Specifically, \mathbf {F}_{h} is linearly projected to Q, and \mathbf {F}_{l} is linearly projected to K and V:\begin{align*} \mathbf {Q}& =f_{\mathrm {QB} }(\mathbf F_{h}), \\ \mathbf {K}& =f_{\mathrm {KB} }(\mathbf F_{l}), \\ \mathbf {V}& =f_{\mathrm {VB} }(\mathbf F_{l}), \tag {4}\end{align*}

View SourceRight-click on figure for MathML and additional features. where f_{\mathrm {QB} }(\cdot) , f_{\mathrm {KB} }(\cdot) and f_{\mathrm {VB} }(\cdot) are the learnable matrices.

Next, we process the low-level feature map K by the channel-wise attention operation:\begin{equation*} \mathbf {w}_{c}=\mathrm {ReLU}(\mathrm {GAP} (\mathbf {K})), \tag {5}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where the weight \mathbf {w}_{c} of channel-wise attention is computed by global average pooling (GAP) and ReLU.

Then, we calculate the Hadamard product between \mathbf {W}_{c} and Q by \begin{align*} \tilde {\mathbf {F}}_{h}& =\mathrm {ReLU}(\mathbf {Q}\odot \mathbf {W}_{c} + \mathbf {V}), \\ \mathbf {W}_{c}& =[\mathbf {w}_{c},\mathbf {w}_{c}, {\dots },\mathbf {w}_{c}]_{1\times d_{k}}, \tag {6}\end{align*}

View SourceRight-click on figure for MathML and additional features. where d_{k} denotes the temporal dimension of K, and \odot represents the Hadamard product.

4) Transformer Encoder:

By performing SCT, TDCT and BUCL on the feature pyramid \left \{{{\mathbf {F}_{3}^{(L)}, \mathbf {F}_{4}^{(L)}, \mathbf {F}_{5}^{(L)}}}\right \} , we can obtain three feature sets, each of which contains four feature maps. Then, we concatenate the four feature maps of each set along the channel dimension. To reduce the channel dimension, we process the concatenated feature maps by a convolutional layer \mathrm {ConV} (\cdot) to yield the feature maps {\tilde {\mathbf {F}}}_{3}^{(L)} , {\tilde {\mathbf {F}}}_{4}^{(L)} and {\tilde {\mathbf {F}}}_{5}^{(L)} :\begin{align*} {\tilde {\mathbf {F}}}_{3}^{(L)}& =\mathrm {ConV} (\mathrm {Concat}({\mathbf {F}}_{3}^{(L)},\tilde {\mathbf {F}}_{l(5,3)}^{(L)},\tilde {\mathbf {F}}_{l(4,3)}^{(L)},\tilde {\mathbf {F}}_{(3,3)}^{(L)})), \\ {\tilde {\mathbf {F}}}_{4}^{(L)}& =\mathrm {ConV} (\mathrm {Concat}({\mathbf {F}}_{4}^{(L)},\tilde {\mathbf {F}}_{l(5,4)}^{(L)},\tilde {\mathbf {F}}_{(4,4)}^{(L)},\tilde {\mathbf {F}}_{h(3,4)}^{(L)})), \\ {\tilde {\mathbf {F}}}_{5}^{(L)}& =\mathrm {ConV} (\mathrm {Concat} ({\mathbf {F}}_{5}^{(L)},\tilde {\mathbf {F}}_{(5,5)}^{(L)},\tilde {\mathbf {F}}_{h(4,5)}^{(L)},\tilde {\mathbf {F}}_{h(3,5)}^{(L)})), \tag {7}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \mathrm {Concat} (\cdot) represents the concatenation operation along the channel dimension of the four feature maps in each set, the feature map {\tilde {\mathbf {F}}}_{i}^{(L)} consists of a sequence of feature vectors \left [{{ {\tilde {\mathbf {f}}}_{i,1}^{(L)},{\tilde {\mathbf {f}}}_{i,2}^{(L)}, {\dots },{\tilde {\mathbf {f}}}_{i,T}^{(L)} }}\right]^{\top } , and T denotes the temporal dimension of {\tilde {\mathbf {F}}}_{i}^{(L)} .

Finally, we encode the context information of temporal sequence \left [{{ {\tilde {\mathbf {f}}}_{i,1}^{(L)},{\tilde {\mathbf {f}}}_{i,2}^{(L)}, {\dots },{\tilde {\mathbf {f}}}_{i,T}^{(L)} }}\right] by Transformer encoder. In the positional encoder of Transformer, we adopt the sine and cosine functions to incorporate the order information of feature vectors \left [{{ {\tilde {\mathbf {f}}}_{i,1}^{(L)},{\tilde {\mathbf {f}}}_{i,2}^{(L)}, {\dots },{\tilde {\mathbf {f}}}_{i,T}^{(L)} }}\right] into \begin{align*} {\mathbf {E}}_{i}^{(L)}& = \mathrm {TransformerEncoder} ({\tilde {\mathbf {P}}}_{i}^{(L)}), \\ {\tilde {\mathbf {P}}}_{i}^{(L)}& ={\tilde {\mathbf {F}}}_{i}^{(L)}+{\mathbf {P}}_{i}^{(L)}, \tag {8}\end{align*}

View SourceRight-click on figure for MathML and additional features. where {\mathbf {P}}_{i}^{(L)} denotes the positional encoding matrix; {\mathbf {E}}_{i}^{(L)} denotes the encoded feature map of the i\text {-} th feature map {\tilde {\mathbf {P}}}_{i}^{(L)} ; \mathrm {TransformerEncoder} (\cdot) represents the encoder component of Transformer. Because of the large number of parameters in Transformer, we reduce the hidden dimension d_{FF} of the feed-forward network. Besides, we retain the original number of attention heads N_{h} and encoder layers N_{e} in Transformer [33]. The parameter settings will be detailed in Section III-B.

D. CAFTFL-Based Classification Network

CAFTFL-based classification network predicts the sleep stages whilst handling the class imbalance via the attention mechanism and the loss function CAFTLF in a two-stage training process.

In CAFTFL-based classification network, we fuse the encoded feature map {\mathbf {E}}_{i}^{(L)}=\left [{{ {\mathbf {e}}_{i,1}^{(L)},{\mathbf {e}}_{i,2}^{(L)}, {\dots },{\mathbf {e}}_{i,T}^{(L)} }}\right]^{\top } into an attention vector \tilde {\mathbf {e}} _{i} by the attention layer. More concretely, we combine the feature vectors \left [{{ {\mathbf {e}}_{i,1}^{(L)},{\mathbf {e}}_{i,2}^{(L)}, {\dots },{\mathbf {e}}_{i,T}^{(L)} }}\right] via the weighted sum:\begin{equation*} \tilde {\mathbf {e}} _{i} = \sum _{t=1}^{T} \alpha _{i,t}\mathbf {a}_{i,t}, \tag {9}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \alpha _{i,1},\alpha _{i,2}, {\dots },\alpha _{i,t} denote the attention weights which can be learned by an attention layer:\begin{align*} \mathbf {a}_{i,t} & =\tanh \left ({{\mathbf {W} {\mathbf {e}}_{i,t}^{(L)}+\mathbf {b}}}\right), \\ \alpha _{i,t} & =\frac {\exp \left ({{\mathbf {a}_{i,t}^{\top } \mathbf {w}_{\alpha }}}\right)}{\sum _{t=1}^{T} \exp \left ({{\mathbf {a}_{i,t}^{\top } \mathbf {w}_{\alpha }}}\right)}, \tag {10}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where W and b are the learnable weight matrix and bias, respectively; \mathbf {w}_{\alpha } is the trainable weight vector.

Finally, the i\text {-} th feature vector \tilde {\mathbf {e}} _{i} passes through a fully connected layer to yield the i\text {-} th output logit {\mathbf {O}}_{i} , and thus the sleep stage can be predicted via \begin{equation*} \hat {y}=\mathrm {argmax}\left ({{\sum _{i \in \{3,4,5\}} \mathbf {O}_{i}}}\right), \tag {11}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \hat {y} denotes the predicted sleep stage.

In the classification network, we employ a piecewise loss function to address the class imbalance problem of sleep stages. The training process of SleepFC consists of two stages. In the first stage, we utilize the standard multi-class cross-entropy [34] as the loss function; in the second stage, we devise the loss function CAFTLF as \begin{align*} & {\mathcal {L}}_{\mathrm {CAFTLF}} =-\frac {1}{S} \sum _{\{i\in 3,4,5\}} \sum _{s=1}^{S} \sum _{k=1}^{K} w_{k} y_{i,s}^{k} \log \left ({{\widehat {y}_{i,s}^{k}}}\right), \tag {12}\\ & w_{k} =\begin{cases}\displaystyle 1+\mu _{({y}_{s},\widehat {y}_{s})} \cdot \max \left ({{1, \log \left ({{\frac { S }{ S_{k}} }}\right)}}\right), \mathrm {if } {y}_{s} \neq \widehat {y}_{s} \\ \displaystyle 1, \; \; \qquad \qquad \qquad \qquad \qquad \qquad \hspace {0.6em} \mathrm {if } {y}_{s} = \widehat {y}_{s},\end{cases} \tag {13}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \widehat {y}_{s}^{k} denotes the predicted probability of the s\text {-} th sample belonging to class k; S denotes the total number of samples, and K denotes the total number of classes; w_{k} denotes the weight assigned to class k; S_{k} denotes the number of samples belonging to class k; \mu _{({y}_{s},\widehat {y}_{s})} is an adjustable parameter indicating the distinctness of the class.

When training SleepFC, an early stopping technique is employed to reduce the overfitting risk and enhance the generalization performance. In the training process, once the validation loss stops decreasing not less than a certain number of training iterations (i.e., the early stopping patience \phi _{1} ), the first stage of training ends and the second stage of training starts. In the second training stage, if the validation accuracy ceases to increase not less than a certain number of training iterations (i.e., the early stopping patience \phi _{2} ), which indicates that the trained model can no longer be improved, then the training process ends. In the second training stage, the class weight w_{k} is influenced by two factors: first, the number of samples in each class; second, the classification rate for each class in the first training stage. So, CAFTLF can depress the over-belief of classification network in the classification rate while neglecting the class size, which is thereby conducive to overcoming the imbalanced classification problem in sleep staging. In addition, the hyperparameters for training SleepFC will be detailed in Section III-B.

SECTION III.

Experiments

A. Datasets

We evaluate our proposed method, SleepFC, on three public benchmark datasets: SleepEDF-20 [26], SleepEDF-78 [27], and ISRUC-S3 [35], whose critical characteristics have been summarized in Table I.

TABLE I Dataset Characteristics and Evaluation Protocols
Table I- Dataset Characteristics and Evaluation Protocols

SleepEDF-20: SleepEDF-20 is comprised of 10 male subjects and 10 female subjects aged from 25 to 34 years old without sleep disorders. Two consecutive nights of PSG recordings were collected from them, except that one recording of subject 13 was lost due to device failure. Based on the Rechtschaffen and Kales criteria [6], sleep experts manually annotated the PSG sleep periods in 30-second sleep epochs and categorized the sleep epochs into eight classes: MOVEMENT, UNKNOWN, WAKE, N1, N2, N3, N4, and REM.

SleepEDF-78: SleepEDF-78 is the Sleep-EDF Expanded dataset (version 2013), consisting of 78 healthy subjects aged from 25 to 101. Each subject underwent two consecutive nights of PSG sleep recordings, except for subjects 13, 36 and 52, whose one recording was lost due to device failure. Every sleep epoch was categorized into the same eight classes as SleepEDF-20.

ISRUC-S3: ISRUC-S3 contains the PSG recordings collected from 10 healthy subjects (9 males and 1 female). The recordings of ISRUC-S3 lasted continually for 8 hours with a sampling frequency of 200 Hz. Each recording includes 6 EEG channels, 1 ECG channel, 3 EMG channels, and 2 EOG channels. According to the criteria of AASM, sleep experts categorized these PSG signals into five sleep stages: WAKE, N1, N2, N3, and REM.

B. Experimental Settings

For each dataset, we use one single channel of original EEG, except for ISRUC-S3 whose signals are downsampled at the frequency of 100 Hz. In experiments, we use the Fpz-Cz channel of EEG from SleepEDF-20 and SleepEDF-78, and the C4-A1 channel of EEG from ISRUC-S3 for method evaluation. The MOVEMENT class refers to the physical activity during sleep. There are also the movement artifacts that cannot be scored in both beginning and end of the recording from each subject. These noisy parts of each recording are labeled as UNKNOWN [36]. Because these two classes don’t represent any specific sleep stage, we exclude them before experiments [6], [31], [37]. Moreover, according to the AASM criteria, we merge N3 and N4 stages into N3 for classification [6], [17], [38], [39]. Besides, we keep 30 minutes of the WAKE periods before and after the sleep period as the WAKE stage [40].

We follow the universally-used evaluation protocols for method evaluation [6], [17], [22]. The evaluation protocols on different datasets have been described in Table I. It is worth mentioning that, in experiments, the validation set is randomly selected from the training set, which is independent of the testing set. Besides, we adopt three metrics to evaluate the method performance: accuracy (ACC), macro F1-score (MF1), and Cohen’s Kappa (\kappa ) [6], [22], [36].

The parameter settings of SleepFC are given in the following. L is set as 10, which means that one current and nine previous adjacent EEG epochs are used as the input data of SleepFC. In each convolutional block of CFPN, for every convolutional layer, the kernel size is set as 3, the stride size is set as 1, and the padding size is set as 1; for every max-pooling layer, the kernel size is set as 5, and the stride size is set as 5. In CFPN, the output channel number d_{cr} of the convolutional layer is set as 128. In CSTCL, all the outputs of SCT, TDCL and BUCL have the same channel dimension d_{cr} =128 . In SCT, the number of mixture models in MoS is set as 2. In Transformer encoder, the number of heads is set as N_{h}=8 , and the number of encoder layers is set as N_{e}=6 . The hidden dimension of the feed-forward network d_{FF} is set as 128. Besides, SleepFC is trained using the Adam optimizer [41] with \eta = 5 \times 10^{-4} , \beta _{1} =0.9 , \beta _{2} =0.999 , and \epsilon = 1\times 10^{-8} . In training, the mini-batch size of SleepFC is set as 32. To mitigate overfitting, the L2-weight regularization with a coefficient of 1\times 10^{-6} is adopted for SleepFC.

On SleepEDF-20 and SleepEDF-78, SleepFC is evaluated on the validation set every 500 training iterations (i.e., the validation period \psi =500 ); and on ISRUC-S3, SleepFC is evaluated with \psi =150 . At the same time, the validation loss is also monitored for early stopping. The first stage of training focuses on minimizing the validation loss. If \phi _{1}=20 , the first training stage ends and the second training stage starts. The second stage of training turns to maximize the validation accuracy, \eta =1 \times 10^{-4} and \phi _{2}=10 .

C. Feature Evaluation

To evaluate the performance gain brought by CFPN, we compare the feature extraction components of SleepFC, U-Time, XSleepNet, and SleepTransformer with and without the feature pyramid method on SleepEDF-78. This comparison is carried out under the condition where the subsequent components of the feature extraction components are utilizing Transformer Encoder and the CAFTFL-based classification network of SleepFC.

U-Time [15] is a fully convolutional network for sleep staging. U-Time has an encoder-decoder structure for feature extraction, and the encoder is used for feature extraction and the decoder for times series segmentation. In our experiments, we only utilized the encoder component to extract EEG features directly from the raw EEG signal.

XSleepNet [6] is a sequence-to-sequence bidirectional RNN for sleep staging. XSleepNet is composed of two network streams: one for processing raw signals and the other for processing time-frequency images. In our experiments, we only use the former stream to extract features, considering its suitability for EEG.

AttnSleep [17] is an attention-based deep learning approach for sleep staging using single-channel EEG. The feature extraction component of AttnSleep is a multi-resolution convolutional neural network (MRCNN), which is bifurcated into two distinct branches. The low-resolution branch extracts low-frequency features, and the high-resolution branch extracts high-frequency features. The features from the two branches are then concatenated as the extracted features.

From Table II, we can see that, with feature pyramid, the overall ACC, MF1, and \kappa performances of all the evaluated methods consistently rise. These results not only validate the competency of CFPN in SleepFC, but also verify the compatibility of feature pyramid with all the compared networks for sleep staging.

TABLE II Evaluation of Feature Extraction Method Feature Pyramid in SleepFC
Table II- Evaluation of Feature Extraction Method Feature Pyramid in SleepFC

D. Method Comparison

Table III has visualized the confusion matrices of SleepFC for sleep stage classification on SleepEDF-20, SleepEDF-78 and ISRUC-S3. From these confusion matrices, we can observe that the class imbalance problem has a big influence on the performance of SleepFC. Specifically, it is indeed the easiest case to identify the sleep stage W, which belongs to the majority class in the long-tailed distribution, on all the datasets, while, it is also the hardest case to classify the sleep stage N1, which belongs to the minority one at the other end of such a class distribution.

TABLE III Confusion Matrices of SleepFC for Sleep Stage Classification on SleepEDF-20, SleepEDF-78 and ISRUC-S3 (in Each Confusion Matrix, the Row Stands for the Ground-Truth Labels, and the Column Stands for the Predicted Classes; the Above Value Indicates the Classification Rate on Each Class, and the Below Value Indicates the Number of Predicted Samples in Each class)
Table III- Confusion Matrices of SleepFC for Sleep Stage Classification on SleepEDF-20, SleepEDF-78 and ISRUC-S3 (in Each Confusion Matrix, the Row Stands for the Ground-Truth Labels, and the Column Stands for the Predicted Classes; the Above Value Indicates the Classification Rate on Each Class, and the Below Value Indicates the Number of Predicted Samples in Each class)

Moreover, we compare our proposed SleepFC with the state-of-the-art approaches in Table IV. We directly report the results of the methods with the input setting of L=10 in their original papers, including IITNet [21] and SleepEEGNet [13]. For those methods having a different setting of L, we also evaluate them based on the input data of L=10 for fairness. In particular, we implement AttnSleep [17], Multi-Task CNN [42], TinySleepNet [37], XSleepNet [6] and U-Time [15] using public-available codes, and reproduce DeepSleepNet [36], ResnetLSTM [43], SleepFCN [44], Single-Stream XSleepNet [6], SleepTransformer [22], TSA-Net [39], MNN [20] and SeqSleepNet [10] by ourselves.

TABLE IV Comparison of SleepFC With Related State-of-the-Arts for Sleep Staging
Table IV- Comparison of SleepFC With Related State-of-the-Arts for Sleep Staging

From Table IV, we can see that SleepFC performs the best for sleep staging in terms of ACC, MF1 and \kappa on the whole. In greater detail, SleepFC achieves the remarkable F1-Scores performances on N1, N2 and REM. Besides, the results of SleepFC on WAKE and N3 are also relatively encouraging. Actually, the sleep stage N1 is a challenging minority class, which only accounts for 5%-15% of the total sleep time. Even so, SleepFC still obtains a relatively high F1-score on N1. These results readily demonstrate the ability of SleepFC to deal with the class imbalance problem in sleep staging.

Furthermore, we measure the model size of SleepFC in Table V. Although the performance advantage of SleepFC over XSleepNet is not so obvious as the compared approaches, yet SleepFC has smaller parameter amount and requires fewer EEG epochs.

TABLE V Model Size of SleepFC
Table V- Model Size of SleepFC

E. Model Ablation

We carry out ablation study to validate the rationality and effectivity of the key components CFPN, CSTCL and CAFTLF in SleepFC on SleepEDF-20. The following four experiments are conducted:

  1. Ablation on CFPN: CFPN and CAFTLF-disabled classification network with the first training stage.

  2. Ablation on CFPN+CAFTLF: CFPN and CAFTLF-based classification network with the two-stage training.

  3. Ablation on CFPN+CSTCL: CFPN, CSTCL and CAFTLF-disabled classification network with the first training stage.

  4. Ablation on CFPN+CSTCL+CAFTLF: CFPN, CSTCL and CAFTLF-based classification network with the two-stage training, i.e., SleepFC.

From Fig. 2, we can see that CSTCL avails SleepFC of capturing the informative multi-scale transition rules between sleep stages, thus boosting the performance of SleepFC. These results reveal the value of this context learning component in SleepFC for sleep staging. By comparing CFPN and CFPN+CAFTLF as well as comparing CFPN+CAFTLF and CFPN+CSTCL+CAFTLF, we can find that CAFTLF-based classification network not only can enhance the overall ACC, MF1, and \kappa performances of SleepFC, but also can significantly improve its F1-Score for N1 classification in spite of the severe class imbalance problem. Such results confirm that CAFTLF enables SleepFC to attach importance to the minority class but without compromising its performance on the majority classes.

Fig. 2. - Ablation of SleepFC on SleepEDF-20.
Fig. 2.

Ablation of SleepFC on SleepEDF-20.

F. Sensitivity Analysis

1) Evaluation on the Number of EEG Epochs:

We evaluate the influence of the number of EEG epochs, denoted as L, on the performance of SleepFC, by adjusting L as 1, 2, 5, 10 and 20. Our results under three different evaluation metrics, as illustrated in Fig. 3, reveal that SleepFC achieves its peak performance on SleepEDF-20, SleepEDF-78 and ISRUC-S3 when L is set as 10. By contrast, both increase and decrease of L result in a performance decline of SleepFC. This is mainly because the lower values of L cannot offer sufficient temporal context information for CSTCL of SleepFC to learn discriminative feature maps \left \{{{\mathbf {E}_{3}^{(L)}, \mathbf {E}_{4}^{(L)}, \mathbf {E}_{5}^{(L)}}}\right \} from the feature pyramid, while the higher values of L will involve much redundant and noisy information to harm the discriminability of learned feature maps \left \{{{\mathbf {E}_{3}^{(L)}, \mathbf {E}_{4}^{(L)}, \mathbf {E}_{5}^{(L)}}}\right \} . As a compromise, L=10 is the relatively best choice for SleepFC.

Fig. 3. - Evaluation on the number of EEG epochs as the input for SleepFC using SleepEDF-20, SleepEDF-78 and ISRUC-S3: (a) the results of SleepFC under ACC; (b) the results of SleepFC under MF1; (c) the results of SleepFC under 
$\kappa $
.
Fig. 3.

Evaluation on the number of EEG epochs as the input for SleepFC using SleepEDF-20, SleepEDF-78 and ISRUC-S3: (a) the results of SleepFC under ACC; (b) the results of SleepFC under MF1; (c) the results of SleepFC under \kappa .

2) Evaluation on the Convolution Kernel Size of CFPN:

We evaluate the influence of the convolution kernel size of CFPN, denoted as K, on the performance of SleepFC, by adjusting K from 1 to 9. By observing the results of SleepFC under three different evaluation metrics on ISRUC-S3 in Fig. 4, we can find that the performance of SleepFC fluctuates with the increase of K, resulting in more than one peak. The reason to explain this phenomenon is as follows. For a sleep stage, the larger convolution kernel size enables CFPN to encode the rich and varied information, thus being beneficial for CFPN to learn more robust features at the expense of discriminability; the smaller convolution kernel size enables CFPN to encode the detailed and typical information, thus being conducive to CFPN to learn more discriminative features at the cost of robustness. To ensure a good generalization performance, CFPN should balance both discriminability and robustness in feature learning. Moreover, the signal data in different sleep stages have different characteristics, commensurate with different kernel sizes of CFPN to learn the features with the strongest generalizability. Therefore, SleepFC exhibits a fluctuating performance as K increases. However, as shown in Fig. 4(d), the larger kernel size also means the more model parameters and computational complexity of SleepFC. Considering this, we recommend K=3 for SleepFC, because this is the smallest kernel size for SleepFC to obtain the relatively highest ACC, MF1, and \kappa .

Fig. 4. - Evaluation on the convolution kernel size of CFPN in SleepFC using ISRUC-S3: (a) the results of SleepFC under ACC; (b) the results of SleepFC under MF1; (c) the results of SleepFC under 
$\kappa $
; (d) the number of parameters in SleepFC with different size of convolution kernels.
Fig. 4.

Evaluation on the convolution kernel size of CFPN in SleepFC using ISRUC-S3: (a) the results of SleepFC under ACC; (b) the results of SleepFC under MF1; (c) the results of SleepFC under \kappa ; (d) the number of parameters in SleepFC with different size of convolution kernels.

3) Evaluation on the Concatenation Order of Feature Maps:

We evaluate the concatenation order of feature maps on ISRUC-S3, including [\tilde {\mathbf {F}}^{(L)}_{(5,i)},{\mathbf {F}}^{(L)}_{i},\tilde {\mathbf {F}}^{(L)}_{(4,i)},\tilde {\mathbf {F}}^{(L)}_{(3,i)}] , [{\mathbf {F}}^{(L)}_{i},\tilde {\mathbf {F}}^{(L)}_{(5,i)},\tilde {\mathbf {F}}^{(L)}_{(3,i)},\tilde {\mathbf {F}}^{(L)}_{(4,i)}] , [{\mathbf {F}}^{(L)}_{i},\tilde {\mathbf {F}}^{(L)}_{(3,i)},\tilde {\mathbf {F}}^{(L)}_{(4,i)},\tilde {\mathbf {F}}^{(L)}_{(5,i)}] , [\tilde {\mathbf {F}}^{(L)}_{(3,i)},{\mathbf {F}}^{(L)}_{i},\tilde {\mathbf {F}}^{(L)}_{(4,i)},\tilde {\mathbf {F}}^{(L)}_{(5,i)}] , [{\mathbf {F}}^{(L)}_{i},\tilde {\mathbf {F}}^{(L)}_{(3,i)},\tilde {\mathbf {F}}^{(L)}_{(5,i)},\tilde {\mathbf {F}}^{(L)}_{(4,i)}] , and [{\mathbf {F}}^{(L)}_{i},\tilde {\mathbf {F}}^{(L)}_{(5,i)},\tilde {\mathbf {F}}^{(L)}_{(4,i)},\tilde {\mathbf {F}}^{(L)}_{(3,i)}] . From Fig. 5, we can see that the concatenation order of feature maps has nearly no influence on the performance of SleepFC under different evaluation metrics on ISRUC-S3. Generally, the concatenation order of feature maps in deep learning will affect the classification performance, so long as the downstream layers learn the deep representation relying on the position of concatenated feature maps [45], [46], [47]. Nevertheless, SleepFC concatenates the four feature maps along the channel dimension instead of the temporal dimension, so the concatenation order has no impact on the process of temporal feature learning. Actually, the minor performance variation of SleepFC for different feature map concatenation is mainly caused by the random initialization of convolution layers after concatenation, which is inevitable and negligible in applications.

Fig. 5. - Evaluation on the in SleepFC using ISRUC-S3: (a) the results of SleepFC under ACC; (b) the results of SleepFC under MF1; (c) the results of SleepFC under 
$\kappa $
.
Fig. 5.

Evaluation on the in SleepFC using ISRUC-S3: (a) the results of SleepFC under ACC; (b) the results of SleepFC under MF1; (c) the results of SleepFC under \kappa .

4) Evaluation on the Scale of Learned Representation:

We evaluate the effectivity of each scale of learned representation output from SleepFC on ISRUC-S3. For convenience, we denote the three scales of learn representations as {\mathbf {O}}_{3} , {\mathbf {O}}_{4} , and {\mathbf {O}}_{5} , which correspond to the output logits of SleepFC, as shown in Fig. 1(c). As reported in Table VI, the results exhibit the performance enhancement of SleepFC with the increment of learned representation scales; when all the three scales of representations {\mathbf {O}}_{3} , {\mathbf {O}}_{4} and {\mathbf {O}}_{5} are used together, SleepFC performs the best. Such results validate the effectiveness of the multi-scale representations learned by SleepFC for sleep staging. Further, we can observe the contribution of different scales of learned representations to sleep stage prediction. In greater detail, {\mathbf {O}}_{3} is especially effective to classify the WAKE and N3 stages, and {\mathbf {O}}_{4} is particularly effective to classify the N1 stage, while {\mathbf {O}}_{5} is effective for classifying all the sleep stages on the whole without special superiority in any class. Since different scales of learned representations have different classification advantages, combining these representations together, as done by SleepFC, can well integrate their advantages and hence achieves the best performance for sleep staging.

TABLE VI Evaluation on the Scale of Learned Representation Output From SleepFC
Table VI- Evaluation on the Scale of Learned Representation Output From SleepFC

G. Significance Test

We evaluate the statistical significance of the performance improvement of SleepFC over the three related advanced methods AttnSleep, DeepSleepNet, and XSleepNet by means of the paired Wilcoxon signed-rank test. To be specific, we assess the p-values for ACC improvement, MF1 improvement, and \kappa improvement of SleepFC in comparison to the three methods. To this end, we set the null hypothesis H_{0} as follows: the performance difference between SleepFC and each compared model in the control group is not significant; if the p-value is less than 0.05, H_{0} will be rejected. In statistical significance tests, all the methods adopt the same input of EEG epochs with L=10 .

As recorded in Table VII, in almost all the cases, SleepFC has obvious performance improvements over the compared approaches, and the p-values for these improvements are much lower than the significance level of 0.05. Such results straightforwardly evidence the statistical significance of the performance superiority of SleepFC for the task of sleep staging.

TABLE VII Statistical Significance Tests on the Performance Improvement of SleepFC
Table VII- Statistical Significance Tests on the Performance Improvement of SleepFC

SECTION IV.

Conclusion

In this paper, we have proposed a novel method SleepFC for the issue of single-EEG-based sleep staging. SleepFC not only can effectively extract and fuse the representative features from the salient waves of EEG epochs, but can learn and capture the informative multi-scale sleep transition rules among sleep stages, and also can competently tackle the serious class imbalance problem ever haunting this issue. Experimental results on three public benchmark datasets have demonstrated the superiority of proposed method over the related state-of-the-arts. In future, we will tentatively incorporate an appropriate transfer learning strategy into SleepFC to handle the thorny problem of cross-subject domain gap, so as to further enhance the performance of our model for this issue.

References

References is not available for this document.