Adaptive Margin Aware Complement-Cross Entropy Loss for Improving Class Imbalance in Multi-View Sleep Staging Based on EEG Signals

Sleep data are typically characterized by class imbalance, which can cause the model to be overly biased toward frequent classes, resulting in low accuracy of minority class classification. However, the minority class of sleep staging has important value in diagnosing certain disorders, such as an N1 Stage that is too short indicating possible hypersomnia or narcolepsy. To address this problem, we propose a multi-view CNN model based on adaptive margin-aware loss. A novel margin-aware factor that considers the relative sample sizes of both frequent and minority classes can improve the overfitting of minority classes by increasing the regularization strength of minority classes without changing the sample size to maximize the decision margins of minority classes. On this basis, we propose margin-aware cross-entropy and margin-aware complement entropy loss, respectively. Margin-aware complement entropy can be achieved to increase the regularization for minority classes while neutralizing errors for minority classes, thus improving the classification accuracy for minority classes. Finally, the synergy of margin-aware complement entropy and cross-entropy is realized in an adaptive way to improve the sleep staging classification accuracy. We tested on three sleep datasets and compared them with the state-of-the-art, and the results demonstrate that our proposed algorithm not only improves the accuracy of sleep staging in general, but also improves the minority classes to a greater extent.


I. INTRODUCTION
S LEEP is a very important physiological process for human beings, and the quality of sleep has a great impact on physical and mental health [1]. Deep learning not only enables end-to-end learning, but can also remain free from prior knowledge, so sleep staging research based on deep learning has become a hot topic [2]. The main existing deep learning frameworks used in the field of sleep staging are Deep Belief Networks (DBNs) [3], Auto-encoders [4], Deep Neural Networks (DNNs) [5], Convolutional Neural Networks (CNNs) [6] and Recurrent Neural Networks (RNNs) [7], as well as their hybrid structures such as CNN+RNN [8], CNN+LSTM [9] and RNN+LSTM [10]. Numerous studies [6], [11], [12] have shown that CNNs are still the most important network type in sleep staging. In addition, networks in the sleep domain are mostly built in a serial or parallel manner. They can be divided into two categories. One [8], [9], [13] class uses different network architectures to mine different feature representations of the same data. The other uses different scales of the same network architecture or different types of inputs to mine different scales of information [14], [15], [16]. Compared to two-stream networks using different network types, two-stream networks using the same architecture are simpler. However, most existing neural networks [2], [6], [9], [10] in the field of sleep staging use single-view data from the same data source as input, and lack information mining of data from different views. Usually, feature representations extracted from multi-view data are complementary in nature [18]. Therefore, we automatically constructed a 1D CNN containing the original EEG sequence signals and a 2D CNN architecture with EEG spectral images, respectively, using the previously proposed CNN architecture automatic search algorithm [19].
Although the two-stream network with multiple views can improve the mining ability of the data, it also causes the degradation of the model performance when encountering the data with class imbalance. Sleep datasets are typically classically imbalanced datasets due to the different durations of each stage of sleep. In all stages of sleep, the N2 stage usually occupies 45%-55% of the entire sleep stage, while the N1 stage only accounts for 2%-5% [20]. Only a few sleep staging studies have improved the class imbalance. The available studies fall into three main categories. One category improves the class imbalance by means of reweighting. Emadeldeen [21] et al. proposed a class-sensitive weighted loss function for improving the sleep EEG staging class imbalance problem. Re-weighting may make model optimization unstable and difficult [22]. The second category improves the class imbalance problem by oversampling the minority class. Supratak et al. [9] proposed balancing the sample classes by replicating the minority class samples to increase the sample size of that class. Chawla et al. used the synthetic minority oversampling technique (SMOTE) [23] based on synthetic minority classes to improve the class imbalance. Phan et al. [15] trained the network by random sampling so that each batch sample category was balanced. However, by oversampling a minority of classes may cause overfitting of that class [23]. The third class improves the class imbalance problem by augmenting the minority class data, e.g., Sun et al. [24] proposed augmenting the minority class by geometric transformations such as flips and translations, F. Lotte [25] proposed amplification of the minority class samples using segmentation and recombination of the original signal, J. Fan, et al. [26] generated the minority class of samples by using adversarial generative networks (GANs). However, sleep EEG is time-series data, and such an expansion may change the original time-dependent information of the signal.
Resampling and data expansion techniques are most commonly used in sleep staging to improve the overfitting problem in the minority class by changing the number of samples. Margin loss [27], [28] does not require changing the number of samples, but improves the final classification accuracy by making the classifier have the maximum classification margin: that is, making each class have a larger interval from the decision boundary, thus improving the overfitting problem. The large-margin softmax [29] and angular softmax [30] expand the class margins by minimizing the intraclass distance and maximizing the interclass distance. However, the classes in these methods are independent of each other, resulting in equal margins for each final class from the decision boundary, which is not more favorable to enhance the overfitting problem of the minority class. Cao et al. [31] proposed a method called label-distribution-aware margin (LDAM) loss to improve the overfitting problem for the minority class. LDAM is a way to improve the generalization error of minority classes from the perspective of applying stronger regularization to minority classes, which improves the classification accuracy of minority classes on the test set without reducing the model fitting ability. The margin of each class in LDAM depends on the derivative of its own sample size, resulting in a larger decision margin for the minority class. However, LDAM ignores the fact that the imbalance rate between minority and frequent classes may reduce the probability of obtaining larger margins for minority classes.
In addition, Chen proposed complement entropy loss [32], which suppresses misclassification and achieves error neutralization by exploiting the complement information of samples in this class, thus improving the confidence level. Because the minority class is richer in error information, Kim et al. [33] used complement entropy to improve the class imbalance problem. However, the existing complement entropy loss only considers the improvement of accuracy from the perspective of neutralizing the error information, and the available training samples are still insufficient for a very small percentage of minority class samples, which can limit the effectiveness of complement entropy loss for models in performing error neutralization of minority classes. Therefore, we introduce the proposed margin-aware factor into the complement entropy loss to further improve the classification accuracy of complement entropy loss for class imbalance data from the perspective of increasing the regularization of the minority class.
To avoid the introduction of unknown noise by generating new simulated samples and the possible instability of model fitting by reweighting techniques, this paper proposes a novel approach to improve the sleep staging class imbalance problem. We enhance the minority class regularization strength and error suppression of complement information by increasing the decision margins of minority classes and by learning the wrong prediction information, thus, ultimately, improving the model's ability to identify sleeping minority classes without adding new samples.

II. PROPOSED METHOD
In this paper, we propose an adaptive margin aware complement-cross entropy loss based multi-view CNN (AMC-MVCNN), which is shown in Fig. 1.

A. Architecture of the Multi-View CNN
To more fully exploit the information contained in the sleep data, we also propose a multi-view CNN (MVCNN) architecture. Data from different views are obtained with complementary feature representations [18], which helps to indirectly improve the learning of the model for the minority class. The 1D CNN and 2D CNN of the architecture in this paper are derived from our previous architecture search algorithm SOSCNN [19]. The SOSCNN algorithm is driven by the sleep EEG data and automatically designs a network architecture suitable for sleep staging. In this study, two CNN network architectures with different dimensions are combined to construct the MVCNN. The MVCNN architecture is divided into 1-dimensional convolutional and 2-dimensional convolutional branches, as shown in Fig. 1. The 1-dimensional convolutional branch of the network contains 7 layers, of which layers 1, 2, 4 and 6 are convolutional layers with the number of neurons and convolutional kernel size being (57@21 * 1, 211@21 * 1, 154@58 * 1, 140@69 * 1), respectively, and the step size is 3. Both layer 3 and layer 5 are average pooling layers with a pooling window of 3 * 1 and a step size of 2. Layer 7 is a flatten layer. The network's 2-dimensional convolutional branch of contains a total of 7 layers, of which layers 1, 2, 4, 5 and 6 are all convolutional layers, and their number of neurons and convolutional kernel size are (46@5 * 5, 187@6 * 6, 68@7 * 7, 68@7 * 7, 254@7 * 7), respectively, while the step size is 1. Layer 3 is the average pooling layer with a pooling window of 3 * 3 and a step size of 2. Layer 7 is the flatten layer. Each convolutional layer is followed by an activation layer (ReLU) and a batch normalization layer.
i denotes the i -th sample with the original signal data type. S  time-frequency image data type. y i denotes the label of the i -th sample one-hot type. The output of the CNN feature extraction layer for both branches of MVCNN is shown below.
where F 1 and F 2 denote the feature extraction process after 1-dimensional CNN and 2-dimensional CNN, respectively. The representations extracted from the 1D CNN and 2D CNN in MVCNN are subjected to the flatten operation, and then the outputs of each branch are directly spliced to obtain the joint sleep EEG representations. The process of obtaining the joint representation is shown in Equation (3).
The MVCNN takes the joint representation (o 1 , · · · , o N ) through the fully connected layer and uses the softmax function to calculate the predicted probabilityy of the input samples. The classification error between the predicted values and the true labels is calculated with the L AMC E loss function proposed in this study. The final task of staging sleep EEG is accomplished by minimizing the classification error and performing backpropagation to update the MVCNN network weights. The weight update can be expressed by the following Equation (4).

B. Margin Aware Factor
Margin loss [29], [30], [36], [37], [38] is a classification method that obtains the maximum margin. In this class of methods, the larger the sample distance from the decision margin is, the higher the confidence of the classification. The core of the margin-aware factor is to improve the model generalization ability by a stronger regularization of the minority class based on the ratio of the sample sizes between the majority class and the sample being classified, thus improving the classification accuracy of the minority class for the test set. The stronger regularization of the minority class is achieved by larger margins, and the process of such adjustment requires the computation of fine-grained upper bounds on the generalization error. The error for each class in the classification task can be defined by the following process [31]. We use R d to denote the input space, {1, · · · , k} to denote the real category label space, and x and y to denote the input data and their corresponding labels, respectively. Assuming that the class conditional probability distributions P(x|y) of the training and test datasets are the same, we denote the conditional distribution of class j by P j = P(x|y = j ). We first define the relevant modeling on the balanced dataset and later extend it to the class imbalance dataset. We denote by L bal [MV C N N] the standard 0-1 test error calculated by MVCNN for the balanced data distribution, which can be expressed by the following equation.
Similarly, the error of the j th category can be expressed by the following equation.
to denote the training dataset and n j to denote the number of samples in class j . The margins of the sample (x,y) can be defined as follows.
The training margin of class j is defined as follows.
We denote by G the family of hypothetical classes in the dataset, the complexity of class G by C(G), and γ min = min{γ 1 , · · · , γ k } the minimum margin of all classes. The typical generalization error bound when the training set data are codistributed with the test set data is denoted by C(G) √ n. Then, the imbalance generalization error for the training and test sets with the same class imbalance distribution can be expressed as follows.
Since the training data have a high probability (1 − n −5 ) of randomness, the fine-grained error for each class, such as the jth class error, can be reduced to L j .
From the generalization error formula of each class, it is known that the generalization error L of the minority class is reduced by increasing its margin. The margin modification of the original LDAM [31] only considers the number of samples themselves, and ignores the relative relationship between the numbers of samples in the minority class and the frequent class, although this relative relationship directly affects the performance of the model. We propose a margin aware factor m, which is considers both the number of samples themselves and the number of most frequent samples. In Fig. 2(a), a schematic diagram of margin modification with a margin aware factor is shown. In Fig. 2(a), l 1 is the original decision margin without any adjustment, which is equidistant between the two classes. l 2 is the decision margin considering only the number of samples of the class itself, l 3 is the decision margin adjusted by the margin-aware factor, and γ 1 , γ 2 , γ 3 , γ 4 indicates the distance between the two classes and the different decision margins. Our proposed margin aware factor accounts for the amount of difference between the minority class and the frequent class samples; when the data are more imbalanced, the more the minority class is encouraged to obtain larger margins. The model of m is shown in Equation (11).
where n max is the number of frequent class samples, n i is the number of class i samples, and k is the number of classes in the dataset. C is a hyperparameter to be tuned. In this paper, C is set to 1. Fig. 2(b) shows the effect of the numbers of minority and frequent class samples on the margin-aware factor, and y 1 , y 2 , y 3 , y 4 represents the variation curve of the margin-aware factor with the expanding number of samples within the majority class. When the number of minority class samples is fixed (e.g., n=291), the presence of more samples in the frequent class leads to more imbalanced the sample size in the dataset and greater adjustment of the margin-aware factor of the minority class so that the minority class obtains a larger margin and thus a greater degree of correctness and certainty.
The mathematical model of the modified margin-aware cross entropy loss (L MC E ) is shown in Equation (12).
where N is the number of samples in a batch and C is the number of classes. z j denotes the output data corresponding to the real class j in the network. z k denotes the output data corresponding to class k, (k = j ) in the network.

C. Margin-Aware Complement Entropy
The purpose of complement entropy loss is to utilize the valuable complementary information provided by the misclassified information. Complement entropy loss is used to indirectly increase the probability of correct classification through error suppression of the complementary information provided by the error probabilities of all class samples. When classifying a sample with class imbalance, it has a higher probability of being misclassified due to the overfitting of the minority class. Since the error probability value of the minority class is larger than that of the frequent class, the suppression of error information is stronger for the minority class than for the frequent class when error neutralization is performed. Therefore, complement entropy loss [32] is introduced to achieve the utilization of error information. Although complement entropy loss improves the classification accuracy through the use of complementary information, the minority class overfitting problem still occurs in the face of insufficient minority class training samples.
The original complement entropy loss is computed to improve the probability of correct classification by suppressing misinformation only. The overfitting of minority classes in the class imbalance problem can be further improved if the regularization of minority classes is also increased from the perspective of regularization. Therefore, this paper modifies the margins of the samples in the complementary entropy by introducing the margin-aware factor m proposed above and increases the regularization of the minority class through the margin-aware factor to make the modified loss more favorable to the data with class imbalance. We name the improved complement entropy loss as the margin-aware complement entropy L MC O E . The process of L MC O E is shown in Fig. 3.
m = 1 n log(n max /(n max −n i +(n i /n max ))) i f or i ∈ {1, . . . . . . , k} (15) where N is the number of samples in a batch and C is the number of classes. P i is the complementary information calculated for the i -th sample using the error probability. g is the subscript of the correct class. z i j is the output data of the i -th sample for class j . m is the proposed margin aware factor that relies on both the number of frequent class samples and its own sample size. e z ig −m denotes the information that the i -th sample is of class g and is corrected by margin aware m.

D. Adaptive Margin Aware Complement-Cross Entropy
We propose adaptive margin aware complement-cross entropy loss to achieve synergy between margin aware complement entropy and margin aware cross entropy. In margin aware cross entropy (MCE) loss, we focus on learning the correct class information to guide the training of the model, while in margin aware complement entropy (MCOE) loss, we focus more on learning the complementary error information. Therefore, we use an incremental function f (x) ∈ (0, 1) to achieve adaptive synergy between MCE and MCOE. t is the horizontal coordinate translation, which is set t to 50. λ is the parameter to be adjusted and is set to 1 in this paper. In the early stages of training, the model focuses more on learning the correct samples to actively improve the prediction probability for the correct class. As the number of training sessions increases, the final prediction probability is indirectly improved by increasing the learning for incorrect information. The mathematical model of the adaptive margin aware complement-cross entropy loss is shown in Eqs. (16)(17).

A. Datasets
We used three different sleep staging benchmark datasets to validate the performance of the model. The percentage of data in each sleep stage is described in detail in Table I and  Table II. 1) DRM-SUB [39]: This dataset consists of 20 full-night sleep recordings from people aged between 20 and 65 years. Twenty subjects comprised 4 males and 16 females. They were all healthy individuals without sleep-related disorders. The data were sampled at a frequency of 200 Hz. Since the sleep EEG from the Cz-A1 channel possesses a better quality of sleep staging [40], the EEG signal from this channel was used in this paper to evaluate the model. On the DRM-SUB dataset, the AAMS [41] staging criteria were used to classify sleep staging into five stages: W, N1, N2, N3 and R.
2) SleepEDF-20 [42]: This dataset consists of data from 10 male and 10 female subjects between the ages of 24 and 35. Due to equipment problems, data were collected for only one night for Subject 13; data were collected for two nights for all subjects except other subjects. The collected data were divided into 30-second increments, and then sleep experts manually classified the segments into eight categories (W, N1, N2, N3, N4, R, MOVEMENT, UNKNOWN). N3 and N4 were combined into N3 according to the AAMS staging criteria [43]. In addition, we removed MOVEMENT and UNKNOWN in the preprocessing process.
3) SleepEDF-ST [44]: This dataset is a subset of the Sleep-EDF extended dataset. The sleep telemetry subset is composed of 22 Caucasian subjects, all of whom have mild sleep disorders. These subjects were between 18 and 79 years of age, with 7 males and 15 females. The sampling frequency of this dataset was 100 Hz, and this dataset was labeled according to the R&K sleep staging criterion.
The experiment was divided into three parts. First, the performance of our algorithm was analyzed in comparison with the state-of-the-art for sleep staging with single-channel EEG. To ensure the accuracy of the comparison experiments, the results of the comparison experiments between this paper and other algorithms for DRM-SUB and SleepEDF-20 are derived from the data provided in published literature. In addition, we performed ablation experiments on the proposed algorithm in this paper, which were used to verify the performance of each step of the proposed algorithm improvement. Finally, the classification performance of the AMC-MVCNN was tested under the R&K staging standard. In this paper, the average classification accuracy, Cohen's kappa coefficient (kappa), precision, sensitivity, the per-class F1-score and the macroaveraging F1-score were used to measure the model's sleep staging performance.

B. Comparison of AMC-MVCNN and State-of-the-Art
We compared the overall performance of the AMC-MVCNN with other superior algorithms for sleep staging with single-channel EEG. Table III shows the comparison results based on the DRM-SUB dataset. The average classification accuracy of our model is improved by 8.46% when compared to the model that also uses only the CNN structure. Our average accuracy is also improved by 3.91% compared to the improved CNN-HMM. This result indicates that our adjustments improve the accuracy of the algorithm when using the same type of network model. Our model also improved by 2.64%-3.85% compared to the improved version using only LSTM or LSTM. This result indicates that our improved model has stronger sleep staging capability than the LSTM. We also compared our model with the hybrid class model from CNN-LSTM, and our model improved the classification accuracy by 3.39%-3.43% on average. This result indicates that our improved multi-view CNN model achieves higher classification performance than the hybrid model on DRM-SUB. Compared with AMC-MVCNN, EEMD+RUSBoost uses a resampling technique to equalize the training samples to improve the class imbalance, which may change the original EEG timing distribution and increase the memory overhead. In addition, only two CNN models with different dimensions are used in AMC-MVCNN, which is less complex, less difficult to fit, and can effectively improve the recognition of minority classes compared with other methods that improve the classification accuracy based on the structure of mixing multiple models. When the test set includes an imbalanced class, the average classification accuracy may be too biased toward frequent classes to fully reflect the performance of the classification algorithm, so we also compare it from a consistency perspective. Our model has a larger Kappa value, which indicates that our algorithm achieves a stronger classification performance for class imbalanced data.
To validate the classification performance of our model for each sleep stage on the DRM-SUB dataset in more detail, we compared the results of staging for each stage of sleep from both precision and sensitivity perspectives with results from other state-of-the-art approaches, and the detailed data are shown in Table IV. In terms of precision, our model achieves the highest precision in all four sleep stages, W, N1, N3 and R, and the percentages of sleep stages in the DRM-SUB dataset are ranked as N1<R<W<N3<N2. Our model precision improves for all four phases of N1, R, W and N3 when compared with the optimal values produced by the comparison models, and the highest improvement is 9.33% in the N1 phase, which accounts for the least amount of data, and 4.95% in the R phase, which accounts for the second-least amount of data. The improvement in the N3 and W stages was 1.05%-1.54%. In the class with the largest sample size, N2 is 0.62% lower than the optimal value. This indicates that our model effectively improves the precision of the minority class. Based on the analysis of sensitivity, our model has the highest sensitivity for all four phases of W, N1, N2 and R. The maximum improvement is 3.23% in phase W. In phase N3, our model sensitivity is reduced by 0.7% compared with the optimal value from the comparison model. Overall, our model improved in the three sleep stages N1, R and W with smaller sample sizes, both in terms of precision and sensitivity analysis. This indicates that our model has better classification performance for the minority classes in the DRM-SUB dataset.
To more visually compare the performance of our model with the overall state-of-the-art classification for the 5 sleep stages, we compared the F1-score values of the models. The overall comparison results are shown in Fig. 4. Our model's F1-score achieves the highest values for all 5 sleep stages, indicating that our model offers a superior sleep staging capability.
To verify the generalization ability of the model on different datasets, we also compared AMC-MVCNN with the state of the art on the SleepEDF-20 dataset. Table V presents the model comparison results on the SleepEDF-20 dataset. In terms of average classification accuracy, the average classification accuracy of our model improves by 2.52%-5.52% compared to the model that also uses only the CNN structure. Our average accuracy is also improved by 0.44%-1.62 compared to the improved CNN. This shows that our improvement enhances the average classification accuracy of the model  I  OVERVIEW OF THE DRM-SUB AND SLEEPEDF-20 SLEEP DATASETS WITH THE AASM CRITERION   TABLE II  OVERVIEW OF THE SLEEPEDF-ST SLEEP DATASET WITH THE R&K CRITERION   TABLE III  COMPARISON    CNN+RNN and 1-max CNN, which use class imbalance processing techniques. In addition, the AMC-MVCNN model is simple and improves the classification performance only through the improvement of the loss function, the model fitting cost is lower compared with the models of other hybrid architectures among the comparison methods, and the AMC-MVCNN can effectively improve the recognition of minority classes in sleep staging. We also compared models in terms of consistency. Our model obtains a larger Kappa value, indicating that AMC-MVCNN achieves stronger classification performance for class imbalanced data. Similarly, for the SleepEDF-20 dataset, we compared the classification performance of AMC-MVCNN with the state of the art for each sleep stage. Table VI shows the comparison results of the models on the SleepEDF-20 dataset. Based on analysis of the precision values, our model achieved the highest values in three sleep stages N1, N3 and R. In the N1 stage, in which the sample size was the smallest, the precision improved by 9.47% over the optimal values from other state-of-the-art models. This was slightly worse than the comparison model in the W and N2 stages. Analyzed from the perspective of the sensitivity, our model obtained the highest values in the three sleep stages N1, N2 and R. The sensitivity in the N1 stage, where the sample size is the smallest, improved by 7.12% compared to the optimal value in the comparison model. The values of the sensitivity in the W and N3 stages were slightly worse than those of the comparison model. Overall, from both precision and sensitivity perspectives, AMC-MVCNN gains improvement in the N1 phase with the smallest sample size, and the amount of improvement in classification performance during the N1 phase is also much larger than that for the other sleep phases in both evaluation metrics. This result indicates that our model's performance is better for classes with smaller sample sizes.
We also compared the F1-score values of the models. The overall comparison results are shown in Fig. 5. Our model's F1-score values are the highest for the N1, N2, N3 and R stages, while in the W phase, our model's F1-score values are not competitive.

C. Ablation Experiments
To verify the effectiveness of each step of the model improvement, we conducted ablation experiments for three    V  COMPARISON OF THE AVERAGE CLASSIFICATION ACCURACY AND KAPPA VALUES BETWEEN OUR MODEL AND OTHER STATE-OF-THE-ART  MODELS ON THE SLEEPEDF-20 DATASET. THE '-' IN THE TABLE INDICATES THAT NO RELEVANT DATA IS PROVIDED IN THE LITERATURE   TABLE VI  THE RESULTS Table VIII.
Based on the experimental results in Table VIII, our proposed AMC-MVCNN method achieves the highest overall classification accuracy compared to the RW and SMOTE techniques using the class imbalance technique. In addition, the F1-sorce values of the six stages of sleep showed that the AMC-MVCNN method improved the classification performance in all stages of sleep, and the greatest improvement was observed in stage S4, which contained the fewest samples. AMC-MVCNN does not need to introduce new simulation samples and has less computational overhead compared to the SMOTE method. AMC-MVCNN more easily fits the model and enables better identification of S4 compared to the reweighting technique.

E. Discussion
Existing methods regarding the improvement of the accuracy of the minority classes in sleep staging are still shortcomings. Reweighting not only requires manual adjustment of hyperparameters, but also tends to make model optimization unstable and difficult in extreme data imbalance or largescale scenarios. Oversampling techniques for a minority of classes may lead to overfitting of that class. The way in which data enhancement techniques enhance the sample size of a minority class of sleep EEG is controversial. There is uncertainty as to whether the EEG signal generated by data enhancement techniques matches the true EEG distribution and whether the temporal correlation of the original signal is retained. And it may even introduce unknown noise and increase the cost of model fitting. The AMC-MVCNN method proposed in this paper avoids the destruction of the original sleep EEG timing structure and the introduction of unknown information through the adjustment of the generalization margin. In addition, a few classes of N1 are highly susceptible to misclassification due to their greater similarity to other sleep stages. The AMC-MVCNN method can reduce the confidence of the incorrect class by reverse suppression of the sample error information, which can further indirectly improve the confidence of the correct classification of the class and the minority class identification. The AMC-MVCNN achieved good classification performance on all three datasets, indicating that the AMC-MVCNN has good generalization ability. The AMC-MVCNN method showed the most significant performance improvement in the minority class N1 of sleep staging, a result that may have clinical implications because the performance improvement of the AMC-MVCNN for the N1 stage is potentially valuable in the assessment of aiding the diagnosis and evaluation of multiple sleep disorders, such as episodic narcolepsy [45].
The AMC-MVCNN model proposed in this paper learns complementary information from multi-view data and improves the representation mining of the minority class samples, but there is also the possibility of introducing redundant features. Therefore, the introduction of attention mechanism or feature selection mechanism for further learning of the joint representations extracted by AMC-MVCNN can be considered in future work, which will help to improve the recognition ability of minorities classes. In addition, most of the existing proposed algorithms to improve the accuracy of a minority of classes (e.g., stage N1) have been studied from the perspective of increasing the sample size or weight adjustment, and they usually ignore studies that address the unique characteristics of the N1 stage of sleep EEG. Therefore, the learning of special waveforms (e.g., vertex waves) in N1 stage can be increased by introducing a priori knowledge in the future to further improve the recognition of sleep N1 stage.

IV. CONCLUSION
To improve the performance impact of imbalanced sleep data on the sleep staging model, we propose an adaptive margin aware complement-cross entropy loss based multi-view CNN. First, we learn complementary feature representations from the raw data of sleep and the time-frequency images by building two-stream CNNs with different convolutional dimensions to further mine the effective information of the minority classes. Second, we also propose a margin aware factor that considers both the relative relationship between the number of minority and frequent class samples to encourage minority classes to obtain larger margins and thus improve the classification accuracy for minority classes. In addition, we improve the complement entropy by means of a margin aware factor and propose an adaptive margin aware complement-cross entropy to enable the model not only to learn correct information but also to achieve error neutralization by suppressing incorrect predictions. Finally, we tested on three benchmark sleep datasets, and the test results demonstrated that we not only improved the accuracy of sleep staging overall but also improved the classification accuracy for minority classes. The AMC-MVCNN has been considered only from the perspective of performance improvement of the method itself, and the following work can combine new methods such as contrast learning and the characteristics of sleep EEG itself to further improve the accuracy of the minority classes in sleep staging from the perspective of mining the more discriminative representations of sleep staging minority classes.