Mixture of Experts for EEG-Based Seizure Subtype Classification

Epilepsy is a pervasive neurological disorder affecting approximately 50 million individuals worldwide. Electroencephalogram (EEG) based seizure subtype classification plays a crucial role in epilepsy diagnosis and treatment. However, automatic seizure subtype classification faces at least two challenges: 1) class imbalance, i.e., certain seizure types are considerably less common than others; and 2) no a priori knowledge integration, so that a large number of labeled EEG samples are needed to train a machine learning model, particularly, deep learning. This paper proposes two novel Mixture of Experts (MoE) models, Seizure-MoE and Mix-MoE, for EEG-based seizure subtype classification. Particularly, Mix-MoE adequately addresses the above two challenges: 1) it introduces a novel imbalanced sampler to address significant class imbalance; and 2) it incorporates a priori knowledge of manual EEG features into the deep neural network to improve the classification performance. Experiments on two public datasets demonstrated that the proposed Seizure-MoE and Mix-MoE outperformed multiple existing approaches in cross-subject EEG-based seizure subtype classification. Our proposed MoE models may also be easily extended to other EEG classification problems with severe class imbalance, e.g., sleep stage classification.


I. INTRODUCTION
E PILEPSY, a pervasive neurological disorder, affects a diverse global population, spanning all age groups and totaling approximately 50 million individuals [1].Its diagnosis typically relies on the identification of two or more unprovoked seizures [2], which manifest as abrupt and unregulated electrical disturbances in the brain, causing an array of symptoms such as convulsions, loss of consciousness, and involuntary movements [3].
Electroencephalogram (EEG) is the gold standard in epilepsy diagnostics.However, manual interpretation of EEG signals for seizure identification is labor-intensive and hinges on expert knowledge, necessitating the development of intelligent epilepsy detection algorithms [4].Furthermore, seizures have different subtypes, e.g., focal, generalized, and unknown seizures [5], classified according to their symptoms and the brain region affected.Precise seizure subtype identification enables healthcare practitioners to perform tailored treatments.
Traditional machine learning algorithms have been widely employed for automated seizure subtype classification.The initial step involves pre-processing raw EEG signals to eliminate artifacts originating from muscle or eye movements and electrical noise [6].Then, time domain [7], frequency domain [8], temporospatial [9], and/or nonlinear [10] features can be extracted, and feature selection [11] could also be performed.Finally, machine learning models such as Logistic Regression (LR) [9] and Multilayer Perceptron (MLP) [12] can be used for seizure classification.
Manual feature extraction can be time-consuming and sub-optimal, promoting the development of Deep Neural Networks (DNNs) for automatic EEG-based seizure subtype classification, such as EEGNet [13], CE-stSENet [14], and EEGWaveNet [15], which automatically extract features from the raw EEG data.While DNN-based approaches have demonstrated promising performance, they still face some challenges: 1) Class imbalance.Due to the extended duration of EEG recordings, brief durations of seizures, and the varying occurrence probabilities of different subtypes, seizure datasets often exhibit considerable class imbalance [16].Some examples are shown in Fig. 1.This may lead to low balanced classification accuracy (BCA) in seizure subtype classification, if class imbalance is not explicitly considered.2) Lack of a priori knowledge integration.DNNs can extract EEG features automatically, but they require a large amount of labeled training data, which may not always be available.In contrast, traditional approaches employ a priori expert knowledge to manually extract EEG features, which generally requires fewer training data, but the performance may be suboptimal.It is Fig. 1.The number of different seizure subtype events in the TUSZ dataset [17].FNSZ: focal non-specific seizure; GNSZ: generalized seizure; CPSZ: complex partial seizure; ABSZ: absence seizure; TNSZ: tonic seizure; TCSZ: tonic-clonic seizure; SPSZ: simple partial seizure; MYSZ: myoclonic seizure.
desirable for DNNs to incorporate the a priori expert knowledge to reduce their labeled data requirement and enhance their performance.This paper presents two Mixture of Experts (MoE) [18] based models for seizure subtype classification, addressing the aforementioned challenges.Our primary contributions include:

II. RELATED WORKS
This section provides a brief overview of existing seizure subtype classification algorithms, the MoE framework, and approaches for incorporating a priori knowledge in DNNs.

A. Seizure Subtype Classification
Although significant progresses have been made in EEG-based seizure detection, there are only a small number of seizure subtype classification approaches.
Roy et al. [19] employed k-Nearest Neighbors, Stochastic Gradient Descent, and XGBoost for seizure subtype classification.Saputro et al. [20] utilized Mel Frequency Cepstral Coefficients, Hjorth Descriptor, and Independent Component Analysis for EEG feature extraction, and Support Vector Machine for seizure subtype classification.In addition to these manual feature extraction approaches, some studies have also used DNNs for automatic feature extraction.Zhang et al. [21] developed DWT-Net, a CNN inspired by Discrete Wavelet Transformation feature extraction, for seizure subtype classification.Asif et al. [22] introduced SeizureNet, an ensemble model to learn multi-spectral feature embeddings for seizure subtype classification.Peng et al. [4] extended EEGNet [13] to TIE-EEGNet, by adding a temporal information enhancement module with sinusoidal encoding before its first convolutional layer.This module enables better temporal information capture from EEG signals and improves seizure subtype classification performance.
However, none of the DNN-based approaches explicitly utilized a priori domain knowledge to facilitate seizure subtype classification.

B. Mixture of Experts
The MoE model was first introduced in 1991 [18], under the divide-and-conquer principle.It divides the problem space among multiple local experts, supervised by a routing network, allowing each expert to manage a small local region of the problem space.The routing network determines the experts to be employed for a new input, and aggregates their outputs through a weighted average [23].The MoE possesses remarkable abilities in handling complex and heterogeneous data sources, adapting to varying input data distributions, and enhancing model performance by adaptively combining multiple experts with complementary skills.Moreover, MoE provides great interpretability and flexibility in model design, as it allows for independent analysis and adjustment of the contribution of each expert [24].MoE models have found extensive applications in natural language processing [25], computer vision [26], speech recognition [27], and so on.
MoE models have also been used in EEG classification.Subasi [28] used discrete wavelet transform to decompose single-channel EEG into multiple frequency sub-bands, and then input them into an MoE model.Karimu et al. [29] extracted features from Continuous Wavelet Transform scalograms of five EEG channels, and then used them in a fuzzy MoE models to classify attention-deficit/hyperactivity disorder and healthy children.Yang et al. [30] proposed an identitybased multi-gate MoE for human emotion recognition, which customized a portion of the model subspace for each subject based on his/her identity.
However, to our knowledge, end-to-end MoE networks have not been used for seizure subtype classification.

C. A Priori Knowledge Incorporation
Data-driven DNNs have demonstrated remarkable performance in image classification [31], speech recognition [32], natural language processing [33], etc.However, it is also Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
interesting to incorporate a priori expert knowledge into DNNs to reduce their labeled data requirement and enhance their performance [34]: 1) Use expert knowledge to augment or transform the inputs.For instance, Yin et al. [35] used medical knowledge graph's adjacency information to model the attention mechanism of a recurrent neural network in learning electronic health records.2) Add penalty terms to the loss function, reflecting constraints or a priori knowledge specific to the problem domain [34].For instance, Fu [36] imposed concept relationship based constraints in Hopfield network training.3) Add specific layers or modules to the DNNs.For instance, Li and Srikumar [37] used first-order logic statements to encode domain knowledge as relations, and added connections to a DNN based on the logical constraints imposed by these relations.DNNs for seizure subtype classification have also used various network structures to mimic traditional filters.For instance, CE-stSENet [14] employed multi-level spectral and multi-scale temporal analysis, whereas EEGWaveNet [15] utilized spatial-temporal feature extractors.However, these works used DNNs to replace traditional approaches, instead of complementing them.This paper will show that the latter may result in better performance.

III. METHODOLOGY
This section introduces two novel MoE models, Seizure-MoE and Mix-MoE, for EEG-based seizure subtype classification.Our code is available at https://github.com/ZhenbangDu/Seizure_MoE.

A. Seizure-MoE
Fig. 2 illustrates the structure of the proposed Seizure-MoE, where f θ can be any existing DNN that takes raw EEG signals X ∈ R C×T as input, in which C is the number of channels, and T the number of time domain samples.By removing the final classification layer, we can obtain the automatically extracted feature x = f θ (X) ∈ R 1×D , where D is the dimensionality of the features.Two-layer MLPs are used as expert networks (E x p), and a single fully connected layer as the routing network (Router ), both of which take x as the input.Overall, Seizure-MoE is an end-to-end neural network.
The output y o of Seizure-MoE is: where n represents the number of expert networks, and ω i (x), the ith component of ω(x), denotes the weight assigned to the ith expert network output y i (x) by the routing network.
To compute the routing coefficients, we initially apply a trainable weight matrix W R ∈ R D×n to the input and subsequently pass the result through a Softmax function.Additionally, during training, we introduce adjustable Gaussian noise into the routing network's output before applying the Softmax function, to facilitate load balancing among the experts [38].The noise level for each component is determined by a secondary trainable weight matrix W noise ∈ R D×n , so ω(x) is formulated as: where Randn generates a random number following the standard normal distribution, and Softplus(xW noise ) = log(1 + e xW noise ).
Considering both the classification performance and the balance among different experts, the training loss of Seizure-MoE is: where B ∈ R B×C×T is a batch of B training samples, L cls is the cross-entropy loss, and L Imp and L Load are introduced to balance the contributions from different experts [38]: 1) L Imp encourages uniform routing weights across all experts: where X is the corresponding batch of features, Imp i the importance of expert i, and CV the coefficient of variation [39]: in which Std(•) is the standard deviation and Mean(•) is the mean.The importance of an expert, Imp i , is the sum of the routing values across X : 2) L Load facilitates balanced assignments of the training samples across all experts [38], [40]: where Load i (X ) = x∈X P i (x) is the load of the ith expert, in which P i (x) is the likelihood that ω i (x) is non-zero: where excluding(ω(x), i) is the minimum of ω(x) excluding the ith component, and the cumulative distribution function of the standard normal distribution.

B. Mix-MoE
Prior work [34] has suggested that incorporating a priori knowledge into DNNs can enhance model performance.Some studies [19], [20] have also demonstrated the significance of manual features, derived from raw EEG signals using time or frequency domain expert knowledge, for seizure subtype classification.Furthermore, traditional machine learning algorithms with handcrafted features achieved comparable performance with DNNs [4].These results indicate the potential of integrating manual features as a priori knowledge into DNNs to further improve their performance.
To leverage both manual EEG features and raw EEG signals, we propose Mix-MoE to combine a traditional machine learning classifier using manual features with Seizure-MoE using raw EEG signals.As depicted in Fig. 3, we train Mix-MoE in two stages [41]: 1) Imbalanced Pretraining.
To encourage the diversity among different experts, we employ pretraining to enhance each expert's ability to handle its specialized task.Due to severe class imbalance in seizure subtype classification, previous research [4] used a balanced sampler 1 to ensure an equal frequency of samples from each class in each data batch.This was achieved by setting the probability of a sample from Class j in a batch to: where N is the total number of samples, and n j the number of samples in Class j.
Assume the number of experts equals the number of classes.We adopt imbalanced samplers that prioritize a different class for each expert during training, by introducing a factor U to Pr j to boost Class j: At this stage, a shared f θ is used by all experts, and Router is not utilized.As a result, the imbalanced pretraining loss is the sum of different experts' crossentropy losses n i=1 L cls,i (B).

2) Joint Training.
In this stage, we extract manual features from the raw EEG data by a priori knowledge, and then train a traditional machine learning classifier from them as the first expert (denoted as E x p 0 ) in Mix-MoE.Specifically, we adopt the parameters obtained from the pre-training stage and E x p 0 to initialize the expert networks in the Mix-MoE, then add the routing network to weight all experts' outputs, where the input to Router where L KL performs knowledge distillation [42] to make the DNN to learn a priori knowledge from E x p 0 : in which y 0 (x) is the output of E x p 0 for input x.
In this way, Mix-MoE can make use of both manual features (a priori knowledge) and the features extracted automatically by DNNs.
IV. EXPERIMENTS This section conducts experiments to evaluate the performance of Seizure-MoE and Mix-MoE on two public seizure datasets.

A. Datasets
Two public seizure datasets, TUSZ and CHSZ, were used in our experiments.Their characteristics are shown in Table I.
The TUSZ dataset [17] consists of EEG recordings from 68 epilepsy patients, selected from the Temple University Hospital EEG Data Corpus [43], with subtype annotations.
The CHSZ dataset [4] was collected from Wuhan Children's Hospital, affiliated with Tongji Medical College of Huazhong University of Science and Technology.It includes 22-channel EEG signals from 27 infant and child patients, with varying sampling rates such as 500 Hz, 698 Hz, and 1,000 Hz.Same as [4], we selected Focal Seizures, Absence Seizures, Tonic Seizures, and Tonic-Clonic Seizures to facilitate the comparison of four-class-classification results on both datasets.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Preprocessing
The same preprocessing was performed for both datasets.We first down-sampled the EEG signals to 128 Hz, and rereferred 20 common channels [14], as listed in Table II.Then, a 50 Hz notch filter, a 64 Hz low-pass filter, and detrending were applied to remove artifacts and noise.
Three-fold cross-validation was used to assess the performance of our model.Subjects were randomly shuffled and divided into three subsets, ensuring a similar number of events for each seizure subtype.In each fold, one subset was used as the test set, and the other two were combined for training and validation.To avoid overlap between the training and validation sets, we divided the seizure events into two halves, one for training and the other for validation.We then segmented all seizure events into samples using a 4-second sliding window with 50% overlap for the training set, and no overlap for the validation and test sets, as illustrated in Fig. 4. The total number of samples has been shown in Table I.

C. Experimental Settings
Seizure-MoE and Mix-MoE were compared with existing traditional machine learning approaches using manual features and DNNs using raw EEG signals.
For traditional baselines, the average and kurtosis of the db5 wavelet decomposed components of the raw EEG data were used as manual features, and Support Vector Machine (SVM), Ridge Classifier (RC), Logistic Regression (LR) and Gradient Boosting Decision Tree (GBDT) as classifiers, whose hyperparameters were determined by the validation set.To assess the model performance more realistically, we focused on event-level performance rather than samplelevel performance.The test events were divided into 4-second fragments, and we employed majority voting to combine the classifications of fragments belonging to the same event into a single class.In the final evaluation, we computed both BCAs and F 1 scores by comparing the model's predicted labels for each seizure event against the groundtruth labels, instead of doing so for each sample (fragment).We repeated all experiments 10 times, and report their averages and standard deviations.

D. Main Experimental Results
Table III shows the experiment results.Observe that: 1) Traditional machine learning approaches using manual features achieved comparable performance with DNNs.2) In most cases, MoE improved the DNN performance on both datasets, and MoE-CE-stSENet demonstrated a considerable improvement on CHSZ, with about 0.08 increase in both BCA and F 1 .3) Through the integration of a priori knowledge from traditional machine learning classifiers, Mix-EEGNet and Mix-EEGWaveNet attained notable performance improvements over their respective MoE counterparts, i.e., increased the BCA by at least 0.06 and F 1 by 0.02 on TUSZ, and BCA by 0.12 and F 1 by 0.18 on CHSZ.4) Mix-TIENet was the best performer on TUSZ, and Mix-CE-stSENet on CHSZ.5) Mix-MoE outperformed both GBDT and Seizure-MoE, indicating the benefits of fusing DNNs and a priori knowledge.Table III also gives the results of ensembling various DNNs, where the number of networks in the ensemble equaled the number of experts in the MoE model.Compared with ensemble learning, Mix-MoE achieved better performance with fewer parameters, as it used a single shared feature extractor f θ .
By integrating the imbalanced sampler and a priori knowledge into Mix-MoE, on TUSZ the best Mix-MoE To validate if the performance improvements of Mix-MoE over the DNN baselines in Table III were statistically significant or not, we performed paired t-tests on the BCAs and F 1 scores.The Null hypothesis was that the performance difference between two algorithms has zero mean, which was rejected if p ≤ 0.05.The results are shown in Table IV.Mix-MoE achieved statistically significant BCA and F 1 improvements over the DNN baselines in most cases.
In summary, our proposed Seizure-MoE can effectively utilize the features generated by DNNs, whereas Mix-MoE can further improve the performance by incorporating a priori knowledge of the manual features.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Effect of the Number of Experts
To demonstrate the impact of the MoE structure on the DNNs, we conducted experiments to analyze the sensitivity of Seizure-MoE to the number of experts.The experimental results are depicted in Fig. 5.
On TUSZ, MoE improved the BCA for all four backbone networks, and also enhanced the F 1 score for EEGNet, TIENet and CE-stSENet.As the number of experts increased, the performance first increased (more experts gave finer partitions of the feature space) and then decreased (due to the increased number of parameters and the limited amount of training data), which is intuitive.Similarly, on CHSZ, MoE improved the BCA of EEGWaveNet, TIENet and CE-stSENet, and substantially boosted the F 1 score of CE-stSENet.
In summary, Seizure-MoE consistently enhanced the network's performance, particularly BCA.
For the remaining experiments, we employed 4 experts, which equaled to the number of seizure subtypes in the dataset.

F. Effect of the Imbalanced Sampler
To validate the efficacy of our proposed imbalanced sampler, we compared the performance of Mix-MoE with and without an imbalanced sampler, as well as different imbalance rates U .The experimental results are shown in Fig. 6.
Compared with the "None" baseline (without imbalanced pretraining), imbalanced sampling improved the BCA and the F 1 score, indicating that imbalanced pretraining could allow different experts to focus on different classes.After introducing traditional machine learning models as a priori knowledge, the homogenization of outputs from different experts could be prevented, thus enhancing the MoE model's ability to handle imbalanced data.
As the imbalance rate U increased from 1 to 4, the BCA first increased then decreased.When the imbalance rate increased,  the input data of different experts became more concentrated on a certain class, enhancing the expertise of the corresponding expert and thus improving the BCA.However, if the imbalance rate was too large, each expert may be completely dominated by a certain class, thus the experts became exclusive to each other.This led to performance degradation, since the ideal relationship among different experts should be "competition with cooperation".
For the remaining experiments, we set the imbalance rate to be equal to the number of classes, which was 4, for simplicity.

G. Effect of a Priori Knowledge
To demonstrate the effectiveness of incorporating a priori knowledge into DNNs, we conducted experiments using various different a priori knowledge sources in Mix-MoE: None (no a priori knowledge), and LR/RC/SVM/GBDT as E x p 0 .The results are shown in Fig. 7.
On TUSZ, a priori knowledge consistently improved the F 1 scores.Although BCA had a few counter-examples, in most cases incorporating a priori knowledge was still beneficial.Moreover, stronger a priori knowledge, e.g., GBDT, generally led to better performance.Similar observations can also be made on the CHSZ dataset.
In summary, incorporating a traditional machine learning classifier as a priori knowledge in Mix-MoE improved the overall classification performance.

H. Expert Allocation of Each Class
In order to examine which experts the MoE model relies on more heavily to classify samples for different classes, we used Mix-CE-stSENet to show in Fig. 8 the proportion of experts corresponding to the Top 1 probability of Router output for different class samples.
For the majority classes (e.g., FSZ and TCSZ in TUSZ, and FSZ in CHSZ), the classification relied more on the a priori knowledge provided by E x p 0 ; however, for the minority classes, the classification depended more on the expert corresponding to that class during imbalanced pretraining.
These results demonstrated that Mix-MoE effectively combined the traditional machine learning classifier with DNNs.For majority classes with a large number of samples, using E x p 0 for classification produced more stable and confident results, whereas for minority classes with few samples, imbalanced pretraining enabled an expert to focus on its feature subspace, leading to more specific experts and enhancing the model's ability to handle imbalanced data.
V. CONCLUSION This paper has proposed two novel MoE-based models, Seizure-MoE and Mix-MoE, for cross-subject EEG-based seizure subtype classification.Experimental results on two public seizure datasets demonstrated that they outperformed multiple existing approaches.Our experiments also showed that introducing traditional machine learning classifiers as a priori knowledge in Mix-MoE and using our proposed imbalanced sampler could further improve the performance.
Our future research will: 1) Explore the combination of MoE with other techniques, such as the attention mechanism and graph convolutional networks, to further improve the performance.The importance of a priori knowledge has been demonstrated in Mix-MoE; therefore, it is interesting to explore different a pr knowledge sources and their impact on the performance Mix-MoE, as well as the use of other knowledge distillation techniques.2) Seizure-MoE and Mix-MoE can be extended to other BCI tasks or paradigms, such as sleep stage classification.Due to the varying lengths of different sleep stages, sleep stage classification also suffers from significant class imbalance [45], which may be another suitable application of Seizure-MoE and Mix-MoE.

1 )
We propose a novel model, Seizure-MoE, which employs an end-to-end MoE framework for seizure subtype classification.It uses existing DNN-based approaches as feature extractors and MLPs as experts, achieving improved performance with minimal additional parameters.2) We further propose Mix-MoE, which fuses a traditional machine learning model utilizing manual features and the aforementioned Seizure-MoE.By incorporating a priori expert knowledge in traditional machine learning into the MoE, Mix-MoE can further outperform Seizure-MoE.3) The implementation of a two-phase training strategy and an imbalanced sampler in Mix-MoE to better harness the expertise of various experts while addressing the class imbalance.Moreover, the adoption of a custom-designed loss during the second training stage to further improve the performance.The remainder of this paper is organized as follows: Section II gives a concise review of the pertinent literature.Section III proposes Seizure-MoE and Mix-MoE.Section IV evaluates the performance of Seizure-MoE and Mix-MoE.Finally, Section V draws conclusions and outlines some future research directions.

Fig. 8 .
Fig. 8.The proportion of Top 1 selected experts for different classes.The black frame represents the most frequent selected experts.(a) TUSZ; and, (b) CHSZ.

TABLE I COMPARISON
OF EVENT/SAMPLE * NUMBER OF TUSZ AND CHSZ is formed by concatenating the manual features with the features extracted from f θ .The final joint training loss is: 1 https://pytorch.org/docs/stable/_modules/torch/utils/data/sampler.html#WeightedRandomSampler

TABLE II THE
20 RE-REFERRED CHANNELS

TABLE IV PAIRED
t -TEST RESULTS BETWEEN MIX-MOE AND THE DNN BASELINES ON THE BCAS AND F 1 SCORES IN TABLE III.STATISTICALLY SIGNIFICANT ONES ARE MARKED IN BOLD