TIE-EEGNet: Temporal Information Enhanced EEGNet for Seizure Subtype Classification

Electroencephalogram (EEG) based seizure subtype classification is very important in clinical diagnostics. However, manual seizure subtype classification is expensive and time-consuming, whereas automatic classification usually needs a large number of labeled samples for model training. This paper proposes an EEGNet-based slim deep neural network, which relieves the labeled data requirement in EEG-based seizure subtype classification. A temporal information enhancement module with sinusoidal encoding is used to augment the first convolution layer of EEGNet. A training strategy for automatic hyper-parameter selection is also proposed. Experiments on the public TUSZ dataset and our own CHSZ dataset with infants and children demonstrated that our proposed TIE-EEGNet outperformed several traditional and deep learning models in cross-subject seizure subtype classification. Additionally, it also achieved the best performance in a challenging transfer learning scenario. Both our code and the CHSZ dataset are publicized.

An example of epileptic EEG signal with spike-and-wave discharges (in red).
Electroencephalogram (EEG), the electrophysiological sig-33 nal generated by synchronizing neuronal activities in the 34 brain, is the most popular and successful non-invasive signal 35 for epilepsy-related disease diagnosis. Epileptic EEG signals 36 are characterized by spike-and-wave discharges (SWDs) [3], 37 as shown in Fig. 1. Reading EEG signals for epilepsy diagno-38 sis requires experienced clinicians. Such labor-intensive work 39 not only demands a significant time investment but also relies 40 heavily on the experience and subjective judgments of the 41 labelers. To relieve the burden on clinicians and improve the 42 treatment efficiency, automated epilepsy diagnosis algorithms 43 are desired. 44 More specifically, for clinical epilepsy diagnosis, seizure 45 detection is the first step. The next is to determine the sub-46 type, which is important for identifying epilepsy syndromes, 47 targeted therapies, and eligibility for epilepsy surgery [4]. 48 According to the 2017 International League Against Epilepsy 49 guideline [5], [6], subtypes of epilepsy include: 50 1) Generalized seizures, e.g., absence seizures and tonic- 51 clonic seizures, which affect both sides of the brain. 52 2) Focal seizures, e.g., simple and complex focal seizures, 53 which are distinguished by the intensity of patients' 54 consciousness located in only one area of the 55 brain. 56 3) Generalized and focal seizures, which begin at one part 57 of the brain, then spread to both sides. 58 Machine learning algorithms are promising for automatic 59 seizure subtype classification. In a typical scenario with tra-60 ditional machine learning approaches, EEG artifacts, e.g., 61 eye/muscle movements and electrical noise, are first removed 62 by band-pass filtering and detrending [7], [8]. Then, time 63 domain features [9], frequency domain features [10], temporal-64 spatial features [11], [12], or nonlinear features [13], can be 65 extracted, selected [14], and finally sent to a classifier, e.g., 66 support vector machine (SVM) [10], [15], logistic regres-67 sion (LR) [ Deep learning eliminates the burden of manual feature 142 extraction, and hence becomes popular in seizure classifi-143 cation. Asif et al. [34] proposed SeizureNet, a CNN-based 144 neural network, which obtained 0.620 weighted F 1 score 145 for within-patient seizure type classification. Li et al. [35] 146 proposed CE-stSENet, a multi-scale model embedded with 147 group convolution, to process the temporal and spatial data 148 in parallel, achieving an average F 1 score of 0.937 and an 149 accuracy of 0.920.

150
Most existing approaches for automatic seizure classifi-151 cation are either patient-specific (build a separate machine 152 learning model for each patient, where the training and test 153 data come from the same patient) or patient-independent 154 (all patients' data are mixed together, and then partitioned 155 randomly into training and test sets, without distinguishing 156 among individuals or considering the temporal causality). 157 These scenarios are under the assumption that distributions of 158 the training and test data are consistent, which unfortunately 159 does not always hold in reality due to individual differences 160 and non-stationarity of EEG signals. As a result, these models 161 have limited generalization and robustness. It is desirable to 162 learn a model from already collected data and generalize well 163 to new patients. In data scarcity applications, e.g., seizure classification and 166 brain-computer interfaces (BCIs) [36], slim neural networks 167 are preferred to reduce overfitting.

168
Multiple slim CNN architectures have been proposed. For 169 example, Ding et al. [37] proposed ACNet, which introduces 170 an asymmetric convolution block to decompose one traditional 171 square-kernel convolutional layer into three parallel branches 172 of convolutional layers with extra horizontal and vertical 173 kernels to emphasize the skeleton position. SqueezeNet [21] 174 embeds a Fire module into CNN to reduce the number of 175 parameters. The Fire module consists of a squeeze layer 176 with kernel size 1 × 1 to compress the feature maps, and 177 an expand layer with kernel size 1 × 1 and 3 × 3 to 178 recover the feature maps' resolution. Hu 192 EEGs are also frequently used in BCIs, and hence differ-    The architecture of the basic EEGNet is shown in Fig. 2. 236 In the first block, two convolutional layers are arranged 237 sequentially. The first layer is dedicated to capturing the 238 input EEG signal's feature maps containing different band-239 pass frequencies. Its kernel size is (1, f s /2), where f s is 240 the sampling frequency. The next layer executes depth-wise 241 convolution with kernel size (C, 1), where C is the number 242 of channels. Without full connection to previous feature maps, 243 this layer not only reduces the number of trainable parameters, 244 but also explicitly learns spatial filters for each temporal filter. 245 The outputs of Block 1 are frequency-specific spatial feature 246 maps.

247
Block 2 employs separable convolution, with a depth-wise 248 convolutional layer followed by a point-wise convolutional 249 layer. In addition to reducing the number of parameters, 250 relationships within and across feature maps are obtained in 251 a decoupled way. Finally, the classifier is a fully-connected 252 layer with softmax activation.  For an input X ∈ R H×W , let and K ∈ R (2m+1)×(2n+1) be the convolutional kernel. Then, 261 the (u, v)th element of the output feature map Z ∈ R H×W , 262 Z u,v , obtained from a typical dot-product convolution, is: In EnK, the convolutional kernel uses linear time encoding 265 to embed temporal information into the convolution operation: According to (4)  2) It should be periodic, to simulate the repeated SWDs.

292
3) It should preserve the time sequence information that 293 reflects the temporal features such as wave shape in a 294 period.

295
Sinusoidal functions are used in TIE, as in Transformer [45].

296
More specifically, the proposed sinusoidal encoding (SE) is: where ω is the SE period regulator.

299
Additionally, TIE uses the average of all elements in P u,v , 300 instead of their sum as in EnK, in temporal information 301 encoding. As a result, the encoding has the same magnitude 302 as the elements in P u,v .

303
In summary, the proposed TIE module is:

305
where the element of representation matrix R is calculated by:   According to previous studies [46], EEG is typically 313 described in terms of rhythmic activities and transients. The 314 rhythmic activities are divided into bands by frequencies, 315 as shown in Table I. Note that infants and children's brains 316 are in development, and hence the rhythmic activities of their 317 EEG signals are significantly different from adults [47]. Also, 318 EEG activities vary across individuals. Thus, it is of practical 319 significance to establish a reliable and general model that is 320 robust to the variations among patients.

321
A fixed ω in (5) may not give the optimal performance. So, 322 we try to select the optimal ω from a candidate list ω. 323 Let f s be the sampling rate, and f c = {0.5, 4, 8, 324 12, 16, 32, 64} the crucial frequency list consisting of the 325 lower and upper frequency bounds of all EEG bands in Table I. 326 To correlate with EEG frequency bands, a candidate ω of 327 sinusoidal encoding should satisfy: Thus, ω is generated by: In the training and validation stage, models with different 347 candidate period regulator ω from ω are trained in parallel.

348
The model with the lowest validation loss is used in test.    We considered a more realistic cross-patient classification 388 scenario. Fig. 5 depicts the experimental setup, including 389 patient subset partition and training/validation/test sample gen-390 eration. 391 We used 3-fold cross-validation. Patients were shuffled and 392 partitioned into three subsets, ensuring that the number of 393 events of each seizure subtype in the three subsets are similar. 394 Each of the three subsets took turns to be the test set, and 395 the other two were merged as the training and validation sets. 396 To ensure that the training and validation sets do not overlap, 397 the first half events were used for training, and the last half for 398 validation. Then, the seizure events of all sets were empirically 399 sliced by a 4-second sliding window to generate samples. Two 400 successive sliding windows had 50% overlap to increase the 401 number of training samples, but no overlap at all in generating 402 the validation and test samples. As shown in Tables II and III, different seizure subtypes had 405 significantly different number of events. Therefore, we used 406 the balanced classification accuracy (BCA) and the weighted 407 F 1 score as our performance measures. Both were already 408 implemented in the scikit-learn 1 package of Python.

409
To simulate real-world applications, we considered event-410 level classification rather than sample-level in the final evalu-411 ation. More specifically, seizure event recordings were sliced 412 into 4-second fragments as training/validation/test samples, 413 and each fragment was classified with probabilities of different 414 seizure types. Since labels were assigned for each seizure 415 event, fragments from the same event should have the same 416 class label, even though they may be classified as different 417 types by the model. Thus, majority voting was used to aggre-418 gate the classifications of the fragments in the same event 419 into a final class. In performance evaluation, both BCA and 420 F 1 score were calculated by comparing the classified and 421 ground-truth labels of each seizure event, rather than each 422 sample (fragment).  Deep learning baselines are EEGNet, EnK-EEGNet, 429 EEGWaveNet [49], and CE-stSENet. In addition, we improved 430 EnK-EEGNet by replacing its summation representation 431 matrix R with the average representation matrix in (7), named 432  as Avg-EnK-EEGNet. We also implemented TIE-CE-stSENet 433 and TIE-EEGWaveNet, which added a TIE module to the con-  Table V.

449
TIE-EEGNet obtained the best BCA and F 1 score on CHSZ.  To evaluate the effectiveness of the average representation 464 matrix R in (7), experiments were conducted to compare the 465 performances of TIE-EEGNet and EnK-EEGNet with different 466 representation matrices: summation (R sum ), maximum (R max ), 467 and average (R avg ). Fig. 6 shows the results.

468
For EnK-EEGNet, both R max and R avg outperformed R sum 469 on both datasets. Between the former two, R avg outperformed 470 R max on TUSZ, but the opposite was observed on CHSZ.

471
Anyway, changing R was effective for EnK-EEGNet, though 472 it is difficult to conclude whether R max or R avg was better.

473
For TIE-EEGNet, again both R max and R avg outperformed 474 R sum on both datasets, and R avg was always slightly better 475 than R max .

476
In summary, replacing R sum in the original EnK-EEGNet 477 with R avg was effective.  Fig. 8 shows their performances. 506 The BCA and F 1 score varied with ω, and there was not 507 a single best ω. The ω selection approach in TIE-EEGNet 508 almost always outperformed a fixed ω, or 20 BO iterations 509 for searching ω ∈ [2,256]. Furthermore, our proposed ω 510 selection approach only needed to evaluate 7 ω candidates, 511 so the computational cost was low.

512
H. Transfer Learning 513 A common assumption in machine learning is that the 514 distributions of the training and test data are identical or 515 similar, which may not hold in practice. For example, the 516 CHSZ dataset was acquired from children and infants, whereas 517 the TUSZ dataset covered patients of all ages. A seizure 518 subtype classification model trained on TUSZ may perform 519 poorly on CHSZ, due to the significant differences between 520 adults' and children's EEG signals. In the ideal case, we should 521 train a separate model for each patient group. However, this 522 of spatial) information from the raw EEG signals.

563
The following directions will be considered in our future 564 research: 565 1) TIE focuses on enhancing the learning of temporal 566 information. However, the location of epilepsy, espe-567 cially the onset region, and the transmission mode, 568 are also important in indicating the epilepsy subtype. 569 Therefore, incorporating the spatial information may 570 further improve the classification performance.  3) A very simple fine-tuning transfer learning approach 577 was used in this paper. However, there are many 578 other more sophisticated approaches [52], e.g., 579 instance/feature/knowledge transfer, which could 580 be used to further enhance the transfer learning 581 performance. Additionally, it has been shown that data 582 augmentation [53] and data alignment [54], [55] can 583 significantly improve the transfer learning performance 584 in other paradigms of BCIs, e.g., motor imagery. 585 Similar approaches may also be developed for seizure 586 classification.