B2-ViT Net: Broad Vision Transformer Network With Broad Attention for Seizure Prediction

Seizure prediction are necessary for epileptic patients. The global spatial interactions among channels, and long-range temporal dependencies play a crucial role in seizure onset prediction. In addition, it is necessary to search for seizure prediction features in a vast space to learn new generalized feature representations. Many previous deep learning algorithms have achieved some results in automatic seizure prediction. However, most of them do not consider global spatial interactions among channels and long-range temporal dependencies together, and only learn the feature representation in the deep space. To tackle these issues, in this study, an novel bi-level programming seizure prediction model, B2-ViT Net, is proposed for learning the new generalized spatio-temporal long-range correlation features, which can characterize the global interactions among channels in spatial, and long-range dependencies in temporal required for seizure prediction. In addition, the proposed model can comprehensively learn generalized seizure prediction features in a vast space due to its strong deep and broad feature search capabilities. Sufficient experiments are conducted on two public datasets, CHB-MIT and Kaggle datasets. Compared with other existing methods, our proposed model has shown promising results in automatic seizure prediction tasks, and provides a certain degree of interpretability.


I. INTRODUCTION
E PILEPSY is a chronic non-infectious disease caused by paroxysmal abnormal super-synchronous discharge activity of brain neurons.It is one of the most common neurological diseases worldwide and covers all age groups, around 50 million epileptic patients worldwide [1].Epilepsy is associated with adverse outcomes, including serious comorbidities, injury and death [2].The central problem of epilepsy is the unpredictability of seizures, which can have a persistent negative impact on patients' life.
If seizures can be predicted a few minutes before onset, patients will be able to take precautions against injury and open the door to new and timely treatment for the prevention or control of impending seizures [3].In addition, doctors usually provide treatment plans for patients with epilepsy based on the type and number of seizure onset recorded by patients.But, epilepsy data recorded by patients and their caregivers are often unreliable.It takes a lot of time and energy for doctors to detect seizures from long-term electroencephalogram (EEG) records.To make effective treatment plans, it is necessary to use seizure prediction algorithms to identify seizure events automatically.Therefore, an automatic seizure prediction algorithm is vitally important for patients with epilepsy.EEG is generated by synchronous activity of a large number of neurons in the brain, which is consistent with the super-synchronous discharge mechanism of epilepsy, so EEG is an indispensable source of data for predicting seizures.These seizure prediction algorithms usually have two main functions: (1) They can be integrated into wearable technology and combined with an online alarm system to start therapeutic interventions [4], [5].(2) It can assist medical workers in reviewing offline long-term EEG records to detect seizures automatically [6].
A complete seizure often includes interictal, preictal, ictal and postictal [7], [8].The seizure prediction tasks can be simplified as a classification of interictal and preictal.When a certain amount of preictal data is predicted, it can provide early warning for the impending seizure onset.
In recent years, deep learning algorithms have attracted extensive attention in various fields because of their great generalization ability and more automatic feature extraction ability, encouraging their application in the field of seizure prediction.Truong et al. [9] used short-time fourier transform (STFT) to extract EEG features from the original EEG signals and used convolutional neural network (CNN) [10] to classify the interictal and preictal.Ozcan and Erturk [11] extracted spectral band power, statistical moment and hjorth parameters to reveal the frequency and time domain features of the EEG signals.The features are given as input to a 3D CNN [12].Daoud and Bayoumi [13] used deep convolutional neural network (DCNN) and concatenated with a bidirectional long short-term memory (Bi-LSTM) network as the back-end of model to classify.
Many studies have shown that seizures involve not only the seizure onset zone and its surroundings, but also the brain areas far away from seizure onset zone [3], [14].Abnormal interactions among different brain areas may lead to seizure onset.To characterize interactions among different brain areas within a whole-brain range, recent studies generally construct brain functional connectivity networks based on scalp EEG using channels as nodes [15], [16], [17].According to the international standard electrode positions, in multi-channel EEG data, different channels correspond to different brain regions, so abnormal interactions among different brain regions can be reflected by the interactions among different channels.Furthermore, seizures do not occur randomly and have been shown to have long-range temporal dependencies [3], [18], [19].In summary, the global channel interactions in spatial, and long-range temporal dependencies are crucial to seizure prediction algorithms.However, most of the previous traditional deep learning algorithms, such as CNN, they can only capture local channel interactions in spatial and short-range temporal dependencies due to the regular and local receptive field of convolution operators, without considering global channel interactions and long-range temporal dependencies together, resulting in the lack of interpretability of the model and the common results.
In fact, vision transformer (ViT) [20] algorithm based on global attention mechanism can achieve the global channel interaction in spatial, and obtain long-range temporal dependence features required for seizure prediction.But ViT only considers the deep features of the last transformer modules, transformer modules with different depths may contain complementary features related to seizure prediction tasks [21].The complementary features can be obtained through the broad connection of shallow and deep transformer modules.But these complementary features are redundant and complicated.By applying attention mechanisms to these complementary features, we can further extract critical spatiotemporal long-range correlation complementary features that are beneficial to seizure prediction.However, the broad connection above is only used for the attention mechanism part to extract connected attention information from different transformer modules, instead of mapping all features together into a new vast space to learn new generalized features.It is necessary to search for seizure prediction features in a vast space, so as to learn new generalized spatiotemporal long-range correlation features that help predict seizures [22].
Therefore, according to the neuroscience mechanism of seizure, a novel bi-level programming seizure prediction model, broad vision transformer network with broad attention, called B2-ViT Net, is proposed for learning the new generalized spatio-temporal long-range correlation features, which can characterize the global channel interaction features in spatial and long-range dependence features in temporal, captures generalized features that are beneficial to seizure prediction, thus improving the prediction performance.Compared with other black box deep learning models, our model can quantify the interaction weights among channels, and evaluate the importance of each channel at any time, thus providing a certain degree of interpretability.
Specifically, the contributions of our proposed method can be summarized in the following aspects.
1) Based on the neuroscience mechanism of seizure onset, we proposed a novel bi-level programming seizure prediction model B2-ViT Net, which considers the global spatial interactions among channels and long-range temporal dependencies together through the global attention mechanism, called spatio-temporal longrange correlations.The global attention mechanism here can innovatively quantify the interaction weights among channels, and evaluate the importance of each channel at any time.
2) Both deep and broad features are crucial for seizure prediction tasks.Previous seizure prediction algorithms only focused on deep features while ignoring the generalized features that combine deep and broad.Generalized features are characterized through linear and nonlinear random mappings in our model.Our proposed model can comprehensively learn generalized spatio-temporal longrange correlation features that are conducive to automatic seizure prediction in a vast space, improve the prediction performance.
3) Sufficient experiments are conducted on two public datasets, CHB-MIT and Kaggle datasets.Compared with other existing methods, our proposed method has achieved promising results in automatic seizure prediction tasks, obtains the highest AUC and the lowest FPR.On CHB-MIT dataset, B2-ViT obtains 0.923, 93.3%, and 0.057/h on AUC, sensitivity and FPR, respectively.On the Kaggle dataset, the proposed model reached 0.816, 85.2%, and 0.013/h on AUC, sensitivity and FPR, respectively.

II. PRELIMINARY KNOWLEDGE
This section introduces the preliminary knowledge of ViT and BLS, which helps to build B2-ViT Net.

A. ViT: Vision Transformer
Transformer is a deep neural network mainly based on self-attention mechanism, which is initially applied in natural language processing.Inspired by its powerful global presentation ability, researchers extend transformer to computer vision tasks, which is called ViT [20].Compared with other networks (such as CNN), the model shows competitive performance on various benchmarks.The model follows the following steps: (1) Converting image data to sequences form as transformer input; (2) applying linear projection to the sequences; (3) adding extra learnable classification token, adding positional embedding; (4) a transformer encoder is applied to the processed data, which mainly includes multihead self-attention mechanism (MHSA) block and multi-layer perceptron (MLP) block.The feature nodes are obtained by a random mapping, and then the feature nodes are mapped to a possible high-dimensional vector space to obtain enhancement nodes, so that the model can automatically search features related to specific tasks in a vast vector space.Both two features yield the final output.

III. DATASETS AND METHODOLOGY
This section thoroughly introduces the datasets, data preprocessing, the modeling method of B2-ViT Net, and postprocessing.In addition, the model frame diagram and algorithm table are also provided.The structure of B2-ViT is shown in Fig. 2, the detailed implementation steps of B2-ViT are summarized in Algorithm 1.

A. Datasets 1) CHB-MIT Dataset:
The CHB-MIT seizure EEG dataset [24] is obtained from Boston Children's Hospital and included in the EEG database of the Massachusetts Institute of Technology.It contains 23 records from 22 subjects (chb21 is recorded again of chb01 subjects after 1.5 years).Each subject has 9-24 recordings lasting for 1 hour (some of which are long records of 2-4 hours), and the dataset includes 884 hours of continuous scalp EEG recordings and 163 seizures.All EEG data are collected using 10-20 international standard electrode positions, EEG is recorded using 18/23 lead, and the sampling frequency is 256 Hz.
2) Kaggle Dataset: The American Epilepsy Society Seizure Prediction Challenge of Kaggle dataset [25] has iEEG data from 5 dogs and 2 patients, with 48 seizures and 627.7 hours interictal records, which is simply denoted as Kaggle dataset.Intracranial EEG (iEEG) data of 5 dogs are recorded from 16 implanted electrodes, and the sampling rate is 400 Hz.Recorded iEEG data of 2 patients from 15 deep electrodes (Patient 1) and 24 subdural electrodes (Patient 2), and the sampling rate is 5 kHz.The calculation is difficult due to the patients' high sampling rate of the Kaggle dataset, so the two patients' iEEG data are not considered, which is consistent with [9], [26], [27].These two datasets are used in most seizure prediction tasks [8], [9], [26], [28].

B. Preprocessing
As shown in Fig. 1, a complete seizure can be divided into preictal, interictal, seizure interictal horizon (SIH), seizure prediction horizon (SPH), and seizure occurrence period (SOP).SPH is the prediction period before the seizure, during which appropriate measures can be used to prevent or control the impending seizure in advance.SOP is the interval where the seizure is expected to occur.SIH is defined as EEG signals about 4 hours before and 4 hours after the seizure [29], which can reduce the interference caused by the near seizure state.To predict correctly, seizures must be after SPH and within SOP.This paper follows the definitions of SOP and SPH proposed by [30].In this work, SPH is set to 5 minutes and SOP is set to 30 minutes, which is consistent with most studies.CHB-MIT dataset has many seizures in a short time.For seizures less than 30 minutes from the previous seizure, we assume that there are only the leading seizure exists.In addition, this work only considers patients with seizures less than 10 times a day, because it is not very necessary to perform this task for patients who have seizures every 2 hours on average.Based on the above definition and consideration, this work evaluated 64 seizures in the CHB-MIT dataset and 42 seizures of 5 dogs in the Kaggle dataset.These two datasets' available data are summarized in Table I and Table II.
Classification tasks often face the problem of class imbalance, automatic seizure prediction tasks are no exception, interictal data is far more than preictal data.To solve this problem, the sliding overlap technique with step size s is used to obtain more preictal data.the number of extra preictal data N after oversampling is computed as: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where w is the sliding window length, P is the total length of preictal data, I is the total length of interictal data, R is the ratio of the total length of preictal data to the total length of interictal data and s = w × R.
In this paper, STFT [31] is used to preprocess the raw EEG data to extract time-frequency features, which converts the original EEG signal into a time-frequency matrix.The window length of STFT is 30s.STFT is chosen because it can capture the dynamic changes of the frequency characteristics of EEG signals of epileptic patients, and compared with wavelet transform (WT) [32] and other signal analysis methods, it has a shorter processing time of time series, which is helpful for real-time seizure prediction.Besides, it is widely used in EEG processing, retains most of the information in the original signal, and many studies have shown its advantages in EEG [9], [33].The datasets used are contaminated with 60 Hz power line noise, so components in the 57-63 Hz and 117-123 Hz frequency ranges are excluded to eliminate power line interference, and the DC component (0 Hz) is also removed.

C. Proposed Method
B2-ViT Net is a novel bi-level programming problem for seizure prediction.It considers the spatio-temporal long-range correlation features required for seizure prediction.In addition, it has strong global deep and broad feature search capabilities, which can comprehensively learn generalized spatio-temporal long-range correlation features that are conducive to automatic seizure prediction in a vast space, thus improving the prediction performance.
For a given preprocessed image I ∈ R L×C 1 ×W , L is the length of sequence, C 1 and W are the number of channels and width of image patches, which can be processed directly by the standard transformer.To get the input x 1 ∈ R L×C×D of the first transformer layer, linear projection is adopted for satisfying the required dimension D of transformer, C 1 is additionally added with classification token, which is recorded as C.After processing the input data, the model is first divided into two parts, one is ViT backbone to obtain deep features Out Deep , and the other is broad attention to obtain local broad features Out Br oad .The transformer layer includes two blocks: MHSA and MLP.In addition, residual connections are used in MHSA and MLP blocks, and LayerNorm (LN) is applied before each block.Next, the calculation process of MHSA, MLP and broad attention is introduced in detail.
Multi-Head Self-Attention: Given the input x i ∈ R L×C×D of i-th layer.Then query q i ∈ R L×C×(h×d q ) , key k i ∈ R L×C×(h×d k ) and value v i ∈ R L×C×(h×d v ) are obtained by chunking x i into three tensors and rearranging them, h is the number of head, where d q , d k and d v are the dimension of q i , k i and v i , respectively, i ∈ [1, l], where l is the number of transform layers.Then inner product, softmax and the second linear projection are performed.The output of MHSA can be obtained by the following: where Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. 8: x = y i 11: end for Random W h j , β h j ; 27: Caculate H j = ξ j (Z n W h j + β h j ) 28: end for 29: Stack all the enhancement nodes as . ., H m ] 30: Calculate the weight connected the hidden layer and output layer w 2 by: w 2 = (A T A + λ 1 I) −1 A T Y T 31: Get the prediction matrix for seizure detection where q j i , k j i , v j i are the corresponding value of j-th head in i-th layer of q i , k i , v i , w o is the weight matrix of the second linear projection.Because of the residual connection between the layers, the hidden layer's output ŷi in i-th layer is formulated by Multi Layer Perceptron: MLP has two fully connected layers and an activation function layer, the activation function used in this paper is GELU.The output of MLP can be denoted as where w 1l , b 1l , w 2l , and b 2l are the weights and bias of the corresponding linear layers.The output y i in i-th layer is formulated as The output y i of i-th layer is the input x i+1 of (i + 1)-th layer, so the deep feature Out Deep is the output of last layer: Broad Attention: Queries, keys and values of different layers are concatenated respectively as below: Self-attention is performed on the concatenated query Q, key K and value V to get Attention(Q, K , V ).In this paper, 1D adaptive average pooling is introduced to solve the problem of dimension inconsistency between Out Deep and Attention(Q, K , V ).The output features Out Br oad of broad attention can be denoted as: where d is the hidden dimension of transformer layer.
Combining the deep feature Out Deep and local broad feature Out Br oad , the final output feature Out D B of BViT is computed as: where γ can be used to adjust the weights of two types of features.Finally, the probability of categories is calculated by a softmax function.So far, we have obtained the spatio-temporal long-range correlation features required for seizure prediction.The feature data and its labels are denoted as It is necessary to search for seizure prediction features in a vast space to learn new generalized spatio-temporal long-range correlation features that help predict seizures.Therefore, the above algorithm is extended to a vast space through BLS to learn generalized features, so as to improve the performance and representation ability of seizure prediction tasks.Firstly, Out D B are randomly extended to a vast space via a linear random mapping, that is: where W z i and β z i are generated by a random mapping φ i .
Then the set of n groups of feature nodes can be defined as Secondly, the j-th group of enhancement nodes can be constructed by similarly, both W h j and β h j are generated by the nonlinear random mapping ξ j .The set of m groups of enhancement = [H 1 , H 2 , . . ., H m ], ξ j is the tansig function here, tansig is a hyperbolic tangent s-type nonlinear function, which is defined as: Therefore, the output Y P of the improved algorithm with BLS can be constructed by the following formula: w 2 can be obtained by solving the ridge regression problem: where λ 2 is the regularization coefficient.
Our proposed model B2-ViT is a novel bilevel programming problem, the goal of the model is shown in Eq. ( 9), shown at the bottom of the page, where x 1 is the input data, Y T is the true label, w 1,r is the corresponding weight of the front r layer of our proposed model, W z and β z are the corresponding weight and offset of the feature nodes, W h and β h are the corresponding weight and offset of the enhancement nodes, φ and ξ are random mappings used to generate feature nodes and enhancement nodes, L is the length of x 1 , f (x 1 ; w 1,r ) is a BViT function of input x 1 , which is parameterized by a weight vector w 1,r , l is the loss function of BViT, Out D B can be denoted as f (x 1 , w 1,r −1 ), λ 1 ∥w 1,r ∥ 2 2 is the regularization term that penalizes the complexity of weights, softmax is a classification function.

D. Postprocessing
In this work, the k-of-n method is used to predict seizure as in [9], [11], an alarm is set only when at least k of the n predictions are positive, we set k to 4 and n to 5. In addition, to avoid the increase of False Prediction Rate (FPR) caused by multiple alarms in a short time, we set the refractory period to 30 min, that is, the reoccurring alarm within 30 min after the alarm occurs will be ignored.

A. Experimental Settings and Evaluation Metrics
In this work, Area Under Curve (AUC), Sensitivity (S n ), FPR, p-value are choosen as the evaluation metrics of the proposed method.AUC is a performance metric to measure the quality of the classifier.The closer to 1, the better the effect.S n is the ratio of correctly predicted seizures to all seizures.FPR is the number of mispredictions per hour.p-value is the probability of predicting at least m of M seizures, which can be obtained by the following min Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V PERFORMANCE COMPARISON OF VIT, B-VIT AND B2-VIT ON CHB-MIT DATASETS TABLE VI PERFORMANCE COMPARISON OF VIT, B-VIT AND B2-VIT ON KAGGLE DATASETS TABLE VII COMPUTATIONAL COMPLEXITY EVALUATION METRICS OF VIT, B-VIT AND B2-VIT ON CHB-MIT AND KAGGLE DATASETS
formula [9], [11]: where P ≈ 1 − e −FPR×SOP , SOP is the seizure occurrence period, is set to 30 min.If p < 0.001, it can be considered that our model is superior to random prediction at the 0.001 significance level.
To make the results more reliable, the leave-one-out crossvalidation (LOOCV) method is used for each subject.If the subject has M seizures, M − 1 seizures will be used for training, and the rest seizure will be used for testing.Each seizure will be taken as the testing set in turn.In addition, to monitor whether the model is overfitting in real-time and adjust the parameters of the model, M − 1 seizures data for training are divided into training set and validation set, the proportion of validation set is set to 25% as in most seizure prediction studies.The number of transformer blocks l is 6, the number of self-attention heads h is 8, the dimension of one self-attention head is 64, the hidden layer size is 512.
Our experiments are based on PyTorch 1.11.0, which is implemented using Python 3.8 and Cuda 11.3.0.The loss is optimized by the Stochastic Gradient Descent (SGD) optimizer (learning rate = 0.001, momentum = 0.9, weight decay = 5e −5 ), the cosine scheduler is used to optimize the learning rate, the epoch of training is set to 100, the loss function is Cross Entropy Loss.The early stopping method with patience 10 is used to obtain better generalization performance and avoid over-fitting.Two GeForce RTX 3090 Ti are used that approximately need 64 GB GPU memory in total.

B. Overall Performance Comparison
The overall performance of B2-ViT Net is evaluated in the following methods.CNN [9] is a forward neural network with deep structure and convolution calculation, which is one of the representative algorithms of deep learning and has achieved good results in computer vision and natural language processing in recent years.It is one of the most popular in deep learning methods currently designed for EEG decoding [41].DCNN+Bi-LSTM [13] used DCNN to extract spatial features, and Bi-LSTM was used as a classifier to improve classification accuracy, which is typically designed to predict EEG seizures.Vision Transformer [20] is the application of transformer in the field of computer vision, achieving performance beyond CNN in most visual tasks.AdderNet [8] proposed a simple and effective end-to-end adder network and supervised contrastive learning, used addition instead of multiplication significantly reduces computational costs.Multi-scale ProtoPNet [26] proposed a deep learning model for patient-specific seizure prediction, it attempted to measure the similarity between the inputs and prototypes (learned during training) as evidence to make final predictions.
From Table III, Table IV and Fig 3, the proposed B2-ViT scheme yields an average AUC of 0.923 while other methods only achieve an average AUC of 0.834, 0.824, 0.861, 0.917, 0.843 on the CHB-MIT datasets and an average AUC of 0.816 while other baseline methods only achieve an average AUC of 0.792, 0.806, 0.759, 0.794, 0.764 on the Kaggle datasets, which shows that our proposed method has good classification ability.In particular, patients 1, 19, 20 and 23 of CHB-MIT reach an AUC greater than 0.99, which proves the effectiveness of our method in distinguishing preictal EEG signal from interictal EEG signal.In addition, our seizure prediction method is superior to other compared methods by successfully warning 60 seizures out of 64 on the CHB-MIT dataset, 37 seizures out of 42 on the Kaggle dataset.Meanwhile, our method achieves a remarkably low FPR.
As a the bi-level programming model B2-ViT Net obtains the promising AUC, S n and FPR, which indicates the effectiveness of our proposed method in automatic seizure prediction.In addition, for all subjects in CHB-MIT and Kaggle datasets, the p-value is less than 0.001, this shows that our seizure predictor is significantly better than the random predictor under 99.9% confidence interval (0.001 significance level), which is statistically significant, providing significantly excellent performance in automatic seizure prediction of our proposed B2-ViT framework.

V. DISCUSSIONS A. Ablation Studies
To verify the effectiveness of our proposed B2-ViT model, we conducted further ablation experiments, and the results are shown in Table V, VI and Fig. 4. It can be seen that on the CHB-MIT dataset, all the evaluation metrics of BViT model are higher than ViT, AUC is increased by 1%, S n is increased by 4.9%, FPR is decreased by 0.004, and the p-value under the significance level of 0.001 is increased from 11/13 to 12/13.The evaluation metrics of B2-ViT are significantly higher than those of BViT, with AUC increased by 4.2%, S n increased by 8.2%, FPR decreased by 0.024, and the p-value under the significant level of 0.001 increased from 12/13 to 13/13.Similar results can be obtained on the Kaggle dataset.As can be observed, the results prove the effectiveness of B2-ViT model, B2-ViT model consistently outperforms ViT and BViT models.Moreover, Table VII shows the params and training time of different algorithms, which indicates that our method achieves high performance improvement in a small increment of training time without increasing trainable parameters.

B. Effects of Different Window Lengths of EEG Signals
An appropriate window length is expected to achieve better performance.We evaluate the effect of different window  lengths on the experimental results using the baseline method ViT, and find that the window length of 30s is more appropriate.The results are shown in table IX.Within 30s, with the increase of window length, ViT contains more and more distinctive feature information, and its performance is getting better and better.When the window length exceeds 30s, all evaluation metrics decline, and the classification performance reaches the bottleneck.This shows that the window length of 30s contains enough feature information for classification, so the window length of 30s is chosen for seizure prediction.

C. Effects of Parameter Settings in BLS
Relevant parameter settings in BLS may affect the experimental results of our proposed model, the number of feature nodes and enhancement nodes can be adjusted according to different experimental scenarios.To verify the robustness of our proposed model, the influence of important experimental parameters of BLS on AUC is analyzed.Fig. 5 shows the corresponding AUC under different mapping feature nodes and enhancement nodes.The range of the feature nodes' groups is set to 10-15, and the number of enhancement nodes is set to 1, 100, 500, 1000, 5000.It can be seen that the best experimental results can be obtained when the mapping feature nodes and enhancement feature nodes are 165 and 100 respectively on patient 1 of the CHB-MIT dataset.The AUC of B2-ViT is relatively stable, and good experimental results are obtained.Therefore, the automatic seizure prediction performance of B2-ViT does not fluctuate obviously due to the change of parameters of BLS, which shows that our proposed method has good robustness in BLS module.

D. Performance Comparison of the Existing State-of-the Art Methods
Table VIII shows the experimental settings and performance results of the existing state-of-the-art methods on CHB-MIT and Kaggle datasets, where NR is not reported values.It is necessary to point out that it is difficult to compare our method directly with the existing methods due to the different experimental settings, such as the Interictal distance-Preictal length and validation scheme.Compared with our LOOCV strategy, the no-cv and k-fold cv in [34], [35], [36], and [37] are much less challenging and stable, and the intrapatient variation of seizures is ignored.In addition, although statistical significance research has always been emphasized, only [9], [11], [26], [34], and [42] have statistical evaluation.Moreover, deep learning method is usually a black box, and the interpretability of the model is an important research direction at present, but few studies give interpretability in specific scenarios.
As a result, compared with other methods, our bi-level programming model B2-ViT Net yields a competitive AUC, S n , FPR and p-value.The AUC, FPR and p-value has reached SOTA, only the S n lower than [37].Although [37] achieved very high sensitivity on the CHB-MIT dataset, they adopted a complex time-consuming feature extraction method and 10-fold CV instead of LOOCV.Because each seizure is independent in LOOCV, it is more realistic and useful in clinical application.What's more, our model uses the attention mechanism to explain the global spatial interactions among channels and long-range temporal dependencies required for seizure prediction, so that the model has a certain degree of interpretability.

E. Limitations and Future Directions
Although our proposed seizure prediction algorithm achieves strong prediction performance, some limitations still remain in the current work.On the one hand, to the lack of detailed on the patient's epileptogenic zone and corresponding biomarkers, the results were not validated through neuroscience experiments.For example, channels located in the epileptogenic zone may show strong abnormal connections with other channels, channels located in the epileptogenic zone are assigned attention weights higher than other channels.Besides, the neural links between brain regions assigned high attention weights were not captured.Fig. 6 shows some of our conjectures.
On the other hand, our method is based on patientdependent, meaning that both the training and test sets come from the same patient.It cannot be directly used for patient-independent seizure early warning tasks, i.e., the model trained by one patient cannot be applied to another patient.This is mainly because our method lacks the ability to handle the different distribution between the training and test sets.Therefore, transfer learning strategies [43], [44] will be considered to improve the performance of patients-independent seizure prediction tasks in our future work.In addition, we will try to cooperate with medical institutions, further explore the biomarkers of the epileptogenic zone, the neural links between the brain assigned high attention weights, and apply our proposed method to the realistic seizure prediction tasks in the future.

VI. CONCLUSION
Based on neuroscience mechanisms, we consider the global channel interactions in spatial, long-range dependencies in temporal together, and explore the generalized spatio-temporal long-range correlation features required for seizure prediction in a vast space.A novel bilevel programming model B2-ViT Net is proposed for extracting generalized spatiotemporal long-range correlation features for automatic seizure prediction.The proposed model has strong generalized feature search capability, which can comprehensively learn generalized spatio-temporal long-range correlation features that are conducive to automatic seizure prediction in a vast space, improving feature representation ability.In the attention mechanism of our proposed model can calculate the interaction weights among channels, and evaluate the importance of each channel at any time.We evaluated the performance of B2-ViT model on the CHB-MIT and datasets, the model yields promising results in terms of AUC, Sn, FPR and p-value, where the AUC, FPR and p-value have reached SOTA.Experimental results illustrate that our proposed method can predict seizures efficiently, help patients prevent or control the impending seizure, and improve their quality of life.

Fig. 3 .
Fig. 3.The AUC for each seizure prediction on the (a) CHB-MIT and (b) Kaggle datasets.Each bar represents one seizure.Correct and incorrect predictions of seizure are given with ▲ and △, respectively.
Implementation of the B2-ViT AlgorithmRequire: Input seizure EEG data X and label Y T ; Ensure: Prediction Y P for for seizure detection; Parameters: W p1 , b p1 , W p2 , b p2 : Linear projection parameters; W c , p : the class token and positional embedding matrices; W i : multi-head attention parameters for layer i; γ 1 i , β 1 i ,γ 2 i , β 2 i : two sets of layer-norm parameters for i; d: the dim of one head; W i 1l , b i 1l , W i 2l , b i 2l : MLP parameters for layer i; l: the depth of transformer block; γ : the coefficient factor; the regularization coefficient of BLS λ 1 .1: Fed X into the STFT to obtain the initial features X s ; 2:

TABLE VIII EXPERIMENTAL
SETUP AND PERFORMANCE RESULTS OF EXISTING METHODS ON CHB-MIT AND KAGGLE DATASETS

TABLE IX PERFORMANCE
COMPARISON OF DIFFERENT WINDOW LENGTHS ON THE CHB-MIT DATASET