Seizure Prediction Using Multi-View Features and Improved Convolutional Gated Recurrent Network

Epilepsy is one of the most common neurological diseases worldwide. Early prediction of seizure onsets is of great significance for the safety of intractable epilepsy patients. This work aims to develop a reliable and accurate method for patient-specific seizure prediction based on scalp electroencephalograms (EEGs). Local fractal spectrum, relative band energy, and synchronization modularity features are used to reveal the characteristics of multi-channel EEG in perspectives of time domain, frequency domain, and functional connectivity, respectively. A novel framework, named multi-view convolutional gated recurrent network (Mv-CGRN), is proposed to comprehensively analyze the spatio-temporal sequences of multi-view features and capture the potential variations preceding the impending seizure. Moreover, an attention mechanism is embedded in Mv-CGRN to determine the optimal feature combinations for each patient by adaptively tuning the weight parameters. The proposed system achieves an average sensitivity of 94.50% and an average false positive rate (FPR) of 0.118/h on CHB-MIT scalp EEG dataset, using the leave-one-out cross validation (LOOCV). Our work shows a promising performance compared with the state-of-the-art works in the same filed.


I. INTRODUCTION
Epilepsy is the second most common neurological disorder that affects about 65 million individuals worldwide [1]. Epilepsy seizures arise spontaneously from sudden surges of electrical activity in the brain. It may cause direct injuries to the neural tissues and indirect injuries such as fractures, burns, accidents, and even death [2]. Current treatment for epilepsy are mainly pharmacological and surgical. Unfortunately, antiepileptic drugs failed to control seizures among 20%-30% of patients [3], and surgery is not always an available option. Therefore reliable and accurate prediction for an impending seizure attack is of great significance to ensure the safety and life quality of intractable epilepsy patients.
The associate editor coordinating the review of this manuscript and approving it for publication was Victor Hugo Albuquerque .
Hitherto, much work has been devoted to seizure prediction based on electroencephalogram (EEG), which is widely used to measure the neural electrical activity. It has long been observed that the transition from non-seizure state to seizure state is not an abrupt and isolated phenomenon and may be preceded by clinical, metabolic, or electrical changes [4]. Based on this observation, researches on seizure prediction define the stages of epilepsy into four major periods: (1) the preictal state which is the period preceding the seizure onset; (2) the ictal state which is the period of the seizure itself; (3) the postictal state which is the brain recovery period after the seizure; (4) the interictal state which can be considered as the normal state. Generally, seizure prediction can be perceived as a binary classification problem of preictal signals and interictal signals. Most current seizure prediction methods consist of two major steps: extracting features of EEG signals, and classifying the features with a pre-trained classifier.
Although the dysfunction mechanism of epilepsy remains unclear, various kinds of linear and nonlinear features have been presented in previous studies to detect early changes in EEG before the seizure onset. Examples of these features include absolute and relative spectral band power [5], [6], Hjorth parameters [7], autoregressive coefficients [8], wavelet coefficients [9], largest Lyapunov exponent [10], statistical features [11], instantaneous amplitude and phase [12], phase correlation [13], phase synchronization [14], and common spatial pattern features [15]. Considering that seizure types and characteristics vary across patients and may evolve over time with the same patient, multi-view feature extractions are becoming necessary in EEG processing and have been employed in several studies [16], [17]. Yet efficient and reasonable combination of features is a crucial part and a challenging problem for seizure prediction. In addition to traditional features in time, frequency, and time-frequency domain, the interaction between different electrodes should be noticed and explored in depth. Based on the graph theory, the human brain can be regarded as a functional connection network. The functional connectivity [18]- [20] between brain regions is assessed by time coupling or time dependence between signals collected in different regions, which can act as an important indicator of brain dysfunction before the seizure onset.
On the other hand, the classification method also plays a vital role in seizure prediction. With the rapid development of machine learning, especially neural network algorithms, many classifiers have been employed to automatically learn and classify the extracted features of EEG, including support vector machine (SVM) [13], [14], [21], [22], linear discriminant analysis (LDA) classifier [15], Bayesian classifier [5], [23], Adaboost classifier [24], convolutional neural network (CNN) [9], [25], long short-term memory (LSTM) network [26], [27], gated recurrent network (GRN) [28], generative adversarial network (GAN) [29], etc. CNN is widely used in image processing and has an advantage in extracting the spatial information. Thus, it is a powerful tool to analyze the feature matrices of multi-channel EEG. At the same time, the time-varying information is essential for seizure prediction since the neural electrical activity is always a dynamic and nonstationary process. In recent studies, 3D CNN [16] and convolutional long short-term memory (C-LSTM) network [30] have been introduced to process sequences of feature matrices so that the fluctuation pattern of EEG can be recognized. To achieve a better prediction performance, we intend to propose an improved classification framework which is capable of analyzing sequences of multiview feature matrices.
This work presents a novel patient-specific seizure prediction method, and the main contributions can be summarized as follows: (i) A well-designed combination of features are used, consisting of local fractal spectrum, relative band energy, and synchronization modularity. The characteristics of multi-channel EEG are revealed in perspectives of time domain, frequency domain, and functional connectivity, respectively. The multi-view feature set provides a reliable foundation for seizure prediction. (ii) A novel framework based on CNN and GRN, named multi-view convolutional gated recurrent network (Mv-CGRN), is proposed to process multiple views of feature matrix sequences. The spatio-temporal evolution pattern of multi-channel EEG can be learned and categorized synthetically. (iii) A depth-wise separable CNN is introduced in the fore-end of Mv-CGRN to address the multi-band feature matrices. Features in different frequency bands can be analyzed precisely and comprehensively while the parameter explosion is avoided. (iv) An attention mechanism for multi-view features is proposed to improve the classification performance by focusing on essential features automatically. It enhances the reliability and adaptability of the network for patient-specific seizure prediction.
The rest of this paper is organized as follows. Section II describes the utilized dataset. Section III presents the proposed seizure prediction system including the preprocessing procedure, feature extraction methods and the framework of Mv-CGRN. Section IV gives the experimental results, followed by relevant comparisons and analysis. A conclusion is given in Section V.

II. DATASET
In this work, the proposed prediction system is trained and tested on CHB-MIT scalp EEG dataset [31], collected from Children's Hospital Boston. The EEG were recorded according to the international 10-20 system of electrode positions. Recordings of 23 pediatric patients are grouped into 24 cases. Each case contains 9-42 hours of continuous recordings, with the total length of 866 hours and 198 seizures. All recordings were sampled at 256 Hz with a bipolar montage. To match the composition of channels for each case, 20-channel signals are adopted in this work.
The categorization of recordings is as follows: 30 minutes' recordings preceding seizure onsets are categorized as preictal (P1); recordings of seizure periods are categorized as ictal (P2); 30 minutes' recordings following seizure onsets are categorized as postictal (P3); the rest are categorized as interictal (P0). The purpose of seizure prediction is to distinguish P1 from P0, regardless of P2 and P3. For sequences of seizures that occurred close to each other, the incoming seizures are discarded if the interval time is less than 1 hour. We evaluated the cases which have at least three seizures and three hours of interictal period in EEG recordings. This is due to the fact that inadequate samples will cause an overfitting problem in the training phase and deprive the generalizability of the classifier. Considering all the above categorizations and constraints, 140 seizures and 476 hours of interictal recordings are available for 24 cases.

III. METHODOLOGY
The block diagram of the whole system proposed for seizure prediction is given in Fig. 1. In the preprocessing step, raw EEG are segmented and decomposed into sub-band signals.  The next step is to extract multi-view features from preprocessed signals and divide the feature set into training set and test set. The classification step consists of the training phase (phase 1) and test phase (phase 2). In phase 1, the classifier with trainable parameters is trained iteratively by the training set to obtain the classification ability. In phase 2, the test set is used to simulate the practical situations and evaluate the prediction performance of the proposed system.

A. PREPROCESSING
A multi-band feature extraction scheme is designed to obtain the more accurate and distinguishable features from EEG segments. Wavelet packet transform (WPT) is a powerful tool to analyze non-stationary bioelectrical signals and is used to decompose raw EEG into sub-band signals in this work. In WPT, a complete binary tree is created by decomposing both detail coefficients and approximation coefficients of the previous level without omission or redundancy. Thus, WPT can provide a multi-scale signal set with higher frequency resolution, compared with discrete wavelet transform (DWT). For N -level decomposition, WPT produces 2 N different sets of coefficients that give the time-frequency representations of the original signal. In this work, a 7-level WPT is implemented on raw EEG signals with Daubechies 4 (db4) wavelet. According to the intrinsic frequency bands commonly used in human brain researches, the wavelet packet coefficients are combined to reconstruct the signals in six sub-bands that are delta (0-3 Hz), theta (4-7 Hz), alpha (8-13 Hz), beta (14-30 Hz), gamma-1 (31-60 Hz), and gamma-2 (61-120 Hz) band. To eliminate the power line hums at 60 Hz and its harmonics, wavelet coefficients of 57-63 Hz and 117-120 Hz are excluded. As an example, Fig. 2 gives a raw EEG segment in case Chb01 and its corresponding decomposed signals.

B. FEATURE EXTRACTION 1) LOCAL FRACTAL DETRENDED ANALYSIS
Fractal analysis is frequently employed in biomedical signal processing to define the scale invariant structure in ECG, EEG, MR, and X-ray pictures [32]. The scale invariant structures of neuron firing time series differentiates between healthy and pathological conditions and between different types of pathological conditions [33], [34]. Therefore fractal analysis can be an important tool to extract distinguishable features in time domain for interictal and preictal state. Conventional fractal analysis estimates the Hurst exponent that defines the particular kind of scale invariant structure of the time series. As an extension of detrended fluctuation analysis (DFA), local detrended fluctuation analysis (L-DFA) was conceived by [35] to characterize the dynamical fractal structure of biomedical signals. The fractal spectrum is calculated from the local Hurst exponent which will fluctuate in time and identify the time instant of structural changes within the signal. Thus, temporal variations in the structure of EEG can be effectively revealed by L-DFA.
For a given time series {x k }, the scale-dependent measure µ s,m 0 is defined as a root-mean-square fluctuation of the integrated response time series y m around a polynomial trendŷ m,λ of order λ within a floating trial interval [m 0 −s/2, m 0 +s/2], where m 0 is the center of the interval and s is the sample size (i.e., scale): A second-order polynomial detrending (i.e., λ = 2) of the response series is used in this work. The local Hurst exponent h loc is estimated as the linear regression slope of log-log plot of µ s,m 0 versus scale s, which portrays the local variation and self-similarity of the time series {x k }. The fractal spectrum D h [36] is defined as follows: where P h is the normalized distribution of h loc in logcoordinates, ε is the bin size of the histogram used to define P h and P h max is the maximum probability of h loc . The fractal spectrums of local Hurst exponent series in each sub-band for channel Fp1-F7 of the EEG segment in case Chb01 are shown in Fig. 3. The abscissa value of the apex, which is the most prominent index in the spectrum to characterize the temporal structure of each channel, is used to form the time-domain feature matrix of the EEG segment.

2) RELATIVE BAND ENERGY
Signal energy shows enormous potential in tracking the seizure generation since it is sensitive to waveforms such as seizure-like bursts of neural electrical activity. To bring out the signal energy evolution hidden in different frequency ranges that is related to the potential seizure attack, we calculate the relative band energy based on the wavelet packet coefficients.
As mentioned before, a 7-level (i.e., N = 7) WPT is implemented on raw EEG signals in this work. Each subspace of the binary tree is labeled by (g, p), where g is the depth of the subspace in the tree, and p is the index of the subspace at the same depth. The sub-band signal is reconstructed by the wavelet packet coefficients {d(g, p)} of corresponding subspaces. Mathematically, the energy of sub-band ω is computed as The relative energy measures the ratio of the energy in band ω to the total energy of the signal in logarithm scale, which is computed as where E Total is the sum of band energy in delta, theta, alpha, beta, gamma-1, and gamma-2 band. The relative energy RE ω is computed for each channel of the EEG segment.

3) SYNCHRONIZATION MODULARITY
In this work, the functional connectivity of multi-channel EEG is quantified by the synchronization modularity. Firstly, the synchronization is calculated as the connection strength between different channels. Thus, a weighted graph can be constituted based on the connection matrix by regarding each channel as a node. Then the community detection algorithm is applied on the graph to obtain the synchronization modularity of EEG. The phase locking value (PLV) is a commonly used bivariate measurement that reflects the degree of phase synchronization in a specific frequency band [37]. Given that φ X (t) and φ Y (t) are the instantaneous phase of the input signal X (t) and Y (t) at time point t, respectively, PLV is calculated by where l is the time point index, L is the number of time points in a time window. PLV indicates the phase synchronization between two signals by the average of phase differences in this time window. The PLVs range from 0 to 1, where ''0'' means two signals are completely asynchronized, and ''1'' means two signals are fully synchronized. The modularity was defined by Newman and Girvan [38] as the quality function of community detection. In this work, the modularity is utilized to measure the functional connectivity of scalp electrical activity based on phase synchronization. The synchronization modularity Q sm can be calculated by where S ij is the phase synchronization (i.e., PLV) between channel i and channel j, and S is the sum of phase synchronization of all channel pairs. V i denotes the community index of channel i. δ(V i , V j ) yields ''1'' if channel i and channel j are in the same community (i.e., V i = V j ), ''0'' otherwise (i.e., V i = V j ). The detailed procedure of the graph community detection algorithm is as follows: 1) Start from v communities (v is the number of channels for a v-variate signal), where each community contains a single channel. Calculate the original synchronization modularity Q 0 ; 2) Reduce the number of communities by merging one community into another. Calculate the Q sm of each merging mode, and determine the mode with maximum modularity increment Q as the best merging mode; 3) Merge the communities according to the best merging mode, and repeat the iteration step until only one community exists; 4) The community partition mode with maximum Q sm is considered as the best mode. For channel i, the modularity Q i of its corresponding community is recorded.
The connection graph and the corresponding modularity map in a 10-second window of beta-band EEG are given in Fig. 4. A higher Q i in the map means a stronger interaction among channels inside the community. The modularity map digs out the potential information from the connection graph and clearly reveals the functional regions of scalp electrical activity.

C. MULTI-VIEW CONVOLUTIONAL GATED RECURRENT NETWORK 1) DEPTH-WISE SEPARABLE CONVOLUTION
Depth-wise separable convolution was proposed by [39] to accelerate the computation of CNN by separating the standard convolution into depth-wise convolution and point-wise convolution. It has been applied and made great progress in different research fields, especially in computer vision tasks. The typical implementation of depth-wise separable convolution is Xception [40], which is a variant model of Inception [41].
In [40], it is clarified that the standard convolutional kernel can be seen as a three-dimensional filter: depth dimension and space dimension (corresponding to the width and height of the feature map). The convolution operation actually implements a joint mapping of depth correlation and spatial correlation. The combination of depth features and the combination of spatial features can be performed separately by the depth-wise convolutional layer and the 1 × 1 point-wise convolutional layer. It shows that the computation can be significantly reduced with slight performance loss [39], [40].

2) GATED RECURRENT UNIT
GRN is one of the most efficient and powerful variants of the standard recurrent neural network (RNN) [42]- [44]. It is composed of a series of cascaded gated recurrent units (GRUs). The GRU employs a gating structure to control the flow of information inside the unit. Thus, the cascaded GRUs can adaptively capture the sequence dependencies over various lengths of time and avoid the vanishing gradient problem of long-term dependence in standard RNN. Compared with long short-term memory (LSTM) network, which is another popular variant of RNN, GRN can achieve the similar performance with a much simplified structure and fewer parameters. Fig. 5 shows the typical architecture of a GRU which has two main gates, namely the update gate and the reset gate. The update gate z n is used to determine how much information of the previous state is brought into the current state. The reset gate r n is used to determine how much information hidden in the previous state needs to be forgotten. All the relationship can be mathematically expressed as where u n is the input vector in the n-th state, c n andc n are the output vector and candidate output vector, respectively, W z , W r and W are trainable weight parameters, and σ is the sigmoid function.

3) ATTENTION-ASSISTED MV-CGRN
The framework of the proposed Mv-CGRN is given in Fig. 6. In each time window, three views of features are extracted from raw EEG as the input of the network, which are local fractal spectrum (View 1), relative band energy (View 2), and synchronization modularity (View 3). The channel-wise features in each sub-band are mapped into a 8 × 8 matrix according to the relative positions of electrodes and eventually form a 6 × 8 × 8 matrix for each view. A depth-wise convolution layer is employed to process the feature matrix separately in each band, followed by a two-stack point-wise convolution layer which combines the multi-band feature matrices.
An attention layer is embedded between the flatten layers and the GRU layer to adaptively integrate multi-view features by weight parameters α f , and thus to capture the more distinguishable classification features for the particular patient.
Supposing that H f is the output vector of the f -th view of feature, the attention mechanism is defined as follows: where W e and b e are trainable parameters. The integrated vector exported by the attention layer in the n-th window is the input vector of the GRU layer in the n-th state. The last output vector of the GRU layer is forwarded to the fully connected layer. Finally, a binary classification result of ''preictal'' or ''interictal'' is obtained based on the output of the softmax layer. Convergence of parameters in Mv-CGRN model during training is achieved by optimizing the crossentropy loss function.

IV. RESULTS AND DISCUSSIONS
The proposed seizure prediction method has been tested on CHB-MIT scalp EEG dataset, as described in Section II.
In spite of the abundant research in epilepsy, there is no univocally defined interval for the preictal state. In this work, the preictal interval is defined as 30 minutes' recordings preceding the seizure onset, if any. Recordings inside and outside the prediction interval constitute the positive and negative, respectively. A seizure is correctly predicted if at least one sample segment inside the corresponding preictal interval is classified as positive. On the contrary, the positive classification of a sample segment outside the preictal interval is counted as a false positive alarm. Usually, false positive alarms tend to appear in clusters during an abnormal period. Therefore, a refractory period is defined in which subsequent alarms are ignored once the system is triggered and gives the first alarm. The length of the refractory period is determined as 5 minutes in this work. Sensitivity (SEN, the ratio of correctly predicted seizures to the total number of seizures) and false-positive-rate (FPR, the number of false alarms per hour) are employed as two dominant evaluation indexes for the prediction system. The prediction time of a seizure is calculated as the distance between the first positive sample segment determined by the system and its onset. The prediction time indicates how early a seizure is predicted and acts as another important index. The average, minimum, and maximum prediction time of successfully predicted seizures are denoted byT P , T P min , and T P max , respectively. For each case, the features are computed over a 1-second time window, and 10 successive windows make up a 10-second sample segment. The preictal and interictal recordings are divided into half-overlapping segments. Due to the  limited number of seizures and the relatively short length of preictal intervals, the number of negative samples is much larger than that of positive samples. The classifier tends to be more accurate toward the class with the larger number of training samples [45]. Thus, equal numbers of preictal segments and interictal segments are selected to overcome   groups are adopted for training except one group reserved for test. By this method, all seizures are covered in the test and the tested seizure is unseen during training. The evaluation indexes are averaged across M test groups in each trial and then averaged across 10 trials of LOOCV for each case. Table 1 gives the experimental results of the proposed seizure prediction method. Owing to the combination use of depth-wise separable convolution and GRN, the proposed classification framework can process the features with complicated data structure, which overcomes the limitation of conventional classifiers like SVM, artificial neural network (ANN), CNN, and RNN. In most cases, all seizures can be successfully predicted with a low FPR. The area-under-curve (AUC) value provides an alternative for evaluating the classifier performance. It shows the direct classification ability on all the EEG segments, without any constraints. A high average AUC value of 0.901 is achieved by the attentionassisted Mv-CGRN. Fig. 7 plots the stacked bar chart of average weight parameters α f in the attention layer for each case, by which the patient-specific combination of multi-view features can be visualized. We have also compared the performance of Mv-CGRN with relevant classification frameworks that utilize benchmark algorithms. The depth-wise separable convolution and GRN were replaced by standard convolution and LSTM, respectively. The rest part of Mv-CGRN remained unchanged, and the same hyper-parameters were used including the batch size, learning rate, epoch, and the number of hidden units. As shown in Table 2, the highest sensitivity and AUC value are achieved by the combination of standard convolution and LSTM. It holds a slight advantage at the cost of a large number of trainable parameters, which means a high computational complexity. By contrast, Mv-CGRN achieves the similar performance with much fewer parameters, reduced by 61.96%. Table 3 gives the comparison of our work with recent seizure prediction works that use the same CHB-MIT scalp EEG dataset. It is difficult to determine the best work without any dispute since each work employs different constraints and is tested on a limited set of dataset with different evaluation methods. Moreover, there are some differences between the application scenarios of each work. Thus, we have tried our best to do the comparison pertinently and objectively. Zhang and Parhi [5] presented a seizure prediction method based on spectral power features that are ranked and selected in a patient-specific manner. The method, tested on 78 seizures of 17 cases, achieves the best performance among all the compared works in terms of SEN and FPR, which are 98.68% and 0.046/h, respectively. However, it requires a prior knowledge and an adequate validation process to select the optimal classification features for each patient. The procedure needs to be performed repeatedly for new patient datasets, which reduces the generalizability of the method. Liu et al. [17], proposed a multi-view CNN framework to analyze the features of time domain and frequency domain. The method also achieves a relatively high sensitivity over 90%. However, it was only tested on 2 cases and didn't use the LOOCV. Our work achieves an average sensitivity of 94.50% and an average FPR of 0.118/h, outperforming all the compared works except [5]. In our work, 140 seizures are selected from 24 cases for training and test based on the aforementioned standards. Though the number of seizures is not the most, it is sufficient to ensure the reliability and generalizability of our method. The length of preictal interval is defined individually by each work, ranging from 5 minutes to 120 minutes, which has a direct impact on relevant indexes including SEN, FPR, and prediction time. A moderate length of 30 minutes is adopted in our work, and we have focused on exploring the potential evolutions during this interval. The average prediction time of our method is 27.15 minutes. It is an appropriate distance for the daily prevention of seizure onset because the patient doesn't have to suffer the uncertainty of an attack over a long time while still having enough time to take the suitable action before the onset.

V. CONCLUSION
This work presents a patient-specific seizure prediction method using multi-view features and a novel Mv-CGRN framework. Raw EEG are decomposed into sub-band signals by WPT according to the intrinsic frequency bands of human brain. Local fractal spectrum, relative band energy, and synchronization modularity features are extracted to characterize the variations of EEG from the interictal state to the preictal state in various perspectives. The Mv-CGRN, assisted by the attention mechanism, is employed to process the spatiotemporal feature sequences for different patients in a selfadaption manner. The proposed method achieves an average sensitivity of 94.50%, an average FPR of 0.118/h, and an average prediction time of 27.15 minutes, using the leaveone-out cross validation on CHB-MIT scalp EEG dataset. The seizure prediction method in this work shows a reliable and promising performance compared with the state-ofthe-art works in the same field. With the aim of achieving a more accurate prediction and meeting the real-life use requirements, the future research will focus on three aspects: (i) exploring an efficient method to automatically extract features in a patient-specific manner; (ii) introducing trainable parameters to determine the appropriate preictal interval for each patient; (iii) furtherly simplifying the structure and shortening the computing time of the Mv-CGRN model.