Spatial-Temporal Neural Network for P300 Detection

P300 spellers are common brain-computer interface (BCI) systems designed to transfer information between human brains and computers. In most P300 detections, the P300 signals are collected by averaging multiple electroencephalographic (EEG) changes to the same target stimuli, so the participants are obliged to endure multiple repeated stimuli. In this study, a spatial-temporal neural network (STNN) based on deep learning (DL) is proposed for P300 detection. It detects P300 signals by combining the outputs from a temporal unit and a spatial unit. The temporal unit is a flexible framework consisting of several temporal modules designed for analyzing brain potential changes in the time domain. The spatial unit combines one-dimensional convolutions (Conv1Ds) and linear layers to generalize P300 features from the space domain, and it can decode EEG signals recorded using different numbers of electrodes. Both amyotrophic lateral sclerosis (ALS) patients and healthy subjects can benefit from this study. In the within-subject P300 detection and the cross-subject P300 detection, our approach gained higher performance with fewer repeated stimuli than other comparative approaches. Furthermore, we applied the proposed STNN in the P300 detection challenge of BCI Competition III. The accuracy score was 89% in the fifth round of repeated stimuli, outperforming the best result in the literature (accuracy = 80%) to the best of our knowledge. The results demonstrate that the proposed STNN performs well with limited stimuli and is robust enough for various P300 detections.


I. INTRODUCTION
Brain-computer interface (BCI) systems enable neural signals to control external devices directly. In recent years, BCIs have been applied in many fields, such as environmental control [1], communication [2], and neurofeedback rehabilitation [3]. Electroencephalography (EEG) monitoring is one of the most popular measurement tools in BCI applications because of its non-invasiveness, mobility, and relatively low cost [4].
The P300 speller, as an EEG-based BCI paradigm, was first proposed by Farwell and Donchin [5], as shown in Figure. 1. During the spelling, the participants are required to focus their gaze on the lighted characters when the rows or columns of 36 alphanumeric characters are randomly intensified. In this process, the participants' brain activity changes evoked by the target characters are called eventrelated potentials (ERPs). Within the ERPs, the P300 signal is one of the most robust components that corresponds to a positive deflection, occurring 250-500ms after a target presentation [6].
An efficient P300 detection technique is a valuable contribution for the BCI community. Humans, particularly amyotrophic lateral sclerosis (ALS) patients, who suffer constructed using Conv1Ds and linear layers. It can generalize the spatial features of EEG signals recorded using different numbers of electrodes (for example, 8, 16, or 64); 3) We designed a temporal unit inspired by [8]. This unit is used to analyze brain potential changes in the time domain by stacking multiple temporal modules. The number of the temporal modules can be adjusted according to the corresponding P300 detection task.
We demonstrate the effectiveness of our model using three public databases: P300 speller with ALS patients [9], covert and overt ERP-based BCI [10], and BCI Competition III-dataset II [11].
The remainder of this paper is structured as follows. Section II introduces related work; The description of databases and data preprocessing procedures are presented in Section III; Section IV details the proposed STNN; the results and discussion are in Sections V and VI; and Section VII concludes the paper.

II. RELATED WORK
The current mainstream P3000 detection approaches can be categorized into two types: deep learning (DL) and traditional technologies using statistical features and classifiers. In the traditional ones, the feature extraction mainly includes measures such as independent component analysis (ICA) [12], canonical correlation analysis (CCA) [13], common spatial patterns (CSP) [14], and XDAWN spatial filter [15]. Commonly used classifiers include linear discriminant analysis (LDA) [16], support vector machine (SVM) [17], and Riemannian geometry classifier (RGC) [18], among others. Of these, the combination of XDAWN and RGC is perhaps the most potent approach for P300 detection [19], which exhibits a strong generalization capability for variable EEG signals. Nevertheless, it is still not as competitive as DL approaches [20].
Convolutional neural network (CNN) as a representative DL framework [21] [22][23] [24][25] [26] has attracted widespread attention from the BCI community. In 2010, Cecotti et al. [23] first proposed a CNN-based P300 detection approach that won the third BCI competition. This method adopts a four-layer CNN architecture to extract channel features and temporal features in sequence, demonstrating that CNN can capture both spatial peculiarities and latent serial dependencies from EEG signals. However, although CNN improved the detection accuracy to an unprecedented level, there are still two major obstacles that lie ahead for such methods. Firstly, the network accuracy depends on the quality and quantity of training data, while the amount of high-quality data commonly remains limited in P300 tasks because of the high cost of time and labor. Secondly, the P300 response is a relatively small potential change presented at a high resolution in the time domain [24], yet the CNN-based frameworks are not skilled at decoding sequential information with limited EEG data.
To resolve the above problems, some of the recent DL approaches tend to strengthen the learning capability of neural networks when limited data are available, such as [27][28] [29], or adopt more advanced architectures to optimize the feature extraction procedure, such as [30][31] [32] [33]. EEGNet [33] as a generic DL network implemented by depth-wise and separable convolutions is proposed, which yields the satisfactory results in various EEG detections. This network extracts temporal features from the EEG signals firstly and then performs spatial filtering on each temporal feature map. With this design, the network can directly perform sequential learning using raw EEG signals and then generalize the captured dependencies in the space domain. It is more competitive for P300 detection than other DL-based pure sequence models, such as recurrent neural networks and long-shortterm memory networks. However, this network relies on multiple repeated stimuli to collect EEG signals.
Every participant in the study went through 35 trials, with 10 rounds of repeated stimuli in each trial. Every round of stimuli contained two target stimuli and 10 nontarget stimuli, where a stimulus was a random intensification of a row or a column. Two target stimuli indicated the intensifications of the row and the column of the target character, respectively. The non-target stimuli were the intensifications of the rows and columns of the nontarget characters. The time between the onset of two adjacent stimuli, called stimulus onset asynchrony (SOA), was 250 ms, where the intensification time and the inter-stimulus interval (ISI) were both 125 ms.   Figure. 2). The recordings using the two interfaces both included three sessions, with six trials in each session. Each trial contained eight rounds of repeated stimuli with 12 stimuli (two target stimuli and 10 nontarget stimuli) within every round of stimuli. The SOA and the ISI were 250 ms and 125 ms, respectively. For the stimulating patterns, the rows or columns were illuminated on Farwell and Donchin's interface as described in Dataset 1, whereas the GeoSpell interface displayed six characters per time interval until all 36 had appeared twice.

C. DATASET 3: BCI COMPETITION III-DATASET II
In Dataset 3 [11], the EEG signals recorded using Farwell and Donchin's interface, were bandpass filtered between 0.1 and 60 Hz and digitized at 240 Hz from 64 channels. Both EEG signals from two subjects (A and B) were divided into a training set (85 trials) and a testing set (100 trials). Every trial contained 15 rounds of repeated stimuli, and the intensification time and the ISI were 100 ms and 75 ms in each round.

D. DATA PREPROCESSING
The EEG signals of Dataset 1-3 were downsampled to 128, 128, and 120 Hz, respectively. Then, they were bandpass filtered between 0.1 and 20 Hz with the fifth-order Butterworth filter [36] to remove the short-term fluctuations   and leave the longer-term trends [37]. At last, they were extracted from 0 to 0.5 s after each stimulus onset, as shown in Table 1.

IV. METHODS
This section describes the proposed STNN, where the temporal unit and spatial unit are connected concurrently, as shown in Figure 3. The details are as follows.

A. PARALLEL MECHANISM
The proposed model adopts a parallel mechanism to perform simultaneous analysis of EEG information in the time and space domains, which is expressed as: = ( ; ) + ( ; ) , (1) where and denote the input of EEG signals and the output of predicted results, the ideal output is either 1 (target) or 0 (nontarget), ( ; ) and ( ; ) represent the functions of the temporal unit and the spatial unit, and are the network parameters, and is an Sshaped activation function.

B. SPATIAL UNIT
The spatial unit utilizes the global features of the EEG signals in the space domain for P300 detection. It is composed of Conv1Ds, linear layers, weight norms (WNs) [38], max-pooling operations, rectified linear units (ReLUs), and a dense layer, as shown in Table 2. The hyperparameter tunning process is given in A.1 and A.2 (Appendix-A). This unit generalizes spatial features from the horizontal and vertical dimensions by the combination of multiple Conv1Ds and linear layers. To improve the model's robustness in different P300 detections, a multi-

Conv1D
Weight Dense layer. The dense layer is connected to the extracted features, producing the predicted results of the spatial unit.

C. TEMPORAL UNIT
Temporal unit detects P300 signals by learning the features of temporal changes in EEG signals. It comprises n temporal modules and a dense layer, as shown in Figure.3. The number of the temporal modules can be customized according to the input EEG signals.
Temporal module. As shown in Figure.4, each temporal module is assembled of a temporal analyzer and a global generalizer with a residual connection. The temporal analyzer performs sequence analysis, which is the core of learning the features of temporal changes. The global generalizer generalizes features from the raw EEG signals or the outputs from the former layer of the temporal module. It provides the global information for the next sequence analysis, which can be expressed as: are the inputs and outputs of each temporal module, and both are the same size of × ; and indicate the number of channels and time points, respectively; and represents the temporal module. Temporal analyzer. The temporal analyzer is composed of four components: a dilated Conv1D [39], a clipping operation, a weight norm, and a ReLU. Within the temporal analyzer of the temporal module, the hyperparameters of the dilated Conv1D include input channel, output channel, kernel size, dilation, and zeropadding, where the input channel and output channel are the number of the electrodes of input EEG signals, and the kernel size, dilation, and zero-padding parameter are , 2 , and ( − 1) × 2 , respectively. By them, the range of learning the temporal changes can be constantly extended. The function of clipping operation is to cut off  As for a temporal unit stacked by n temporal modules, the output :, ( ) are the n-level mapping of the raw EEG signals :, ( ) , … , :, ( ) , which yields the final result of the temporal unit by connecting with a dense layer; indicates the length of receiving field, as calculated in (3).
where is the kernel size of the dilated Conv1D within the temporal module. For EEG epochs in different P300 detection tasks, we can customize the temporal detection range by adjusting n (the number of stacked temporal modules) and k (the kernel size of the dilated Conv1D in the dilated Conv1D).
Temporal generalizer. The structure of the temporal generalizer is similar to the spatial unit, consisting of Conv1Ds, linear layers, and WNs, as shown in Table 3. The hyperparameter tuning process is given in A. 3

and A.4 (Appendix-A).
Residual connection.The residual connection simplifies the network learning process, especially when multiple temporal modules are stacked.

V. EXPERIMENTS
We performed three experiments to evaluate the proposed STNN: 1) P300 detection under multiple repeated stimuli with applications to ALS patients; 2) P300 detection on two speller paradigms to healthy subjects; 3) an ablation and combination study using BCI Competition III-dataset II.
All the experiments involved 30 iterations of network training within 5 mins, where the Adam optimizer [40] (learning rate = 0.001) was used to minimize the binary cross entropy (BCE) [41] between the outputs and the labels in Pytorch [42] environment. The evaluation metrics included accuracy, area under the receiver operating characteristic curve (AUC) [43], F1-score, and Kappa coefficient [44].
The reference approaches were 3D Input CNN [45], the winner in BCI Competition III [23], and EEGNet-t&p [27], where t and p were the number of temporal filters and pointwise filters, respectively.

A. EXPERIMENT 1
The first experiment explored our model performance under multiple rounds of repeated stimuli to ALS patients using Dataset 1 [9]. As described in Section III, Dataset 1 was composed of 8-channel EEG signals from eight ALS subjects, and there were 35 trials of each subject. Each trial was included of EEG signals under 10 rounds of repeated stimuli, there were 12 stimuli (two target and 10 nontarget stimuli) in each round. In a trial, by averaging the EEG epochs under the same stimuli from 1 to rounds, there were 12 EEG epochs for training or testing the model performance under the round of the repeated stimuli. The proposed model was represented as STNN-n&k, where n was the number of temporal modules and k was the kernel size of the dilated Conv1D in each module. STNN-3&15, 4&7, 4&8, 5&3, and 5&4 were used in the experiment, and they produced the receptive fields of length 60, 56, 64, 48 and 64, covering the main portion of the EEG signals (The data length is 64 in Databset 1) in the temporal domain. The reference models were EEGNet-4&2, 8&2, 16&2, 4&4, 8&4, 16&4, and the most used in [33] were EEGNet-8&2 and EEGNet-4&2.
We implemented a within-subject P300 detection and a cross-subject P300 detection, respectively. In the withinsubject task, we randomly selected 20 trials (240 EEG epochs) for model training from each subject and the remaining 15 trials (180 EEG epochs) for testing the model. The average results of eight subjects with 1-10 rounds of repeated stimuli are shown in Tables 4 and 5. From the performance comparison of these models, we can see that the average AUC and F1 scores of the proposed models are higher than its competitors under 1-10 rounds of repeated stimuli. All our models reach above 0.95 AUC scores using the EEG signals from the first five rounds of repeated stimuli, while EEGNet cannot reach it until at least the ninth round of stimuli. Moreover, the average F1 score of our models under the fifth round of repeated stimuli improved 25.3% than EEGNet in the same condition. This result is close to that of the reference models using 10 rounds of stimuli. It is demonstrated that the proposed models can attain the similar high detection accuracy using fewer repeated stimuli, thereby reducing the number of stimuli for ALS patients in applications.
In the cross-subject P300 detection, we utilized all trials of five random subjects for network training and the trials from the remaining three subjects to evaluate the network performance. The average AUC and F1-score results of five experiments following the above steps are listed in Tables 6 and 7, where we can see that our models obtain improvements of 2% in the average AUC score and 12.4% in the average F1 score under 10 rounds of stimuli. Notably, our models using EEG signals of six rounds of stimuli can reach the similar performance compared to the reference ones of 10 rounds of stimuli.     The average results of the kappa coefficient are given in Figure.6. According to [44], a study has substantial reliability when the kappa coefficient is greater than 0.6. STNN-average reached this standard under the 3 rd and the 6 th rounds of repeated stimuli in the with-subject detection and cross-subject detection, respectively. While EEGNetaverage fulfilled the criterion under the 7 th and the 10 th rounds of stimuli, respectively. Overall, the proposed models achieved advantages both in the within-subject and cross-subject P300 detections, and they can reduce four rounds of repeated stimuli than the reference ones when gaining the similar results.

B. EXPERIMENT 2
The second experiment studied our model performance on the two P300 speller paradigms (Farwell and Donchin's paradigm and the GeoSpell paradigm) to healthy subjects using Databset 2 [10]. Databset 2 were 16-channel EEG signals from 10 healthy subjects. There were three sessions   In the within-subject experiment, two sessions were randomly chosen as the training set from each healthy subject, and the remaining one was used for testing the models. Figure.7 and Figure.8 give the AUC and F1 scores and Kappa coefficient results of the proposed models and reference models using Farwell and Donchin's paradigm and the GeoSpell paradigm. We can see that the proposed STNN-3&15, 4&7, 4&8, 5&3, and 5&4 all achieved perfect detection (AUC, F1 scores and Kappa coefficient results were equal to 1) under the second rounds of stimuli on Farwell and Donchin's paradigm and under the fourth rounds of stimuli on the GeoSpell paradigm, while the reference ones need at least four and six rounds of repeated stimuli to reach this goal on Farwell and Donchin's and the GeoSpell paradigm, respectively. It shows that our models realize better performance using fewer stimuli to healthy subjects.
In the cross-subject experiment, we utilized all the trials from five random subjects to train network parameters. The trials from the remaining five subjects were used for the network testing. Figure.9 and Figure.10 show that our models always score higher than or equal to the reference ones under 1-8 rounds of repeated stimuli and reach perfect detection under the second round of stimuli over the two paradigms, while the reference ones fulfill this condition in the third or the fourth stimuli. Therefore, it can be seen that the proposed models can reduce repeated stimuli to healthy subjects and is robust to the two different P300 speller paradigms.

C. EXPERIMENT 3
To measure the contribution of individual components and component combinations on the model performance, the third experiment was an ablation and combination study on Dataset 3 (BCI Competition III-dataset II, 64 channels) [11], where the training and testing sets of two subjects (A and B) were described in Section II. In order to compare to other models, we implemented the P300 detection using the same evaluation metrics and rounds of stimuli as in the literatures [23] [45].  In the combination study, we tested the accuracy of multiple combinations of STNN, including STNN-1&60, 2&30, 3&15, 4&7, 4&8, 5&3, 5&4, and 6&2. In the ablation study, the above combinations were compared with their temporal units and STNN assembled with only the spatial unit. The results are shown in Table 8, where we can see that 1) the network performance is continuously improved by stacking one to four temporal modules in the temporal unit, while the performance does not continue to be enhanced when stacking more than four modules; 2) the parallel mechanism of the temporal unit and the spatial unit improves the accuracy by 1-3% over the temporal unit working alone; 3) the temporal unit is superior to the spatial unit in terms of the average accuracy; 4) to the best of our knowledge, the proposed STNN stacked with four or more temporal units outperforms the best state-of-theart model in the literatures by at least 9% in accuracy.

VI. DISCUSSION
In this paper, we propose a novel DL model called STNN for P300 detection. The results prove that STNN performs better than other DL model and reduces the number of repeated stimuli in different P300 detections. Both healthy subjects and ALS patients can benefit from this research, even with limited data.
The main reasons are as follows: 1) the temporal unit, as a flexible DL-based network dedicated to time-domain modeling, can capture the temporal dependencies from brain potential changes by constructing an end-to-end multi-level sequential mapping, so it is more sensitive than the previously mentioned approaches when detecting P300 signals; 2) the spatial unit can constantly generalize and compress P300 features in the space domain, which hedges complex noise interference to a certain extent; 3) a joint decision-making mechanism is built into the network by connecting the temporal unit and the spatial unit concurrently, which can utilize the above advantages of the two units, thus achieving both better performance and stronger robustness, as shown in Experiments 1 and 2.
Furthermore, it should be emphasized that stacking multiple temporal modules within the temporal unit is critical for sequential modeling, as shown in Experiment 3. The network accuracy is constantly improved when one to four temporal modules are stacked, which demonstrates a more complicated multi-level sequence model is more suitable for characterizing temporal changes in human brain regions. Nevertheless, the over-stacking of temporal modules cannnot endlessly improve its performance but rather increases the model complexity due to the larger number of training parameters, as we can see that the accuracy scores of STNN-4&7, 4&8, 5&3, 5&4, and 6&2 are almost equivalent in Table 5. Even so, our results still significantly outperform the best methods in the literature, to the best of our knowledge, in BCI Competition III. This is possible because, driven by the great success of 2D or 3D CNNs in image processing and video analysis, some current state-of-the-art DL frameworks are commonly obsessed with high-dimensional feature extraction from EEG data. However, the P300 signals present significant 1D features (the deflections in the time domain) rather than high-dimensional ones. 2D or 3D frameworks are not skilled at decoding features from EEG signals recorded with a small number of channels, because the EEG data inherently lack the spatial resolution [46]. In contrast, our network focuses more on the temporal activities within the P300 signals and EEG channel generalization in the space domain, which thus can capture more hidden information from EEG signals at a low SNR.
In the future, the proposed STNN is predicted to reach a high information transfer rate (ITR) when implementing online P300 detection because the information transferred per unit of time is likely to increase because of the decrease in the rounds of repeated stimuli. Moreover, we consider that this network has potential for some applications in other areas of EEG-BCI systems, seeing that it is designed with a flexible structure and can be fast training and testing with limitied data, as shown in B.1-4 (Appendix-B).

VII. CONCLUSION
Spatial-temporal neural network (STNN), a DL-based P300 detection network, is proposed in this paper. The network is a parallel architecture consisting of a temporal unit and a spatial unit. It can perform EEG channel generalization and analyze the brain's potential changes simultaneously.
The results using three public databases reveal that our network performs better with fewer rounds of stimuli than other competitors. It is robust with limited data and is suitable for decoding EEG data recorded with various electrodes. In the future, we expect the proposed network to play a critical role in online P300 detection and other areas of EEG-BCI systems. VOLUME XX, 2021 1

APPENDIX-A: HYPERPARAMETER TUNING
A.1 lists the hyperparameter tuning process of the spatial unit, where the average AUC scores of STNN-3&15, 4&7, 4&8, 5&3, and 5&4 are given. We utilized Dataset 3 for training, validating, and testing models, where we performed 5-fold cross-validation on the training dataset (85 trials), and the testing dataset (100 trials) was given in Section II. V We can see that the model performance can be improved by increasing the number of hyperparameters in the spatial unit, while the excessive increase in the hyperparameters does not significantly improve its performance. Therefore, the spatial unit with maximum channel = 128 was adopted in our P300 detection study.
APPENDIX-A.1. Hyperparameter tuning of the spatial unit.

Conv1D
Weight Norm A.3 gives the hyperparameter tuning process of the global generalizers in the temporal module using Dataset 3. A.4 shows the average training loss and validation loss of subjects A and B. We separately assembled the global generalizers with different maximum channels into STNN-3&15, 4&7, 4&8, 5&3, and 5&4. According to the average 5-fold cross-validation and testing AUC scores of these five models, we can see that the model performance can be improved using the global generalizers in the temporal modules. However, a huge amount of training parameters led to computational redundancy but did not obviously improve the model performance. Therefore, the output channel of the temporal feature generalizer was set up to 128 in our P300 detection study.

APPENDIX-B: RUNNING TIME
B.1-4 present the average running time of the three experiments. All the experiments were implemented using a Linux PC with two GeForce GTX 1080 GPUs.
B.1 shows the processing time of the within-subject P300 detection and cross-subject P300 detection in the first experiment, where we list the average training and testing times of 1-10 rounds of stimuli.  In the second experiment, we calculated the average training and testing times of 1-8 rounds of stimuli in the within-subject P300 detection and cross-subject P300 detection. The results in Farwell and Donchin's paradigm and the GeoSpell paradigm are given in B.2 and B.3, respectively.
B.4 gives the average training and testing times of two subjects (A and B) using our model components and combinations in the third experiment.