Introduction
Brain Computer Interfaces (BCIs) are intended to create a direct connection between the human brain and computerized devices, enabling individuals to operate such devices without peripheral muscle involvement [1]. Several BCIs have been proposed over the years, leveraging information provided by brain electrical activity. Depending on the specific application, brain activity can be acquired either invasively through electrocorticogram (ECoG) or non-invasively through electroencephalogram (EEG) [2], [3]. EEG is commonly preferred as non-invasive, but scalp-measured brain electrical activity has lower signal-to-noise ratio, which poses significant challenges to develop effective BCIs [4]. Despite their effectiveness, BCIs that rely on Steady-State Visual Evoked Potentials (SSVEP) and P300 (a positive deflection of electric brain activity that occurs 300ms after stimulus recognition [5] indicating the presence of conscious mental attention processes [6]) provide a reduced pool of available functionalities [7]. This is because P300 remains the same regardless of the stimulus that elicits it, and SSVEP is limited by the resolution of the system to detect depolarization changes due to the different flashing frequencies [8]. To address these limitations, researchers have explored the use of Event-Related Potentialss (ERPs), which offer a wider range of evoked responses that can be associated with different actions [9]. ERPs are time-locked responses to stimuli, and their shape depends on the category of the stimulus presented.
To exploit ERPs potential, neuroscientists have extensively studied their morphology to identify reliable and characteristic markers. In more detail, the intent is to leverage the domain knowledge to define a set of rules that an expert BCI system can use to recognize different ERPs and perform the corresponding action. Additionally, these studies can provide insight into the cognitive processes underlying perception and provide a better understanding of this phenomenon. For example, in a recent work [10], researchers analyzed ERPs generated by 10 stimuli categories from both visual and auditory domains to extract reliable markers for the perception process. By interpreting the spatio-temporal coordinates of the brain activity, they identified relevant voltage peaks and statistically investigated their relationship to the specific stimulus category. Despite the interpretability benefits provided by an expert BCI system, the definition of hard rules can be challenging due to the complexity of EEG data. Indeed, EEG typically contains multiple channels, which refer to different brain regions. Moreover, the morphology and amplitude of brain waves varies across individuals, requiring a time-consuming fine-tuning procedure for each new user. To address these challenges, research has been conducted to leverage machine- and deep-learning techniques to automatically identify ERPs, assessing promising results [11]. The advantage of these techniques is that they do not need any prior knowledge concerning the investigated domain. A significant machine-learning approach is presented in [12] and [13], where ERPs are recognized belonging to 14 different semantic categories of stimuli. The proposed approach largely overcame the accuracy threshold of 70% for each category of stimulus, which is considered as the minimum requirement to guarantee a meaningful BCI communication [14].
In addition to perceived stimuli, the use of imagined stimuli in BCI has been also explored for patients who are unable to produce observable responses, such as those in a coma or severe locked-in state. Indeed, mind-reading applications would considerably expand the BCI potential in clinical field. Again, studies have been conducted that demonstrate the potential of motor imagery-based BCI systems in enabling communication and control of external devices. For instance, in [15] a BCI system based on motor imagery was developed to allow patients to control robots. Similarly, in [16], a motor imagery-based BCI system was designed to control a wheelchair. These studies highlight the potential of motor imagery-based BCI systems in improving the quality of life for individuals with severe motor disabilities. Despite being effective at moving objects, motor imagery BCIs still provide a reduced pool of additional functionalities. Therefore, researchers investigated ERP-based BCIs using imagined stimuli as an alternative to provide a wider set of possible actions for the patients. These works have been inspired by evidence reported by recent studies that have found an overlap in neural processing related to perceived and imagined stimuli [17]. In more detail, studies demonstrated that similar brain regions activate in response to the same stimulus, whether it is perceived or imagined [18]. Furthermore, multi-voxel approach revealed that similar sensory visual features can be identified in both perceived and imagined stimuli, with additional activity in the anterior fronto-temporal region during imagery tasks due to attention and memory processes [19]. For auditory stimuli, the secondary auditory cortex activates similarly in both perceived and imagined stimuli, while the primary auditory cortex tends to be more active in response to perceived stimuli [20]. However, recent findings have provided evidence of an activation of the Heschl gyri (A1) during music imagery [21]. To further explore these findings, a recent study has identified reliable markers in imagery ERPs to develop a robust set of hard rules for implementing a mind-reading BCI system [22]. Despite the promising results assessed by neuroscientists at identifying a reliable set of marker to distinguish imagery ERPs, few approaches have been proposed that resort to machine- and deep-learning techniques. Prior works, such as those presented in [23] and [24], leverage deep-learning techniques to distinguish between ERPs elicited by imagery of different stimuli. Specifically, the former focused on distinguishing between ERPs related to the imagery of homes and human faces, achieving an accuracy of 68%. On the other hand, the latter study combined the potential of convolution neural networks and genetic algorithms to distinguish between ERPs that are associated with imagery of dogs, airplanes, and houses, achieving 60% accuracy. However, these performances are not enough to create a reliable BCI system.
In this study, we introduce a novel machine learning-based BCI system that can learn characteristic patterns to accurately classify ERPs associated with different semantic categories of stimuli. To assess the performance of the approach, we conducted an extensive experimental campaign involving 20 volunteers, who were presented with 40 stimuli belonging to 10 semantic categories and asked to both perceive and imagine them. The proposed system demonstrates a very high accuracy, largely exceeding the 70% threshold for both perception and imagery ERPs. To reduce the number of electrodes needed, we also investigate how the number of considered channels affects the classifier’s performance. Our findings indicate that several electrodes can be neglected, allowing to improve the patient comfort. These promising results lay the groundwork for the development of effective mind-reading BCI systems relying on stimuli imagery only. MIRACLE offers three main contributions over the existing literature. To the best of the authors’ knowledge, it represents the first attempt to use Functional Data Analysis (FDA) for the ERP recognition problem. This tool is crucial to retain the time information and reduce data dimensionality by means of Functional Principal Component Analysis (fPCA). Secondly, we propose a hierarchical classifier architecture that simplifies the multi-category stimuli classification problem as a series of binary classifications, thus enhancing the stimuli recognition performance and exceeding those obtained so far for mind reading, reaching them over a wider set of stimuli. Last, we contribute to BCI technology by presenting an approach to quantify the importance of each channel in the decision-making process of the hierarchical classifier. This approach calculates the Kendall
The rest of the paper is organized as follows: Section II provides insights concerning the experimental procedure used to collect the ERPs data considered in this work. Then, Section III presents the method designed to associate each ERPs to the semantic category of the stimulus that has elicited it. Section IV discusses the system’s performance. Section V explains the approach designed to investigate EEG channels’ importance and presents the obtained results.
Experimental Setup
This Section outlines the experimental methodology designed for ERPs data collection. For additional information, please refer to [10] and [22], where further neuroscientific insights into the collected ERPs’ components are provided.
The data collection process involved a group of 20 volunteers (13 females and 7 males) with an average age of 23.9±3.34 years. All participants were right-handed according to the Edinburgh Inventory Questionnaire [25]. Furthermore, they had normal or corrected-to-normal vision and hearing and did not experience deficits in language comprehension, reading, or spelling. None of the volunteers had previously been diagnosed with a psychological or psychiatric disorder or drug abuse. The experimental protocol adhered to the Helsinki Declaration of 1964 and was approved by the Ethics Committee of Bicocca University (protocol number RM-432). Each participant provided written informed consent before participating in the data collection process.
Each participant was given instructions regarding the standardized experimental protocol. Firstly, an high-density electrodes cap was carefully applied to the volunteer’s scalp. Then the participant was instructed to wear Sennheiser electronic gmbH headphones and sit inside an anaechoic and faradized cabinet, located 114cm away from a HR VGA color monitor positioned outside. All the volunteers were specifically instructed to remain still and avoid any eye or body movements while focusing their gaze on a central point displayed on the screen. The stimuli were organized into 12 runs, consisting of 8 runs with visual stimuli and 4 runs with auditory stimuli. Each visual run lasted for 3 minutes, while each auditory run lasted for 2 minutes and 30 seconds. The allocation of stimuli to the runs was randomized within their respective sensory domains. In total, there were 40 stimuli instances for each category, encompassing 10 categories, with 7 visual and 3 auditory categories. Consequently, there were a total of 280 visual stimuli and 120 auditory stimuli. Although all subjects perceived the same stimuli, the order of presentation varied between participants. The visual stimuli consisted of images measuring
During the experiment, EEG data was continuously recorded by a cap embedding 126 electrodes sampling at 512Hz and placed according to the 10/5% system by Oostenveld and Praamstra [27]. The electrodes’ impedance was kept below
Perception vs Imagery: CPz Trend. This Figure refers to Subject 1 and shows the trends measured by CPz channel for all the semantic categories of stimuli, comparing perception and imagery ERPs. Please notice that, to ease the comparison, the plotted trends have been subjected to baseline correction.
Proposed Method
This Section details MIRACLE’s pipeline, tailored to recognize the semantic category of the stimulus related to the measured ERPs, and presents the performance metrics used for its evaluation. A comprehensive set of metrics has been considered, to provide a reliable understanding of MIRACLE’S performance and enable comparison with other methods in the literature.
A. Pre-Processing
The first step of the MIRACLE classification pipeline is pre-processing. It consists of two stages: baseline correction and frame selection. As grand averaged ERPs are affected by noise, baseline correction is essential. Indeed, this step aims to remove the resting-state activity from the signal, and guarantees that any measured voltage changes are due to the stimulus rather than to the ongoing brain processes [29]. For each grand averaged ERP, its mean voltage recorded before the onset of the stimulus is subtracted channel by channel. After this procedure, the unnecessary channels, i.e., the ones related to the ocular (vEOG, hEOG) and mastoids (M1, M2) are removed.
Then, frame selection is performed, to improve the signal-to-noise ratio [29]. Considering imagery stimuli categories, we consider as relevant the 400 to 1000ms range, while for perception stimuli categories, we maintain the grand averaged ERP’s portion included in 100 to 1000ms range. The choice of considering a smaller interval in imagery ERPs is due to the lack of P300 component. Indeed, as reported in Section I, it is related to attention processes, so it is not typically observed in potentials that are not evoked by an external stimulation [30].
B. Windowing
Windowing is a technique widely used in machine learning to enhance model training and provide robustness to time shifts in data collection. It involves dividing the pre-processed signals into smaller segments of fixed length leveraging a window that slides of a specific factor, defined as slope. Windowing is also essential when designing an online BCI to enable real-time ERPs classification. Indeed, the window and the slope determine the time required to get the first prediction and the update rate, respectively. It follows that window size is a crucial parameter that must be properly fine-tuned. Wider windows correspond to a reduced number of instances and longer wait to get the first prediction, while smaller windows may not accurately represent the grand averaged ERP behavior, leading to inconsistent results. In the design of MIRACLE, we tuned the window size according to an a posteriori approach, which relies on the performance of the classifier in recognizing the classes from the extracted features while varying the window size, and evaluated different sizes ranging from 100ms to 600ms. For each evaluated window size, we extracted the corresponding set of features, using the 75% of the data to train the classifier, and the remaining 25% for testing purposes. The train-test split was performed according to a stratified split. To evaluate the classifier’s performance in recognizing the semantic category of the stimuli, we computed the F1-Score using k-fold cross-validation. Fine-tuning of the classifier was conducted separately for the perception and imagery datasets. In both cases, we determined the optimal window size to be 500ms, while setting the slope to 1s.
C. Features Extraction
After the windowing stage, the perception and imagery datasets consist of two collections of grand averaged ERPs’ windows. In more detail, they are composed of a set of 6600 and 9300 windows 500ms long, respectively. It follows that each window collects the samples measured in the window time frame by the 122 channels. Formally, we can define the perception and the imagery datasets as \begin{equation*} \chi = \{ERP_{1,1,1}, {\dots }, ERP_{i,j,w}, {\dots }, ERP_{Su,St,W}\},\end{equation*}
\begin{align*} ERP_{i,j,w} &= [ERP_{i,j,w}(1,a), {\dots }, ERP_{i,j,w}(k,t), \\ &\qquad \qquad {\dots }, ERP_{i,j,w}(K,b)]\end{align*}
The large number of channels can cause a decrease in classifier performance due to the curse of dimensionality problem [31]. This occurs because as more features are added, the dimension of the space in which the instances are represented enlarges, making it more challenging for the classifier to learn a robust decision function without overfitting. One common technique used to address this problem is Principal Component Analysis (PCA), which reduces the instances dimensionality while preserving most of the information [32]. However, PCA does not account for the temporal dependency of grand averaged ERPs, for which the shape of the curves is a critical factor that cannot be neglected in their classification. To overcome this issue, we decided to represent the data using FDA, which allows us to use fPCA. fPCA is an extension of PCA that accounts for the time dimension and is specifically designed to handle data where observations can be represented as functions, such as EEG time-series [33].
Therefore, we transform our data resorting to FDA techniques. According to this representation, each window is composed of a set of 122 functions fitted to the discrete samples provided by each EEG channel. To fit the functions, B-splines are used, which are piece-wise polynomial functions defined over a sequence of knots that partition the function domain in intervals. Within each interval, the B-spline is a polynomial of a fixed degree. The degree of smoothness is determined by the number and location of the knots, as well as the order of the B-splines. B-splines were preferred over Fourier or other options because they allow for flexible modeling of complex functions. Additionally, the use of B-splines can help to maintain important features of the original data, such as peaks and troughs, while removing noise [34].
In MIRACLE, the EEG channels’ functions are approximated using 5 knots (\begin{equation*} \widehat {ERP}_{i,j,w}(k,t) = \sum _{n=1}^{m+2r-1} {\omega _{n} B_{n,r}(k, \tau )} \tag{1}\end{equation*}
fPCA Computation. Each 500ms window is composed of the samples collected for the the 122 channels in that time range. To compute fPCA, the first step is to estimate from the samples of each channel the respective function. Then, fPCA computes the functional principal components that best explains the variance in the window. As in each window the first component explains more than 95% of the variance, it is the only one considered.
As previously mentioned, FDA representation enables us to leverage fPCA to reduce instances dimensionality, while preserving temporal information [35]. In fPCA, the objective is to decompose the data into a set of orthogonal functions known as principal components, which capture the main sources of variability in the data. Each functional principal component is associated with a score that represents the amount of explained data variance. Usually, the first few principal components capture the largest sources of variability, so that the others can be neglected. To perform fPCA, we first center and scale the windows to have zero mean and unit variance. Then, we compute the eigenfunctions and eigenvalues of the covariance operator, which describes the variation in the data over time and is defined as \begin{align*} &\hspace {-1pc}Cov[\widehat {ERP}_{i,j,w}(k, t), \widehat {ERP}_{i,j,w}(l, t)] \\ &= \frac {1}{N_{w}-1} \sum _{t=1}^{N_{w}} \big [\left ({\widehat {ERP}_{i,j,w}(k, t) - \mu _{\widehat {ERP}_{i,j,w}(k, t)}}\right ) \\ &\quad \times \left ({\widehat {ERP}_{i,j,w}(l, t) - \mu _{\widehat {ERP}_{i,j,w}(l, t)}}\right )\big ],\end{align*}
\begin{align*} &\hspace {-1.5pc}\int _{a}^{b} Cov[{\widehat {ERP}}_{i,j,w}(k, t), I({\widehat {ERP}}_{i,j,w}(l, t))] dt \\ &= \lambda \int _{a}^{b} I({\widehat {ERP}}_{i,j,w}(k, t))I({\widehat {ERP}}_{i,j,w}(l, t)) dt,\end{align*}
\begin{equation*} I(t) = \sum _{p=1}^{\infty} \sqrt {\lambda _{p}} \phi _{p}(t)\xi _{p}, \tag{2}\end{equation*}
The first functional principal component captures most of the variance (over 95%) in both perception and imagery datasets; therefore this is the only one retained. It follows that fPCA reduces each window dimensionality from 122 channels to 1 functional principal component. To highlight its variations within the window, we also calculate the first derivative of the component. Finally, we extract statistical features, i.e., mean, standard deviation, minimum, and maximum from both the component and its first derivative. By extracting these 8 features for each window, we can provide a concise summary of the available information to the machine-learning classifier, guaranteeing robust training performance.
D. Classifier Learning and Evaluation
After the features extraction process, each window of the grand averaged ERPs is represented by eight features, including the mean, standard deviation, minimum, and maximum of the first functional principal component and its derivative. These instances are presented to the hierarchical classifier for training and evaluation purposes. The evaluation procedure was first conducted considering k-fold cross-validation with a stratified split, where 25% of the data is used for testing purposes. This enables us to assess the model’s predictive capabilities when presented with grand averaged ERPs belonging to the same subjects considered in training. Additionally, we investigated the model’s performance on new users by resorting to leave-one-out validation, where the classifier is trained on all the instances in the dataset except for those belonging to one user, which constitute the test set. We repeat this procedure for all subjects, and the average evaluation metrics are considered.
To design the hierarchical classification structure, stimuli categories are grouped into fictional macro-categories according to a semantic perspective to create a tree structure, whose architecture is reported in Figure 3. Grouping similar categories together improves model interpretability and, depending on the requirements of the end user, it allows for high-level predictions while maintaining a correct semantic. Each binary classification is performed by a k-Nearest Neighbours (k-NN) machine-learning model. k-NN identifies the k nearest neighbors to a new data point in the training set and classifies it based on the most commonly represented class among the neighbors [36]. One of the advantages of k-NN is that it is a non-parametric algorithm that makes no assumptions about the underlying data distribution. The key parameter in this model is the number of neighbors considered, k, which we fine-tuned and set to 5. Each k-NN in the tree is trained and evaluated individually, and overall architecture performance is also estimated.
Hierarchical Classifier Architecture. This Figure shows the hierarchical classification architecture designed to identify the grand averaged ERPs. Each binary split is performed by a machine-learning model, ad-hoc trained to distinguish the two classes.
The F1-Score is considered as the main evaluation metric, due to its ability to provide a balanced view of performance, unlike accuracy, which can be misleading in datasets with unbalanced class distribution. It is defined as the harmonic mean of precision and recall.
Experimental Results: Evaluation and Discussion
To evaluate MIRACLE ability to recognize new grand averaged ERPs belonging to the same subjects as those in the training set, we utilized 10-fold cross-validation. This approach is widely adopted in machine-learning as it ensures reliable and consistent estimation of the classifier’s performance [37]. In each of the 10 iterations, the dataset was partitioned into two subsets: a training set that contained 75% of the instances and a test set including the remaining 25%. Stratified sampling was used to ensure a similar class distribution in the training and test sets, which is crucial in case of imbalanced datasets. The performance metrics were computed as the average assessed in the test set for each iteration. We evaluated the classifiers both individually and within the hierarchical architecture. In the former case, we evaluated each classifier’s ability to distinguish between classes on which it was trained. In the latter, we evaluated the performance of the entire hierarchical architecture, where each binary classifier predicted all the samples provided to the previous nodes. As a result, in case of misclassifications, the samples provided to a binary classifier may not belong to any of the classes on which it was trained.
We applied the same procedure to both perception and imagery datasets. Tables II and III present the precision, recall, and resulting F1-Score of each binary k-NN classifier on the perception and imagery datasets, respectively. The classifiers achieved an average F1-Score of 94.26% and 97.34%, with both datasets performing best for written and spoken words. The most frequently misclassified stimulus was the infant face. Upon further investigation, it was discovered that 7.63% and 1.89% of the grand averaged ERPs windows associated with infant face were incorrectly attributed to adult face during the split considering perception and imagery datasets, respectively. Clearly, the reaction process to the perception or imagination of faces share commonalities, so the specific error does not pose particular harm to the overall performance of the system.
Then, we evaluated the performance of the binary k-NN classifiers within the hierarchical structure. To this extent, we provided all instances in the dataset to the first binary classifier, which aimed to distinguish between visual and auditory stimuli categories. The instances were then passed on to the following binary classifiers according to the predicted category, until a leave of the hierarchical structure is reached. Table IV and V show the resulting precision, recall, and F1-Scores for the perception and imagery datasets, respectively. The F1-Score assessed to 91.10% and 94.13% for the perception and imagery datasets, respectively. The infant face was the least recognized class, and a higher percentage of wrongly predicted windows should be attributed to the animal stimulus. This could be due to the baby schema, i.e., some animals sharing physical characteristics with babies, as suggested by ethologist Konrad Lorenz’s research in 1943, eliciting similar responses in humans [38], [39].
The purpose of leave-one-out validation is to assess the performance of a BCI system at classifying ERP windows belonging to a new subject. Despite high leave-one-out validation performance would be desired, as it would guarantee that the system is effective on new patients without requiring classifier re-training, it is well known in the literature that differences in human physiology and brain patterns cause BCI illiteracy [40]. To evaluate the performance of MIRACLE on new users, we trained the hierarchical architecture on all but one user and repeated the procedure for all subjects, considering the average performance. The results for the perception and imagery datasets at each split level are reported in Tables VI and VII, which also provide a comparison with k-fold cross-validation outcomes. Despite MIRACLE assesses high F1-Score, even on new users, in both perception and imagery datasets, the classifier proves to perform better on the same subjects considered in the training phase rather than on new ones.
Additionally, Figure 4 shows the F1-Score obtained by the hierarchical classifier according to leave-one-out validation for each subject, considering the perception and imagery datasets. The results demonstrate inter-subject variability in performance, with some subjects achieving a higher recognition accuracy than others. This finding is consistent with previous studies indicating that similarity in cortical organization or patterns of brain activity can influence BCI performance [41]. Similarly, evidence is reported in literature that individuals who had similar EEG patterns in the training set achieved higher classification accuracy in BCI tasks [42].
Leave-One-Out Validation Results. This Figure shows the average F1-Score assessed by each subject according to leave-one-out validation procedure.
Finally, the trend in predictions over time for Subject 1 was investigated and reported in Figure 5 for the imagery dataset; perception dataset shows similar results. The left subplots show the predictions made when the hierarchical classifier was trained on 75% of the instances, equally distributed by stimuli categories and subjects. The right subplots show the predictions made by the hierarchical classifier when trained on the instances of all the subjects, except for Subject 1, who constituted the test set. Little classification error occurred for both perception and imagery stimuli categories according to k-fold cross-validation. Furthermore, by performing a majority voting and attributing to the grand averaged ERP the label most attributed to its windows, the performance can be enhanced. The same applies to the hierarchical classifier trained according to leave-one-out validation. However, in this case, more errors occur, mostly in discriminating adult and infant faces or music and emotional vocalization stimuli categories.
Predictions for Subject 1. This Figure compares the stimuli categories predicted for Subject 1’s data by the hierarchical classifier when trained on imagery dataset and evaluated according to k-fold cross-validation (left) and leave-one-out validation (right).
EEG Channels Reduction
It is known in the literature that PCA prevents to interpret the contributions of each input attribute in the predictive process, as the principal components are computed as a linear combination of the inputs. Similarly, fPCA can have similar interpretability issues. To address this problem, we embed in MIRACLE an approach to estimate the contribution of each channel in determining the first functional principal component. Specifically, for each grand averaged ERP related to the same stimulus we compute the correlation between each channel and the first functional principal component. Despite several formulations are provided to estimate correlation, we use Kendall
Channels Importance. This Figure shows, for each considered stimulus, the channels that, on average, are most correlated to the first fPCA component, i.e., that are most considered in the classification process.
At this point, the hierarchical classifier has been iteratively re-trained and evaluated according to k-fold cross-validation by removing one channel at time, from the least to the most important one. Figure 7 reports the F1-Score assessed by the hierarchical classifier on the perception and imagery datasets according to the number of channels considered. It turns out that, in perception dataset, the first important channel alone, provides 74.32% F1-Score; in imagery 80.09%. Also, in perception dataset to provide 90% F1-Score 93 electrodes are required; in imagery 37 electrodes are enough to assess the same performance. According to the performance requirements of the specific application, this analysis provides valuable insights to design a cap which embeds only the minimum number of important electrodes.
Setup Simplification. This Figure shows the average F1-Score assessed on grand averaged ERPs classification as a function of the number of considered channels.
Concluding Remarks and Outlook
This study represents an FDA and machine learning-based approach that demonstrated the feasibility of developing a passive BCI that recognizes specific semantic categories of stimuli based on grand averaged EEG data. Specifically, we recognize imagery stimuli categories, which has not been addressed by previous studies. MIRACLE’s classification performance for both perceived and imagery stimuli categories is remarkable, surpassing the 70% threshold for effective communication. With accuracy rates of 96.37% and 83.11% in k-fold cross-validation and hold-out validation, respectively, MIRACLE shows great potential for clinical applications, particularly considering the realm of mind-reading and communication with locked-in patients. Moving forward, our future work will focus on refining the hierarchical classification architecture, broadening the scope of stimuli categories, and assessing real-time performance.
NOTE
Open Access provided by 'Politecnico di Milano' within the CRUI CARE Agreement