Processing math: 0%
MIRACLE: MInd ReAding CLassification Engine | IEEE Journals & Magazine | IEEE Xplore

MIRACLE: MInd ReAding CLassification Engine

DatasetsAvailable

Abstract:

Brain-computer interfaces (BCIs) have revolutionized the way humans interact with machines, particularly for patients with severe motor impairments. EEG-based BCIs have l...Show More

Abstract:

Brain-computer interfaces (BCIs) have revolutionized the way humans interact with machines, particularly for patients with severe motor impairments. EEG-based BCIs have limited functionality due to the restricted pool of stimuli that they can distinguish, while those elaborating event-related potentials up to now employ paradigms that require the patient’s perception of the eliciting stimulus. In this work, we propose MIRACLE: a novel BCI system that combines functional data analysis and machine-learning techniques to decode patients’ minds from the elicited potentials. MIRACLE relies on a hierarchical ensemble classifier recognizing 10 different semantic categories of imagined stimuli. We validated MIRACLE on an extensive dataset collected from 20 volunteers, with both imagined and perceived stimuli, to compare the system performance on the two. Furthermore, we quantify the importance of each EEG channel in the decision-making process of the classifier, which can help reduce the number of electrodes required for data acquisition, enhancing patients’ comfort.
Page(s): 3212 - 3222
Date of Publication: 03 August 2023

ISSN Information:

PubMed ID: 37535483

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Brain Computer Interfaces (BCIs) are intended to create a direct connection between the human brain and computerized devices, enabling individuals to operate such devices without peripheral muscle involvement [1]. Several BCIs have been proposed over the years, leveraging information provided by brain electrical activity. Depending on the specific application, brain activity can be acquired either invasively through electrocorticogram (ECoG) or non-invasively through electroencephalogram (EEG) [2], [3]. EEG is commonly preferred as non-invasive, but scalp-measured brain electrical activity has lower signal-to-noise ratio, which poses significant challenges to develop effective BCIs [4]. Despite their effectiveness, BCIs that rely on Steady-State Visual Evoked Potentials (SSVEP) and P300 (a positive deflection of electric brain activity that occurs 300ms after stimulus recognition [5] indicating the presence of conscious mental attention processes [6]) provide a reduced pool of available functionalities [7]. This is because P300 remains the same regardless of the stimulus that elicits it, and SSVEP is limited by the resolution of the system to detect depolarization changes due to the different flashing frequencies [8]. To address these limitations, researchers have explored the use of Event-Related Potentialss (ERPs), which offer a wider range of evoked responses that can be associated with different actions [9]. ERPs are time-locked responses to stimuli, and their shape depends on the category of the stimulus presented.

To exploit ERPs potential, neuroscientists have extensively studied their morphology to identify reliable and characteristic markers. In more detail, the intent is to leverage the domain knowledge to define a set of rules that an expert BCI system can use to recognize different ERPs and perform the corresponding action. Additionally, these studies can provide insight into the cognitive processes underlying perception and provide a better understanding of this phenomenon. For example, in a recent work [10], researchers analyzed ERPs generated by 10 stimuli categories from both visual and auditory domains to extract reliable markers for the perception process. By interpreting the spatio-temporal coordinates of the brain activity, they identified relevant voltage peaks and statistically investigated their relationship to the specific stimulus category. Despite the interpretability benefits provided by an expert BCI system, the definition of hard rules can be challenging due to the complexity of EEG data. Indeed, EEG typically contains multiple channels, which refer to different brain regions. Moreover, the morphology and amplitude of brain waves varies across individuals, requiring a time-consuming fine-tuning procedure for each new user. To address these challenges, research has been conducted to leverage machine- and deep-learning techniques to automatically identify ERPs, assessing promising results [11]. The advantage of these techniques is that they do not need any prior knowledge concerning the investigated domain. A significant machine-learning approach is presented in [12] and [13], where ERPs are recognized belonging to 14 different semantic categories of stimuli. The proposed approach largely overcame the accuracy threshold of 70% for each category of stimulus, which is considered as the minimum requirement to guarantee a meaningful BCI communication [14].

In addition to perceived stimuli, the use of imagined stimuli in BCI has been also explored for patients who are unable to produce observable responses, such as those in a coma or severe locked-in state. Indeed, mind-reading applications would considerably expand the BCI potential in clinical field. Again, studies have been conducted that demonstrate the potential of motor imagery-based BCI systems in enabling communication and control of external devices. For instance, in [15] a BCI system based on motor imagery was developed to allow patients to control robots. Similarly, in [16], a motor imagery-based BCI system was designed to control a wheelchair. These studies highlight the potential of motor imagery-based BCI systems in improving the quality of life for individuals with severe motor disabilities. Despite being effective at moving objects, motor imagery BCIs still provide a reduced pool of additional functionalities. Therefore, researchers investigated ERP-based BCIs using imagined stimuli as an alternative to provide a wider set of possible actions for the patients. These works have been inspired by evidence reported by recent studies that have found an overlap in neural processing related to perceived and imagined stimuli [17]. In more detail, studies demonstrated that similar brain regions activate in response to the same stimulus, whether it is perceived or imagined [18]. Furthermore, multi-voxel approach revealed that similar sensory visual features can be identified in both perceived and imagined stimuli, with additional activity in the anterior fronto-temporal region during imagery tasks due to attention and memory processes [19]. For auditory stimuli, the secondary auditory cortex activates similarly in both perceived and imagined stimuli, while the primary auditory cortex tends to be more active in response to perceived stimuli [20]. However, recent findings have provided evidence of an activation of the Heschl gyri (A1) during music imagery [21]. To further explore these findings, a recent study has identified reliable markers in imagery ERPs to develop a robust set of hard rules for implementing a mind-reading BCI system [22]. Despite the promising results assessed by neuroscientists at identifying a reliable set of marker to distinguish imagery ERPs, few approaches have been proposed that resort to machine- and deep-learning techniques. Prior works, such as those presented in [23] and [24], leverage deep-learning techniques to distinguish between ERPs elicited by imagery of different stimuli. Specifically, the former focused on distinguishing between ERPs related to the imagery of homes and human faces, achieving an accuracy of 68%. On the other hand, the latter study combined the potential of convolution neural networks and genetic algorithms to distinguish between ERPs that are associated with imagery of dogs, airplanes, and houses, achieving 60% accuracy. However, these performances are not enough to create a reliable BCI system.

In this study, we introduce a novel machine learning-based BCI system that can learn characteristic patterns to accurately classify ERPs associated with different semantic categories of stimuli. To assess the performance of the approach, we conducted an extensive experimental campaign involving 20 volunteers, who were presented with 40 stimuli belonging to 10 semantic categories and asked to both perceive and imagine them. The proposed system demonstrates a very high accuracy, largely exceeding the 70% threshold for both perception and imagery ERPs. To reduce the number of electrodes needed, we also investigate how the number of considered channels affects the classifier’s performance. Our findings indicate that several electrodes can be neglected, allowing to improve the patient comfort. These promising results lay the groundwork for the development of effective mind-reading BCI systems relying on stimuli imagery only. MIRACLE offers three main contributions over the existing literature. To the best of the authors’ knowledge, it represents the first attempt to use Functional Data Analysis (FDA) for the ERP recognition problem. This tool is crucial to retain the time information and reduce data dimensionality by means of Functional Principal Component Analysis (fPCA). Secondly, we propose a hierarchical classifier architecture that simplifies the multi-category stimuli classification problem as a series of binary classifications, thus enhancing the stimuli recognition performance and exceeding those obtained so far for mind reading, reaching them over a wider set of stimuli. Last, we contribute to BCI technology by presenting an approach to quantify the importance of each channel in the decision-making process of the hierarchical classifier. This approach calculates the Kendall \tau correlation coefficient between each EEG channel and the first principal component identified by fPCA, providing an effective tool to reduce the number of electrodes embedded in the acquisition cap and improve patients’ comfort. Furthermore, it enables the automatic identification of relevant brain areas involved in stimuli classification, which turns out to be consistent with neuroscientific knowledge. To validate MIRACLE, grand averaged ERPs triggered by 10 semantic categories of stimuli, both visual and auditory, were collected during an extensive experimental campaign involving 20 volunteers. Performance is evaluated through k-fold cross-validation to ensure robustness when considering new ERPs belonging to the same subjects analyzed during training, and leave-one-out validation to investigate the impact of BCI illiteracy on the hierarchical classifier’s accuracy. The results showed that the proposed BCI system reaches outstanding performance on both perception and imagery ERPs, vastly exceeding the 70% threshold for effective communication. Summarizing, the most significant achievement of this work is demonstrating the ability to recognize the imagined stimulus that triggers an ERPs, which represents a significant step forward in the field of mind-reading applications that can be effective even in patients who are in a coma or affected by locked-in syndrome.

The rest of the paper is organized as follows: Section II provides insights concerning the experimental procedure used to collect the ERPs data considered in this work. Then, Section III presents the method designed to associate each ERPs to the semantic category of the stimulus that has elicited it. Section IV discusses the system’s performance. Section V explains the approach designed to investigate EEG channels’ importance and presents the obtained results.

SECTION II.

Experimental Setup

This Section outlines the experimental methodology designed for ERPs data collection. For additional information, please refer to [10] and [22], where further neuroscientific insights into the collected ERPs’ components are provided.

The data collection process involved a group of 20 volunteers (13 females and 7 males) with an average age of 23.9±3.34 years. All participants were right-handed according to the Edinburgh Inventory Questionnaire [25]. Furthermore, they had normal or corrected-to-normal vision and hearing and did not experience deficits in language comprehension, reading, or spelling. None of the volunteers had previously been diagnosed with a psychological or psychiatric disorder or drug abuse. The experimental protocol adhered to the Helsinki Declaration of 1964 and was approved by the Ethics Committee of Bicocca University (protocol number RM-432). Each participant provided written informed consent before participating in the data collection process.

Each participant was given instructions regarding the standardized experimental protocol. Firstly, an high-density electrodes cap was carefully applied to the volunteer’s scalp. Then the participant was instructed to wear Sennheiser electronic gmbH headphones and sit inside an anaechoic and faradized cabinet, located 114cm away from a HR VGA color monitor positioned outside. All the volunteers were specifically instructed to remain still and avoid any eye or body movements while focusing their gaze on a central point displayed on the screen. The stimuli were organized into 12 runs, consisting of 8 runs with visual stimuli and 4 runs with auditory stimuli. Each visual run lasted for 3 minutes, while each auditory run lasted for 2 minutes and 30 seconds. The allocation of stimuli to the runs was randomized within their respective sensory domains. In total, there were 40 stimuli instances for each category, encompassing 10 categories, with 7 visual and 3 auditory categories. Consequently, there were a total of 280 visual stimuli and 120 auditory stimuli. Although all subjects perceived the same stimuli, the order of presentation varied between participants. The visual stimuli consisted of images measuring 18.5 \times 13.5 cm, presented at the center of the monitor for 1500ms, accompanied by a white background. On the other hand, the auditory stimuli consisted of 1500ms recordings played through the headphones from an iPhone 7 and a Huawei P10. The visual stimuli were carefully matched in terms of sensory properties such as luminance, color, and size. Likewise, the human-related visual and auditory stimuli were matched based on perceptual properties like sex and age, while written and spoken words were matched based on linguistic properties. Specifically, a set of 40 common Italian words was selected. The auditory stimuli were normalized and adjusted to intensity levels ranging from 20 to 30 dB, ensuring consistency in intensity and volume. For what concerns the choice of the specific categories of stimuli, which are detailed in Table I, they represent the 10 most distinctive categories of sensory and perceptual stimuli for which human display innate and devoted neural mechanisms, which have been studied both in terms of both anatomical localization and timing of activation. When a run started, the first stimulus was presented. The presentation of the stimuli to the volunteers leveraged Eevoke Software for audiovisual presentation (ANT, Enschede, The Netherlands). Following the removal of the stimulus, a gray screen was displayed for an intra-stimulus interval of 500\pm 100ms . Subsequently, a yellow frame appeared on the screen, signaling to the participant to imagine the picture or sound that had just been presented. After a designated 2000ms duration for performing the imagery task, the yellow frame disappeared, and the participant was provided with an inter-trial interval of 900\pm 100ms before the next stimulus was presented. It is important to note that the decision to allocate 2000ms to the volunteers for imaging each stimulus is supported by evidence in the literature that at least 500ms are required to figure an alphabet letter, and this interval considerably increases considering auditory stimuli [26]. To foster concentrations, volunteers were informed beforehand that at the end of the experiment they would be asked to complete a questionnaire related to the presented stimuli.

TABLE I Experimental Campaign: Semantic Categories of Stimuli. This Table Presents the Categories of the Stimuli Employed in the Experiment; 40 Different Stimuli are Collected for Each Category
Table I- 
Experimental Campaign: Semantic Categories of Stimuli. This Table Presents the Categories of the Stimuli Employed in the Experiment; 40 Different Stimuli are Collected for Each Category

During the experiment, EEG data was continuously recorded by a cap embedding 126 electrodes sampling at 512Hz and placed according to the 10/5% system by Oostenveld and Praamstra [27]. The electrodes’ impedance was kept below 5k\Omega . The electrodes record both EEG and electrooculogram (EOG), and use the linked mastoids (M1, M2) as reference leads. ANT software was used to acquire and clean the data. In detail, it applies a band-pass filter between 0.016 and 30Hz to all the EEG channels, and between 0.016 and 70Hz for the EOG channels. Artifacts caused by eye movements, blinks, or excessive muscle potential were also removed by leveraging peak-to-peak amplitude exceeding 50\mu \text{V} as criterion, leading to a rejection rate of 5%. Furthermore, ANT acquisition software performs ERPs grand averaging to improve the signal-to-noise ratio of the potential elicited by each stimulus [28]. Specifically, for both the perception and imagery recordings, the 40 responses of each subject associated with stimuli belonging to the same semantic category were first synchronized based on the trigger onset and then averaged. The resulting collections of averaged ERPs for perception and imagery stimuli constitute the datasets produced at the end of the experimental procedure. Both perception and imagery datasets consisted of 200 ERPs, 10 per subject, one for each category of stimulus. In the perception dataset, each ERP lasted 1500ms, with 100ms pre-stimulus baseline; In the imagery dataset, each ERP lasts 2000ms, with a 100ms pre-stimulus baseline. An example of the ERPs collected in the perception and imagery datasets is reported in Figure 1, where Subject 1’s trends for the CPz channel are presented for all the stimulus categories. It is interesting to notice that, as reported in Section I, P300 component was reduced and delayed in the imagery datasets. This result further suggests how imagery is a weaker and noisier imaginative experience, as opposed to the more vivid and detailed perceptual one [22].

Fig. 1. - Perception vs Imagery: CPz Trend. This Figure refers to Subject 1 and shows the trends measured by CPz channel for all the semantic categories of stimuli, comparing perception and imagery ERPs. Please notice that, to ease the comparison, the plotted trends have been subjected to baseline correction.
Fig. 1.

Perception vs Imagery: CPz Trend. This Figure refers to Subject 1 and shows the trends measured by CPz channel for all the semantic categories of stimuli, comparing perception and imagery ERPs. Please notice that, to ease the comparison, the plotted trends have been subjected to baseline correction.

SECTION III.

Proposed Method

This Section details MIRACLE’s pipeline, tailored to recognize the semantic category of the stimulus related to the measured ERPs, and presents the performance metrics used for its evaluation. A comprehensive set of metrics has been considered, to provide a reliable understanding of MIRACLE’S performance and enable comparison with other methods in the literature.

A. Pre-Processing

The first step of the MIRACLE classification pipeline is pre-processing. It consists of two stages: baseline correction and frame selection. As grand averaged ERPs are affected by noise, baseline correction is essential. Indeed, this step aims to remove the resting-state activity from the signal, and guarantees that any measured voltage changes are due to the stimulus rather than to the ongoing brain processes [29]. For each grand averaged ERP, its mean voltage recorded before the onset of the stimulus is subtracted channel by channel. After this procedure, the unnecessary channels, i.e., the ones related to the ocular (vEOG, hEOG) and mastoids (M1, M2) are removed.

Then, frame selection is performed, to improve the signal-to-noise ratio [29]. Considering imagery stimuli categories, we consider as relevant the 400 to 1000ms range, while for perception stimuli categories, we maintain the grand averaged ERP’s portion included in 100 to 1000ms range. The choice of considering a smaller interval in imagery ERPs is due to the lack of P300 component. Indeed, as reported in Section I, it is related to attention processes, so it is not typically observed in potentials that are not evoked by an external stimulation [30].

B. Windowing

Windowing is a technique widely used in machine learning to enhance model training and provide robustness to time shifts in data collection. It involves dividing the pre-processed signals into smaller segments of fixed length leveraging a window that slides of a specific factor, defined as slope. Windowing is also essential when designing an online BCI to enable real-time ERPs classification. Indeed, the window and the slope determine the time required to get the first prediction and the update rate, respectively. It follows that window size is a crucial parameter that must be properly fine-tuned. Wider windows correspond to a reduced number of instances and longer wait to get the first prediction, while smaller windows may not accurately represent the grand averaged ERP behavior, leading to inconsistent results. In the design of MIRACLE, we tuned the window size according to an a posteriori approach, which relies on the performance of the classifier in recognizing the classes from the extracted features while varying the window size, and evaluated different sizes ranging from 100ms to 600ms. For each evaluated window size, we extracted the corresponding set of features, using the 75% of the data to train the classifier, and the remaining 25% for testing purposes. The train-test split was performed according to a stratified split. To evaluate the classifier’s performance in recognizing the semantic category of the stimuli, we computed the F1-Score using k-fold cross-validation. Fine-tuning of the classifier was conducted separately for the perception and imagery datasets. In both cases, we determined the optimal window size to be 500ms, while setting the slope to 1s.

C. Features Extraction

After the windowing stage, the perception and imagery datasets consist of two collections of grand averaged ERPs’ windows. In more detail, they are composed of a set of 6600 and 9300 windows 500ms long, respectively. It follows that each window collects the samples measured in the window time frame by the 122 channels. Formally, we can define the perception and the imagery datasets as \chi ^{pro} and \chi ^{fig} , where each \chi is a collection of on the windows ERP_{i,j,w} extracted from the grand averaged ERP recorded for the i^{th} subject when triggered by the j^{th} stimulus, i.e., \begin{equation*} \chi = \{ERP_{1,1,1}, {\dots }, ERP_{i,j,w}, {\dots }, ERP_{Su,St,W}\},\end{equation*} View SourceRight-click on figure for MathML and additional features. where St is the number of stimuli categories in the dataset, i.e., 10, Su the number of volunteers, i.e., 20, and W the number of overall windows extracted from all the grand averaged ERP. Moreover, each ERP_{i,j,w} contains all the discrete samples collected in the w^{th} window frame, that is \begin{align*} ERP_{i,j,w} &= [ERP_{i,j,w}(1,a), {\dots }, ERP_{i,j,w}(k,t), \\ &\qquad \qquad {\dots }, ERP_{i,j,w}(K,b)]\end{align*} View SourceRight-click on figure for MathML and additional features. where K is the number of channels, i.e., 122, while a and b are the extremes of the grand averaged ERP window’s domain, i.e., ERP_{i,j,w}: [a,b] and b-a=500ms .

The large number of channels can cause a decrease in classifier performance due to the curse of dimensionality problem [31]. This occurs because as more features are added, the dimension of the space in which the instances are represented enlarges, making it more challenging for the classifier to learn a robust decision function without overfitting. One common technique used to address this problem is Principal Component Analysis (PCA), which reduces the instances dimensionality while preserving most of the information [32]. However, PCA does not account for the temporal dependency of grand averaged ERPs, for which the shape of the curves is a critical factor that cannot be neglected in their classification. To overcome this issue, we decided to represent the data using FDA, which allows us to use fPCA. fPCA is an extension of PCA that accounts for the time dimension and is specifically designed to handle data where observations can be represented as functions, such as EEG time-series [33].

Therefore, we transform our data resorting to FDA techniques. According to this representation, each window is composed of a set of 122 functions fitted to the discrete samples provided by each EEG channel. To fit the functions, B-splines are used, which are piece-wise polynomial functions defined over a sequence of knots that partition the function domain in intervals. Within each interval, the B-spline is a polynomial of a fixed degree. The degree of smoothness is determined by the number and location of the knots, as well as the order of the B-splines. B-splines were preferred over Fourier or other options because they allow for flexible modeling of complex functions. Additionally, the use of B-splines can help to maintain important features of the original data, such as peaks and troughs, while removing noise [34].

In MIRACLE, the EEG channels’ functions are approximated using 5 knots (m ) and a polynomial of order (r ) equal to 2. In more detail, let {ERP_{i,j,w}(k, t)} , t=[a,b] , be the set of N_{w} discrete samples associated to the k^{th} channel of the grand averaged ERP window referred to the j^{th} subject elicited by the i^{th} stimulus. Let also {\tau _{n}} , n=1,2, {\dots },m be a set of m knots equally spaced in the function domain. The knots divide the window interval [a,b] into m+1 sub-intervals, where a=\tau _{1}\leq \tau _{2}\leq {\dots } \leq \tau _{m} \leq \tau _{m+1}=b . It follows that the basis smoothing of the grand averaged ERP window’s data can be defined by a linear combination of the fitted splines as \begin{equation*} \widehat {ERP}_{i,j,w}(k,t) = \sum _{n=1}^{m+2r-1} {\omega _{n} B_{n,r}(k, \tau )} \tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \widehat {ERP}_{i,j,w}(k,t) is the smoothed function, {B_{n,r}(k, \tau )} is the B-spline of order r associated to the n^{th} knot, and \omega _{n} are the weights attributed to each B-spline in the linear combination, estimated by least squares. As reported in Figure 2, at the end of this step, the window consists of a set of 122 functions.

Fig. 2. - fPCA Computation. Each 500ms window is composed of the samples collected for the the 122 channels in that time range. To compute fPCA, the first step is to estimate from the samples of each channel the respective function. Then, fPCA computes the functional principal components that best explains the variance in the window. As in each window the first component explains more than 95% of the variance, it is the only one considered.
Fig. 2.

fPCA Computation. Each 500ms window is composed of the samples collected for the the 122 channels in that time range. To compute fPCA, the first step is to estimate from the samples of each channel the respective function. Then, fPCA computes the functional principal components that best explains the variance in the window. As in each window the first component explains more than 95% of the variance, it is the only one considered.

As previously mentioned, FDA representation enables us to leverage fPCA to reduce instances dimensionality, while preserving temporal information [35]. In fPCA, the objective is to decompose the data into a set of orthogonal functions known as principal components, which capture the main sources of variability in the data. Each functional principal component is associated with a score that represents the amount of explained data variance. Usually, the first few principal components capture the largest sources of variability, so that the others can be neglected. To perform fPCA, we first center and scale the windows to have zero mean and unit variance. Then, we compute the eigenfunctions and eigenvalues of the covariance operator, which describes the variation in the data over time and is defined as \begin{align*} &\hspace {-1pc}Cov[\widehat {ERP}_{i,j,w}(k, t), \widehat {ERP}_{i,j,w}(l, t)] \\ &= \frac {1}{N_{w}-1} \sum _{t=1}^{N_{w}} \big [\left ({\widehat {ERP}_{i,j,w}(k, t) - \mu _{\widehat {ERP}_{i,j,w}(k, t)}}\right ) \\ &\quad \times \left ({\widehat {ERP}_{i,j,w}(l, t) - \mu _{\widehat {ERP}_{i,j,w}(l, t)}}\right )\big ],\end{align*} View SourceRight-click on figure for MathML and additional features. where {\widehat {ERP}}_{i,j,w}(k, t) represents the k^{th} channel function of the w window at time t , \mu _{\widehat {ERP}_{i,j,w}(k, t)} is its mean across all time points, and N_{w} is the number of time points for each channel in the window. As we are considering functions, the functional covariance must also be computed from the sample covariance by solving \begin{align*} &\hspace {-1.5pc}\int _{a}^{b} Cov[{\widehat {ERP}}_{i,j,w}(k, t), I({\widehat {ERP}}_{i,j,w}(l, t))] dt \\ &= \lambda \int _{a}^{b} I({\widehat {ERP}}_{i,j,w}(k, t))I({\widehat {ERP}}_{i,j,w}(l, t)) dt,\end{align*} View SourceRight-click on figure for MathML and additional features. where \lambda is the eigenvalue associated with each eigenfunction and I(\cdot ) is the integral operator. The solution to this problem can be expressed as a series expansion in the form \begin{equation*} I(t) = \sum _{p=1}^{\infty} \sqrt {\lambda _{p}} \phi _{p}(t)\xi _{p}, \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \phi _{p}(t) are the eigenfunctions of the covariance operator, \lambda _{p} are the corresponding eigenvalues, and \xi _{p} are the coefficients of the expansion. Then we can project each window onto the fPCA space to obtain a lower-dimensional representation by retaining only the information provided by a reduced set of the principal components.

The first functional principal component captures most of the variance (over 95%) in both perception and imagery datasets; therefore this is the only one retained. It follows that fPCA reduces each window dimensionality from 122 channels to 1 functional principal component. To highlight its variations within the window, we also calculate the first derivative of the component. Finally, we extract statistical features, i.e., mean, standard deviation, minimum, and maximum from both the component and its first derivative. By extracting these 8 features for each window, we can provide a concise summary of the available information to the machine-learning classifier, guaranteeing robust training performance.

D. Classifier Learning and Evaluation

After the features extraction process, each window of the grand averaged ERPs is represented by eight features, including the mean, standard deviation, minimum, and maximum of the first functional principal component and its derivative. These instances are presented to the hierarchical classifier for training and evaluation purposes. The evaluation procedure was first conducted considering k-fold cross-validation with a stratified split, where 25% of the data is used for testing purposes. This enables us to assess the model’s predictive capabilities when presented with grand averaged ERPs belonging to the same subjects considered in training. Additionally, we investigated the model’s performance on new users by resorting to leave-one-out validation, where the classifier is trained on all the instances in the dataset except for those belonging to one user, which constitute the test set. We repeat this procedure for all subjects, and the average evaluation metrics are considered.

To design the hierarchical classification structure, stimuli categories are grouped into fictional macro-categories according to a semantic perspective to create a tree structure, whose architecture is reported in Figure 3. Grouping similar categories together improves model interpretability and, depending on the requirements of the end user, it allows for high-level predictions while maintaining a correct semantic. Each binary classification is performed by a k-Nearest Neighbours (k-NN) machine-learning model. k-NN identifies the k nearest neighbors to a new data point in the training set and classifies it based on the most commonly represented class among the neighbors [36]. One of the advantages of k-NN is that it is a non-parametric algorithm that makes no assumptions about the underlying data distribution. The key parameter in this model is the number of neighbors considered, k, which we fine-tuned and set to 5. Each k-NN in the tree is trained and evaluated individually, and overall architecture performance is also estimated.

Fig. 3. - Hierarchical Classifier Architecture. This Figure shows the hierarchical classification architecture designed to identify the grand averaged ERPs. Each binary split is performed by a machine-learning model, ad-hoc trained to distinguish the two classes.
Fig. 3.

Hierarchical Classifier Architecture. This Figure shows the hierarchical classification architecture designed to identify the grand averaged ERPs. Each binary split is performed by a machine-learning model, ad-hoc trained to distinguish the two classes.

The F1-Score is considered as the main evaluation metric, due to its ability to provide a balanced view of performance, unlike accuracy, which can be misleading in datasets with unbalanced class distribution. It is defined as the harmonic mean of precision and recall.

SECTION IV.

Experimental Results: Evaluation and Discussion

To evaluate MIRACLE ability to recognize new grand averaged ERPs belonging to the same subjects as those in the training set, we utilized 10-fold cross-validation. This approach is widely adopted in machine-learning as it ensures reliable and consistent estimation of the classifier’s performance [37]. In each of the 10 iterations, the dataset was partitioned into two subsets: a training set that contained 75% of the instances and a test set including the remaining 25%. Stratified sampling was used to ensure a similar class distribution in the training and test sets, which is crucial in case of imbalanced datasets. The performance metrics were computed as the average assessed in the test set for each iteration. We evaluated the classifiers both individually and within the hierarchical architecture. In the former case, we evaluated each classifier’s ability to distinguish between classes on which it was trained. In the latter, we evaluated the performance of the entire hierarchical architecture, where each binary classifier predicted all the samples provided to the previous nodes. As a result, in case of misclassifications, the samples provided to a binary classifier may not belong to any of the classes on which it was trained.

We applied the same procedure to both perception and imagery datasets. Tables II and III present the precision, recall, and resulting F1-Score of each binary k-NN classifier on the perception and imagery datasets, respectively. The classifiers achieved an average F1-Score of 94.26% and 97.34%, with both datasets performing best for written and spoken words. The most frequently misclassified stimulus was the infant face. Upon further investigation, it was discovered that 7.63% and 1.89% of the grand averaged ERPs windows associated with infant face were incorrectly attributed to adult face during the split considering perception and imagery datasets, respectively. Clearly, the reaction process to the perception or imagination of faces share commonalities, so the specific error does not pose particular harm to the overall performance of the system.

TABLE II Classification Report for Binary Classifiers: Perception. This Table Reports the Precision, Recall, and F1-Score Assessed by Each Binary Classifier Independently at Predicting the Grand Averaged ERPS in Perception Dataset. The Support is Equal to 660 Samples for Each Class
Table II- 
Classification Report for Binary Classifiers: Perception. This Table Reports the Precision, Recall, and F1-Score Assessed by Each Binary Classifier Independently at Predicting the Grand Averaged ERPS in Perception Dataset. The Support is Equal to 660 Samples for Each Class
TABLE III Classification Report for Binary Classifiers: Imagery. This Table Reports the Precision, Recall, and F1-Score Assessed by Each Binary Classifier at Predicting the Imagery Dataset. The Support is Equal to 930 Samples for Each Class
Table III- 
Classification Report for Binary Classifiers: Imagery. This Table Reports the Precision, Recall, and F1-Score Assessed by Each Binary Classifier at Predicting the Imagery Dataset. The Support is Equal to 930 Samples for Each Class

Then, we evaluated the performance of the binary k-NN classifiers within the hierarchical structure. To this extent, we provided all instances in the dataset to the first binary classifier, which aimed to distinguish between visual and auditory stimuli categories. The instances were then passed on to the following binary classifiers according to the predicted category, until a leave of the hierarchical structure is reached. Table IV and V show the resulting precision, recall, and F1-Scores for the perception and imagery datasets, respectively. The F1-Score assessed to 91.10% and 94.13% for the perception and imagery datasets, respectively. The infant face was the least recognized class, and a higher percentage of wrongly predicted windows should be attributed to the animal stimulus. This could be due to the baby schema, i.e., some animals sharing physical characteristics with babies, as suggested by ethologist Konrad Lorenz’s research in 1943, eliciting similar responses in humans [38], [39].

TABLE IV Classification Report for Hierarchical Classifier: Perception. This Table Reports the Precision, Recall, and F1-Score for the Perception Dataset When Predicted by the Hierarchical Classifier. The Support is Equal to 660 Samples for Each Class
Table IV- 
Classification Report for Hierarchical Classifier: Perception. This Table Reports the Precision, Recall, and F1-Score for the Perception Dataset When Predicted by the Hierarchical Classifier. The Support is Equal to 660 Samples for Each Class
TABLE V Classification Report for Hierarchical Classifier: Imagery. This Table Reports the Precision, Recall, and F1-Score for the Imagery Dataset When Predicted by the Hierarchical Classifier. The Support is Equal to 930 Samples for Each Class
Table V- 
Classification Report for Hierarchical Classifier: Imagery. This Table Reports the Precision, Recall, and F1-Score for the Imagery Dataset When Predicted by the Hierarchical Classifier. The Support is Equal to 930 Samples for Each Class

The purpose of leave-one-out validation is to assess the performance of a BCI system at classifying ERP windows belonging to a new subject. Despite high leave-one-out validation performance would be desired, as it would guarantee that the system is effective on new patients without requiring classifier re-training, it is well known in the literature that differences in human physiology and brain patterns cause BCI illiteracy [40]. To evaluate the performance of MIRACLE on new users, we trained the hierarchical architecture on all but one user and repeated the procedure for all subjects, considering the average performance. The results for the perception and imagery datasets at each split level are reported in Tables VI and VII, which also provide a comparison with k-fold cross-validation outcomes. Despite MIRACLE assesses high F1-Score, even on new users, in both perception and imagery datasets, the classifier proves to perform better on the same subjects considered in the training phase rather than on new ones.

TABLE VI Classification Performance According to k-Fold Cross-Validation and Leave-One-Out Validation: Perception. This Table Compares the F1-Score Assessed at Recognizing the Perception Stimuli Categories
Table VI- 
Classification Performance According to k-Fold Cross-Validation and Leave-One-Out Validation: Perception. This Table Compares the F1-Score Assessed at Recognizing the Perception Stimuli Categories
TABLE VII Classification Performance According to k-Fold Cross-Validation and Leave-One-Out Validation: Imagery. This Table Compares the F1-Score Assessed at Recognizing the Imagery Stimuli Categories
Table VII- 
Classification Performance According to k-Fold Cross-Validation and Leave-One-Out Validation: Imagery. This Table Compares the F1-Score Assessed at Recognizing the Imagery Stimuli Categories

Additionally, Figure 4 shows the F1-Score obtained by the hierarchical classifier according to leave-one-out validation for each subject, considering the perception and imagery datasets. The results demonstrate inter-subject variability in performance, with some subjects achieving a higher recognition accuracy than others. This finding is consistent with previous studies indicating that similarity in cortical organization or patterns of brain activity can influence BCI performance [41]. Similarly, evidence is reported in literature that individuals who had similar EEG patterns in the training set achieved higher classification accuracy in BCI tasks [42].

Fig. 4. - Leave-One-Out Validation Results. This Figure shows the average F1-Score assessed by each subject according to leave-one-out validation procedure.
Fig. 4.

Leave-One-Out Validation Results. This Figure shows the average F1-Score assessed by each subject according to leave-one-out validation procedure.

Finally, the trend in predictions over time for Subject 1 was investigated and reported in Figure 5 for the imagery dataset; perception dataset shows similar results. The left subplots show the predictions made when the hierarchical classifier was trained on 75% of the instances, equally distributed by stimuli categories and subjects. The right subplots show the predictions made by the hierarchical classifier when trained on the instances of all the subjects, except for Subject 1, who constituted the test set. Little classification error occurred for both perception and imagery stimuli categories according to k-fold cross-validation. Furthermore, by performing a majority voting and attributing to the grand averaged ERP the label most attributed to its windows, the performance can be enhanced. The same applies to the hierarchical classifier trained according to leave-one-out validation. However, in this case, more errors occur, mostly in discriminating adult and infant faces or music and emotional vocalization stimuli categories.

Fig. 5. - Predictions for Subject 1. This Figure compares the stimuli categories predicted for Subject 1’s data by the hierarchical classifier when trained on imagery dataset and evaluated according to k-fold cross-validation (left) and leave-one-out validation (right).
Fig. 5.

Predictions for Subject 1. This Figure compares the stimuli categories predicted for Subject 1’s data by the hierarchical classifier when trained on imagery dataset and evaluated according to k-fold cross-validation (left) and leave-one-out validation (right).

SECTION V.

EEG Channels Reduction

It is known in the literature that PCA prevents to interpret the contributions of each input attribute in the predictive process, as the principal components are computed as a linear combination of the inputs. Similarly, fPCA can have similar interpretability issues. To address this problem, we embed in MIRACLE an approach to estimate the contribution of each channel in determining the first functional principal component. Specifically, for each grand averaged ERP related to the same stimulus we compute the correlation between each channel and the first functional principal component. Despite several formulations are provided to estimate correlation, we use Kendall \tau coefficient, as it is non-parametric and more robust to noise [43]. The channels that are most correlated with the first principal component, considering all subjects, are considered the most relevant. Indeed, those channels are the most correlated to the first principal component, which is the one employed to extract the features proposed to the classifier to identify relevant patterns on which rely its predictive process. This procedure is repeated for each grand averaged ERPs in both perception and imagery datasets separately. Figure 6 shows the outcomes of the process, reporting the channels as colored dots, with size that increases with importance. It turns out that, regardless the specific stimulus, central, dorsolateral and centro-parietal brain area are the most considered by the hierarchical classifier to perform the recognition task. Finally, it is also possible to see that the prefrontal cortex is considered more in perception than in imagery stimuli categories, where the centro-parietal is most involved. It is worth mentioning that the correlation is computed over the whole frame of the grand averaged ERP considered, i.e., between 100ms and 1000ms for the perception dataset and 400ms to 1000ms for the imagery one. Therefore, the preponderance of frontal and central activity is also due to the cognitive processes related to the perception of the stimulus. We can state that is possible to reduce the acquisition setup. Therefore, we first averaged the Kendall \tau coefficient computed for each ERP. Then, according to the average correlation coefficient, we sort the channels, from the most to the least important. Table VIII reports the 10 most important channels in perception and imagery datasets. In both datasets the most important channels’ set includes electrodes of central, dorsolateral prefrontal/frontocentral and centro/parietal brain areas, e.g., C1, C2, and Cz, CCP1h and CCP2h, FC2, and FFC1h, FFC2h, and FFC4h. The results of our study align with the physiological findings reported in [10] and [22]. This result proves that our approach is able to identify important channels located in brain areas that are known to be involved in stimulus processing, without relying on prior knowledge. In more detail, many of the channels that are known to contain relevant markers according to previous studies were reported as important according to the presented approach. This is the case of C1, C2, Cz, CPz, FFC1h, FFC2h, AF2, AF3, AF4, AFz, Fz, FPz, P3, and P4. Exceptions are channels in the midline occipital and left occipitotemporal regions, including Oz, Iz, P7, P8, PPO9h, and PPO10h, which are known to be relevant for distinguishing between word, checkerboard, object, human, and animal face stimuli, but were among the least important for the hierarchical classifier. Nevertheless, MIRACLE was still able to accurately distinguish among these stimuli categories, even with the limited contribution of these channels.

TABLE VIII Channels’ Importance. This Table Reports the 10 Channels That, One Average, are Most Important to Recognize the Stimuli Categories Associated to the Grand Averaged ERPS in Perception and Imagery Datasets
Table VIII- 
Channels’ Importance. This Table Reports the 10 Channels That, One Average, are Most Important to Recognize the Stimuli Categories Associated to the Grand Averaged ERPS in Perception and Imagery Datasets
Fig. 6. - Channels Importance. This Figure shows, for each considered stimulus, the channels that, on average, are most correlated to the first fPCA component, i.e., that are most considered in the classification process.
Fig. 6.

Channels Importance. This Figure shows, for each considered stimulus, the channels that, on average, are most correlated to the first fPCA component, i.e., that are most considered in the classification process.

At this point, the hierarchical classifier has been iteratively re-trained and evaluated according to k-fold cross-validation by removing one channel at time, from the least to the most important one. Figure 7 reports the F1-Score assessed by the hierarchical classifier on the perception and imagery datasets according to the number of channels considered. It turns out that, in perception dataset, the first important channel alone, provides 74.32% F1-Score; in imagery 80.09%. Also, in perception dataset to provide 90% F1-Score 93 electrodes are required; in imagery 37 electrodes are enough to assess the same performance. According to the performance requirements of the specific application, this analysis provides valuable insights to design a cap which embeds only the minimum number of important electrodes.

Fig. 7. - Setup Simplification. This Figure shows the average F1-Score assessed on grand averaged ERPs classification as a function of the number of considered channels.
Fig. 7.

Setup Simplification. This Figure shows the average F1-Score assessed on grand averaged ERPs classification as a function of the number of considered channels.

SECTION VI.

Concluding Remarks and Outlook

This study represents an FDA and machine learning-based approach that demonstrated the feasibility of developing a passive BCI that recognizes specific semantic categories of stimuli based on grand averaged EEG data. Specifically, we recognize imagery stimuli categories, which has not been addressed by previous studies. MIRACLE’s classification performance for both perceived and imagery stimuli categories is remarkable, surpassing the 70% threshold for effective communication. With accuracy rates of 96.37% and 83.11% in k-fold cross-validation and hold-out validation, respectively, MIRACLE shows great potential for clinical applications, particularly considering the realm of mind-reading and communication with locked-in patients. Moving forward, our future work will focus on refining the hierarchical classification architecture, broadening the scope of stimuli categories, and assessing real-time performance.

NOTE

Open Access provided by 'Politecnico di Milano' within the CRUI CARE Agreement

References

References is not available for this document.