Less Is More: Brain Functional Connectivity Empowered Generalizable Intention Classification With Task-Relevant Channel Selection

Electroencephalography (EEG) signals are gaining popularity in Brain-Computer Interface (BCI)-based rehabilitation and neural engineering applications thanks to their portability and availability. Inevitably, the sensory electrodes on the entire scalp would collect signals irrelevant to the particular BCI task, increasing the risks of overfitting in machine learning-based predictions. While this issue is being addressed by scaling up the EEG datasets and handcrafting the complex predictive models, this also leads to increased computation costs. Moreover, the model trained for one set of subjects cannot easily be adapted to other sets due to inter-subject variability, which creates even higher over-fitting risks. Meanwhile, despite previous studies using either convolutional neural networks (CNNs) or graph neural networks (GNNs) to determine spatial correlations between brain regions, they fail to capture brain functional connectivity beyond physical proximity. To this end, we propose 1) removing task-irrelevant noises instead of merely complicating models; 2) extracting subject-invariant discriminative EEG encodings, by taking functional connectivity into account. Specifically, we construct a task-adaptive graph representation of the brain network based on topological functional connectivity rather than distance-based connections. Further, non-contributory EEG channels are excluded by selecting only functional regions relevant to the corresponding intention. We empirically show that the proposed approach outperforms the state-of-the-art, with around 1% and 11% improvements over CNN-based and GNN-based models, on performing motor imagery predictions. Also, the task-adaptive channel selection demonstrates similar predictive performance with only 20% of raw EEG data, suggesting a possible shift in direction for future works other than simply scaling up the model.


I. INTRODUCTION
E LECTROENCEPHALOGRAPHY (EEG)-based braincomputer interaction (BCI) systems have enabled a variety of neurological tasks, such as motion intention recognition [1], emotion analysis [2] and brain disease detection [3]. The EEG-based BCI uses non-invasive scalp electrodes to record and further analyze electrical fluctuations that occur as a result of brain activity. A task that is of interest to this study is recognising Motor Imagery (MI), a cognitive process in which subjects imagine moving different parts of their bodies. There have been years of research into algorithms for detecting a particular MI, as well as implications for individuals with disabilities in a range of applications, including brain typing [4], mind-controlled wheelchairs [5] and prosthetic arm [6].

A. NN-Based MI Recognition
Recent MI recognition research takes advantage of deep neural networks (NN) to extract discriminative representations from enormous amounts of data without the need for carefully-designed features. For instance, convolutional neural networks (CNNs) are widely adopted to extract spatial correlations between different sensory channels, whilst recurrent neural networks (RNNs) are promising for capturing temporal dependencies from raw EEG signals. A single CNN architecture consisting of two convolution blocks, dubbed EEGNet, was proposed by [7] for EEG classification across different paradigms. Zhang et al. [4] design a hybrid model using CNN and RNN to extract spatial and temporal features from EEG signals, paired with an Auto-encoder to eliminate artifacts. Zhang et al. [8] additionally transform raw EEG signals into a matrix-like form to explore correlations between physically adjacent sensory channels.

B. Subject-Independent MI Recognition
While achieving success in subject-dependent settings, they still suffer from subject-independent evaluation, in which training and test data are collected from different subjects. In fact, EEG signals may manifest different patterns even when subjects are performing the same cognitive task. It is necessary either to adapt to new subjects or to model subject-invariant features to address this issue. Fahimi et al. [9] combine general mental states with new subject's data for personalized modeling. Chen et al. [10] reduce the discrepancy between two subjects with adversarial training. In [11], the mean covariance matrix of each user is transformed into an identity matrix for the alignment of their EEG trials. However, each new subject must be adapted once, and negative transfer may also occur in view of noisy EEG signals [12]. Alternatively, another line of work seeks to identify patterns that are universal across many subjects. For instance, Zhang et al. [13] incorporate self-attention [14] into a convolutional-recurrent model to explore concentrated temporal periods.

C. Topological Pattern of Brain Activity
Despite being dominant in learning spatial representations, CNNs are structured on dense and regular "grid"-like inputs, which limits their ability to identify non-Euclidean relationships. It is, however, true that EEG channels are non-Euclidean by nature, as are brain regions. Additionally, even with the same acquisition device, signals do not necessarily originate from the same location due to the variance between subjects. Zhang et al. [15], [16] represent EEG nodes as a graph to learn a topological-based positioning relationship, which appears to be less subject-dependent than conventional representations. It brings to light recent interest in using graph neural networks (GNNs) as an alternative to CNNs in obtaining non-Euclidean representations from raw EEG input signals [17]. There are concerns raised by existing graph-based EEG representations despite the topological features showing robustness to inter-subject variability.
1) Channel Redundancy: standard EEG acquisition devices measure brain activity from the entire scalp. A particular cognitive task will, however, not require the activation of all brain regions [18], and different tasks may even activate different regions [19]. It follows that redundancy would be included if all channels are used indiscriminately.
2) Edge Formation: confusion still exists regarding how inter-regional brain connections should be represented as graph edges. The graph representation of certain non-euclidean structures, such as brain networks, is shown to be more effective than using CNNs [17]. Still, the extensively studied distance-based edge formation [15], [16] does not account for dynamic functional connectivity between different MI intentions.
3) Graph Noise: EEG recordings are bound to contain noise from both external and internal factors, implying raw EEG signals are likely to result in poor model generalization and over-fitting risks given GNNs' sensitivity to the quality of the input graph [20]. This makes it even more challenging to identify task-relevant patterns while invariant to subjects, as EEG datasets are practically impossible to scale up as much as other domains.
Targeting each of these concerns, our study presents a Subject-Independent MI classification model using brain Functional Connectivity (SIFT-EEG). For redundant channel, functional connectivity statistically contributes to identifying the brain regions involved when performing a particular cognitive task. For edge definition, graph adjacency is dynamically determined by functional connectivity, which filters out weak associations and identifies task-relevant active brain regions based on their importance in the functional connectivity graph. This is followed by performing self-attentive temporal convolution to extract discriminative task-adaptive temporal embeddings that mitigate graph noise contained in raw EEG signals. Finally, we derive robust topological embeddings from the task-adaptive temporal graphs of different subjects using a Graph Isomorphic Network (GIN) that reliably detects equivalent graphs [21].

D. Contributions
• We propose a subject-independent MI prediction model built upon functional topological adjacency, which further takes into account self-attentive temporal convolution and graph isomorphism, thus capturing task-adaptive but subject-invariant EEG embeddings.
• We present a data-driven channel selection algorithm based on active brain regions, which can exclude non-contributory channels and reduce the impact of taskirrelevant noises.
• We evaluate SIFT-EEG on a large-scale EEG-based MI dataset, demonstrating its effectiveness for subjectindependent classification, with competitive performances against state-of-the-art even using 20% of raw data.

A. Graph Representation for EEG Signals
A graph G = ⟨V, E, W ⟩ is an abstract structure in non-euclidean space, composed of nodes V , edges E, and weights W . Thus, the brain network topology can be represented by graphs. The nodes {v i } i=1:n ∈ V refer to EEG scalp electrodes located on specific brain regions, with a total electrode count of n. Each edge e i j ∈ E represents the inter-regional relation between electrode pairs (v i , v j ). Accordingly, the pairwise edge weight w i j ∈ W indicates the strength of each relation.
Graph adjacency A takes the form of a 2D matrix R N ×N , where N denotes the number of nodes. One can be either a weighted or an unweighted graph. For unweighted adjacency, A i j = 1 represents a pair of connected nodes while A i j = 0 means there is no connection between them. A weighted adjacency has an additional attribute to the strength of relationship by setting A i j = w i j . Degree d(v i ) measures a node's centrality by the number of edges connecting to it, indicating the node importance within the graph, calculated by adding up the i-th row of A.
Edge definitions for EEG-based graph representations are yet to be provided. The literature mostly uses either complete graph [17], i.e., each pair of nodes are connected, or distancebased connections [15], [16], that is, two nodes connected if their physical distance between electrodes is lower than a predefined threshold. Overview of the Subject-Independent MI classification model using brain Functional Connectivity (SIFT-EEG) model. We first crop raw EEG signal into a sequence of time slices using the slide window technique; adopt CNN to extract temporal features and a self-attentive module to search for the most discriminative temporal slice; then we combine nodes in the temporal embedding with functional connectivity to generate the graph representation of EEG signals; select three layers of Graph Isomorphism Network (GIN) to extract topological features and lastly, the extracted topological embedding are classified to different motion intention using a fully connected network with a softmax activation function.

B. Graph Neural Networks
Graph Neural Networks (GNNs) are a family of generalized neural networks excelling at analyzing graph-structured data.
Graph Convolutional Network (GCN) [22] simplify spectral graph convolutions to the framework of neural networks, which defines a GCN operator to exchange node-wise information through edge connections, along with a layer-wise propagation rule that updates hidden node features. Throughout the K -layer propagation process, the nodes receive the averaged features from their neighbors via a mean aggregation, and the shape of the graph structure remains the same at the next layer.
where h (k+1) v is node v's hidden feature at k-th GCN layer, W are model parameters and σ is non-linear activation. READOUT(·) function is further applied after propagating the last layer to extract the graph-level embedding, GraphSAGE [23] interpret GNNs from a spatial perspective and generalize the mean-aggregator in GCN to a wider range of operators, where W are model parameters and AGG(·) refers to a permutation invariant function such as min/max/mean pooling. The recent Graph Isomorphism Network (GIN) [21] is derived from the Weisfeiler-Lehman (WL) isomorphism test for checking if graphs are topologically identical, proving that GIN is as powerful as the WL test when AGGREGATE(·) and READOUT(·) are permutation invariant and injective, which produces more discriminative embeddings than other GNNs variants in graph-level classification.
Assume the function f : X → R n for any countable node feature space X . Then for infinitely many choices of ϵ including all irrational numbers, such that any function g can be decomposed into g(c, for some function φ, where c ∈ X and X ⊂ X . A multilayer perceptron (MLP) with more than one hidden layer can be used to approximate injective function [24], so the node embedding h v becomes distinguishable, using the GINConv operator defined by where the AGG(·) of neighbor nodes is implemented as a summation to impose injective mapping.

III. METHODOLOGY A. Overview
We now describe the goal of EEG-based Motor Imagery (MI) classification and overview the key steps of our approach. Given an EEG segment X ∈ R N ×K collected while a subject was performing a MI task, we aim to estimate the associated specific intention Y by training a predictive model that performs supervised classification under the subject-independent setting, meaning that model training and evaluation are conducted on two disjoint groups of subjects. K = T × f is the number of time points within a segment, N is the number of electrodes, T is the recording duration, and f is the sampling frequency.
Let X be an EEG segment and Y be the intention, our approach approximates the mapping f : X → Y , parameterized by a Neural NetworkŶ = f (X ), by involving the following five steps: 1) Calculating adjacency A ∈ R N ×N across the readings of all sensory channels X based on functional connectivity; 2) Selecting the top-N ′ strongest channels X ′ ∈ R N ′ ×K tailored to the MI task and identifying task-relevant edges E ′ ∈ R N ′ ×N ′ ; 3) Extracting temporal node embedding V ∈ R N ′ ×ω , by summarizing the features of each temporal slice S m ∈ R N ′ ×ω , where S = {S m } M m=1 results from sliding window applied on raw data of task-relevant channels X ′ ; 4) Generating topological graph embedding H ∈ R h from the brain functional network G = V, E ′ constructed by temporal node embeddings and task-relevant edges;

5)
Predicting the intention f (X ) =Ŷ of an EEG segment from the graph-level topological embedding.

B. Functional Adjacency Matrix
We first calculate the adjacency of an EEG-based brain network with functional connectivity. We prefer this approach to previous distance-based or complete graph connections which either ignore topological relationships or are very computationally intensive if too many nodes are involved. Functional connectivity defines the statistical dependencies among temporal signals, commonly measured as the similarity between two brain regions by using the Pearson coefficient: where Cov(i, j) is the covariance of measurement readings between sensory electrodes i and j, Var(i) denotes the standard deviation of i-th channel readings throughout K timesteps. This results in a complete functional adjacency A ∈ R N ×N with N being the number of electrodes. The following section discusses how we further reduce nodes and edges by only keeping task-relevant channels.

C. Task-Adaptive Channel Selection
The activation of different brain regions varies with brain activity [25], indicating that some regions may not respond as actively to certain tasks as others. However, raw EEG data collected from electrodes distributed throughout the scalp will thus always contain information irrelevant to a specific MI task. The presence of such redundancy increases the computational burden and over-fitting risks.
This has led to the demand for filtering out task-irrelevant channels, yet most efforts are directed at manually solving complex optimization problems [26], which even requires extensive domain expertise [27]. The problem may, however, require an automatic solution in some cases, such as cross-subject analysis [28]. Instead, we propose a simple yet effective data-driven channel selection strategy upon node importance within all EEG channels, representing the activation intensity of corresponding brain regions. In particular, we suggest two metrics for measuring the nodes' importance ∈ R N . 1) Degree-Based Importance: The degree of a node indicates its centrality within a brain network as well as the implication at the graph-level. Within this metric, we simply define the nodes' importance W as the number of incoming edges by looking up the channel-wise adjacency matrix A. Then, we apply a proportional threshold T * to preserve channels N ′ adapting to a specific intention by descending nodes upon the importance . The above steps are applied for each EEG segment X in the dataset.
2) Strength-Based Importance: We assume the channels with high correlation strength across different MI tasks are more active than those less correlated. Say we have EEG segments of C tasks with N c for each, being performed by multiple subjects. We compute the absolute Pearson coefficient between the same channel's N c readings of every Algorithm 1 Task-Relevant Channel Selection Require: raw EEG segments set X , intention set C, importance weight and threshold T * Ensure: EEG segments set X ′ of task-relevant channels 1: X ′ ← ∅ 2: for X ∈ X do 3: Sort all channels by importance scores S in descending order, and store the indices into I I ← argsort(W ) 4: Select the most active channels with threshold T * X ′ ← X i , ∀i < T * and i ∈ I

5:
Preserve readings of the most active channels X ′ and append them to the task-relevant EEG set two intentions c i , c j ∈ C across all intentions. Then the task-relevant strength of each channel is obtained by averaging its C −1 coefficient values across all intention pairs. We define this strength of all channels as the importance weight for segment X . Lastly, we apply a threshold T * to preserve task-relevant channels N ′ by descending nodes upon their taskrelevant strength.

D. Temporal Embedding
There might be different concentration periods between subjects during MI, leading to different temporal properties even if two subjects were performing the same task [13]. Thus, our objective is to identify the most discriminative period within each EEG segment of different subjects, which we refer to as the subject-invariant temporal embedding.
Specifically, we first crop the EEG segment X ′ ∈ R N ′ ×K into M temporal slices S ∈ R M×N ′ ×ω using a sliding window of size ω, after selecting the strongest N ′ channels. We then create initial temporal features U m of each temporal slice S m ∈ R N ′ ×ω by applying a 3 × 3 convolution kernel followed by an elu non-linear activation, where U m retains the same tensor shape as input slice S m . Next, we use self-attention [14] to capture slice-wise correlations and adapt temporal weights to their temporal features, leading to a temporal embedding V ∈ R N ′ ×W that summarizes the EEG segment, with We present part of the edges with the highest correlation for readability purposes. The graph is visualized using BrainNet Viewer [29].
where W m , W h and b h are learnable parameters. The temporal embedding V aggregates all of M slices, taking into account each slice's importance, which derives the most discriminative representation upon input values.

E. Topological Embedding
The brain activity is coordinated by multiple brain regions [30]. These inter-regional correlations are potentially beneficial when incorporated into predictive models. While CNN-based methods [4], [8], [13] assume Euclidean-structure of EEG electrodes, these works ignore the natural geometry of brain structure and connections between different regions beyond their immediate vicinity. As opposed, graph-based brain representations appear to reflect better the non-Euclidean nature of human's scalp [17], but also encode the subject-invariant positioning priors of electrodes [15], [16] to the model. Nevertheless, their graph constructions cannot encode dynamic functional connectivity that adapts to different MI tasks, since the edges therein are fixed in terms of the distances between nodes.
1) Task-Adaptive Edge Formation: In a brain network, the correctly illustrated edges could model how each brain region associates with other active regions during a MI task. Meanwhile, channel-wise correlations across MI intentions do not remain constant as aforementioned. Having identified the task-relevant channels N ′ , we now formulate the task-relevant edges E ′ ∈ R N ′ ×N ′ to leverage strong associations tailored to the task [31], in the following stages: 1) Constructing: Calculate channel-wise correlation adjacency A c ∈ R L×S×N ×N of all subjects for each intention c ∈ C, using the Pearson coefficient, where L denotes the number of labels, S is the number of subjects, and N represents the number of channels/electrodes.

Algorithm 2 Task-Relevant Edge Formation Algorithm
Require: EEG segments set X , adjacency matrices A, intention set C, edge threshold T e 1: for c in C do 2: Average the matrices A c of all EEG segments within the same intention c A c = mean X →c (A), ∀A ∈ A, ∀X ∈ X 3: Sort edges E A c in A c based on the absolute value of weights in descending order Derive critical edges corresponding to the strongest associations using edge threshold T e E c = indices(E c (0 : t × |E c |)) 5: end for 6: Merge critical edges together to produce task-relevant edges 2) Generalizing: Find the generalized connectivity across all subjects by averaging the adjacency matrices of all subjects for each intention c. 3) Thresholding: Preserve critical connections that exceed the threshold T e for each intention c. 4) Merging: Derive task-relevant edges E ′ by merging the critical connections of all C intentions. Algorithm 3 shows the pseudocode to derive a purely functional adjacency E ′ . Moreover, considering the temporal embedding V results from the task-relevant channels X ′ , we construct the task-adaptive brain network G = V, E ′ to obtain topological embedding.
2) Embedding Computation: We adopt a L-layer GIN to investigate the topological embedding of a given brain network G, as it shows promise for graph-level classification [21], [32]. The graph-level topological embedding H ∈ R W after L-layers' propagation is produced by where W is the embedding dimension. We set MLP with 2 hidden layers, ϵ = 0, and k = 3 in practice. While READOUT(·) function can be either non-injective aggregations (e.g., pooling) or injective mappings (e.g., MLP(·)), it is believed that the latter will yield a more discriminative graph embedding, as shown by [21]. Our empirical studies experiment with different setups to verify this argument in EEG-based applications.

F. Intention Classification
We take the topological embedding H of an EEG segment X to perform intention prediction, with a softmax function to estimate the probabilities of each possible intention, where W o and b o are trainable parameters. The predicted intention is thus given byŶ = arg max c P c .
IV. EMPIRICAL STUDIES A. Experiment Setting 1) Dataset: In this work, we evaluate the performance of the proposed SIFT-EEG on a widely studied largescale cross-subject EEG-based Motor Imagery dataset eegmmidb (EEG motor movement/imagery database) downloaded from Physionet database [33]. The dataset was collected using BCI2000 [34] containing 64 electrodes with the sampling frequency of 160Hz from 109 healthy subjects. Following the conventions [15], [16], we remove subjects #88, #89, #92, #100 from the dataset due to consecutively resting states. Within three sessions of Motor Imagery tasks, EEG signals were recorded when subjects were executing left/right fist open and closed imagery. There are approximately fifteen 3.1-second segments of imaginary left/right-hand movement in each session; each subject performed three sessions. We randomly select 90 subjects and use their EEG segments as the training set, whereas the remaining 15 subjects are used as the test set. There are no disjoint subjects in the training and testing set, which ensures the evaluation is conducted in a subject-independent setting. We experiment with twelve different train/test splits to reduce randomness in the results.
We fix a unique random seed for each split that specifies the training and testing subjects. Noticeably, the last split includes all untested subjects from the previous eleven splits to ensure each subject is tested at least once.
2) Preprocessing: In line with most related works, raw EEG data is normalized with a z-score, calculated by subtracting the mean value and dividing by the standard deviation of training samples X . We do not further apply filters or data augmentation on raw EEG signals. We use temporal data as model input.
3) Baselines: We reproduce several baselines for subjectindependent MI classification, including both traditional and DNN-based approaches. We adopt the Common spatial pattern and linear discriminant analysis (CSP+LDA) [35] as the representative traditional means. The DNN-based approaches are further divided into the CNN and GNN families.
Specifically, the CNN-based methods include EEGNet [7], CRAM [13], GHAM [16]. The GNN members GIN0 [21], GraphSAGE [23] and EEG-GAT [36] apply different GNNs under the same framework [17]. In addition, we implement a two-layer RNN as a vanilla DNN baseline to model temporal data. 4) Implementation Detail: Each input EEG segment X ∈ R 64×496 contains 496 timesteps with 64 channels. For temporal embedding, we apply the slide window technique with window size of 400 and a step size is 10. Hence, the input signal contains ten temporal slices, where each slice has the shape of [64,400] (i.e., N = 64, ω = 400). All the models are implemented with PyTorch 1 and trained and trained on an NVidia 3060-Ti GPU in a fully-supervised manner. We use cross-entropy as the objective function and optimize model parameters using Adam with a learning rate of 0.001. The training batch size is 500. Each model is trained for 120 epochs, and the dropout probability is set to 0.5 to avoid over-fitting. We implement all GNN-related components using Torch_Geometric. 2

B. Result & Discussion
The proposed SIFT-EEG is empirically compared with a range of baselines, focusing on three research questions.
1) Does the modeling of temporal dependencies benefit spatial/topological correlations and model performance? 2) Does task-adaptive dynamic connectivity outperform distance-based connectivity for topological learning? 3) Does task-relevant channel selection with topological features still achieve competitive performance? The evaluation metrics include classification Accuracy and the Area Under ROC-Curve(ROC-AUC). All models are trained and evaluated with the same setting for fair comparison.
1) Overall Performance: Tab. I reports the model performance of all approaches on both metrics. Observe that SIFT-EEG outperforms all baseline models, with 1.13% and 14.68% accuracy improvements, as well as ROC-AUC gains of 1.09% and 15.44%, compared to the best performing CNNbased state-of-the-art (GHAM) and GNN-based state-of-theart (EEG-GAT), respectively. Meanwhile, the top performers in both DNN families prove to be more accurate than CSP+LDA. In addition, all DNN baselines (except RNN) include mechanisms to represent the spatial correlation between EEG channels, providing better results than vanilla RNN. It follows that such representations should be incorporated into modeling. Fig. 5 depicts how the training loss changes with the number of training epochs increase. GNN-based methods generally have lower training losses and faster convergence rates compared with CNN-based methods, which suggests the capability of topological features in task-specific predictions. However, their test performances are worse than CNN-based methods, showing a tendency to overfit the training data. The reason may be that GNN-based methods extract topological features directly from raw EEG signals. Recall that our evaluations take place in a subject-independent setting. In this case, the variances in temporal patterns between subjects, i.e., graph noises, may cause them to perform inconsistently [37], [38]. Conversely, CNN-based methods explicitly handle temporal correlations before looking at the spatial domain. For instance, CRAM locates discriminative temporal features adaptively for different subjects by using attention. This eases the burden on the spatial feature extractor and generalizes the model to new subjects, albeit fitting these models takes longer. The proposed SIFT-EEG combines the merits of both, leveraging flexible topological features of the brain signal, as well as subject-adaptive temporal features that reduce task-irrelevant noise.
In addition, we perform statistically significant tests to evaluate the model performance improvements of SIFT-EEG over baselines. We use the pairwise t-test, assuming the pairwise difference is significant if p-value is less than 0.05. The results are reported in Tab. III, where statistically significant differences are bolded.
2) Impact of Temporal Embedding: Now we analyze how temporal embedding contributes to EEG-based MI classification and answer the first research question. EEGNet design a CNN block to extract temporal features, while CRAM and GHAM adopt self-attention on a recurrent network to capture long-term temporal dependency and adaptive subject-specific patterns, thus improving model performance in subject-independent experiments. In contrast, GIN0 and GraphSAGE take raw EEG signals directly as input without learning temporal features. This leads to much lower predictive results than their CNN-based counterpart. EEG-GAT, on the other hand, attempts to extract temporal information with a 2D temporal convolution operator, which derives around 32% and 24% improvement over GIN0 and GraphSAGE, even when all three approaches are developed under a similar predictive framework. We also examine the impact of temporal embedding within SIFT-EEG. Denoted as SIFT-EEG(w/o t), a SIFT-EEG variant without temporal embedding is compared with full SIFT-EEG to reveal its effect. There is an increase of 29.01% and 28.96% in the results of accuracy and ROC-AUC, respectively. It is thus clear from the comparisons of all three groups that modeling temporal dependency in conjunction with spatial/topological embedding is essential.
3) Comparison of Topological Feature: The second research question investigates whether the proposed task-adaptive dynamic connectivity facilitates topological learning. We first identify the graph connection of each model in comparison.   GHAM uses distance-based graph representations of EEG channels, but not GNN for representation learning. All three GNN baselines define complete graph connections to yield the best results. SIFT-EEG chooses functional connectivity and bypasses the position limits. GHAM extends CRAM with a graph definition of input that leads to an increase in 1.5% on accuracy and 0.7% on ROC-AUC to CRAM, implying the merit of non-euclidean assumption imposed by graph representation. Among the three GNN baselines, EEG-GAT shows better performance than the other two. A possible explanation is, GAT could benefit from well-designed attention in large and noisy graphs [39], suggesting GAT seems to win out over GIN in a complete graph of the brain network (as per their framework).
We further replace dynamic connectivity with complete connection in SIFT-EEG and observe a clear performance drop as in SIFT-EEG (w/o d). The GIN applied to SIFT-EEG aggregates all nodes indiscriminately under complete connections. This leads to the conclusion that SIFT-EEG needs to be defined with task-adaptive edges.  and max-pooling. The performance also drops noticeably. As a reminder, SIFT-EEG emphasizes a) graph-level prediction instead of node-level prediction; and b) structural information of the graph (brain network). Our results empirically align with the theoretical findings of [21], i.e., MLP-based injective mapping increases capacity over its non-injective READOUT counterpart. 5) Analysis of Task-Adaptive Channel Selection: Our third research question examines task-adaptive channel selection from four perspectives. First, we visualize and compare the effects of two importance metrics. We next apply task-adaptive channel selection to CRAM and GHAM to examine its applicability. We evaluate its effectiveness against three additional channel selection approaches using SIFT-EEG. Finally, we discuss the model elapsed time results with different channel selection ratios. a) Qualitativeness: Fig. 2 showcases the positions of task-relevant channels selected by two different importance metrics. The nodes resulting from degree-based importance are primarily located around the central sulcus, while those derived by strength-based importance clusters in the frontal lobe correspond to the functional area for motor control in the study of neuroscience [40]. b) Applicability: In addition to SIFT-EEG, the performance of two other models, CRAM and GHAM, is compared with task-adaptive and random selection by keeping 20% original channels. For comparison, CRAM makes no graph assumptions, GHAM represents the input with a hard-ruled graph definition, whilst SIFT-EEG dynamically determines the graph representation. As seen by Table II, the accuracy of all models for task-adaptive selection is similar to (with CRAM and GHAM) or even slightly higher (with SIFT-EEG) than all channels when only 20% are used, whereas random strategy returns lower performance. Hence, our task-adaptive channel selection could benefit all models in general. Still, it works best when coupled with dynamic functional graph connectivity. c) Effectiveness: Moreover, we compare the proposed task-adaptive selection with two additional channel selection strategies. Shan et al. [41] identify subject-specific channels by finding channels with strong correlations to the central channel. In reproducing this strategy, we select the channels with an average correlation ≥ 0.7, except for the reference channels C3/C4/Cz. Mattioli et al. [42] reduce the number of channels required by segmenting motion functional regions and producing regions of interest. In our experiments, we select channels located in the motor cortex region for this method. Table IV reports the least number of channels  Classification accuracy for top t% task-relevant channels selected by the proposed algorithm. required to achieve an accuracy greater than 60%, for each of these strategies applied to SIFT-EEG. The task-adaptive selection with strength-based importance is the top performer with the fewest channels required. The degree-based metric, however, appears to be relatively ineffective. It might be the case that nodes are unequally distributed throughout the scalp. According to Fig. 6, the sensory nodes near edge areas (e.g., AF7, O1 and P10) have fewer neighbors than those in central areas such as Cz, Cpz and Fcz. Meanwhile, the distance between nodes affects both the number of incoming edges and the strength of the connection. This eventually biases the estimation when using the degree-based importance metric.
On the other hand, the strength-based task-adaptive selection does not only report higher accuracy but with fewer channels, even compared to using channels from the widely-recognized motor cortex region [42]. The results indicate that task-adaptive channel selection can help encode the most discriminative EEG embedding with the fewest channels, even in the absence of extensive domain knowledge. d) Efficiency: Our final step is to examine the model performance and running efficiency when varying the number of channels and edges. Fig. 7 and Fig. 8 show accuracy and relative elapsed time, respectively. The random selection consistently results in degraded performance as available channels decrease. In contrast, both task-adaptive selections perform better than random selection. The strength-based selection In an interesting twist, running with 20% channels would cost slightly more time than the case of 40%, which, however, would lead to rebounded classification accuracy.
In addition, we compare the computational efficiency of SIFT-EEG with baselines when all 64 channels are used, as shown in Fig. 9. We evaluate how long it takes to handle 400 arbitrarily sampled EEG segments with 400 forward passes with batch size 100, for both training and inference. The proposed SIFT-EEG takes 3.4% less training time than EEG-GAT, the strongest GNN baseline whilst improving the predictive performance by 14.68%. Moreover, SIFT-EEG consumes 5 fewer but 10 more seconds than CRAM and GHAM, respectively. The higher recognition accuracy of these three models comes at the expense of being more costly during inference. Nevertheless, we note that SIFT-EEG can improve its efficiency with our task-relevant channel selection, denoted by SIFT-EEG (w/c) in Fig. 9. At the time it runs with top-40% task-relevant channels, which improves training and inference efficiency by more than 50% and 75%, respectively. This proves SIFT-EEG to be faster than most baselines while gaining substantial performance benefits. 6) Visualization of Dynamic Functional Connectivity: Having learned the task-adaptive dynamic functional connectivity topology, we compare it with a fixed functional connectivity topology originating from the pre-motor, supplementary, and primary motor area [43]. We visualize the active brain region across certain proportions of 105 subjects upon performing MI tasks, according to two important metrics, as in Fig. 4. The degree-based importance is associated with activities in the frontal and parietal lobes, whereas strength-based importance seems to target the occipital lobes mainly. There are overlaps between the dynamic active region and the fixed motor region regardless of the number of subjects used. Moreover, our dynamic brain topology shows that active regions within each importance metric are distributed similarly across subjects. Increasing the number of subjects converges the distribution to an almost identical pattern, implying that the statistical characteristics of the dynamic topology can be improved  by having more samples. Still, it shows robustness despite different sample sizes. 7) Impact of Varying Training Subjects: Furthermore, we investigate how different numbers of training subjects affect classification accuracy. Our experiment begins with fifty subjects used for training, i.e., the training and test subjects are close to a 1:1 ratio. Following, the training set expands by five subjects per ratio record until 100 subjects have been included. For each train/test ratio, we perform cross-validation to split the train and test sets, ensuring each subject has been tested at least once. We report performance improvements across GNN family methods with varying numbers of training subjects, using fifty as the baseline and the mean and standard deviation for each ratio, as illustrated in Fig. 10. Fig. 10 illustrates a strong correlation is observed between the number of subjects and classification accuracy. Increasing the number of train subjects from fifty to seventy almost linearly improves classification accuracy. This may suggest that SIFT-EEG learns more subject-independent features as it is trained on more subjects. However, the improvement becomes negligible as the training set size increases, when there are seventy to eighty-five subjects. Observations show that the overall performance continues to improve beyond eighty-five training subjects. The presence of more training subjects may increase the likelihood of testing subjects exhibiting similar patterns to those in the training set, as well as improved predictive performance.
We also include the performance changes of other GNNbased methods. Whereas all methods demonstrate performance improvements, SIFT-EEG benefits the most from the use of a greater number of training subjects, as this may facilitate the extraction of subject-independent patterns among the population. In contrast, other methods show fluctuating results and irregular patterns while underperforming SIFT-EEG consistently, mostly because neither of these methods bakes subject-independent features into the representation. Interesting to note that EEG-GAT reports considerably higher variances despite being the best-performing GNN member.

V. CONCLUSION
This paper focused on task-adaptive modeling of brain networks with functional connectivity. We conducted extensive experiments with a large-scale EEG dataset to demonstrate that the proposed predictive model outperforms the stateof-the-art for Motor Imagery classification in the subjectindependent setting. Moreover, our investigation indicated that task-adaptive region selection produces similar predictive performance with only 20% of raw EEG data, with a considerable reduction in computation cost during model training and deployment. Our future work may lend itself to the data scarcity problem in EEG-related research by examining other Motor Imagery datasets with fewer subjects. We will examine how to apply the proposed subject-independent channelselection methods more efficiently in situations where training subjects are limited.