Adaptive Gated Graph Convolutional Network for Explainable Diagnosis of Alzheimer’s Disease Using EEG Data

Graph neural network (GNN) models are increasingly being used for the classification of electroencephalography (EEG) data. However, GNN-based diagnosis of neurological disorders, such as Alzheimer’s disease (AD), remains a relatively unexplored area of research. Previous studies have relied on functional connectivity methods to infer brain graph structures and used simple GNN architectures for the diagnosis of AD. In this work, we propose a novel adaptive gated graph convolutional network (AGGCN) that can provide explainable predictions. AGGCN adaptively learns graph structures by combining convolution-based node feature enhancement with a correlation-based measure of power spectral density similarity. Furthermore, the gated graph convolution can dynamically weigh the contribution of various spatial scales. The proposed model achieves high accuracy in both eyes-closed and eyes-open conditions, indicating the stability of learned representations. Finally, we demonstrate that the proposed AGGCN model generates consistent explanations of its predictions that might be relevant for further study of AD-related alterations of brain networks.


I. INTRODUCTION
The brain is a complex, densely connected system that operates across multiple spatial and temporal scales.Neurological diseases, such as Alzheimer's disease (AD), can alter the connectivity of the brain and thus disrupt brain function [1][2][3][4].AD is the most common cause of dementia and affects millions of patients worldwide [5,6].Currently, the diagnosis of AD is typically made using a combination of cognitive and neurological assessments, as well as neuroimaging techniques, such as positron emission tomography (PET) or magnetic resonance imaging (MRI), which can be time-consuming and expensive [7].The development of rapid, economical, and explainable diagnosis methods is thus of importance [6].
Electroencephalography (EEG) is an economical and non-invasive neuroimaging method that records the sum of electrical potentials generated by various brain areas.EEG is extensively used in the research of AD-related alterations in brain function and functional connectivity.Although EEG is not currently used in clinical settings for AD diagnosis, numerous studies have demonstrated the high effectiveness of an EEG-based diagnosis of AD [8][9][10][11][12][13].
AD causes disruption of synaptic connections across multiple scales [3,14,15] and can thus be viewed as a network disorder [1].The synaptic disconnection can be observed in EEG signals as alterations of synchronisation and functional connectivity (FC) [3,7].Furthermore, the slowing of EEG signals is a reliable characteristic of AD [14,16], observed as a shift of spectral power towards lowfrequency components.Graph-theoretic studies of AD also report reduced complexity, disruption of small-world properties, decreased integration, and increased segregation [10,15,[17][18][19][20].However, one of the challenges in EEG-based predictive models is the efficient utilisation of the information collected over multiple electrodes since there is information to be gained both at the electrode level, e.g.frequency spectrum, and the cross-electrode level, e.g.FC.
Machine learning-based approaches often require domain knowledge and rely on manual feature extraction.For example, Oltu et al. [21] calculate power spectrum density (PSD) and coherence across multiple EEG electrodes and then use descriptive statistics, such as sum and variance, as input features.Other feature-based methods use FC [12,22,23].These methods first reconstruct the brain graph via measures of FC, such as phase lagging index [12], generalised composite multiscale entropy vector [22], or phase synchronisation index [23].The features can then be extracted via statistics [22] or graph-theoretic measures [12,23].
In contrast, deep learning methods can extract features automatically from the input.However, utilising the information from multiple electrodes with classical deep learning methods is challenging.To overcome this issue, several studies have transformed EEG signals into images to make use of convolutional neural networks (CNN) [13,[24][25][26][27], which are efficient in image classification.For instance, Ieracitano et al. [24] compute the PSD across channels and compose them to form a channel by PSD image.Bi et al. [26] use spectral topology images and leverage the colour channels of an image to represent three frequency bands.Finally, Huggins et al. [27] create tiled images where each tile contains the continuous wavelet transform of an EEG electrode.Although these methods utilise multiple channels, the cross-electrode information is still omitted.A CNN trained on FC-based adjacency matrices has been proposed to address this limitation [28].However, CNN is not well suited for such input since the adjacency matrix is irregular and noneuclidean.
Graph neural network (GNN) is an extension of CNN to process graph-structured inputs.Multiple studies propose GNN-based architectures to process EEG.However, GNN methods for EEG-based diagnosis of AD are limited [8,11].GNN-EEG implementations often include several steps: (1) input construction, i.e. graph structure and node features; (2) GNN encoder to learn node embeddings; and (3) aggregation of node embeddings to a graph embedding, which can be used in the final classification step.
There are various approaches to realise the graph construction in step (1).Node features are commonly defined as EEG time-series signal [11,[29][30][31], or a statistical summary of the signal in the time domain [32,33], the frequency domain [8,34], or the differential entropy [29,[34][35][36][37][38].Based on network neuroscience literature, many approaches define the brain graph using FC measures [8, 11, 29, 31-33, 39, 40].The graph structure can also be based on the distance between EEG electrodes [33,35,36].However, such an approach largely ignores brain connectivity information.Alternatively, the brain graph can be automatically learned by the model, either as a learnable mask shared across samples [29,34,41] or by pairwise node feature distance minimisation regularised by an additional graph loss function [37,38,42].While such approaches are flexible and should converge to an optimal graph structure with respect to a given learning task, the learned brain graph might not be representative of the underlying brain connectivity, i.e. such a graph structure might overestimate the strength of the task-relevant edges compared to the underlying connectivity.In this work, we propose an adaptive graph learning mechanism based on node feature enhancement via CNN and subsequent graph construction.This is achieved by using a correlation similarity measure of power spectral density and sparsified via k-nearest neighbour (KNN) edge selection.Thus, it combines the strength of the FC-based and automated graph learning methods.Such a combination overcomes the limitations of fully learnable graphs described above since the correlation computation is ultimately detached from the classification task.However, it should be noted that the adaptively learned graph structure reflects brain region similarity rather than a functional relationship assumed by classical FC measures.
Such updating implies that information from different scales contributes equally to the final node embeddings, hence graph embeddings as well.While brain disruptions caused by AD occur across multiple spatial scales, their predictive power is likely different.Therefore, a gating mechanism is crucial for filtering and weighting the information collected across different scales.We propose to adopt the gated graph convolution [44] to address this issue.
Finally, we implement the aggregation of node embeddings in step (3) by adopting the adaptive structureaware pooling (ASAP) node pooling mechanism [45] to first learn the most important clusters of nodes, which are in turn concatenated to form the graph embedding.This is in contrast to the previous approaches that do not use any node pooling and form graph embeddings via simple element-wise readout layers [8,29,32,39,42,43,46] or concatenating all nodes of the graph [11,40].Other node pooling approaches were tested for EEG applications [46,47].In contrast to ASAP pooling, these approaches pool the graph by selecting a specified number of nodes without considering their local context within the graph.Therefore, important information might be lost due to such node pooling.
In this paper, we propose a novel GNN model for explainable AD classification, which can adaptively enhance node features and dynamically construct brain graph structures as shown in Fig. 1.The learned brain graphs can then be used for the interpretation of predictions.Moreover, a clustering-based node pooling mechanism is adopted to coarsen the brain graph, thus localising the brain regions that contribute to the predictions.Finally, we conduct extensive ablation and parameter sensitivity experiments to elucidate the importance of the individual blocks within the proposed model architecture.

II. DATA
EEG recordings were collected from 20 AD patients and 20 healthy control participants (HC) younger than 70 years.A detailed description of the experimental design and confirmation of the diagnosis is provided in [48].All the AD participants were recruited from the Sheffield Teaching Hospital memory clinic.AD participants were diagnosed between one month and two years before data collection.All of them were in the mild to moderate stage of the disease at the time of recording, with an average Mini Mental State Examination (MMSE) score of 20.1 (sd = 4).High-resolution structural magnetic resonance imaging (MRI) scans of all patients were acquired to eliminate alternative causes of dementia.Age and gender-matched HC participants with normal neuropsychological tests and structural MRI scans were recruited.This study was approved by the Yorkshire and The Humber (Leeds West) Research Ethics Committee (reference number 14/YH/1070).All participants gave their informed written consent.
EEG data were acquired using an XLTEK 128-channel headbox, Ag/AgCL electrodes with a sampling frequency of 2 kHz using a modified 10-10 overlapping a 10-20 international electrode placement system with a referential montage with a linked earlobe reference.The recordings lasted 30 minutes, during which the participants were instructed to rest and not think about anything specific.In case the participants showed signs of drowsiness, they were prompted.Within each recording were five-minutelong epochs during which the participants had their eyes closed, alternating with an equal duration of eyes-open epochs.
As a neurophysiologist confirmed the EEG signal to be artefact-free, we did not further clean the signals.The signals are filtered using a band-pass Butterworth filter to a range of 0.5 Hz and 45 Hz and down-sampled to 200 Hz.Finally, 1-second long windows with 50% overlap are created to increase the sample size.

III. METHODS
The proposed adaptive gated graph convolutional network (AGGCN) model consists of three blocks: a graph learning module, a GNN encoder and a classifier.The graph learning module receives a node feature matrix as input, enhances it using a 1D-CNN and learns the brain graph structure.The GNN encoder then uses the output of the graph learning module as input, i.e. a featured, weighted, undirected graph.The encoder generates a graph embedding used by the classifier to output the predicted probabilities.

A. Node feature and graph learning
The node features are defined as power spectral density computed from 1-second-long EEG signals with 1 Hz increments from 1 to 45 Hz.Hence, the input is a node feature matrix The input is then passed to a convolutional neural network (CNN) with batch normalisation, L CN N 1D convolutional layers and a maximum pooling with kernel size 2 and step size 2. The output is flattened and fed to a fully connected layer with hidden size h CN N and batch normalisation.This neural network outputs a matrix of enhanced node features A graph structure is then inferred from the enhanced node features by computing the absolute value of Pearson's correlation for each pair of nodes.Thus, a unique graph structure is learned for each input sample and is defined by an adjacency matrix A ∈ R N ×N with N = 23 being the number of EEG channels.In order to produce sparse graphs, the k-nearest-neighbours algorithm is utilised.This means that the k strongest edges are preserved for each node.
This proposed graph learning module has multiple hyperparameters that control its architecture.Namely, these are the number of convolutional layers L CN N , the kernel size (which is equal to the step size), the number of filters, the hidden size h CN N , the dropout rate drop CN N and the k KN N parameter that controls the graph sparsity.

B. Graph neural network encoder and classifier
A graph convolution extends the classical convolution from the Euclidean domain to the graph domain.The input graph is given by G = (N, A, X ′ ) where N is the set of nodes, A is the learned graph, and X ′ is the enhanced node feature matrix.A simple graph convolution is defined by the message-passing mechanism wherein the node embedding of node i is learned by aggregating information from its 1-hop neighbourhood, i.e. nodes connected with an edge, as follows: where x l i are the node features of node i at the l th layer, x 0 i is the i th row of the input node feature matrix X, and Θ is a learnable linear transformation.N (i) and e ij are the neighbourhood of node i and the edge weight connecting nodes i and j given by the adjacency matrix A, respectively.Stacking L graph convolutional layers then means aggregating information iteratively from 1hop to L-hop neighbourhoods, thus gradually going from local to global information about the graph.Note that the aggregated message is added to the initial node embedding x l i .Thus, the entire information collected from each L-hop neighbourhood is always fully integrated into the node embedding.However, information might be distributed unequally across spatial scales in brain graphs.The gated graph convolution (GGCN) [44] addresses this problem by introducing a mechanism to decide what information should be retained at each scale selectively: x where m i are the aggregated messages, is the aggregation function, Θ r is a learnable matrix for iteration r, which maps the node features from shape [1, and GRU is the gated recurrent unit [49].Briefly, a GRU is a recurrent neural network layer with update, reset, and input gates that allow the network to recursively update or forget information about the input.The node embeddings are learned recursively up to R iterations with a shared GRU gate, which is equivalent to stacking R GCN layers.
The node embeddings are then passed through an activation function and a batch normalisation layer.Finally, the node embeddings are passed to the node pooling module.The hyperparameters of the proposed encoder are the number of iterations R, the hidden size h GN N , the activation function, the aggregation function and the dropout rate drop GN N applied after the encoder.

Node pooling
After learning the node embeddings, the model learns a coarsened graph using the ASAP pooling mechanism [45].This pooling first learns N clusters, each centred at one node, also named ego-graphs.The membership of node j in the ego-cluster centred at node i is given by the S ij matrix.Note that this is a soft-cluster assignment matrix; thus, each node can belong to multiple clusters with varying membership strengths.The clusters are learned as follows: x m i = max where a ij is the attention score and the membership strength, θ and Θ are learnable vector and matrix, re-spectively.σ is the LeakyReLU activation function, and x m i is the master query representing the initial cluster embedding.The attention scores are also subject to a dropout probability drop pool .The final cluster embedding is then calculated as an attention-weighted sum, which is additionally weighted by the cluster score ϕ i : where the cluster score ϕ i is computed by the local extremum graph convolution [45]: which is designed to measure the relative importance of each cluster.The cluster embedding x c i is then used to select the top k scoring clusters, which will be included in the coarsened graph: where T op k is a function that returns the indices of clusters ī. S and X p are the pruned soft-cluster assignment matrix and the pruned cluster embedding matrix, respectively, and A p is the adjacency matrix of the coarsened graph.
The graph pooling module has the following hyperparameters: the size of the pooled graph k pool , the dropout rate drop pool and the negative slope of the LeakyReLU activation.

Multilayer perceptron classifier
The cluster embedding matrix X p of the coarsened graph returned by the node pooling module is flattened and fed to a multilayer perceptron (MLP) classifier.Specifically, a L M LP -layer MLP with hidden size h M LP is utilised with a block of batch normalisation, activation function, and dropout layers utilised between the fully connected layers.The final layer outputs a twodimensional vector of log probabilities for each class.
The classifier has the following hyperparameters: the number of layers L M LP , hidden size h M LP , activation function and dropout rate drop M LP .

C. Model implementation and evaluation
The proposed AGGCN model was implemented using PyTorch 1.10 [50], and PyTorch Geometric 2.0.2 [51] and trained on a laptop with Intel i7 CPU, 16 GB RAM and an NVIDIA RTX 2070 GPU.
The model is trained by minimising the cross-entropy loss.The model performance is evaluated using repeated (30 times) 10-fold stratified group cross-validation (one group = subject identifier) and trained on the dataset collected during the eyes-closed condition.Since all participants have multiple samples, keeping all the samples from the same participant within the same fold is crucial to prevent information leakage.In order to prevent overfitting, another fold is utilised for validation to implement early stopping and is used to optimise hyperparameters.Thus, in each iteration of the cross-validation, one fold is used as validation, one fold as testing, and the remaining eight folds form the training set.
A stochastic gradient descent (SGD) optimiser and an exponential learning rate scheduler are used to train the model with a batch size of 128 for 200 epochs.If validation loss does not decrease for 15 epochs, the training is stopped early.Additionally, zero-mean Gaussian noise with standard deviation σ is added to the input during training with probability p noise to improve the generalisability of the model.Eventually, the best model was identified using the average cross-validated F1 score measured on the validation folds.The selected model was then retrained and tested on the dataset obtained during the eyes-open condition and the combined dataset from both conditions.The final results are then reported using the test folds only.The stability of the performance is assessed by computing the standard deviation of the samples collected over the 30-times repeated cross-validation.
Note that the hyperparameters of the proposed model are optimised using Bayesian optimisation.Ten warmup random iterations were used to initialise the optimisation, followed by 200 optimisation iterations.The optimisation is evaluated only on the validation sets to prevent overfitting.Moreover, we carry out parameter-sensitivity experiments to verify the influence of a few key hyperparameters of the proposed model architecture.Specifically, these are the number of iterations of the GGCN encoder, the size of the pooled graphs, the sparsity of the learned graph and the choice of aggregation function of the GGCN encoder.Due to the computational cost of running these experiments, we reduce the number of repeats of the cross-validation from 30 to 5. The hyperparameters of the model are reported in our supplementary materials.

IV. RESULTS AND DISCUSSION
In this section, we report the experimental results of our AGGCN model.As illustrated in Table I, our AG-GCN has shown robust performance across all the conditions.Note that the best performance was achieved during the EC condition.This is likely because with eyes closed, the ocular artefacts are minimised; thus, the underlying dynamics are easier to detect.The performance remains high even in the EO condition, suggesting that the proposed model can detect underlying patterns in both EC and EO conditions.However, the performance decreases significantly on the EC+EO combined dataset.We hypothesise that the patterns learned under the EC and EO conditions share relatively little information; thus, the EC+EO model performs significantly worse.We explore this further in section IV C.
In addition, the hyperparameter values of the optimised model are reported in Table III in Supplementary Materials.

A. Comparison with the baselines
The proposed model was compared to seven baseline models proposed in the literature across the three conditions.The first baseline is the best-performing model from our previous work [8].It is a GNN with two spatial graph convolutional layers, maximum readout and brain graph defined using the amplitude-envelope-correlation (AEC-GNN).The second baseline model is the spatiotemporal GNN (STGCN) that uses temporal convolutions and ChebConv layers and defines the brain graphs using wavelet coherence [11].Then, two CNN-based models, PSD-CNN [24] and Wavelet-CNN [27], trained on PSD and wavelet transform, respectively, were used.Next, two traditional machine learning approaches were utilised: support vector machine trained on node degree computed from phase lag index graph (NS-SVM) [12], and a logistic regression trained on vectorised adjacency matrices obtained from coherence graphs across seven frequency bands (AM-SVM) [52].Finally, we use an MLP model where the input is a flattened PSD node feature matrix [8] without using graph-domain information.
Table II shows the f1 scores of various models across different conditions.Note that all seven models were evaluated under the same setting (e.g. the same 1second EEG window samples).We can observe that our proposed AGGCN outperforms the baselines across all conditions.Moreover, STGCN was originally evaluated using a cross-validation setup, which mixed samples from the same subject in their original paper.[11].It is expected that its performance drops significantly when evaluated using stratified group cross-validation in our experiments.

B. Model ablation study
We perform ablation experiments to determine the contribution of each module of the proposed model.The following seven ablated variants of the proposed model were tested in our experiments.
• A: no node pooling; • B: graph learning replaced with a fully connected graph; • C: GGCN replaced with a R th -order ChebConv (R = 4); • D: variants A and B combined; • E: variants A and C combined; • F: variants B and C combined; • G: variants A, B and C combined.The ablation results in Fig. 2 reveal that each of the proposed modules contributes significantly to the high performance of the proposed architecture.For variant A, we can observe that the contribution of the node pooling module is significant, albeit relatively small.However, this module reduces the number of parameters of the model and helps to produce explainable predictions (Fig. 7 and Fig. 8).Without the node pooling, the final MLP classifier would have N × h GN N × h M LP parameters (N = 23), but node pooling reduces it to k pool × h GN N × h M LP (k pool = 3).For variant B, it is not surprising that its performance decreases significantly as the graph learning module is replaced with a fully connected graph.Thus, it cannot leverage graphdomain information except in the node pooling module.
Next, we demonstrate that the GGCN encoder improves performance significantly compared to a Cheb-Conv encoder according to variant C. A ChebConv layer is similar to a GGCN in its iterative nature, i.e.Cheb-Conv iteratively updates node embeddings by approximating the eigendecomposition of graph Laplacian.However, ChebConv does not have any gating mechanism, which means that information across scales contributes to the final embedding equally.Since all of the major modules of the proposed are shown to contribute to the final performance significantly, it is unsurprising that the rest of the ablated models with more than one of these modules perform significantly worse as well (Variants D-G in Fig. 2).Note that some of the ablated models maintain a relatively low variance of performance.We speculate this is because the ablated models can still learn robust embeddings, but some of the information within the data remains inaccessible, which would be enabled by the removed module.
The parameter sensitivity experiments also support the optimal values of crucial hyperparameters of the proposed model (Supplementary Materials, Figs 10-13).It is worth noting that the proposed architecture allows training relatively deep models (using up to eleven GGCN iterations) with only a minor performance decrease (Fig. 10).We can also observe that although the optimal values of these hyperparameters result in the best performance, the performance doesn't change much with adjacent values near the optima.This demonstrates that although the proposed model requires a relatively large number of hyperparameters to be determined, its performance remains robust with sub-optimal values, thus suggesting generalisability potential.

C. Explainability of AGGCN
The proposed model generates plausible and consistent explanations for its predictions.We generate multiple types of prediction explanations.Specifically, these are derived from the following: (1) graph learning, (2) node embedding and GGCN encoder, (3) node pooling, and (4) feature masking.Except for type (4), these explanations could be obtained for individual samples.However, we visualise the diagnosis-averaged explanations to explore the patterns learned by the proposed model.

Graph learning
The graph learning module learns a clear difference between the AD and HC cases, as shown in Fig. 3 (alternatively Fig. 14).The learned brain graphs show that AD cases have increased connectivity overall, while HC graphs seem more sparse with few densely connected regions.A well-defined cluster of densely connected nodes is present in both groups within the centro-parietal and occipital regions and a few strong edges in the frontal and temporal regions.The locations of the strongest edges are consistent across conditions.Fig. 4 then shows the top 30 edges, where the largest increase/decrease in coupling was observed in AD.AD seems to have increased coupling strength in long-distance edges, particularly between frontal and parietal/occipital regions.These increases are quite consistent between conditions.In contrast, AD cases have decreased coupling strength, mostly in local connections in the frontal (EC) and frontal and centro-parietal (EO) regions.
Additionally, we statistically compared the learned graph structures to determine differences between AD and HC cases across EC and EO conditions.The results of this analysis are reported in the supplementary materials (Fig 15).

Node embeddings and GGCN
Another prediction explanation can be derived from the node embeddings obtained by the GGCN (Fig 5).In particular, we visualise the node embeddings obtained after four iterations of GGCN and compress them to 1D representation using principal component analysis (PCA) and extracting the first principal component.PCA is fitted for each condition separately.The node embeddings do not express a change in activity but rather a node similarity.Generally, the node embedding explanations show two large regions of similar embeddings.In EC, these are frontotemporal and centro-parietal regions, and right frontotemporal and the rest of the regions for HC and AD, respectively.The HC similarity region in the EO condition is reduced from frontotemporal to only the frontal region.In contrast, the AD similarity region expands from the right frontotemporal region to the left side.This further highlights the differences in learned patterns under the EC and EO conditions, thus explaining the reduced performance in the combined EC+EO condition.
Next, the role of the gating mechanism is elucidated by analysing the amount of information gathered at each scale, i.e. iteration of GGCN (Fig. 6).We measure this by computing the average Euclidean distance between the initial and updated node embedding at each iteration, i.e. x (r) i and m (r+1) i in Eq. 3.For instance, a small distance means a small amount of information was gathered at that scale.Local information contributes highly to the node embeddings of the AD cases, and then the degree of contributions linearly decreases with increasing graph scale.The opposite pattern is observed for HC cases, where the later iterations influence the node embeddings.This highlights the degradation of global and distributed information caused by AD since the model can efficiently learn with fewer iterations, i.e. most information is obtained from the first three iterations.

Node pooling module
The node pooling mechanism can be exploited to derive two explanations.First, we analyse the frequency with which each node is included in the coarsened graph, i.e. pooling frequency (Fig. 7).Second, cluster attention scores (i.e. a ij in Eq. 5) can be used to identify important hubs that are highly represented in the clusters learned by the node pooling module (Fig. 8).
The nodes in parieto-occipital regions are consistently selected with high pooling frequency for AD and HC cases across both EC and EO conditions (Fig. 7).Additionally, in EC condition, HC cases frequently select frontal nodes while AD cases tend to select central nodes.In contrast, in the EO condition, there seems to be more variation in the pooling frequency, with temporal nodes having a high pooling frequency for AD and HC cases.
Note that the nodes of the pooled graphs are, in fact, cluster embeddings, i.e. attention weighted sum of node embeddings (Eq.7).We visualise the nodes with the highest attention scores of each cluster to highlight important hubs (Fig. 8).The attention scores are directed edges from a source node, transferring information to the cluster centred at the target node.Alternatively, these scores can be interpreted as a cluster membership strength.This information transfer should be interpreted as information flow within the model and most likely does not reflect an information flow within the brain.
In EC, AD cases show a large hub at the P4PZ node with strong long-distance and short-distance to various nodes.Additionally, there is a smaller hub at the T5O1.Similarly, in EO, AD cases have a large hub at the T3C3 node and a smaller one at the T4T6 node.In contrast, HC cases do not have any apparent hubs in the EC condition, with only a small hub at the P4PZ node.The attention links also seem to be rather short-distance.In the EO condition, HC cases show a large hub at the T6O2 node and smaller hubs at the P4PZ and T4T6 nodes.This variance between EC and EO conditions displayed in the pooling frequency and attention scores suggests a plausible answer to why it is challenging for the model to learn joint representation in the EC+EO combined condition.We speculate this is caused by the additional dynamics introduced by the visual processing during the EO condition.

Feature masking
We utilise feature masking to elucidate the importance of the frequency components summarised at each node by the node feature vector, i.e.PSD.In this, values at a selected part of the node feature vectors are replaced by ze-  In both EC and EO conditions, the frequencies between 6 and 10 Hz are the most important since their masking reduced performance by 4.82% and 9.18%, respectively.This fits well with the well-described increase of power as well as functional connectivity in AD within these frequencies corresponding to θ and low α bands [7,9].Similarly, masking of the [1,5], [36,40] and [41,45] frequency ranges results in a significant performance decrease in both EC and EO.Additionally, in EO condition, the [11,15], [16,20] and [26,30] frequency ranges produce a significant performance decrease.

D. Limitations and future work
Although our approach achieves competitive performance, we identify a few drawbacks.First, the relatively small size of our dataset imposes a limit on fitting complex models.We address this issue by segmenting the EEG signals into short windows.The short window length means that the model might not be able to represent information from low-frequency components of the signal.
Next, we do not explore alternative node feature representations beyond PSD in this study.PSD is merely a linear frequency-domain representation of the signal.Including time-domain and nonlinear information in the node features might improve the expressiveness of the model.Similarly, the proposed graph learning mechanism is limited to linear coupling patterns because (1) it is inferred from the node features and (2) it is expressed as Pearson's correlation coefficient.Future work should explore other forms of FC that might be integrated into the graph learning mechanism and study ways to include more complex frequency-dependent coupling information.
Finally, the model architecture might be limited by the relatively large number of hyper-parameters that need to be optimised.However, this limitation should be mitigated by utilising a validation set during the optimisation.Moreover, we explore the model stability with respect to some of the important hyperparameters in the parameter sensitivity experiments.These suggest that the achieved performance of the proposed model is not limited purely to the optimal values of the hyperparameters.

V. CONCLUSION
This work proposes a novel graph learning model that performs highly in the AD diagnosis task.Additionally, we show that the model produces robust and clinically relevant explanations for its predictions via the novel graph structure learning module and the node pooling mechanism.Finally, we highlight the importance of utilising the gating mechanism within a message-passing encoder.This allows the model to accurately represent the multiscale distributed network disruptions displayed in the AD cases.

FIG. 1 :
FIG. 1: The architecture of the proposed adaptive gated graph convolutional network.A) The proposed model consists of a graph learning module, GGCN encoder, ASAP node pooling module, and a three-layer multilayer perception (MLP) outputting the predicted probabilities.B) Graph learning module takes a N × Din node feature matrix as input.Node features are defined as power spectral density from 1 to 45 Hz (Din = 45) computed for all N EEG electrodes (N = 23).Then, a 1D CNN enhances them.The brain graph structure is then constructed as a correlation graph between the outputs from the 1D CNN and made sparse by a k-nearest-neighbour edge selection (Corr-KNN).C) The enhanced node features and the learned graph structure are then passed to a gated graph convolutional neural network (GGCN) encoder.GGCN applies message passing and gated recurrent unit (GRU) recursively over R iterations.

FIG. 2 :
FIG. 2: F1 scores of model variants.The asterisks report the p-value of a nonparametric Mann-Whitney U test measuring the difference between AGGCN and the ablated variants.

FIG. 3 :
FIG. 3: Top 30 strongest edges of the AGGCN-learned graphs of AD and HC cases in EC and EO conditions (average of all samples).

FIG. 4 :
FIG. 4: The differences between AGGCN-learned graphs for AD and HC cases in EC and EO conditions show the AD-related connectivity disruption.The average of all samples, the top 30 strongest edges were preserved.Values above zero indicate AD increase, while values below zero indicate AD decrease.

FIG. 6 :FIG. 7 :
FIG. 6: The average distance between initial node embedding and updated node embeddings shows the amount of information retained in each iteration of GGCN, i.e. going from local to global information.The asterisks denote the p-value of non-parametric Mann-Whitney U tests comparing the average distance between AD and HC cases and EC and EO conditions.

FIG. 8 :
FIG.8: Attention scores learned by the node pooling module (aij in Eq. 5), indicating the amount of information transferred from the source node into a cluster centred at the target node.Averaged for all AD and HC cases across EC and EO conditions (single strongest edge preserved for each target cluster node).
roes and the model is retrained on this modified dataset.The relative reduction in f1 scores was then measured and visualised in Fig 9 for EC and EO conditions.

FIG. 9 :
FIG. 9: Relative change in F1 score when parts of node features are masked, showing the importance of frequency components for the classification task for eyes closed and eyes open conditions.The asterisks denote the p-value of non-parametric Mann-Whitney U tests comparing whether the relative change is significantly different from 0.

FIG. 10 :
FIG. 10: Sensitivity of the proposed model to the number of iterations of the GGCN encoder.The error bars show the standard deviation of accuracies measured across 10 repetitions.The optimal value showed in blue.

FIG. 16 :
FIG. 16: Average adjacency matrix of attention scores obtained by the node pooling module for AD and HC cases across EC and EO conditions.

TABLE II :
The F1 score and the number of trainable parameters of the baseline models and the proposed method across conditions.The best-performing model is highlighted in bold.