Multi-Modal Learning for Predicting the Genotype of Glioma

The isocitrate dehydrogenase (IDH) gene mutation is an essential biomarker for the diagnosis and prognosis of glioma. It is promising to better predict glioma genotype by integrating focal tumor image and geometric features with brain network features derived from MRI. Convolutional neural networks show reasonable performance in predicting IDH mutation, which, however, cannot learn from non-Euclidean data, e.g., geometric and network data. In this study, we propose a multi-modal learning framework using three separate encoders to extract features of focal tumor image, tumor geometrics and global brain networks. To mitigate the limited availability of diffusion MRI, we develop a self-supervised approach to generate brain networks from anatomical multi-sequence MRI. Moreover, to extract tumor-related features from the brain network, we design a hierarchical attention module for the brain network encoder. Further, we design a bi-level multi-modal contrastive loss to align the multi-modal features and tackle the domain gap at the focal tumor and global brain. Finally, we propose a weighted population graph to integrate the multi-modal features for genotype prediction. Experimental results on the testing set show that the proposed model outperforms the baseline deep learning models. The ablation experiments validate the performance of different components of the framework. The visualized interpretation corresponds to clinical knowledge with further validation. In conclusion, the proposed learning framework provides a novel approach for predicting the genotype of glioma.


INTRODUCTION
Glioma is the most common malignant brain tumor in adults with remarkable heterogeneity and diverse survival outcomes [16,15,17].The mutation of the isoc-itrate dehydrogenase (IDH) gene is one of the most significant molecular markers for the diagnosis and prognosis of glioma [21].The current gold standard of IDH mutation detection, i.e., immunohistochemistry and targeted gene sequencing, is invasive and time-consuming, hindering timely clinical decision making [21].
An increasing number of studies have shown that MRI can predict the IDH mutation.Compared to radiomics approaches, deep learning has achieved better performance [18].However, most deep learning models are based on convolutional neural networks (CNN), which cannot leverage the information offered by other non-Euclidean data modalities.Recent studies show that the geometric data describing tumor shape provide robust tumor phenotyping across multiple tissue histology and imaging modalities.In addition, glioma tends to invade the whole brain beyond the focal tumor.Characterizing the global brain using the network approach has shown significance in predicting survival and cognitive decline in brain tumor patients [19,35].Hence, integrating multi-modal data, including tumor image, tumor geometrics, and global brain network, could enhance glioma genotype prediction.
Multi-modal learning shows excellent performance in integrating multi-modal data and minimizing the domain gap between modalities.For example, crossmodal attention is shown able to align fine-grained features between different modalities [14].Additionally, cross-modal contrastive loss shows promising performance in extracting global representations from image and corresponding texts [39].Nonetheless, existing approaches are designed for data modalities with one-to-one correspondence, which may not suit the data modalities with inclusion relation, e.g., tumor images with localized information of focal tumor and brain networks containing the information from both focal tumor and global brain.
This study develops a learning framework that generates multi-modal data, extracts, and integrates multi-modal features for boosting glioma genotype prediction.Specifically, apart from the image and geometric data produced from the tumor segmentation masks, we design a self-supervised approach to generate brain networks from anatomical MRIs.Then, we design three separate encoders for multi-modal feature extraction to characterize glioma from different aspects.At the same time, a hierarchical attention module is specially designed for the brain network encoder to assist the feature extraction.After that, a bi-level multi-modal contrastive loss is designed to tackle the inclusion-relation domain gap between the focal tumor and the global brain.Finally, we construct a weighted population graph approach that models the patient cohort as a large graph based on multi-modal features.A GNN is trained to classify nodes on the population graph to predict IDH mutation of patients.Our contributions include: -Structural brain networks are conventionally constructed from diffusion MRI.
To mitigate the limited availability of diffusion MRI, we propose a selfsupervised approach to reconstruct the edge attributes of the brain network from anatomical MRI through contrastive representative learning, which could help transfer the knowledge of diffusion MRI to anatomical MRI.-We design a hierarchical attention module that sequentially attends to the edges and nodes of the brain network for identifying the brain network features associated with the focal tumor.This approach could allow the brain network encoder to extract the most relevant brain network features and reduce the confounding effect from concomitant pathology.-We present a bi-level contrastive loss for multi-modal data, which aligns the tumor-level features from the focal tumor image and geometric points cloud and then aligns the tumor-level features with the brain-level network features.This approach could reflect the gradient tumor infiltration and tackle the domain gap across the focal tumor and the global brain.-We construct a population graph for modelling the patient cohort with multi-modal data.The weighted nodes represent the multi-modal features of individual patients, while the weighted edges represent the continuous similarity between patients.This approach could help better integrate multimodal features and characterise the patient cohort.
2 Related work

Genotype prediction
The studies of predicting glioma genotypes consist of radiomics-based machine learning methods and deep learning methods.The radiomics-based machine learning approaches first extract hand-crafted features from the tumor core.Feature selection is performed before training models for predicting the genotype [4].For example, Gihr et al. successfully used intensity-based radiomics features to predict IDH mutation with reasonable accuracy [10].However, the reproducibility and generalizability of radiomics are often limited by the non-standard feature engineering and selection procedure.
The end-to-end deep learning models, i.e., ResNet, DenseNet, provide a more robust prediction for tumor genotype over radiomics approaches [1,18].Liang et al. used a 3D-DensNet to predict the IDH mutation, establishing the feasibility of CNN predicting glioma genotype [18].Other deep learning models incorporate radiomics features into the model.Choi et al. integrated radiomics features into the later layers of CNN to enhance prediction [7], which outperforms the conventional ResNet.Despite achieving reasonable performance, the CNN-based models may not learn the information encompassed in the non-Euclidean data, e.g., geometric points cloud and brain networks, which provide crucial tumor biology and neuroscience information.Hence, we propose specialized encoders to obtain features from multi-modal data.

Structural brain networks in glioma
Structural brain network is a graph representation of the complex connectivity among brain regions [5], where the nodes represent the brain regions, defined according to neuroanatomy, and the edges represent the white matter connections among the regions.To generate structural brain networks, most studies use the approaches based on the diffusion MRI, which promises to indicate subtle tumor invasion [36,35,8].However, a robust model training is significantly limited by the data availability of the diffusion MRI.Recent studies indicate that the scalar map of diffusion MRI can be successfully generated from a single anatomical T1 sequence [11], which suggests the high-level correlation between anatomical MRI and diffusion MRI, indicating the potential of constructing brain networks using anatomical MRI.However, a single T1 sequence is insufficient to characterize the heterogeneous structural alternation caused by glioma invasion.Therefore, we proposed reconstructing edge attributes by transferring the knowledge of diffusion MRI to multi-sequence MRI using a contrastive loss.Studies of diffusionbased brain networks generally only include edge attributes.To characterize the brain regions invaded by glioma, we further develop an autoencoder approach to reconstruct node attributes based on regional multi-sequence MRI.

Multi-modal learning
Multi-modal learning is the deep learning approach that learns from more than one data modality, e.g., images, text, points cloud.Multi-modal learning has shown promising performance in a series of learning schemes.Lee et al. proposed a stacked cross attention to discover the full latent alignments between image regions and words in a sentence.Through inferring image-text similarity, the model produced interpretable prediction results [14].Zhang et al. employed a contrastive loss between the lung X-ray and corresponding medical reports to extract relevant representations from both images and text [39].Nevertheless, existing methods are not designed for data modalities with inclusion relation, e.g., focal tumor and global brain.Therefore, we propose a bi-level contrastive loss to align the features from the focal tumor and global brain levels.

Graph neural networks
The fast-developing graph neural network (GNN) family promises to extract features and learning from the geometric data, e.g., points cloud, which can be readily reconstructed from MRI [26].For example, Qi et al. proposed hierarchically generating a graph of points cloud and recursively trained a GNN, which effectively learned local features from the geometric points cloud of the objects.
Further, brain networks are naturally learnable by the GNN due to the graph format.Based on brain networks, GNN has shown high performance in classifying diseases.Ma et al. proposed a combination of recurrent neural network and GNN with an attention-guided random walk module to extract longitudinal structural graph features from the brain network for patient classification [22].The results showed that the attention mechanism could reveal the most critical brain regions and temporal domain during AD progression.Nonetheless, the attention mechanism designed for other diseases may not suit glioma due to the distinct pathophysiology.We thus develop a hierarchical attention module that could attend to the brain structure to reduce the confounding effect from concomitant pathology and capture tumor-specific features.
Finally, GNN also shows high performance in classifying the nodes in a large graph such as citation networks [6].The capability of GNN in handling large graphs could be transferred to patient classification tasks.Parisot et al. proposed a population graph to model the dementia cohort by regarding imaging features of individual patients as nodes, while the clinical similarity between patients as edges [24].A GNN is trained to classify patients, outperforming traditional machine learning models, e.g., random forest.This study develops a population graph to integrate the multi-modal features.Additionally, we permute the edge and node weights to select the best combination in constructing the population graph.

Differences from conference papers
This study is the extension of our two previous papers in four aspects [37,34].Firstly, for brain network reconstruction, we propose a contrastive learning approach to replace the original autoencoder for the brain network edge reconstruction, which additionally incorporates the knowledge from diffusion MRI.Secondly, we combine brain networks with focal tumor data (images and geometrics) to comprehensively characterize glioma.Thirdly, we design an attention module and bi-level multi-modal contrastive loss to extract the most relevant features from the multi-modal data.Finally, we construct a population graph for feature integration and patient classification.

Study overview
As shown Fig. 1, our glioma genotype prediction network has three stages: (1) generating multi-modal data of tumor image, tumor geometrics and brain networks from the multi-sequence MRI; (2) multi-modal contrastive learning extracting features from both focal tumor image, tumor geometrics and global brain networks; (3) feature integration to construct a population graph for patient classification and genotype prediction.

Multi-modal data generation
Our method starts by generating three data modalities from the input multiple MRI sequences (see Fig. 1A), and the three data modalities are: (1) the image data of focal tumor (denoted as x I ) is obtained by assigning Boolean values on the tumor masks and the MRI; (2) the tumor geometric data (denoted as x P ), in the form of points cloud, is generated by sampling the surface meshes of tumor masks using a standard farthest point sampling strategy; and (3) the brain networks (denoted as x B ) is generated by two self-supervised neural networks (NNs) detailed below.[32] are extracted and fed into an NN-based autoencoder (AE) to produce the node attributes u N of the brain networks.The AE consists of a NN encoder that extracts high-level representation vectors from the voxels in the brain region and a NN decoder that attempts to restore the voxels from the representation vectors.By adopting this self-supervised model, representations of the voxels in the brain regions could be extracted as node attributes.
We use the probabilistic tractography atlas as the regions of interest for reconstructing edge attributes of brain networks [36], indicating the 2,309 pathways of white matter tracts connecting the 90 brain regions.Due to the clinical significance of the fractional anisotropy (FA) map derived from the diffusion MRI in characterizing brain connectivity, we utilize the FA map to guide the edge attributes extraction of anatomical MRI.Firstly, voxels of anatomical MRI and the corresponding FA map enclosed by the tractography atlas are input into two multilayer perceptron (MLP), which respectively extract the attribute vectors v E and v E from voxels.Next, two projection heads g and g project the attribute vectors to a common latent space, where domain alignment is performed between the latent attributes of anatomical MRI (z n )) using a contrastive loss.The edge attributes extracted from the anatomical MRI contain corresponding information in the FA map.The Fig. 2. Brain network generation.Two self-supervised models are trained to extract node/edge attributes (v N ,v E ) from node/edge atlas bounded MRI voxels: Node attributes are extracted by the autoencoder, while edge attributes are reconstructed through contrastive learning between anatomical MRI and FA map of dMRI using projection head (g E ,g E ), projected latent features (z E ,z E ) and a contrastive loss.
contrastive loss of the edge L edge is defined as: where n is the target tract index, while m is the index of other tracts in the minibatch; S(•) is the similarity score; τ is the temperature parameter; M is the size of the minibatch.Finally, the node and edge attributes reconstructed from the pre-trained models are reformatted into the brain network data x B = {v E , v N }.

Multi-modal learning for image, geometrics and brain networks
The proposed multi-modal learning framework extracts features from the three modalities data, i.e., focal tumor image, focal tumor geometric and global brain networks.Moreover, hierarchical attention is developed for the brain network encoder to extract tumor-related brain network features.Finally, the extracted features are projected into a shared latent space for bi-level multi-modal contrastive learning, which could minimize the domain gap from the tumor level (image and geometrics) across the global brain level (focal tumor and brain networks).As shown in Fig. 1B, the projection is conducted via three NN-based encoders as follows.Image encoder The image encoder is a 3DCNN defined by u I i = f I (x I i ), where x I i and u I i are the image data and output features for the ith patient, and f I (•) is the 3DCNN model (see Section 4.3 for implementation details).

Geometric encoder
The geometric encoder f P (•) (Fig. 3A) outputs the geometric features u P and geometric attention a P for every point in the points cloud, defined as u P i , a P i = f P (x P i ) for the ith patient.
Brain network encoder with hierarchical attention Brain network features are extracted by training a NN with graph convolution layers, where the NN weights are corrected following a novel hierarchical attention mechanism.
The attention mechanism is structured by edge-level attention and node-level attention.The former is obtained by projecting the geometric attention of tumor boundary onto the edges (Fig. 3B).Specifically, the points clouds are projected to the edge atlas.The crossing edges are then assigned with the boundary attention of the points cloud.The edge attention is defined as: where a E i,j is the edge attention of edge i, j.K is the number of points in points cloud crossed by edge i, j and a P k is the attention of kth point crossed by edge i, j.
The outputs of the edge-level attention are further encoded by the GATConv layers that convolute the nodes and edges of the brain networks to obtain a node embedding defined by e N = f B (x B ), where f B is the components of brain network encoder before the global pooling layers (Fig. 3B).Afterwards, the node embeddings e N are projected to the latent space by a projection head g N a .To extract the tumor-related node embeddings, we applied another projection head g T to project the concatenated tumor features u I , u P , composed by both images and points cloud, into the latent space shared with node embedding.We measure the similarity between the node embedding with the tumor features by: where a N is the attention of the ith node.ga and g T a are linear projection heads projecting tumor features and node embeddings to the same latent space; S(•) is the similarity function.
By performing both edge and node attention in training the brain network encoder, we extract the most tumor-related features from the brain network and reduce the noise caused by confounding effects, e.g., ageing or other concomitant pathology (Fig. 3B).The feature extraction of the brain networks is defined as , where x B i and u B i are the brain network data and brain network features for the ith patient, and f B represents the GNN-based brain network encoder.
Bi-level multi-modal contrastive loss We develop a bi-level multi-modal contrastive loss to further characterize tumor gradient invasion and minimize the domain gap between the focal tumor and global brain.After extracting the multi-modal features from different encoders, two projection heads are adopted to respectively project the tumor-level features of images and points cloud to the same latent space: z I i = g I (u I i ) , z P i = g P (u P i ) where z I i and z P i are the projected latent features of images and points cloud, g I and g P are the pre-defined projection heads.
Meanwhile, another two projection heads are employed to respectively project the extracted focal tumor features and brain network features into another latent space: , where z B i and z T i are the projected latent feature of brain networks and focal tumor); g B and g T are the projection head for the brain network and focal tumor.
Subsequently, a bi-level multi-modal contrastive loss is developed to firstly reduce the domain gap of tumor-level features by minimizing the cosine distance (attract) between the multi-modal latent features (z I i , z P i ) from the same patient i and maximizing the cosine distance (repel) of multi-modal latent feature pairs (z I i , z P j ), (z P i , z I j ) from different patients i and j using the contrastive loss.Secondly, the brain-level domain gap is optimized using a similar approach for the features of brain networks (z B ) and focal tumor (z T ).Due to the asymmetry of the contrastive loss, we design three contrastive losses for tumor image to tumor geometrics (Equation.4), tumor geometrics to tumor image (Equation.5) and global brain network to focal tumor (Equation.6).Finally, we integrate those three sub-losses with a weighting coefficient λ.
where i is the index of the target patient, and j is the index of other patients in the mini-batch; S(•) is the similar score function; τ is the temperature parameter; N is the size of the mini-batch.The final multi-modal contrastive loss L multi is computed as a weighted combination of the above three loss: where λ ∈ [0, 1] is a scalar weight coefficient.
The algorithm The proposed multi-modal contrastive learning algorithm is shown in Algorithm 1.

Populational graph for classifying glioma patients
With the focal tumor and brain network features generated from the multi-modal learning, we construct a population graph to characterize the patient cohort (Fig. 1C): each node represents the multi-modal features extracted from the patients, while each edge represents the similarity between the multi-modal features among the patients (Fig. 5).In the population graph, the node weight of patient i is defined as w node i = u i , and the edge weight between patient i and j is defined as: where u ∈ {u T , u B , u T , u B }: u is the feature extracted from the multi-modal contrastive learning, and r(•) is the correlation operator.θ is the threshold of the correlation.We design five different combinations of the node weight w node and the edge weight w edge listed in Table 1.

Specifically, we consider different combinations of focal tumor features (tumor image and geometrics) and global brain network features as edge and node
Algorithm 1: Multi-modal contrastive learning Input: image: x I , points cloud: x P , brain network: Compute features and attention from image and geometric points cloud: u I i = f I (x I ); u P i , a P = f P (x P ); Compute edge attention: . Compute node embedding using brain network encoder: Compute node attention a N i,n via (3).end for Extract features from brain networks:

Project features to latent space:
-Image: Compute multi-modal contrastive loss by (7).end for Table 1.Population graph with different node and edge weights Node weight Edge weight weight, and we construct the population graph based on the hypothesis that the two categories of features may reflect different patterns of tumor invasion, i.e., localized v.s.widespread invasion.As such, we could integrate two types of features and characterize both the homogeneity and heterogeneity of the cohort.

Datasets
We collect the anatomical MRI data of 424 glioma patients available from The Cancer Imaging Archive (TCIA) [25,28,30] and an in-house cohort with 117 patients.The MRI modalities include pre-contrast T1, post-contrast T1, T2, and T2-FLAIR.We exclude 17 out of 424 patients due to the low MRI quality or missing IDH mutation status.Finally, 407 of 424 patients are included with 105 IDH mutants and 302 IDH wild-types.

Image pre-processing
A standard pre-processing pipeline on MRI data is performed as described [3].Firstly, the pre-contrast T1, T2, and FLAIR images are co-registered to the post-contrast T1 images using the FMRIB's Linear Image Registration Tool of the FMRIB Software Library (FSL) [12].Next, skull stripping is performed using the Brain Extraction Tool in FSL [31].Finally, histogram matching [23] and voxel smoothing with SUSAN noise reduction [12] are conducted as normalization.
For the in-house cohort with diffusion MRI modalities available, the FA maps are derived from the diffusion MRI using the FMRIB's Diffusion Toolbox.The FA maps are used to train the self-supervised models to extract tract-related features from the anatomical MRI to generate brain networks.
Finally, all the MRI data are non-linearly transformed to the standard space by co-registering them to the MNI-152-T1-2MM-brain template available in the FSL using the Advanced Normalization Tools [2].
TCGA datasets provide manually corrected tumor segmentation masks.For other datasets, we utilize the DeepMedic segmentation tool in CaPTk to segment the contrast-enhancing tumor [13].Manual correction is performed by a neurosurgeon and a researcher with a DICE score calculated to ensure cross-validation.The image of tumor core is resampled to 2mm × 2mm × 2mm and cropped to 120 × 120 × 120.

Implementation details
The proposed framework is tested on an Nvidia 1070 max-Q GPU.All models are implemented using Python.
Brain network generation The 20 patients yield 46,180 edges and 1,800 nodes for training the self-supervised model.
For the node autoencoder, all input node voxels are sampled to the dimension of 4,000, the encoder consists of six layers (dimension 2048, 1024, 512, 128, 32, 16), and the output of the bottleneck is a attribute vector v N with a dimension of 16.For the edge encoder, all input edge voxels are sampled (Anatomical: 4000, FA: 1000).Two MLP (MRI: 2048, 1024, 512, 128, 32, 16; FA: 1024, 512, 128, 32, 16) respectively encode the voxel vectors of T1 and FA to attribute vectors v E and v E with a dimension of 16.The final brain networks contain 90 nodes and 2,309 edges with a dimension of 16.

Image encoder
The image encoder is a 3DCNN architecture consisting of five 3D convolutional layers (dimension 64, 128, 128, 256, 256), with four input channels corresponding to the four MRI sequences.Batch normalization and max pooling are performed for all convolutional layers.Three feed-forward layers (dimension: 512, 256, 32) are followed to output features with a dimension of 32.
Geometric encoder A specialized GNN is adopted to extract features from the points cloud.The points are first converted into a graph for each convolution by generating links between points and their nearest neighbors within a predefined radius distance.Secondly, convolution operators NNConv [9] aggregate the points features (euclidean coordinates of points) and the link features (distance between points) to the center node.Finally, the farthest points sampling is adopted to sample the points with the furthest distance from other points.Our geometric encoder consists of convolutional layers (dimension: 32, 64, 64, 128, 128).After the last layer, a global attention pooling is employed to produce attention scores for the points.Finally, a feed-forward network (dimension: 256, 128, 32) outputs the geometric features with a dimension of 32.

Brain network encoder
The projection heads g N a for node attention is NNs (dimension: 16, 32, 64, 128) that projects node embeddings to the latent space shared with focal tumor features z T .The brain network encoder is a graph attention network with GATConv layers (dimension: 64, 128, 128, 256, 256, 256) that can handle the high-dimensional node and edge attributes [33].The feed-forward network outputs the brain network features (dimension: 512, 256, 32).Cosine similarity is used as the similarity score S(•) for generating node attention.
Bi-level multi-modal contrastive loss The projection heads g I , g P and g B , are three separate NNs (dimension 32, 64, 128, 128) that project the features to the latent space with dimension of 128.The projection head for g T is another NN (dimension 64, 64, 128, 128) that projects the u I , u P to the latent space with a dimension of 128.Cosine similarity is selected as the similarity score S(•).τ is set to 0.1 λ is set to 0.8.
Population graph and GNN classifier θ of the population graph is set to 0.5.The GATConv is employed as the graph kernel of the GNN to perform node classification in the population graph.The GNN for the population graph consists of layers of GATConv (dimension: 64, 128, 128, 128) followed by pooling layers and a classification layer.
Training parameters For self-supervised learning for generating brain networks, the autoencoder adopts mean squared error loss (MSELoss), the Adam optimizer with a weight decay of 0.0005 and a batch size of 50.We implement the following hyperparameters: 1000 training epochs.We set the initial learning rate as 0.001, and the learning rate is reduced to 90% after every 50 epochs.For edge reconstruction, we adopt the SGD optimizer [27] with a weight decay of 0.0005 and a batch size of 50 with 1000 training epochs.We set the initial learning rate as 0.001, and the learning rate is reduced to 90% after every 50 epochs.
For multi-modal learning, we adopt the SGD optimizer to optimize the network with a weight decay of 0.0005 and a batch size of 20.We implement the following hyperparameters: 1000 training epochs; a mini-batch size of 20.We set the initial learning rate as 0.001, and the learning rate is reduced to 90% after every 50 epochs.Data augmentation is performed by rotating both images and points cloud data with the same angles.
For population graph-based-GNN, we adopt the Adam optimizer [27] with a weight a batch size of 20.We apply binary cross-entropy loss for patient classification.We implement the following hyperparameters: 200 training epoch; a mini-batch size of 20.We set the initial learning rate as 0.001, and the learning rate is reduced to 90% after every 50 epochs.

Model evaluation
Evaluating performance of the overall framework We implement two classic CNN backbones (3D-ResNet34, 3D-DenseNet50) as the benchmark methods of the complete multi-modal learning framework, while the MLP and support vector machines (SVM) are also implemented as the benchmarks to compare the proposed population graph-enhanced GNN.

Evaluate population graphs
We conduct experiments in constructing a population graph to choose the best combination of edge and node weights (Table 1).
Ablation experiments We perform ablation experiments to test the importance of different components in the proposed framework.Specifically, we first test the performance of every single encoder of tumor image, tumor points cloud and brain networks, using an MLP and binary cross-entropy loss (BCELoss).Secondly, we test the performance of the pairwise combinations of training two encoders for the classification.Thirdly, we add the contrastive loss L contra to the training of the above two-encoder combinations for multi-task experiments.Further, we implement the multi-modal training without the contrastive loss.Finally, we implement the multi-modal training with the hierarchical attention removed.

Visualize interpretation
To interpret the results of the proposed multi-modal learning framework, we identify the critical regions contributing to the prediction from tumor images, tumor points cloud and brain networks using different interpretation approaches.
We employed the Grad-CAM [29] to visualize the critical regions on the tunor images and visualize the geometric attention of the points cloud.To visualize the concordance between points cloud and tumor image, we project the surface points of the Grad-CAM map overlaid on the tumor image to the corresponding points cloud.
To interpret the learning process of the brain network encoder, we employ the GNNExplainer [38] to output a probability score that infers the importance of the edges in the brain network.We retain those edges with probability scores greater than 50%.

Performance of population graph
The performance of different approaches of constructing population graph is in Table 2.The results show that the population graph achieves the best performance (AUC 0.962) with the concatenated tumor features and brain network features defined as the node and the cosine similarity between brain network features defined as edge.The population graph that uses similarity between tumor features to define edge (u B , r(u T ): AUC 0.914, u T , u B , r(u T ): AUC 0.940) generally performs worse than those using the similarity between brain network features to define edge (u T , r(u B ): AUC 0.939, u T , u B , r(u B ): AUC 0.962).Strikingly, the population graph with concatenated tumor features and brain network features defined as both node and edge performs the worst (AUC 0.888).

Performance of the proposed framework
Our results (Table 3) show that the best setting of multi-modal framework (AUC 0.962) outperforms the 3D-CNN backbones (DenseNet: AUC 0.938, ResNet: 0.907), implying the importance of the population graph for feature integration.Notably, the performances of 3D-CNN models are higher than the combination of multi-modal contrastive learning with traditional machine learning models (MLP: AUC 0.936, SVM: AUC 0.932).

Ablation experiments
The full results of the ablation experiments are shown in Table 4.The experiments of the individual encoder show that the brain encoder (AUC 0.877) outperforms the tumor geometric (AUC 0.858) and tumor image (AUC 0.869) encoders.The multi-task two-encoder experiments indicate that the best combination of data modalities is tumor image and tumor points cloud (AUC 0.874).The two-encoder experiments with the additional contrastive loss show that combining tumor image and tumor geometric encoders (AUC 0.929) consistently performs the best.Finally, removing the graph attention modules significantly decreases the performance of the contrastive training framework (with attention: AUC 0.936, without attention: AUC 0.924).

Interpretative visulization
The interpretative visualizations in Fig. 7 show that the image encoder and the geometric encoder indicate common tumor regions (Fig. 7D -F) important for model prediction, which suggests that these regions are specific to IDH mutation.Fig. 7D shows that the Grad-CAM focuses on the tumor contrast-enhancing edges with high intensity in both T2 and T2-FLAIR images (Fig. 7B, C).Through the visualization in Fig. 8, we find that the brain networks of IDH wild-type demonstrate a higher density of important disrupted edges compared to IDH mutant.This finding aligns with our prior knowledge that the IDH wild-type is generally more invasive than the IDH mutant.

DISCUSSION
We propose a multi-modal contrastive learning framework that exploits the multimodal features extracted from the tumor image, points cloud and global brain networks for predicting glioma genotype.We firstly develop a novel self-supervised learning approach to construct brain networks from anatomical multi-sequence MRI.Moreover, tumour-related brain network features can be extracted by developing hierarchical graph attention for the brain network encoders.Further, we design a bi-level multi-modal contrastive loss that could align tumor-related network features with focal tumor features across the domain gap.Finally, we construct a population graph that could effectively integrate the multi-modal features and predict patients' genotype.Our learning framework achieves the highest performance compared to benchmark methods.Previous studies show that the tumor geometric features demonstrate crucial value in characterizing tumors.Although the single geometric encoder does not perform the best in our experiments, the combination of image and geometric encoders shows high performance, which suggests the benefit of including geometric data to enhance extractingthe most relevant features.The interpretative visualization shown in Fig. 7 indicates the agreement between points cloud and image features, which could further validate the effectiveness of the multi-modal contrastive learning in aligning multiple modality domains.Biologically, this could be interpreted as the association between tumor content and tumor boundaries, indicating tumor aggressiveness and invading patterns.
Glioma is characterized by diffuse infiltration beyond the focal lesion, disrupting the global brain.Therefore, characterizing the brain network promises to add value to the focal tumor.However, glioma patients also frequently demonstrate concomitant pathology beyond the lesion, which could challenge extracting specific features from the brain network for predicting the genotype.To efficiently extract tumor-specific features from the brain network, we introduce a hierarchical graph attention that attends to the edges of the brain network associated with tumor points cloud, indicating the white matter tracts (Fig. 8) responsible for tumor invasion across tumor boundaries (Fig. 7).Similarly, the attention module could further identify the nodes associated with the focal tumor.This neuroscience-inspired attention module demonstrates significance in enhancing the model performance and interpretability, shown by the ablation experiments and visualization.
Our experiments show that the multi-modal contrastive learning approach outperforms the CNN-based benchmarks, validating the usefulness of properly incorporating tumor geometrics and brain networks in predicting glioma genotype.Our hierarchical attention transfer the geometric attention of points cloud to the crossing edges and minimize the domain gap between focal tumor and brain networks.Different from the traditional cross-modal attention [14], our hierarchical attention is designed based on the data properties and clinical hypothesis.In addition, we develop a bi-level contrastive loss, tailored to perform contrastive learning at tumor and brain levels, reflecting the gradient invasion pattern.Instead of directly applying cross-modal contrastive loss between three modalities in the same latent space, we design a bi-level approach that considers the tumor gradient effect across the brain.
To integrate the multi-modal features, we develop a population graph to characterize the patient cohort.The brain network features demonstrate as the best features describing patient similarity, the importance of network features.In contrast, focal tumor features show weaker performance in characterizing patient similarity, which might be due to the remarkable tumor heterogeneity and the limited information compared to the global brain, further supporting the value of incorporating comprehensive features in the prediction.
The proposed methods have the potential for automated, rapid diagnosis and prognosis in glioma patients based on pre-treatment MRI, which is essential for patient risk stratification and treatment planning towards precision medicine.Further, our hierarchical graph attention could help reveal the tumor-related disruption beyond the lesion, which could help enhance more precise planning of surgery and radiotherapy, as recent studies show that identifying disrupted white matter tracts could help reveal invisible tumor invasion on the conventional MRI and indicate the recurrence location [35].
This study has limitations.Firstly, due to the rarity of glioma, our training sample is smaller than other cancers, although our cohort is one of the largest in glioma.Secondly, our cohort is slightly more imbalanced (∼ 25% IDH mutant) than reported incidence (∼ 40%) [20].We use both AUC and accuracy in evaluating our model performance, which shows comparable results, implying the model robustness.Thirdly, constructing brain networks relies on the neuroanatomy atlases, where we use an atlas with 90 brain regions, due to the limitation of computational costs.Adopting the atlases with higher resolution could further increase the framework's performance.Our future work will involve larger datasets and transfer learning to further enhance the performance.In addition, manifold or mesh-based geometric encoder could be utilized to capture features from a more detailed geometric data format.

CONCLUSION
We present a novel multi-modal learning framework for predicting the IDH mutation of glioma.Our technical contribution include: a self-supervised approach for generating brain networks from anatomical MRI; a specialized hierarchical attention module that attends to tumor related edges and nodes; a bi-level contrastive loss for minimizing the domain gap between different modalities; a weighted population graph for feature integration and patient classification.Our framework outperforms the classic CNN backbones, while the population graph-based classifier outperforms traditional machine learning models.In future, we will further develop our model to include clinical variables into the prediction model.

Fig. 1 .
Fig. 1.Study overview: A. Multi-modal data generation.Image x I and geometric x P data are generated from tumor masks, while brain network data x B are generated from the pretrained self-supervised models.B. Features of tumor image (u I ), tumor geometrics (u P ), focal tumor( u P , u I ) and global brain network (u B ) are projected by respective projection heads (g I , g P , g T and g B ) for bi-level contrastive learning.A hierarchical attention module attend to the edges and nodes in the brain network.C. A population graph is used to integrate multi-modal features and classify patients using a GNN.

Fig. 3 .
Fig. 3. Hierarchical graph attention: A. Geometric boundary attention produced by the geometric encoder is projected to brain networks to obtain edge-level attention.B-C.Edge-attended brain networks are convoluted to produce node embeddings, projected to latent space for generating node-level attention by computing similarity with tumor features.The node-level attention is then utilized in the global pooling level for generating tumor-related brain-level network features u B .

Fig. 4 .
Fig. 4. Bi-level multi-modal contrastive learning: Latent features of different modalities (u I i , u P i ), (u T i , u B i ) from the same patient (green) attract each other, while latent features of different modalities (u I i , u P j ), (u T j , u B i ) from different patients (red) repel each other.The bi-level loss consists of a tumor-level and brain-level components trained together.

Fig. 5 .
Fig. 5. Population graph for patient classification: Each node weight (w node ) represents features of one patient, while each edge weight (w edge ) represents the similarity among the features of patients.A GNN node classifier is trained for classifying patients.

Fig. 6 .
Fig. 6.AUC for the ablation experiments: A. Models with and without hierarchical attention.B. Models with and without bi-level contrastive loss.C. Models of the full framework and the benchmarks of population graph-enhanced GNN.D. Models of the full framework and CNN benchmarks.

Fig. 7 .
Fig. 7.An case example of interpretation of image and geometric encoders.A. precontrast T1; B. T2; C. T2-FLAIR.D. The Grad-CAM heatmap overlaid on post-contrast T1.E. The Grad-CAM voxels projected to the points cloud.E. Points attention generated by the geometric encoder.

Fig. 8 .
Fig. 8. Examples of IDH mutant and wild-type.A IDH wild-type.B. IDH mutant.Voxel distribution of disrupted tracts with over 50% probability of importance are indicated.

Table 2 .
Experiment for selecting the best combination of the population graph