Differentially Private Graph Classification with GNNs

Graph Neural Networks (GNNs) have established themselves as the state-of-the-art models for many machine learning applications such as the analysis of social networks, protein interactions and molecules. Several among these datasets contain privacy-sensitive data. Machine learning with differential privacy is a promising technique to allow deriving insight from sensitive data while offering formal guarantees of privacy protection. However, the differentially private training of GNNs has so far remained under-explored due to the challenges presented by the intrinsic structural connectivity of graphs. In this work, we introduce differential privacy for graph-level classification, one of the key applications of machine learning on graphs. Our method is applicable to deep learning on multi-graph datasets and relies on differentially private stochastic gradient descent (DP-SGD). We show results on a variety of synthetic and public datasets and evaluate the impact of different GNN architectures and training hyperparameters on model performance for differentially private graph classification. Finally, we apply explainability techniques to assess whether similar representations are learned in the private and non-private settings and establish robust baselines for future work in this area.


Introduction
The introduction of geometric deep learning, and more specifically Graph Neural Networks (GNNs) [1] has enabled training ML models on data in non-Euclidean spaces, with state-of-the-art performance in many applications. GNNs are able to directly leverage the graph structure of the data and propagate the information stored in nodes of the graph along the edges connecting nodes with each other. Thus, the information flow through the network respects the underlying topology of the graph.
In general, GNNs have been employed in three types of problem areas: node classification, edge prediction, and graph classification. In this work, we focus on graph classification tasks. In the setting of graph classification (also termed graph property prediction), the dataset consists of multiple graphs and a GNN is trained to predict one label for each individual graph, predicting a specific property of the whole graph. Application areas of geometric deep learning range from social networks [2] to medical applications [3,4], drug discovery or molecule classification [5], spatial biological networks [6] and shape analysis [7]. Drawing meaningful insights in many of these application areas fundamentally relies upon the utilisation of privacy-sensitive, often scarce, training data belonging to individuals. For example when using functional magnetic resonance imaging (fMRI) for identifying disease-specific biomarkers of brain connectivity like in [3] and [8], the graph data encodes sensitive, patient-specific medical data.  Figure 1: Overview of our differentially private training method for graph classification on a fingerprint dataset. In step (1) the fingerprint images are converted into graphs, which are then in step (2) passed to a GNN model, which is trained with differentially private stochastic gradient descent (DP-SGD). The individual gradients are clipped, then averaged and Gaussian noise is added.
The reliance on sensitive data in machine learning holds potential for misuse and can therefore be associated with the risks to individual participants' privacy. Various machine learning contexts have been shown vulnerable to be exploited by malicious actors, resulting in a leakage of private attributes [9], of membership information [10] or even in full dataset reconstruction [11,12]. In graph machine learning, the data and the models trained on that data are by design more vulnerable to adversarial attacks targeting privacy of the data owners. This is attributed to the fact that graphs incorporate additional information that is absent from typical Euclidean training contexts, such as the relational information about the nodes in the graph. This auxiliary, highly descriptive information can be leveraged by an adversary to assist them in sensitive information extraction, which has been demonstrated in a number of prior works [13,14,15]. Such attacks can also be facilitated by the choice of learning context in cases the model is trained collaboratively. For instance, transductive collaborative learning renders attacks aimed at disclosing the membership of individual training points trivial [14]. Of note, such additional information embedded in graphs is often essential for effective GNN training and is, thus, non-trivial to privatise or remove, as it would be highly detrimental to the performance of the model.
It is thus apparent that the implementation of privacy-enhancing techniques is required to facilitate the training of models of sensitive graph-structured data, but such techniques must also respect the particularities of graph machine learning. Our work utilises a formal method of privacy preservation termed differential privacy (DP) [16] which, when applied to machine learning training, is able to objectively quantify the privacy loss for individual input data points. DP methods have been successfully applied to numerous problems such as medical image analysis [17,18], natural language processing (NLP) [19], reinforcement learning [20] or generative models [21] and have shown promising results. DP guarantees that the information gain from observing the output of an algorithm trained on datasets differing in one individual is (sometimes with high probability), bounded by a (typically small) constant.
In this work, motivated by the above-mentioned requirements for objective privacy guarantees in machine learning tasks involving graph-structured data, we study the problem of efficient differentially private graph neural network training for graph classification tasks. We utilise differentially private stochastic gradient descent (DP-SGD) [22], a technique tailored to neural network training, which, due to its compatibility with existent deep learning workflows, can be seamlessly integrated and therefore offers high generalisability to new model architectures and problem spaces. We show that DP-SGD can be applied to graph learning and evaluate our results with respect to privacy budgets and network performance on four different datasets. To the best of our knowledge, this is the first work that shows the application of differentially private GNNs for graph classification tasks. Combined with our investigation of the explainability technique GNNExplainer to determine differences between DP and non-DP models, this work can serve as a baseline for future work in this area. Our contributions can be summarised as follows: 1. We formally extend the application of DP-SGD to graph classification tasks; 2. We evaluate the differentially private training of commonly utilised graph neural networks on a number of benchmark and real-world datasets and investigate the effects of DP training on model utility and privacy guarantees; 3. To assess whether privately and non-privately trained models learn similar representations, we apply GNNExplainer, a state-of-the-art explainability technique tailored to graph neural networks.

Related Work
Specific facets of differentially private graph analysis have been addressed in prior work: Since the introduction of differentially private computation on graph data in 2007 by Nissim et al. [23], node-level and edge-level DP have been established as the two DP formalisms on graphs [24]. As discussed in the Theory section, the definition of DP relies on the notion of adjacent datasets, that is, datasets differing in the data of one individual. In the setting of tabular data for example, two datasets are adjacent if they differ in one row. In node-level DP, two graph datasets are interpreted as adjacent if one node and its incident edges is inserted or removed. For edge-level DP, on the other hand, two datasets are regarded as adjacent if they differ in exactly one edge. Therefore, node-level DP is a strictly stronger privacy guarantee in comparison to edge-level DP [25]. As real-world graphs are prevalently sparse, removal of a single node can severely alter the graph's structure [25], whereas removal of an edge typically has a less severe impact on the resulting graph structure.
Implementations of the aforementioned techniques have been presented in the context of graph neural network training. For instance, Igamberdiev et al. [26] explore the application of DP on Graph Convolutional Networks (GCNs) [27] for node classification. They evaluate privacy guarantees for text classification on benchmark datasets and achieve rigorous privacy guarantees while maintaining high model performance. Daigavan et al. [28] formalise the notion of node-level DP on one-layer GNNs with an extension of privacy amplification by sampling to GNNs and evaluate their method on several benchmark datasets in node classification tasks. Different approaches to the here introduced application of differential privacy have been explored in the context of federated learning on graphs and locally private graph neural network training. Zhou et al. [29], for example, introduce a vertically federated GNN for node classification tasks and Sajadmanesh et al. [30] introduce a framework to train locally private GNNs.
However, to our knowledge, the application of DP algorithms specifically to graph property prediction has neither been formalised nor evaluated.

Theory
In this section, we introduce and formalise the theory to train graph neural networks for graph property prediction using the concept of differentially private stochastic gradient descent (DP-SGD).

GNNs for Graph Property Prediction
The objective of graph classification (also known as graph property prediction) is to predict a specific property of interest for an entire graph G. G is an unweighted and undirected graph with G = (V, E), where V is the set of nodes and E is the set of all edges of the graph. For the experiments in this work, we will use three commonly used GNN models: Graph Convolutional Networks (GCNs) [27], Graph Attention Networks (GATs) [31], and GraphSAGE [32].

Differential Privacy
Differential Privacy (DP) [16] is a theoretical framework and collection of techniques aimed at enabling analysts to draw conclusions from datasets while safeguarding individual privacy. Intuitively, an algorithm preserves DP if its outputs are approximately equivariant to the inclusion or exclusion of a single individual in the dataset over which the algorithm is executed. The DP guarantee is given in terms of probability mass/density of the algorithm's outputs.
In the current study, we assume that an analyst A is entrusted with a multi-graph database D of cardinality N containing privacy-sensitive graphs G i ∈ D, i ∈ [1, · · · , N ] by a group of individuals. We assume that each individual's graph is only present in the database once. From D, an adjacent database D of cardinality N ± 1 can be constructed by adding or removing a single individual's graph. We denote adjacency by D D . The set (universe) of all adjacent databases forms a metric space X with associated metric d X , in our case, the Hamming metric.
We additionally assume that A executes a query function f over an element of X. In our study, the application of f represents a sequential composition of the forward pass, loss calculation and gradient computation of a graph neural network for each individual input (training example) to f . We then define the L 2 -sensitivity of f as follows: Definition 3.1 (L 2 -sensitivity of f ) Let f, X and d X be defined as above. Additionally, let Y be the metric space of f 's outputs with associated metric d Y . When Y is the Euclidean space and d Y the L 2 metric, we define the L 2 -sensitivity ∆ of f as: We remark that the maximum is taken over all adjacent database pairs in X. Moreover, ∆ describes a Lipschitz condition on f , implying that ∆ ≡ K f , where K f is the Lipschitz constant of f . This in turn implies that ∆ = sup ∇f 2 . In our case, the L 2 -sensitivity of the loss function therefore corresponds to the upper bound on its gradient.
We can now define the Gaussian Mechanism on f : Definition 3.2 (Gaussian Mechanism) Let f, ∆ be defined as above. The Gaussian mechanism M operates on the outputs of f , y = f (x), where y ∈ R n as follows: where ξ ∼ N (0, σI n ), σ is calibrated to ∆, and I n is the identity matrix with n diagonal elements.
When σ is appropriately calibrated to ∆, M preserves (ε, δ)-DP: and all adjacent databases D, D in X: We remark that the definition is symmetric.

DP-SGD
Abadi et al. [22] introduced an extension to stochastic gradient descent (SGD), termed DP-SGD to enable the differentially private training of neural networks. Here, at each training step, the Gaussian Mechanism is used to privatise the individual gradients of each training example before the model parameters are updated. However, since the sensitivity of the loss function in deep neural networks is -in general -unbounded, the gradient L 2 -norm of each individual training example is clipped, that is, projected to an L 2 -ball of a pre-defined radius to artificially induce a bounded sensitivity condition before noise is applied. Tracking the privacy expenditure over the course of training (privacy accounting) is enabled through the composition property of DP, stating that repeated application of DP algorithms over the same data predictably degrades the privacy guarantees. In our study, a relaxation of DP termed Rényi DP (RDP) [33] is used for privacy accounting, due to its favourable compositional properties. RDP guarantees can be converted to (ε, δ)-DP.

Datasets
We evaluate the application of DP-SGD in the context of graph property prediction tasks on four datasets. We rely on three publicly available and one synthetic dataset, generated to provide a reproducible and easy to control proof-of-concept. The three benchmark datasets tackle the problems of molecule classification (Molbace), fingerprint classification, and Left Bundle Branch Block (LBBB) detection on electrocardiogram (ECG) data. Table 1 provides an overview of the datasets and their characteristics.
Synthetic Dataset In order to derive a proof-of-concept of the novel application of DP-SGD on graph classification tasks, we construct a synthetic dataset, in which parameters can be manually controlled. We generate 1000 individual Erdős-Rényi graphs of two classes. Each graph consists of twenty nodes and each node contains nine features, which are sampled from a normal distribution with the mean values of 0 and 0.1, while having the same standard deviation  Figure 2: Graph visualisation of ECG data. We connected the different signal channels based on the medical location of the leads as well as prior knowledge. Leads I, II, III, aVF, aVL, and aVR are located on the extremities and the remaining leads on the chest. of 0.5. The edge connection probabilities of the graphs from the two classes are set to 0.2 and 0.3, respectively. We generate 500 graphs for each class to have a balanced dataset, split the dataset into 600 training samples, 100 validation samples and 300 test samples, and perform binary graph classification. To address the applicability of our approach on graphs containing unconnected components, we do not eliminate unconnected nodes from the generated Erdős-Rényi graphs.
Fingerprints Dataset Fingerprint classification aims to separate images of fingerprints into different classes. A large within-class variability and a small separation between classes makes fingerprint classification a challenging task [34]. We rely on the dataset introduced by Riesen et al. [35] and provided by TU Datasets [36] to perform differentially private graph classification on fingerprints. The graphs are extracted from the images based on directional variance and the task follows the Galton-Henry classification scheme of five classes. We merge the five classes into four classes following the approach described in [35]. Differentially private ML naturally befits this task, as it allows one to privatise the utilisation of the uniquely identifying fingerprint data for e.g. training machine learning models in tasks such as automated authentication.
Molbace Dataset To perform molecule classification in a binary graph classification setting, we use the benchmark dataset Molbace from the OGB database [37], where the Molbace dataset is adapted from MoleculeNet [38]. It consists of 1513 graphs, where each graph represents a molecule. Edges represent molecular bonds and nodes correspond to the atoms in the molecule. Each node contains 9 node features and the average number of nodes per graph is 34. We split the dataset into 1210 training graphs, 152 test graphs and 151 validation graphs. Node features contain atom features; for example the atomic number, chirality, formal charge, or whether the atom is in a ring or not. The prediction task of this dataset is to correctly classify whether the molecule inhibits HIV virus replication [37]. Such a task is representative of federated learning workflows in which e.g. several pharmaceutical companies wish to jointly train a model for molecule property prediction, while wishing to keep their (possibly proprietary) molecule structures private from third parties. ECG Dataset For the task of electrocardiogram (ECG) classification, we use the publicly available ECG dataset from the China Physiological Signal Challenge (CPSC) 2018 challenge dataset [39]. We formulate a classification task between ECGs showing signs of a Left Bundle Branch Block and normal ECGs showing a sinus rhythm. LBBB is an insidious type of arrhythmia (that is, anomaly in the conduction system of the heart), which, when appearing suddenly, can herald acute myocardial ischemia or infarction. The ECG data consists of twelve ECG signal channels (leads), recorded at different locations on the human torso and extremities. Leads affixed to the extremities constitute signal channels I, II, II, aVR, aVF and aVL. Leads affixed to the chest are used to derive signal channels V1 through V6. To construct a graph dataset from the ECG data, we utilise this medical motivation and divide the ECG extremity signal channels from the chest signal channels by fully connecting the extremity and chest subgraphs. In addition, we utilise prior knowledge about the leads which are typically used by physicians to delineate LBBB from sinus rhythm and thus connected channels I, aVR, V5, and V6. The structure of those graphs is visualised in Figure 2. The dataset we use contains ECG data of 1125 subjects. As ECG signals are periodic, we sub-sample the signals by only retaining the first 512 signal points of each channel, leading to 512 node features in the graphs. The binary classification dataset is highly imbalanced with 207 subjects showing signs of LBBB and 918 having normal ECG curves. Evidently, ECG data, like all medical data is highly sensitive, and thus requires formal methods of privacy protection.

GNN Models for Graph Classification and DP-SGD Training
Since the adoption of deep learning techniques to graph learning, most state-of-the-art methods for graph classification rely on a variant of message passing to aggregate information across the nodes [40,41,42,43,44].
For our experiments, we implement a variety of GNN models to compare performance and evaluate the impact of DP on different graph learning techniques. We use GraphSAGE [32], Graph Attention Networks (GATs) [31], and Graph Convolutional Networks (GCNs) [27]. For each dataset, we perform hyperparameter searches, leading to different models for each application. The depth of the GNNs varies from two to three layers with/without Instance Normalisation layers and with/without dropout, depending on the problem space. We do not use Batch Normalisation because of its incompatibility with differentially private training; Batch Normalisation, by taking averages across the batch during the forward pass, leaks information over samples in a batch and precludes the computation of per-sample gradients necessary for DP-SGD. More details about the model architectures can be found in the supplementary material.
When training graph classification models with DP-SGD, we follow the standard procedure of DP-SGD training. Firstly, a privacy budget is set in terms of ε, then the model is trained with a specific noise multiplier that defines the amount of Gaussian noise added to the gradients of the model and a L 2 -sensitivity bound. The model can then be trained a certain number of iterations, until the privacy budget ε is reached. We then report the scores of the best-performing model out of the ones trained before the privacy budget is exhausted. For all differentially private training runs, we set δ = 1 N , where N is the cardinality of the dataset and monitor the performance of the algorithm with different privacy budgets ε. Across all experiments, we utilise the same model architectures for DP-SGD and SGD training with the removal of potential dropout layers for DP-SGD training. In Table 2 we report the mean performance as well as the standard deviation of five independent runs for each experiment. We evaluate different scores for each model: ROC AUC, Accuracy, Sensitivity, Specificity and F1 Score. Hereby sensitivity reports the rate between the true positives and the sum of the true positives and false negatives. Specificity is the rate between the true negatives and the sum of the true negatives and false positives. The ROC AUC score is the Compute Area Under the Receiver Operating Characteristic Curve with a micro average for multi-class datasets. Accuracy is the rate between the true positives and all samples and the F1 Score reports the harmonic mean of the precision and recall, also using a micro averaging strategy for multi-class datasets.

Experimental Results
In this section, we evaluate our results, compare DP-SGD training with standard SGD training and show the impact of different privacy budgets on model performances. The results achieved on the four datasets are summarised in Table 2.

Summary of Results
For all datasets, we observe similar behaviour, namely a correlation between stronger privacy budgets and diminished model performance. Although this phenomenon is -in general -an unavoidable, informationtheoretic consequence of the trade-off between privacy and utility, the individual models exhibit different behaviour with regards to their individual tolerances towards the amount of Gaussian noise added for DP-SGD, as well as the tolerances towards gradient clipping. For instance, for the synthetic dataset, an ε value of 5 does not lead to accuracy loss, whereas for the Molbace dataset, a privacy budget of ε = 10 already results in diminished model accuracy. Interestingly, the performance of DP-SGD training is overall not substantially influenced by the choice of GNN architecture (GCN, GAT or GraphSAGE). We observe high performance and similar convergence rates for all architectures, indicating   Table 2: Summary of our experimental evaluation on four datasets: Synthetic, Fingerprints, ECG, and Molbace with different network types. We report results with SGD and DP-SGD training as well as varying privacy budgets ε. The scores are evaluated on the test sets with a standard deviation based on five independent runs. We find that our models achieve high performance when using our proposed DP-SGD training method. The performance decreases slowly when increasing privacy guarantees. For all models, we observe an increased inter-run variability with stronger privacy guarantees. This behaviour is reflected in the higher standard deviations reported in Table 2, and we attribute this phenomenon to the increased randomness injected by the DP mechanism.
Exemplarily, we visualise the impact of a stronger privacy guarantee on the performance on the ECG dataset in Figure  3. Given that the dataset is highly imbalanced, a constant prediction (marked by the lower dashed green line in Figure  3) would result in an approximate test accuracy of 81.6%. We examine the dependency of the results on the choice of ε and report the different performances. With a very strong privacy guarantee (corresponding to a low ε value), the performance of the network is barely better than a constant prediction. The looser the privacy guarantee (larger ε value) the better the performance; for a very loose ε the results reach non-DP performance. Interestingly, for some models we observe identical performance between DP-SGD and normal training, e.g. Fingerprint-GCN, where the DP-SGD model (privacy budget of ε = 5) reaches slightly higher performance then the normal training, see Table 2; this beneficial effect can be attributed to the regularising effects of gradient norm bounding and noise injection, indicating that -within certain constraints -DP training can go hand-in-hand with excellent overall model performance and generalisability. Scalability In order to investigate the scalability of our approach, we vary the size of the created Erdős-Rényi graphs in the synthetic dataset between 10 and 500 nodes per graph. Figure 4 shows the impact of the graph size on the performance under DP using a three-layer GCN and ε = 2.3 . We visualised the performances of graph sizes between 10 and 50 nodes and find that performance improves with increasing graph size in these ranges. Beyond 50 nodes, the performance remains constantly high, which is why these plots were not included in Figure 4. This behaviour indicates a strong performance of our model across varying graph sizes; i.e. robust scalability.
Explainability The interpretability of GNNs is a challenging and frequently discussed task in research. Recently, approaches like the GNNExplainer [45] formalised methods which can be used to interpret the results of trained GNNs. We make use of this method to interpret the differences in learned representations between models trained with DP-SGD and non-private SGD and visualise the results in Figure 5. The GNNExplainer is an approach for post-hoc interpretation of predictions generated by a trained GNN. It is used to identify which edges in the graph represent essential connections for a prediction of the network, thus indicating nodes important for the final prediction. GNNExplainer prunes the original graph to only contain the nodes and edges with the highest impact on the model prediction. We apply the  Table 3: Mean IoU scores of ten test samples from the Fingerprint dataset for comparing edges between the original graph, the GNNExplainer graph of the model trained with SGD, and the GNNExplainer graph of the model trained with DP-SGD. The IoU between the original graph and the non-DP graph is 0.739. The intersection between the DP and the non-DP graphs decreases with a smaller ε value which corresponds to smaller ROC AUC results.
GNNExplainer to our results on the Fingerprints dataset, comparing a GCN model trained with standard SGD and three GCN models trained with DP-SGD with ε = 5, ε = 1 and ε = 0.5. We set the GNNExplainer threshold for edge importance to 0.2. Qualitatively, we observe that the GNNExplainer results of the DP models and the standard models  Figures (B) and (C) show the pruned graphs for SGD and DP-SGD training, respectively. In the lower example (2) in Figure 5, both GNNExplainer graphs are identical (almost identical in the upper row), showing that in both models the same edges and nodes have a high impact on the models' predictions. This indicates that the feature importance is the same (or almost the same) between both models and that the feature importance is not compromised by the privacy guarantees achieved through DP training.
To provide a quantitative estimation of GNNExplainer similarity of our results, we propose and use an Intersection over Union (IoU) score, measuring the pair-wise overlap of edges in the three resulting graphs. The IoU score of two graphs A and B is defined as follows: where E X represents the set of all edges in Graph X and | · | denotes the cardinality of a set. Table 3 summarises the results of the mean IoU values between the original graph and the GNNExplainer graph based on training with DP, and the two resulting GNNExplainer graphs from DP-SGD and SGD training. The IoU score of the original graph and the GNNExplainer graph of the model trained with standard SGD is 0.739 for all graphs. We compare the overlap between the graphs with the model performance, reported by the ROC AUC score. We find a high IoU score for DP vs. non-DP models, which is in line with the GNNExplainer plots we observe in Figure 5. Moreover, we observe that our GNNExplainer IoU score of the DP and the non-DP models slightly decreases with a smaller ε and smaller ROC AUC scores, see Table 3. The increase in the IoU score between the original model and the DP model with ε = 0.5 most likely only indicates that the DP trained model with ε = 0.5 considers more edges as relevant than the model trained with ε = 1.0. These qualitative and quantitative GNNExplainer results indicate that our proposed DP graph classification models exhibit strong and similar inductive biases compared to "normal" GNNs while preserving privacy guarantees.

Discussion
Conclusion Our work introduces and evaluates differentially private graph classification, a formal method to offer quantifiable privacy guarantees in applications where sensitive data of individuals is represented as a whole graph. Such contexts include medical data (as shown in our ECG classification example), where DP can enable training of machine learning models while maintaining both regulatory compliance and adherence to ethical standards mandating the protection of health-related data. Our experiments on benchmark and real-world datasets demonstrate that the training of GNNs for graph classification is viable with high utility and tight privacy guarantees. Expectedly, we observe a privacy-performance trade-off, whereby a decrease in the value of ε results in a decline in the accuracy of the model, as demonstrated in Figure 3.
Additionally, we investigate the utilisation of explainability techniques to compare the representations learned by models trained with SGD and DP-SGD. The application of the GNNExplainer indicates that models trained with DP-SGD learn similar relevant representations to the non-privately trained models. To quantitatively demonstrate the results of the GNNExplainer, we calculated an IoU score on the edges considered important by the technique between the resulting graphs. We observe an overall high IoU with a slight decline in overlap with tighter privacy guarantees, indicating thatas expected -the high levels of noise required to achieve such guarantees eventually become detrimental to learning.
Limitations and future work Inherent to the concept of differential privacy in machine learning is a performanceto-privacy trade-off. While our experiments visually illustrate the implications of the trade-off and provide insight into its practical importance in the context of machine learning on graphs, the actual relationship between privacy and accuracy is highly task-and user-specific [46,47]. Therefore, we note that one can interpret the value of ε as an additional design-parameter that needs to be optimised for in order to minimise the adverse effects that DP can have on performance in the context of graph classification (or most other learning tasks in general).
While the GNNExplainer concept can provide initial clues to interpret and explain GNN training and the intrinsic differences between models trained with SGD and DP-SGD, it is only an initial step towards full explainability and interpretability. We consider this to be a highly relevant and an interesting direction for future research. In particular, we aim to investigate the effects of differentially private GNN learning on adversarial robustness of the model. We hypothesise that -similarly to Euclidean settings - [48,49] DP should have a mitigating effect against attacks that diminish the utility of the trained model in the context of machine learning on graphs.
In our experiments we utilize a limited set of standard model architectures (GCN, GraphSAGE, GAT). Evidently, more sophisticated architectures have been designed and deployed to real world problems. As our proposed approach is general, we assume that an extension to such advanced graph learning models is natural and should exhibit similar behaviour.
Discussion of potential societal impacts We do not foresee any specific negative social impact of our work. To the contrary, we strongly believe that the implementation of formal techniques for privacy preservation like DP in the setting of GNN training will mitigate the risks of using sensitive data in ML tasks. In the case of medical data (as in the ECG dataset example), we believe the utilisation of privacy preserving methods to also hold positive effects in terms of encouraging data owners (such as patients) to make their data accessible for research purposes. Evidently, such implementations must go hand in hand with educating potential stakeholders in the correct application of DP mechanisms, including the appropriate choice of parameters like ε. In this work, we rely exclusively on public datasets collected with informed consent or with approval of institutional review boards wherever applicable; thus, the risk of potential leakage of sensitive information during our experiments has been appropriately considered.

A Appendix
Supplementary Material

B Model Architectures and Training Parameters
We here provide a brief overview of the model architectures we used for our experiments as well all corresponding hyperparameters. For each dataset, we utilised a different model architecture which we determined through hyperparameter searches. Table 4 summarises the hyperparameters used for each experiment. For SGD training on the Molbace dataset we used a cyclic learning rate scheduler with a defined lower and upper learning rate as described in Table 4. We did not observe increased performance when applying the cyclic learning rate scheduler to DP-SGD training on this dataset, which is why we did not utilise a learning rate scheduler for those applications. All models are implemented using PyTorch [50] and PyTorch Geometric [51]. For all binary models we used the Binary Cross Entropy Loss, for the non-binary classification task of the Fingerprint dataset we use the Cross Entropy Loss.  Table 4.

ECG Dataset
The experiments on the ECG Dataset were performed with a model architecture consisting of two graph convolutional layers, followed by a Max Pooling layer and three linear layers. Each convolutional layer and each linear layer is followed by a ReLU activation function. The hidden channels of the graph convolutional layers are 256, 512 and of the linear layers 128, 56, 24. In the non-DP training all convolutional layers and the Max Pooling layer are followed by a Dropout Layer with dropout probability 0.2. We removed all Dropout Layers for the DP-SGD training, because the added noise intrinsic to the algorithm already functions as regularisation. Learning rate and batch size of the SGD training were set to 0.05 and 24, correspondingly for all models. The learning rates for DP-SGD training runs depends on the graph convolution and can be found in Table 4.

Molbace Dataset
For the experiments on the Molbace dataset we utilise a NN with an Instance Normalisation layer, followed by three graph convolution layers, each followed by another Instance Normalisation layer and a ReLU activation function. The convolutional layers are followed by a Mean Pooling layer and two linear layers with 512 hidden channels. For the non-DP training we use a Batch Size of 64 and a cyclic learning rate scheduler with upper and lower learning rates noted in Table 4. The learning rates were determined using a learning rate finder. For the DP-SGD training we use a Batch Size of 24 and a learning rate of 0.1 for all models. We did not use a learning rate scheduler for DP-SGD training, since it did not improve model performance.

C Performances of Different Model Architectures
The comparable performance of the different GNN architectures we use for our experiments (GCN, GAT, and Graph-SAGE) leads to the conclusion that DP-SGD training is independent from the type of graph convolution. For all datasets, we report similar performance of all models and we show two examples of respective training loss curves in Figure 6 Dataset   and 7. Figure 6 shows the training loss curves on the ECG dataset for the DP-SGD training with noise multiplier 0.6 and L 2 Clip 5.0.

D Explainability using GNNExplainer
As noted in section 5, we applied the explainability technique GNNExplainer [52] to our trained networks.