SC-FGCL: Self-Adaptive Cluster-Based Federal Graph Contrastive Learning

As a self-supervised learning method, the graph contrastive learning achieve admirable performance in graph pre-training tasks, and can be fine-tuned for multiple downstream tasks such as protein structure prediction, social recommendation, etc. One prerequisite for graph contrastive learning is the support of huge graphs in the training procedure. However, the graph data nowadays are distributed in various devices and hold by different owners, like those smart devices in Internet of Things. Considering the non-negligible consumptions on computing, storage, communication, data privacy and other issues, these devices often prefer to keep data locally, which significantly reduces the graph contrastive learning performance. In this paper, we propose a novel federal graph contrastive learning framework. First, it is able to update node embeddings during training by means of a federation method, allowing the local GCL to acquire anchors with richer information. Second, we design a Self-adaptive Cluster-based server strategy to select the optimal embedding update scheme, which maximizes the richness of the embedding information while avoiding the interference of noise. Generally, our method can build anchors with richer information through a federated learning approach, thus alleviating the performance degradation of graph contrastive learning due to distributed storage. Extensive analysis and experimental results demonstrate the superiority of our framework.


I. INTRODUCTION
In recent years, due to the outstanding performance of graph contrastive learning (GCL) in node classification, node clustering and graph classification, it has attracted widespread attention in tasks such as protein structure prediction [1] and computer vision [2]. As a self-supervised graph representation learning method, GCL is able to address the real-world problem of existing large amounts of unlabelled and unusable graph data [3], [4]. It learns the representation of such graph data as a result of pre-training, and only slight fine-tuning is required when applied to a specific task, thus unlocking significant time or equipment costs, and providing good transferability [5]. that anchors have a decisive influence on the training effect, and more representative anchors are more conducive for generalized and stable representations [8], [9]. In the case of centralized training, GCL uses the original graph as a proper anchor.
However, in the real world, due to the large size of realworld graph data [10], [11], it is frequently scattered across multiple devices in more general cases like Internet of Things (IoTs) [12], [13], making it difficult to interoperate and resulting in data isolation with privacy protection in mind [14], [15], [16], [17]. It causes the absence structural and feature information in the stored graphs in each device [18], which lead to a lack of representative ability of origin graph [19]. Then corresponding anchors can lead to a significant reduction in the effectiveness of GCL.
In recent years, federated learning approaches [20] are proposed and flourished, which aims at solving the distributed learning of deep neural networks [21]. They usually aggregate the neural network parameters across clients by weighted averaging, and follow an iterative mode to train models without fusing all data to the central server. However, previous efforts in federated learning have been of limited use in solving the problem of GCL, where specific designs towards mitigating the missing of anchor information are imperative.
To solve the above problem, this paper proposes Selfadaptive Cluster-based Federal Graph Contrastive Learning (SC-FGCL), which allows local devices to federate with other clients to update the node embeddings obtained from clients' network layers, SC-FGCL provides local clients with critical clues to obtain anchors with richer information and thus obtain better GCL results. Simultaneously, the self-adjusting clustering method of server in SC-FGCL automatically explores for the best embedding update solution for each round, enabling each embedding to maximize its own information enrichment while avoiding interference from other noisy node embeddings.
In SC-FGCL, it is assumed that each device holds a graph locally, which can be thought of as a subgraph of the global graph with some node overlap between devices. Then, we propose a novel federal GCL framework that uploads local node embeddings to the server, updates each node embedding jointly on the server side, and server returns the result as a local GCL anchor. We present a corresponding paradigm for the local training, feature update, and communication, as well as analysis of communication and local storage consumption.
To further improve the performance for GCL, we also introduce a self-adaptive clustering method in SC-FGCL for selecting the best update solution for each node on server side. The method recursively selects the embedding with the highest similarity and tightness for the node to be updated, by adopting the calculation of the clustering internal evaluation effectiveness metric as the criterion. In addition, the selection of the threshold for the best clustering internal evaluation metric is dependent on the Calinski-arassment score value, and the server will automatically calculate and select the update policy with the higher value as the final update policy.
Finally, our framework ensures privacy security, while differential privacy is used to protect privacy during data transfer [22], [23], [24]. We conducted experiments on three benchmark datasets and the results show the superiority of our approach.
The main contribution of this paper includes: 1) We propose a novel federal GCL framework for obtaining anchor with richer information by jointly updating node embeddings, effectively mitigating GCL performance degradation due to data isolation. 2) To enable node embeddings to be updated with maximally rich information about themselves, while reducing the interference of noisy embeddings, we propose a self-adaptive clustering method that automatically selects the optimal update scheme for node embeddings.
3) The results of the extensive evaluation show that our framework is advanced compared to the baseline.

A. GRAPH CONTRASTIVE LEARNING
Inspired by contrastive learning (CL), GCL has received wide attention in the recent years, which is based on GNN for contrastive learning on graph data. Veličković et al. proposed DGI (Deep Graph Infomax) [25], which uses the concept of DIM (deep infoMax) [7] to optimize infoNCE loss in order to maximize mutual information between local and global node representations. Following that, Sun, et al. proposed Info-Graph as an extension to DGI [26]. In contrast to DGI, which focuses on node-level GCL, InfoGraph focuses on graph-level GCL. It maximizes mutual information between graph-level and substructural representations at different scales (nodes, edges, triangles). Furthermore, Sun, et al. proposed Info-Graph* for semi-supervised scenes to extend InfoGraph. You, et al. proposed GraphCL [6], currently the most famous of GCL. Based on SimCLR [27], GraphCL obtains two views from original Graph by data augmentation, and also adopts DIM theory for optimization. The loss function of GraphCL follow normalized temperature -scaled cross entropy loss (NT-Xent) [28], [29], [30], NT-Xent is the commonly used GCL loss at present. GraphCL has rich data augmentation methods (node drop, edge drop, feature mask) and contrastive levels (node-level, graph-level, subgraphlevel), which have had a profound impact on subsequent GCL research.
Current work in GCL focuses on two main aspects: data augmentation approaches, and contrastive approaches.
For data augmentation approaches, Zhu, et al. proposed GCA [31], which bases the probability of removing edges on node centrality in order to keep structural information more valuable to Graph. Hassani et al. proposed MVGRL in [32], which introduces a multi-view approach. Recently, there has also been some excellent work emerging on learnable data augmentation approaches [3], [33], [34].
For contrastive learning methods, Qiu, et al. proposed GCC [35] to improve the transferability in interactive contrastive manner. In [36], Xu, et al. proposed GraphLoG, which uses hierarchical prototypes to capture global semantic clusters while keeping local similarity, and employs EM for efficient learning. Furthermore, some recent research focuses on how negative examples are chosen e.g., [37], [38]. Meanwhile, inspired by [39], [40], [41] focuses on the learning model without negative samples.

B. FEDERATED LEARNING
Federated learning was developed to address machine learning performance degradation caused by data isolation issues, with the goal of achieving co-training while maintaining privacy and security. FedAvg [20], the industry and academic standard for federated learning, has been widely used to date. During the training process, factors such as network parameter weights, gradients, or loss of the local model are averaged on the server, thus eliminating the need to expose local data.
On this basis, various approaches to improving FedAvg have emerged. The main issues being researched in the field of federated learning are optimizing approach, and privacy protection.
Li, et al. proposed FedProx for federated learning algorithm optimization [42], which allows different workloads to be performed by taking into account the performance of different devices. FedMA [43] was proposed by Wang, et al., to build global models via layers by matching and averaging hidden elements with the same features.
Currently, there are only [47] and [48] work on federal GCL, but they focuses on data heterogeneity and reducing structural interference to graph federated learning due to differential privacy via graph contrastive learning methods. Our work focuses on addressing the limitations of distributed storage for GCL through federated learning.

III. PROBLEM DEFINITION A. SYSTEM SETTINGS
The system consists of one server and N clients C : Accordingly, the adjacency matrix A i can be obtained from the G i . We assume that nodes are partially overlapped between clients i.e.
All devices federate with others to train their own local GCL pre-training models. The training is initiated by server. At the beginning, server distributes initialized network parameters to individual clients. During training process, clients first perform forward propagation locally and upload the results to server. Then, server aggregates results and sends them back to the corresponding clients. Finally, clients update local network parameters according to aggregated results.

B. SECURITY
Federated learning necessitates the exchange of frequent messages between all devices. As in FedAvg, gradient, network weights are used as transmission parameters. In this process, malicious and semi-honest clients and server may capture changes in transmission and thus infer the original data through differential attacks [49]. To protect privacy, we use differential privacy, which makes it difficult to capture changes to only one element in local node embedding.
Differential privacy is defined as follows [22]: Neighboring datasets are two datasets D and D that differ by only one record. Function f is able to map the dataset D to the abstract range R: f : D → R. The maximum difference between the mapping results is defined as the sensitivity f . Mechanism M is a randomized algorithm that transforms the result of f . Definition 1: ( , δ)-differential privacy [50]. For any neighboring datasets of D and D , and for every set of output , randomized mechanism M gives ( , δ)-differential privacy, if M satisfies:

C. DESIGN OBJECTIVE
The purpose of all clients during training is to train a local representation of the nodes via contrastive learning. Our optimization goal is to minimize the node contrastive loss on all clients: arg min where |V i | denotes the number of G i nodes, |V | = |V i |, and denotes the value of the loss function of C i under C i 's local neural network parameters θ i . The notations mentioned above as shown in Table 1.

IV. FRAMEWORK
In this section, we first briefly outline the general training process. Next, we detail the client local training method and the server aggregation strategy. Finally, the framework's communication mechanism is defined and analyzed.

A. OVERVIEW OF FEDERAL GRAPH CONTRASTIVE LEARNING
Motivated by the performance degradation of GCL in the distributed storage, this framework proposes that multiple clients jointly update local embeddings and use the updated embeddings as anchor for local GCL. The overall process of the framework is shown in Fig.  1. In each training round, the client C i ∈ C forward propagates its graph data to get the local embedding h i . Then, clients send the local embedding to server, which gets h S : {h 1 , h 2 , . . ., h N }. On the server, the h S is updated by the selfadaptive aggregation method to get h S : After that, server sends the updated embedding back to the corresponding client. Finally, client C i receives the updated embedding h i , uses it as an anchor for contrastive learning, computes the local loss. Training stops until local models converge.
At the start of training, the server delivers the same initial graph encoder and projection head network parameters to all clients. In addition, the transmitted data between clients and server with differential privacy.

B. CLIENT GRAPH CONTRASTIVE LEARNING MODEL
As shown in Fig. 2, in k − th round training for the local client C i , G i first propagates forward to obtain the node representation in the Graph: is the projection head function, and h k i, j is the original embedding representation of node j in client C i at the k − th round of training. Then the node embedding sequence h k i in client C i is: After uploading h k i to server, C i gets the updated embedding h k i from the server: where h k i, j and h k i, j are positive sample pair, h k i, j and the other nodes in the h k i are negative sample pairs. The loss function follows the normalized temperaturescaled cross entropy loss mentioned in Section 2.1, and the loss function for node v i, j in the k − th round,is defined as: where sim(·) is defined as: The loss function of C i in the k − th round is: Unlike the usual GCL method which uses the original graph as the anchor, our framework uses the updated embeddings returned by the server as the anchor. Meanwhile, we used the  Our approach enables the local client to use richer information anchor for learning, thereby alleviating contrastive learning performance degradation caused by distributed storage.

C. SERVER AGGREGATION MODEL
When server receives the embeddings that clients have uploaded, it needs to update them jointly. We design an update mechanism with self-adaptive capabilities, as shown in 3. It aggregates similar embeddings, and iteratively adjusts the aggregation radius by computing the tightness score, until the optimal radius is found.
To evaluate the tightness through clustering effect, we introduce Davies-Bouldin Index (DB) and Calinski-Harabasz Index (CH) as the evaluation index of clustering. where S e,D is DB of an embedding e and D, D represents an embedding set and all embeddings in it are considered as a cluster, s(D) is cluster D's diameter, mean(D) represents the cluster centre, l (e, mean(D)) represents the distance between e and mean(D).
DB measures intra-class closeness, a smaller value of S e,D means the more reasonable for e to join the cluster.
Definition 3: Calinski-Harabasz Index [52] CH where, K is the number of clusters, N is the total number of samples in clusters, B(K ) is inter-cluster divergence, W (K ) is intra-cluster divergence. CH measures the overall clustering effect, a higher CH indicates better global clustering.
The detailed process of server aggregation algorithm is as follows: 1 First, for the embeddings received in the k − th round, calculate their distances to others and arranged in descending order to obtain the Similarity Sequence: e r represents the embedding of a node with rank r in SimSeq i, j . 2 The set D i, j of embeddings for updating h k i, j is obtained recursively by sequentially querying embeddings from SimSeq i, j .
Definition 4: Decision recursive formula. For selecting the optimal solution for updating a cluster from adjacent embeddings: Given a sensitive value ε, end the recursion when min D {S e r ,C(γ −1) , S e r ,C(γ −2) } ≥ ε to obtain D i, j . 3 Perform mean pooling on the embedding in D i, j to update the initial embedding. 4 If ε is small, sufficient information will not be obtained, and when ε is large, the updates contain noisy information. To find the optimal sensitivity value, we first set a small ε, and cluster the updated embeddings at the end of all updates. In this paper, meanshift clustering [53] is used as an example. And then compute CH based on the clustering results.
Then, we gradually scaled up ε and repeated the above process until the current clustering CH was smaller than the previous one. The previous ε was considered as the optimal sensitivity value for this round.
The procedure of server aggregation is given in Algorithm 1. The time complexity of Algorithm 1 is O(n 3 + n 2 × m), where n is the number of input embeddings, m is the embedding's dimension.

Algorithm 1: Server Aggregation Algorithm.
INPUT: training round k; number of clients, N; k − th round initial embedding set of server, h k , . . , h k i,|V i | ; initial sensitivity value ε; amplification factor μ; distance function dis(·); Initial round of iterations r = 2; Loop initial round l = 1; In this section, we will conduct a communication security analysis as well as a framework efficiency analysis.

A. SECURITY ANALYSIS
To avoid privacy leakage during communication, we use differential privacy in the communication process. In order to secure privacy while adding less noise and gaining more flexibility, we adopt Gaussian mechanism for differential privacy. Gaussian mechanism is defined as follows: Theorem 1: Gaussian mechanism. [22] For a function f : D → R over a dataset D, the mechanism M in (2) provides the -differential privacy.
where Y is the Gaussian noise satisfying the (3) condition: For all transmitted data, the ( , δ)-differential privacy defined in Definition 1 is satisfied by adding a Gaussian noise to it that meets the above conditions.

B. EFFICIENCY ANALYSIS
In this section, we compare our method with vanilla method (GraphCL with FedAvg) in terms of communication efficiency and local storage space consumption to provide an analysis of the applicability scenarios of this framework.

1) COMMUNICATION EFFICIENCY
To ensure a fair comparison, we set the neural network layers of GraphCL to be consistent with the framework of this paper. It is assumed that the local GraphCL of device D i has a total of m layers of network with parameter dimensions F i × dim 1 , dim 1 × dim 2 ,..., dim m−1 × dim m . The gradient has the same size as the parameters.
When the gradients of GraphCL are federated for learning with the FedAvg method, the size of the data that needs to be transferred for each round by device D i is: Whereas SC-FGCL method passes node embedding, the size of data to be transmitted is: According to (4) and (5), it can be found that our method is insensitive to feature dimensionality, the size of network parameters, and the number of nodes, while GraphCL is the opposite.
In the case of distributed storage, the number of nodes stored by each client tends to be significantly less than centralized case, so SC-FGCL can better support multiple feature data, support multi-layer networks, and the transmission communication is not limited by the network structure.

2) LOCAL STORAGE SPACE CONSUMPTION
For device D i holding G i : {N i , E i }, E i stored in the form of an adjacency matrix, the size of the storage space occupied by the origin graph is: GraphCL generally obtains two augmented graphs from the origin graph, and we discuss the storage consumption under three common augmentation approaches: Node mask: The storage consumption at node mask is: where α is the node retention probability and α ∈ (0, 1). Feature mask: The storage consumption at feature mask is: where β is the node retention probability and β ∈ (0, 1). Subgraph: The storage consumption at subgraph is: The only data that SC-FGCL needs to propagate through the network is the local origin graph, and the combined (6)-(9), SC-FGCL always consumes less local storage than GraphCL.

A. DATASETS
Cora, Pubmed, and Citeseer [54] are three benchmark datasets commonly used in graph neural networks. We divide the node, edge, and feature data so that they are stored in different clients to simulate data isolation. To validate the effectiveness of GCL as a pre-training, we select node classification as the downstream task and use the downstream task accuracy as the evaluation criterion. Details of these three datasets are shown in Table 2.

B. COMPARISON METHODS
As there is no systematic approach focused on mitigating the performance degradation of local GCL through a federal approach before our work. So we compared the downstream task accuracy for the following cases with the same training settings: 1) Centralized training: centralized training to obtain pretrain models 2) Local training: local training via GraphCL only 3) Baseline: network gradient averaging via FedAvg in each round while training locally via GraphCL 4) Our methods We compare the training results between the centralized training and the local training scenario with GraphCL to verify that data isolation reduces the effectiveness of GCL.We use GraphCL+FedAvg as a baseline to compare its training effect on the local client.

C. PARAMETER SETTING
In our experiments, the total number of clients is 3; the ( , δ)differential privacy parameters ∈ 1, 2, 4, 8, 16, 32, 64, 128, δ = 10 4 are set; The learning rate for local GCL is 0.01; A multi-layer perceptron classifier is used for node classification, including a fully connected layer and a activation layer, with a Relu activation layer function and a learning rate of 0.01. Table 3 shows the experimental results of the node classification task using different datasets. Furthermore, for visual analysis, we chose the results from the cora set and plotted their accuracy curves for all clients, as shown in Fig. 4.

D. ANALYSIS OF RESULTS
As shown in Fig. 4, the prediction accuracy of centralised training is consistently significantly higher than the distributed training and converged faster. According to Table 3, the prediction accuracy of the centralised case is 10%-20% higher than the distributed one. This indicates that distributed storage of data can significantly reduce the effectiveness of GCL.
For baseline, gradients are averaged each round during the training process. According to the results shown in Table 3, FedAvg has a 1%-5% increase for the distributed GraphCL. And as shown by Fig. 4, convergence is slightly faster when FedAvg is used.
The SC-FGCL method has superior performance both in terms of accuracy and convergence speed. As shown in Table  3(a), the SC-FGCL has a 4%-7% improvement over baseline in the Cora set, and for the citeseer and pubmed sets, the SC-FGCL method results are close to centralized training results. The curves in Fig. 4 show that the SC-FGCL converges significantly faster than the other methods and is similar to the convergence speed of the centralised method. This demonstrates the significant role of the SC-FGCL method in mitigating the reduced GCL effect due to distributed storage, as well as its superior performance as a federal GCL framework.

TABLE 3. Accuracy of Node Classification Different Method
To verify the effectiveness of the server aggregation scheme proposed in this paper, we conducted ablation experiments on three datasets and selected the results of client1 for observational analysis. We shows the results in Table 4. The top10 embeddings in the similarity matrix are used to update the original embeddings when no regression decision is made and the optimal clustering sensitivity value selection is performed via CH.
According to Table 3, the CH-index binding-only approach outperforms the no-strategy approach, while the regression decision-only approach outperforms the CH-index bindingonly approach. And is optimal when both approaches are taken. This proves that regression decision method has a more significant utility and that CH-index binding acts as an adjustment scheme to assist the regression decision method in achieving the optimal value.

VII. CONCLUSION
In this paper, we propose a Self-adaptive Cluster-based Federal Graph Contrastive Learning (SC-FGCL) framework which unites all clients to update node embeddings as local GCL anchors to mitigate the weakening of GCL effects by distributed stored graph data, resulting in better GCL effects. At the same time, to enable the server to find the optimal update solution, we designed clustering methods with self-adaptive capabilities. It allows each embedding to be updated in a way that maximizes the enrichment of its own information while preventing noisy embeddings from interfering. Experimental results on multiple graph datasets show that our method significantly outperforms the comparative baseline and that our self-training approach yields better performance.