Fast and Privacy-Preserving Federated Joint Estimator of Multi-sUGMs

Learning multiple related graphs from many distributed and privacy-required resources is an important and common task in neuroscience applications. Medical researchers can comprehensively investigate the diagnostic evidence and understand the cause of certain brain diseases via analyzing the commonalities and differences of the brain connectomes predicted from the fMRI data across multiple hospitals. Previous sparse Undirected Graphical Model (sUGM) methods either cannot take full usage of the heterogeneous data while preserving privacy or miss the capability of handling the nonparanormal data, which is highly non-independent and identically distributed (non-i.i.d.). This paper proposes a novel and efficient approach, FEDJEM (federated joint estimator of multiple sUGMs), that trains the multi-sUGMs over a massive network encompassing various local devices and the global center. In order to efficiently process the datasets with different nonparanormal distributions, the proposed federated algorithm fully exploits the computing power of the local devices and cloud center while federated updates ensure that personal data remain local, thus defending the privacy. We also implement a general federated learning framework for multi-task learning based on our method. We apply our method on multiple simulation datasets to evaluate its speed and accuracy in comparison with relevant baselines and develop a strategy accordingly to balance its computation and communication abilities. Finally, we predict several informative groups of connectomes based on the real-world dataset.


I. INTRODUCTION
There has been a wild revolution in collecting massive heterogeneous data [1], [2] across many scientific fields in recent years. For example, different hospitals have been constantly collecting the medical data and information of Alzheimer's patients of different races, genders, ages, and regions with widely used and advanced medical devices. Learning multiple related graphs from such heterogeneous data has become an important task. For instance, we can learn multiple related brain connectomes of Alzheimer's patients from several fMRI datasets obtained from different hospitals. Those connectomes can help researchers better understand the characteristics and underlying causes of the disease. Unlike transfer The associate editor coordinating the review of this manuscript and approving it for publication was Liang-Bi Chen . learning, learning multiple related networks facilitates the exploration of the similarities and differences of connectomes without transferring datasets from related tasks. Doctors can investigate diagnostic evidence based on the shared features revealed by the commonalities of the connectomes. In addition, novel causes of a disease may emerge when analyzing the differences from normal connectomes [3].
Traditionally, the Ising model [4], a method based on FIGURE 1. An illustration of the training model. Multiple related graphs are inferred from the heterogeneous datasets here over a three-step process. 1) We calculate the Kendall's tau correlation estimation of the fMRI data to obtain the correlation matrices. 2) We apply FEDJEM to the joint estimation of the precision matrices. 3) We recover the brain connectomes by decoding the sparsity pattern of the precision matrices.
unacceptable communication load due to the high data capacity [6], [7]. Applying fewer data samples is an intuitive way to resolve this problem, but it drives down model performance [8]. In most cases, it is not possible to obtain high performance and low communication cost simultaneously.
• Privacy preservation: Patient data typically contain personal information [9]. Collecting such data in the cloud center can cause a serious privacy problem.
Researchers are not allowed to access raw data from the hospital database [10] directly in most circumstances due to the high risk of leaking private information.
• Non-independent and identically distributed (non-i.i.d.) datasets: To enhance the heterogeneity of data, it is necessary to collect massive data from different sources; this gives the datasets disparate distributions. There is a serious bottleneck to exploiting training data if the datasets from different tasks follow non-independent and identically distributions, e.g., one dataset follows a Gaussian distribution but the other one does not. Many existing methods require the datasets from different tasks to follow the same or similar distributions. There is a huge reduction in model performance when applying them to non-i.i.d data.
To overcome these obstacles, researchers have developed a series of sparse Gaussian Graphical Model (sGGM) methods which can effectively predict graphs with relatively few data samples under the Gaussian assumption. The graph structure is hidden in the inverse of the covariance matrix, namely, the precision matrix (i.e., = −1 ). GLasso [11], [12], for example, inferred a graph through decoding the sparsity pattern in a precision matrix. The 1 -penalized loglikelihood method is applied to obtain the precision matrix. Unlike GLasso, CLIME makes an inverse matrix estimation by optimizing an 1 constrained problem with regards to the precision matrix. Those models overcome communication and privacy challenges successfully because they learn each model on each dataset individually. Each model can be trained in local devices without requiring data transmission (i.e., with no communication cost), thus protecting patient privacy [13]. However, such individual training models still tend to encounter the first obstacle described above; they also tend to have inferior performance compared to models based on heterogeneous data.
In an effort to resolve the model performance issue, researchers proposed Multi-task Learning [14] in 1997. This approach has since been applied throughout various machinelearning domains. Previous researchers have also introduced multi-task learning to improve the generalization of the single sGGM in response to the data scarcity problem. [15]- [24] proposed the multi-task sGGM, which jointly estimates K different but related sGGMs. These methods enhance the generalization ability of the model, but require that all data should be collected in the cloud center. As mentioned above, this implies a significant risk that private data might be leaked and heavy communication load which causes unacceptable transmission time. In addition, current multi-task sGGMs assume that all the data from different tasks follow the Gaussian distribution. This is not the case in most real-world application situations. In general, most multi-task sGGM methods fail to get out of the dilemma and still struggle in the last two challenges.
In this study, we develop a novel model, the federated joint estimator of multiple sparse Undirected Graphical Models (FEDJEM), to estimate multiple sparse Undirected Graphical Models (sUGMs) jointly via a federated learning algorithm. We design this model to address the three challenges discussed above. This model allows local data to be processed and computed in local devices, then communicates their updates to the cloud center in order to train a global graphical model. FEDJEM (illustrated in Fig. 1) can also handle multivariate nonparanormal data, which relaxes the normality assumption that most real-world non-i.i.d. data do not follow. Our contributions can be summarized as follows: • Novel model: We present the novel, privacy-preserving FEDJEM, a federated multi-task sUGM method, wherein a federated update algorithm first separates the computation process into local devices and a global center. We design the model to safeguard private information as only model parameters are transmitted in this process. All personal data are stored and processed locally and prior to model training.
• Novel relaxation: Considering the properties of realworld non-i.i.d. data, our method can manage data with a nonparanormal distribution, which is a much larger superset of the Gaussian Distribution.
• Swiftness and efficiency: Our method is a fast and efficient federated algorithm. The computing power of local devices is harnessed to minimize the computation load on the cloud center. The transmitted data are relatively small because only parameter updates are communicated between the edges and center.
• General framework: We implement a general federated learning framework for multi-task learning based on our method. This framework would allow others to easily utilize federated multi-task learning while only focusing on the implementation of local and global updates. Our codes can be found on https://github.com/MahjongGod-Saki/FEDJEM The rest of this paper is organized as follows. Section II provides the relevant background material. Section III introduces our method in detail and discusses its main properties. Section IV gives an analysis of its computation and communication cost and convergence. Section V presents previous work related to our method. Our experimental results are discussed in Section VI. Section VII suggests several potential strategies to balance local and global computation and communication time in practice and provides some ideas of future work. Section VIII provides a brief summary and concluding remarks.

A. NOTATIONS
We choose X (i) ∈ R n i ×p to represent the i-th dataset with n i samples and p features. (1) , X (2) , . . . , X (K ) } denotes K datasets generated by K different tasks and nodes.
(i) ∈ R p×p represents the covariance matrix of the dataset (1) , . . . , (K ) } are the sets of the covariance matrices and precision matrices corresponding to the datasets. We list the notations used in this paper in Table 1.

A. SPARSE GAUSSIAN GRAPHICAL MODEL
A single-task sparse Gaussian Graphical Model (sGGM) assumes that data samples follow a normal distribution N (µ, ) with mean vector µ and covariance matrix . The graphical lasso (GLasso) is a penalized maximum likelihood estimator for precision matrix inference. The model can be written as: where L( ) is the log-likelihood function of .

B. FEDERATED MULTI-TASK LEARNING
The federated learning problem involves training a global machine learning model from the data stored locally, i.e., on multiple remote devices. The goal of this learning strategy is to locally store and process data generated by the devices. We only communicate the intermediate updates of parameters periodically utilizing central computing power. The typical optimization problem of federated learning is: where L i is the objective function the i-th device and p i is its weight, p i ≥ 0 and Federated learning comes with statistical challenges in regards to training machine-learning models, mainly due to the variability of the number of data points on each device. A single global model cannot capture every piece of local knowledge. We should naturally obtain separate models for each node rather than training a single global model across the network. [25] shows that a combination of federated learning and multi-task learning, namely multi-task federated learning, performs significantly better. The federated multitask learning framework can be formalized as follows: where ω = (ω 1 , ω 2 , . . . , ω m ), p i ≥ 0, and i p i = 1.

III. METHOD: FEDERATED JOINT ESTIMATOR OF MULTIPLE sUGMs
We focus here on a federated multi-task undirected graphical model problem. To resolve this problem, we design the proposed method as per four distinct properties. 1) Various data are stored and processed on different local devices in a distributed environment, so every node trains its own model locally and then communicates its updates to the center in order to train a global model. The model parameters are updated in a federated manner. 2) The collected data are in a highly non-i.i.d. manner due to the differences of the storage devices and data source. 3) Personal data are kept safe because each node does not communicate its data with other devices and instead processes it locally. 4) Compared with the single-task graphical model, our federated multitask graphical model fuses both global similarity knowledge and differences between tasks. This encourages better performance on real-world data, as evidenced by our experimental results. Fig. 2 shows the detailed flow diagram of our proposed method. VOLUME 9, 2021 FIGURE 2. The flow diagram of proposed method. As the proposed method is applied, each hospital stores and processes the fMRI data of its patients locally. We estimate the precision matrix (i ) (t ) of each task in the t -th iteration through the local update step in local devices. Next, we communicate the updated precision matrix to the cloud center. The center updates the variables Our goal is to estimate multiple related graphs { (i) }. Based on (3), our federated multi-task learning framework can be formalized as follows: Three components remain to be determined in the above equation: 1) The objective function L i (·) is related to the loglikelihood function mentioned in (1), 2) the weight p i , which should be associated with the number of samples, and 3) the total regularization function R total (·), which should be able to capture both the sparsity pattern and the heterogeneity of the precision matrices.
Based on (1), we first apply the log-likelihood function to design L i ( (i) ) = N · L( (i) ), where L( (i) ) is as in (1) and N = i n i . To make . We choose 1 norm as our first regularization function to constrain the sparsity of every precision matrix (i) . The second regularization function R(·) enforces the group sparsity or similarity of all the precision matrices { (i) }. From (4), our proposed method can be represented as (5).
In the real world, it is not ideal to assume that data always follow a Gaussian distribution. We introduce the nonparanormal distribution here to relax the normality assumption. A nonparanormal dataset X (i) contains n i independent observations of a p-dimensional random vector Z (i) = (Z 1 , Z 2 , . . . , Z p ) . There exist a set of univariate strictly increasing transforma- such that: While the variable Z (i) follows a nonparanormal distribution, the transformation functions make f (i) (Z (i) ) follow a Gaussian distribution. The remaining problem is to obtain graphs from data observations. It is impossible to estimate the covariance matrix (i) directly in nonparanormal distribution. There is a mathematical relationship between the covariance matrix and correlation matrix S such that: . As a result, the inverse of correlation matrix S −1 and the inverse of covariance matrix −1 have the same nonzero and zero entries. In other words, S −1 and −1 have the exact same sparsity pattern. Based on this observation, we can infer the graph structure by utilizing the correlation matrix S instead of .
Therefore, an efficient nonparametric estimator [26] for the correlation matrix S has been established. To estimate S, [26] uses the population Kendall's tau correlation coefficients τ jk . So, given the above nonparanormal distribution , its correlation matrix can be estimated as follows: where the Kendall's tauτ jk can be estimated as denotes the j-th feature of the r-th sample in the dataset X (i) . By replacing the covariance matrix (i) with the estimated correlation matrixŜ (i) , our objective function (5) can be formalized as follows:

B. FEDERATED OPTIMIZATION OF THE FEDJEM
From (9) we observe that: • When optimizing the multi-task objective function (9), we only use the correlation matrix of the dataŜ (i) . There is no data transmission between the local devices and the center server, so the risk of leaking the raw data are low.
• When optimizing L( (i) ) = − log det( (i) )+ < (i) ,Ŝ (i) >, we only use a single precision matrix. As a result, L( (i) ) can be updated locally. However, optimizing the regularization function R total ( (2) , . . . , (K ) ) requires all updates communicated from the distributed devices. Therefore, R total ({ (i) }) should be updated globally in the center server. Based on these two observations, we choose the alternating method of multipliers (ADMM) method to ensure that L( (i) ) and R total ({ (i) }) can be optimized locally and globally, respectively. Consequently, we introduce new variables { (i) } and add a group of constraints (i) = (i) , i = 1, 2, . . . , K . The ADMM is used to design the FEDJEM for the federated joint estimator of multiple sUGMs. The objective function is as follows: The augmented Lagrangian function [27] of (10) is given by: where } are dual variables. At the t-th iteration, we can solve (11) as follows: 1) Local update:

3) Global update:{U
The pseudo code of FEDJEM is summarized in Algorithm 1 and Fig. 3 is a visualization of the algorithm.

C. FEDERATED LOCAL UPDATE OF (i )
Taking the derivative of (11) with respect to (i) , we can update (i) as the minimizer of Let the derivative be 0 to obtain (13): The updated (i) can be represented aŝ D jj denotes the j-th diagonal element of the diagonal matrix D and VDV denotes the eigendecomposition of , the right side of (13). VOLUME 9, 2021 Algorithm 1 FEDJEM Data: Number of tasks K , original data {X (i) }, the maximum of iterations T , tuning parameters{λ 1 , λ 2 , ρ, }. 1 initialize the model variables: Taking the derivative of (11) with respect to } as the minimizer of We choose the fused graphical lasso and group graphical lasso as our regularization function R(·). In simplicity of the notations, we denote B (i)

1) VARIATION I: FEDJEM-GROUP
If R(·) is the group-2 norm of the parameter, we can plug it into (11). Then, the loss function J ( has the following formulation: It follows that (16) where S denotes the soft-thresholding operator.

2) VARIATION II: FEDJEM-FUSED
If the R(·) is generalized fused lasso penalty, then the loss function J ( has the following formulation: We can use an iterative solution to obtain the optimal value of (i) jk in (18). In the case of K = 2 and λ 1 = 0, the solution to (18) is:

IV. PRACTICAL CONSIDERATION A. COMPUTATION ADAPTION ON LOCAL DEVICES
The whole computation cost can be split into three parts according to the three steps. The first step is to update (1) , (2) , . . . , (K ) , the computational complexity of which is mainly decided by the eigendecomposition of K p × p matrices, O(Kp 3 ) [28]. The second step is to update (1) , (2) , . . . , (K ) which involves some basic operations on the matrices. The computational complexity of this step is O(Kp 2 ), as is that of the third step. Thus, the computational complexity of our method is O (Kp 3 ).

B. COMMUNICATION ADAPTATION
The communication cost is determined by the size of the transmitted data. If we communicate the datasets {X (i) } to the center server directly, the number of samples of X (i) dramatically affects the communication efficiency [29]. Usually, the number of samples n i is several times greater than the dimensions (n i p), which is extremely costly. In our method, we only exchange the updated variance matrices }, which reduces the communication cost substantially.

C. CONVERGENCE GUARANTEE
We assume that the functions L(·) and R total (·) are closed, proper and convex, and that the augmented Lagrangian J (·) has a saddle point. According to the theorem in [27], if there exists an optimal point * , then (t) → * when t → ∞.

D. DATA PRIVACY
We do not share the original datasets {X (i) }, but rather transfer the model updates between the edge devices and the center server. Thus, we hardly infer the real data from the updates so that the data remain private.

V. RELATED WORK A. JOINT ESTIMATATION OF MULTI-sGGM
The joint graphical lasso is a technique for jointly estimating multiple graphical models from datasets belonging to different but related classes. This method is based on a penalized log likelihood approach, where the penalty function includes two parts. The first part is a lasso [30] that encourages the precision matrices to be sparse. The choice of the second penalty depends on the characteristics of the graphical models that we expect to be shared. We chose several relevant studies to support our work developing a joint graphical lasso in this study. The fused JGL [15], for example, uses a fused norm to encourage a shared pattern of sparsity (shared positions of zeros); the group JGL [15] uses a {G, 2} norm to encourage shared non-zero elements. Innovative penalties were also used by SIMONE [16], Node-based JGL [31], and some other researchers recently to capture special similarity among graphs.

B. CLIME FOR ESTIMATING SPARSE GAUSSIAN GRAPHICAL MODEL
The constrained 1 minimization method for inverse matrix estimation (CLIME) estimator can be used to estimate the precision matrix via an 1 constrained optimization: where the tuning parameter λ > 0. CLIME can be solved column-by-column. We assume here that β is one of the column vectors in the precision matrix . We can then estimate each column β of as follows rather than estimating the entire : Finally, CLIME uses the following operation to maintain the symmetric property of the estimator:

C. SOLUTIONS TO CHALLENGES IN FEDERATED LEARNING
There are four core challenges inherent to federated learning and several current approaches available to mitigate them. 1) Expensive communication. The federated network involves large amounts of devices and data, so the communication speed is restricted by limited resources. Current approaches focus on reducing either the number of communication rounds or the size of data for each round. The local updating method proposed in [32] allows for a variable to be applied on each device in parallel at each round. Compression schemes significantly reduce the size of transmitted data by forcing the updating models to be sparse, as was the case in [29], [33], [34]. 2) System heterogeneity. The performance of different devices may differ due to hardware conditions, network connectivity, or power availability. Sometimes an active device may drop out at a certain iteration due to system problems like poor network connection, which exacerbates problems such as stragglers or fault tolerance. A practical strategy to deal with the device-dropping-out problem is to simply ignore such device failures, which may introduce bias toward devices [6]. System heterogeneity can also be managed using asynchronous communication [35], [36]. 3) Statistical heterogeneity. The size of datasets may vary significantly between devices. We use the MOCHA [25] framework here to balance our datasets by controlling each device's optimization quality. Additionally, for each node, the data may be collected in a non-i.i.d. manner across the network, which contributes to an underlying statistical structure that captures the relationship among devices and their associated distributions [37], [38]. [39] focused on reducing the variance of the model performance across devices to obtain relative fairness beyond accuracy. 4) Privacy. Potential private data leakage is a typical concern in federated learning. Though shared information has a gradient or is processed beyond raw data, it is still subject to leakage. Recently, researchers have used tools based on previous cryptographic protocols such as SMC to ensure data security [40], [41]; however, these tools sacrifice model performance and system efficiency. [42] applied differential privacy for global differential privacy, which makes a trade-off between security and model performance.

VI. EXPERIMENTS
We conduct three types of experiments as described in this section. We first evaluate the performance of the federated learning framework FEDJEM under different circumstances. We apply the local computation, global computation, and communication time of every iteration as our time metric to observe the balance capacity of our framework with different variables. We also implement our method on a simulation dataset to evaluate its performance. We draw the predicted neural connections in the brain according to the estimated precision matrix and compare it with the true brain connectome.  Next, we comprehensively compare FEDJEM and other baselines with respect to time, accuracy, energy consumption and privacy. We compare FEDJEM with other baselines and the total computation time when varying the feature dimension. We also compare the model performance by drawing FPR versus TPR curves for our method and baselines with both Gaussian and non-Gaussian data. We determine the energy consumption and raw data size in the transmission of JGL, GLasso, CLIME, and FEDJEM to show its advantages in both energy efficiency and privacy preservation.
Finally, we implement FEDJEM on a real-world dataset to predict the brain connectomes for further scientific research. The real-world dataset aggregates functional and structural brain imaging data of patients with Parkinson's disease. The dataset contains two tasks with 54 features and 103 samples in one task and 234 in the other.

2) EXPERIMENTAL ENVIRONMENT
We run our experiments on four servers with one dual-core 16 GB RAM, 40 GB cloud storage, and 5M bandwidth plus three single-core 4 GB RAM and 40 GB cloud storage. The center server is much more powerful than the local devices and the bandwidth is limited. This environment satisfies the conditions for federated learning.

3) IMPLEMENTATION
We implement a general federated learning framework for multi-task learning based on our method as well. Fig. 4 shows our federated learning framework with the function of every python file. We update the variable (i) in local_update.py and (i) and U (i) in global_update.py. The updates of every iteration are transmitted between the cloud center and local devices through main.py and worker.py. Our code is available on https://github.com/MahjongGod-Saki/FEDJEM. • Local computation time: In each iteration, we use the temporal costs of updating (i) to measure the local computation consumption.
• Global computation time: We determine the time it takes for the cloud center to update (i) as a measure of global computation cost.

5) HYPER-PARAMETER SELECTION
In this subsection, we discuss the effects of hyper-parameters, which play a crucial role in the convergence rate, sparsity, and accuracy of our experimental results.
a: λ 1 As the regularization parameter of 1 norm, λ 1 can control the sparsity of the precision matrix. A larger λ 1 leads to a sparser estimated network.
b: λ 2 As the regularization parameter of the group-2 penalty or fused lasso penalty, λ 2 encourages a similar pattern of sparsity across all the estimated precision matrices in the group lasso. It also controls the similarity of many elements among all the estimated precision matrices in the fused lasso. A larger λ 2 drives the edges across the estimated networks toward zero.

c: ρ
As the augmented Lagrangian parameter, tuning ρ determines the step size of every iteration. Our algorithm converges faster but performs worse when ρ is relatively large.

6) SIMULATION DATASETS
We conduct simulations to explore a three-class problem. We generate three networks corresponding to three classes, each consisting of c equally sized unconnected subnetworks with d features and a certain power degree distribution. Each network has p = c×d dimensions. Among the c subnetworks, the three networks share (c − 2) subnetworks. Of the last two subnetworks, the first network has both, the second has one, and the third has none. To satisfy the structure of the three networks, we first generate c covariance matrices with ones on the diagonal. The values on elements corresponding to edges obey a uniform distribution U ([−0.4, −0.1] ∪ [0.1, 0.4]). The values on elements not corresponding to edges equal zero. We could then obtain 1.5 times the sum of the absolute values of off-diagonal elements of each row and divide every off-diagonal element of this row. We average the matrix with its transpose, then add every diagonal element by 1.5 times the sum of the offdiagonal elements in its row. The strictly diagonally dominant matrix is positive definite in this case, so we obtain a positive-definite covariance matrix˜ . For the covariance matrices of the three networks, where Resetting one of the c subnetwork blocks in 1 to the identity yields the covariance matrix of the second class 2 . Resetting an additional subnetwork block to the identity yields 3 . For each class, we generate independent, identically distributed samples from a N (0, i ), i = 1, 2, 3 distribution. We generate three categories of datasets Note that d represents the feature dimension of every subnetwork rather than the whole network. We choose d = 16, 32, 64, 128 for comparison when n = 160 and s = 0.01 in the first type of dataset (i.e., p = 160, 320, 640, 1280). n represents the number of samples. We choose n = d, 2d, 4d, 8d for comparison when d = 32 and s = 0.01 in the second type. The sparsity s is linearly positively related to the exponent of a power law distribution, so a larger s means a sparser network. We choose s = 1.0, 0.5, 0.1, 0.01 for comparison when d = 128 and n = 150 in the third type. The simulation datasets are shown in Table 2.

B. THE PERFORMANCE OF FEDJEM UNDER VARIOUS CIRCUMSTANCES
We use local computation, global computation, and communication time as three metrics to analyze different aspects of computational performance while varying the dimension d, number of samples n, and sparsity s of the simulation datasets as shown in Table 2. We conduct three series of experiments in total. By combining them, we find that changes in n, d, and s impact the final results in distinct ways as shown in Fig. 5.

1) VARYING THE NUMBER OF FEATURES d
The first experiment on simulation datasets with varying d is conducted to observe the changes in computation and communication time as the feature dimension of the subnetwork d varies in {16, 32, 64, 128}. Fig. 5 (a)(b)(c) shows that communication takes the most time among the three processes when d is small [6], but local computation time is bottlenecked when d exceeds a certain threshold. Fig. 5 (j) shows that ''local computation'' is lower than ''communication'' at first but surpasses ''communication'' as d increases. Fig. 5 (b) shows that global computation in the cloud center consistently consumes the least time among the three processes. This is because global computation involves only some basic matrix operations. Local computation, conversely, takes much more time than global computation due to its eigendecomposition operation in each iteration. Another point of concern is that available physical memory runs out rapidly as iteration increases when d is large. At a certain point, it is necessary to use SWaP space reserved in advance. This consumes extra time to read data from the hard disk and to save data into it. Therefore, Fig. 5

2) VARYING THE NUMBER OF SAMPLES n
The second simulation experiment with varying n is conducted to observe the changes in computation and communication time as the number of samples n varies in {d, 2d, 4d, 8d}. Fig. 5 (e)(f) shows that the number of samples does not influence the global computation or communication time. However, Fig. 5 (d) shows that more local computation time is consumed as n reaches a certain value, e.g., n = 8d. This is because in (13), is singular when n is small at the beginning of the algorithm. It takes less time to apply eigendecomposition to a singular matrix for local computation. When n is large, theŜ (i) is nonsingular initially. Therefore, the local computation spends a similar amount of time on each iteration. Due to our privacypreserving strategy, the algorithm itself is irrelevant to the value of the samples. Our experimental results satisfy our expectations regarding the irrelevance between the number of samples and the implementation time.

3) VARYING THE SPARSITY s
The third experiment on simulation datasets with varying s is conducted to observe the changes in computation and VOLUME 9, 2021  [43].) Namely, it is more likely to generate a dense network when s is relatively small. Fig. 5 (g) shows that estimating a denser network takes less local computation time without the format transmission operator of the sparse matrix representation. As expected, the curve of ''s = 0.01'' is much lower than other curves. Sparsity has little influence on global computation, but does increase the communication time when more dense parameters are transferred. Fig. 5 (i)(l) shows that sparse parameters with sparse representation reduce the size of data that needs to be communicated, which results in less communication time. However, the time cost returns to an average level quickly when s = 1.0, as shown in Fig. 5 (i). After several iterations, the parameters communicated to the center are not as sparse as ones in the beginning of the process. The communication time is markedly reduced initially and then increases to an average level as the parameters reach a certain level of sparsity.

4) THE PREDICTED BRAIN CONNECTOMES FROM THE SIMULATION DATASET
We also test our model's predictive power according to an estimated brain connectome. We choose 54 ROIs in the brain as target nodes. Using the data generation method described in Section VI-A6, we generate three 54-dimensional networks and 150 non-i.i.d. samples for each network. We draw three true brain images according to the networks, then implement our method on the three sets to infer the edges in order. Finally, we draw the brain connectome according to the prediction results. Fig. 6 shows a comparison between the FIGURE 6. Comparison between true connectome and predicted connectome. We predicted three groups of predicted connectomes corresponding to the three networks of the simulation dataset. true and predicted brain connectomes. Most edges representing strong connections are predicted accurately. In addition, the predicted brain connectome of the three classes recovers some shared edges, which reflects correlation among the three tasks.

C. COMPARISON WITH BASELINES
We then conduct four categories of comparison experiments to evaluate the prediction accuracy, time consumption, energy efficiency, and privacy of the proposed method.

1) ACCURACY COMPARISON
To compare the prediction power of our estimator, we generate four simulation datasets with the number of samples n = 150, the number of the subnetwork c = 6. Two are simulated Gaussian data and the other two are non-Gaussian with the feature dimension of subnetwork d varying in {9, 18} (i.e. p = 54, 108). Fig. 7(a)(b)(c)(d) shows the FPR-TPR curves of FEDJEM, JGL, GLasso, and CLIME on the simulation datasets. We draw four curves in each subfigure by tuning the regularization parameter λ 1 from 0.001 to 0.1 with a step size of 0.001. A larger area under the FPR-TPR curve represents better model performance. Fig. 7(a)(b) shows that our method obtains better curves than the other methods in the Gaussian cases. For non-Gaussian cases in Fig. 7(c)(d), our method achieves even better performance as it can effectively manage highly non-i.i.d heterogeneous data. Overall, the experimental results are consistent with our expectations in Section III.

2) COMPUTATION COST COMPARISON
We next compare the time consumption between FEDJEM and the baselines. We conduct one experiment over a series of simulation datasets to observe the computation time as the number of samples n changes in the set of {2, 10, 100, 1000, 10000}. Fig. 8 provides four different experimental results for computation time when the feature dimension of the subnetwork d varies in {16, 32, 64, 128} (i.e. p = 160, 320, 640, 1280). Fig. 8 (a)(b)(c)(d) shows that the time consumption of FEDJEM is consistently much lower than the three baselines. In addition, our method is not significantly affected by the number of samples n while the traditional multi-sGGM method JGL is very sensitive to it due to the increase in communication cost. The communication consumption of JGL makes the most contribution to the time cost when n increases, and thus becomes the dominant factor.
The four subfigures also show that the time consumption of all the methods increases substantially as the  feature dimension increases. FEDJEM, however, is consistently less sensitive to d compared to the other three baselines. This is consistent with our theoretical computational analysis (Section IV). Overall, FEDJEM outperforms the baselines with faster computation, especially when the datasets are of large scale.

3) THE ENERGY CONSUMPTION AND PRIVACY COMPARISON
Finally, we compare our method and baselines with respect to energy consumption and privacy preservation. We select a dataset generating three networks with 768 features, 150 samples, and sparsity s = 0.01. To test energy consumption, we apply a package called ''CodeCarbon'' [44] to track the carbon emissions produced by all four methods and use joule units as the metric (i.e., ''Energy Cost'' in Table 3). To test privacy preservation, we select data transmitted from a local device to the global center as the evaluation metric (i.e., ''Data in Danger'' in Table 3). The results are shown in Table 3. 1 The first row of the table shows that our method consumes much less energy than the three baselines, which reflects the benefits of the federated learning framework. The second row of the table shows that our method has no data transferred to the cloud center while the JGL transfers 18.397MB. Therefore, FEDJEM has no risk of leaking private information.
1 CLIME spends more than 90 min to finish running. Consequently, we record the energy consumption as larger than 40000.00J.

D. THE PREDICTED BRAIN CONNECTOMES FROM THE REAL-WORLD DATASETS
In the end, we apply FEDJEM on a real-world dataset consisting of functional and structural brain imaging data for Parkinson's disease patients. The dataset contains two tasks, one containing 103 samples of normal people and the other containing 234 samples of Parkinson's patients. Both share the same 54 features. Fig. 9 shows the predicted brain connectome based on the real-world dataset. All connections can be separated into three groups, which are distinguishable in the figure by different colors. The results here align with our expectations. We find that some ROIs in the left and right sides of the brain are connected, which is generally accepted by researchers. We also identify some different edges between the two brain images that may be attributable to Parkinson's.

VII. DISCUSSION
As computation is distributed from one single cloud center to many local devices, our method releases large amount of computing power in the cloud center. This is a significant advantage, however, it necessitates communication during the distribution, which creates an extra step to properly balance communication and computation. There must be a trade-off [45] between local and global computation and communication time in any practical application of this method. If communication is more time-consuming than other processes, for instance, in cases where the network feature dimension is small or the true network is relatively sparse (e.g., Fig. 5), we suggest two potential solutions. 1) Applying a special data structure, such as sparse representation [46]- [48], during communication of model updates. 2) Optimizing the communication network, such as increasing the bandwidth or upgrading from 4G to 5G. It is also possible that local computation is the most time-consuming. For example, Fig. 5 shows that the local computation costs are unacceptable when the feature dimension is relatively large. We suggest three potential solutions: 1) upgrading the local devices with better computing power, 2) adding an edge layer between the local and global computation as a three-layer federated learning framework, or 3) optimizing the local updates, such as by a fast decomposition solver [49], [50].
In the future, we will extend our method to extract other group patterns by applying different group regularization functions (i.e., different R total ). Such penalty functions like group infinity norm group provide different multi-task patterns. We will then evaluate their performance and draw the conclusion about the usage criteria. In addition, based on our experimental results, we realize that we can have different implementation strategies for different device environments. We will establish an automatic mechanism to balance the three types of costs in the federated multi-task undirected graphical model as well.

VIII. CONCLUSION
In this study, we investigate a federated multi-task sUGM problem centered on learning multiple related graphs from many distributed and privacy-requiring resources in the context of neuroscience. We develop a novel method, FEDJEM that can make full use of non-i.i.d. heterogeneous data to jointly estimate the precision matrices with alternating global and local updates. Our method sufficiently uses the computing power of local devices, thus reducing the computational load acting on the cloud center while safeguarding private data. The method shows excellent performance in a series of experiments.
XIAO TAN is currently pursuing the degree with the School of Artificial Intelligence, Southeast University, China. Her research interests include federated learning, graphical model, and distribution optimization.
TIANYI MA is currently pursuing the degree with the School of Artificial Intelligence, Southeast University, China. His research interests include deep learning and model interpretability.
TONGTONG SU is currently pursuing the degree with the School of Artificial Intelligence, Southeast University, China. Her research interests include graphical models and federated learning.