Distributed Learning of Generalized Linear Causal Networks

We consider the task of learning causal structures from data stored on multiple machines, and propose a novel structure learning method called distributed annealing on regularized likelihood score (DARLS) to solve this problem. We model causal structures by a directed acyclic graph that is parameterized with generalized linear models, so that our method is applicable to various types of data. To obtain a high-scoring causal graph, DARLS simulates an annealing process to search over the space of topological sorts, where the optimal graphical structure compatible with a sort is found by a distributed optimization method. This distributed optimization relies on multiple rounds of communication between local and central machines to estimate the optimal structure. We establish its convergence to a global optimizer of the overall score that is computed on all data across local machines. To the best of our knowledge, DARLS is the first distributed method for learning causal graphs with such theoretical guarantees. Through extensive simulation studies, DARLS has shown competing performance against existing methods on distributed data, and achieved comparable structure learning accuracy and test-data likelihood with competing methods applied to pooled data across all local machines. In a real-world application for modeling protein-DNA binding networks with distributed ChIP-Sequencing data, DARLS also exhibits higher predictive power than other methods, demonstrating a great advantage in estimating causal networks from distributed data.


Introduction
Causal diagrams are mathematical models for cause-effect relationships among variables (Pearl 1995), with applications in education (Dehejia and Wahba 1999;Hill 2012), economics (Imbens 2004), and epidemiology (Glymour 2006), etc.A causal diagram is represented by a directed acyclic graph (DAG), where edges encode causal effects among variables (nodes).Graphical models associated with DAGs are known as Bayesian networks (BNs).The probability density of a set of random variables {X 1 , . . ., X p } in a BN factorizes as p(x 1 , . . ., x p ) = p j=1 p(x j | PA j = pa j ), (1) where PA j ⊂ {X 1 , . . ., X p } \ {X j } is the parent set of X j with pa j being its value.Since randomized experiments are not always feasible, various approaches have been put forward to learn causal DAGs from observational data (Pearl 2009;Guo et al. 2020).
In this work, we focus on learning causal DAGs from distributed data.Distributed data storage has been used for privacy protection when managing the copious amount of data generated everyday by government agencies, research institutions, medical centers, technology companies, etc (Mehmood et al. 2016).Many of these organizations collect similar data, or data from the same population, and they often collaborate in social, scientific and business domains.For example, in 2021, Google and Apple were initiating an exposure notification system that tracks potential exposure to COVID-19 and a privacy-preserving analytics platform that collects health metrics (Apple and Google 2021).Also, small clinics often combine data to obtain statistically significant results since the individual datasets are otherwise too small.Such collaborations always require privacy disclosures stating that each institution cannot share the privately-hold sensitive data with the other parties.Even when data privacy is not a concern, merging data from multiple sources remains a difficult task.Since each local agency stores and manages data with different platforms, it is often time-consuming to extract, standardize and migrate the data.
From a purely computational perspective, distributed computing often uses parallelism across local machines, resulting in substantial reduction in computation time and allowing us to scale inference procedures to massive datasets.
The widespread use of distributed data has lead to a research area whose goal is to adapt common statistical and machine learning tasks to the distributed setting.A straightforward idea for deriving a distributed version of any inference procedure is to form a global estimate by averaging the local estimates, an approach known as one-shot parameter averaging.This approach, however, fails to obtain solutions with any desired level of suboptimality compared to the global estimate constructed based on all the data (Zinkevich et al. 2010;Shamir et al. 2014).To overcome drawbacks of one-shot averaging, communication-efficient algorithms have been proposed that utilize multiple rounds of communication between local and central machines to generate a sequence of (global) estimates (Zhang et al. 2013;Shamir et al. 2014;Jordan et al. 2018;Fan et al. 2019).Communication-efficient algorithms are particularly useful in distributed optimization of multi-agent systems, such as electronic power systems, sensor networks and smart manufacturing (Molzahn et al. 2017;Yang et al. 2019).
Despite these methodological advances, learning causal DAGs from data distributed across independent machines is still a challenging task.A major difficulty is how to integrate local Distributed causal graph learning information to form a global causal graph that satisfies the acyclicity constraint.One approach is to iterate over local datasets (once) and then combine the local graphs or the local p-values to form a global graph (Na and Yang 2010;Gou et al. 2007;Tang et al. 2019).However, there is no theoretical guarantee that aggregating local estimates, using this single-iteration approach, would lead to an estimate close to the global minimizer of the loss on the combined data.Another contribution of our work is the use of generalized linear models (GLMs) for local conditional distributions in BNs, which brings several advantages to causal structure learning.
Our proposed GLM DAG model is a flexible family for various data types beyond linear Gaussian models (with equal variance) and multi-logit models (Gu et al. 2019;Aragam et al. 2019), and hence it can be applied to multiple domains.Additionally, most models in the GLM family lead to convex loss functions which facilitate the optimization task.The objective function in our distributed learning algorithm is equivalent to a regularized likelihood of the overall data, which has been shown to be effective in learning both continuous and discrete DAGs (Fu and Zhou 2013;Aragam and Zhou 2015;Gu et al. 2019;Ye et al. 2021).Furthermore, we show that GLM DAG models lead to the identifiability of the underlying causal DAGs (Proposition 1, Section 2), while other common models, such as multinomial for discrete networks and Gaussian linear DAGs, are not identifiable in general (Pearl 1995;Heckerman et al. 1995).Under such identifiability, we establish the 2 -consistency of a global maximizer DAG of our regularized likelihood score (Theorem 2, Section 4.2).
The paper is organized as follows.Section 2 defines the generalized linear DAG model and establishes some identifiability results under this model.In Section 3, we set up the optimization problem for learning causal graphs and develop the DARLS algorithm that combines simulated annealing to search over permutation space and a distributed optimization algorithm to optimize the network structure given an ordering.We then establish theoretical results for the convergence of the distributed optimization algorithm and estimation consistency of DARLS in Section 4.
Section 5 consists of exhaustive simulation experiments, where we compare our method to existing ones using distributed data, test the robustness of DARLS against violations of its underlying model assumptions, and examine the accuracy loss and computational efficiency of distributed learning.We also apply the distributed learning methods to ChIP-sequencing data for modeling protein-DNA binding networks in Section 6.The paper is concluded with a discussion in Section 7.
All proofs are relegated to the supplementary material.

Generalized linear DAG models
Denote by x j ∈ R d j a realization of variable X j , where d j = 1 for a numerical X j and d j = r j − 1 for a categorical variable X j with r j classes, using the one-hot encoding.Let the influence of X i on X j and β ij = 0 if X i / ∈ PA j .Put where β 0j ∈ R 1×d j and d = p i=1 d i .Here and elsewhere, [x, y] denotes the vertical concatenation of two vectors or matrices x and y.We define a generalized linear DAG (GLDAG) as the Bayesian network (1) with conditional densities given by GLMs with canonical links, that is, where b j and c j are both functions from R d j to R. Note that β j x = i∈PA j β ij x i only depends on pa j .GLDAG models allow for many common distributions via the choice of the log partition- and the multinomial for b j (θ) = log 1 + d j l=1 e θ l .Note that in the multinomial case b j (•) is a multivariate function, operating on a vector θ = (θ l ), in contrast to the other example for which b j (•) is a scalar function.The Bernoulli and multinomial choices above give rise to logistic and multi-logit regression models for each node.
We collect all the parameters of model (3) in a matrix β ∈ R (d+1)×d which is obtained by horizontal concatenation of β j , j = 1, . . ., p, each as defined in (2).We say that a GLDAG (3) is continuous if all the variables are continuous.Recall that in this case, d j = 1 for all j ∈ [p] and thus β is a (p + 1) × p matrix.We rewrite the log pdf of (3), in the continuous case, as where β x ∈ R p and β ij = 0 if and only if X i → X j .Next, we define identifiability of DAG models following Peters et al. (2014); Hoyer et al. (2008); Peters and Bühlmann (2014), and show that continuous GLDAGs are identifiable.
Definition 1 (Identifiability).Suppose we are given a joint distribution L(X) = L(X 1 , . . ., X p ) that has been generated from an unknown GLDAG model (3) with a graph G 0 .If the distribution L(X) cannot be generated by any GLDAG model with a different graph G = G 0 , then we say G 0 is identifiable from L(X).
It is well-known that linear Gaussian DAGs and multinomial DAGs in general are not identifiable (Pearl 1995;Heckerman et al. 1995).In contrast, continuous GLDAG models (4) are identifiable under mild assumptions: Proposition 1. Suppose the joint distribution L(X) is defined by the log-pdf L(x; β) with a DAG G 0 according to (4) such that β ij = 0 if and only if i ∈ PA j in G 0 .If L(x; β) is second-order differentiable with respect to x and the first-order derivative of b j (•) exists and is not constant for all j, then G 0 is identifiable from L(X).
Proposition 1 establishes the identifiability of continuous GLDAG models (4), partially justifying our goal as to learn causal graphs.This result also expands the class of identifiable DAG models in the literature.DAGs generated from linear Gaussian structural equation models with equal variance can be fully identified (Peters and Bühlmann 2014), which is a special case of the GLDAG models with b j (θ) = θ 2 /2, ∀j.A different class of identifiable DAG models is the additive noise model, X j = f j (PA j ) + ε j , assuming nonlinear f j and/or non-Gaussian ε j (Shimizu et al. 2006;Hoyer et al. 2008;Peters et al. 2014).

Distributed DAG learning
In this section, we construct the objective function using distributed data and propose a simulated annealing search combined with an iterative optimization method to learn causal DAG structures.
We start with the definition of topological sorts for DAGs.Given a permutation π on [p] := {1, . . ., p}, we permute a vector v = (v 1 , . . ., v p ) according to π to obtain a relabeled vector Let {x h } n h=1 be an i.i.d.sample of size n from model (3).We also let x j h represent the observed value of the j-th variable (X j ) in the h-th data point.Consider a subset I ⊂ [n].The normalized negative log-likelihood of the subsample {x h } h∈I is given, up to an additive constant, by Note that in this notation, [n] denotes the normalized negative log-likelihood of the entire sample of size n.

Global objective function and annealing
We consider the case that the overall data is stored on K different servers, where each local machine M k holds its private data {x h } h∈I k and communicates with a central machine C. Let The normalized negative log-likelihood based on the entire data can be decomposed as . Let P be the set of all permutations on [p] and D(π) ⊂ R (d+1)×d the set of DAGs whose topological sorts are compatible with a permutation π ∈ P. Note that D(π) is a linear subspace of R (d+1)×d .We ideally would like to estimate β by minimizing a regularized loss function of the form and ρ(•) is an appropriate regularizer to promote sparsity in β.We call f (π) the global objective function since it is defined using all data across local machines.
Recall that β ij = 0 if and only if i ∈ PA j .To learn sparse DAGs, we apply group regularization of the form where ρ g (•) is a nonnegative and nondecreasing group regularizer and λ > 0 is a tuning parameter.
Restricted to D(π), the regularizer can be further simplfied to ρ(β) = λ j i≺πj ρ g (β ij ).In this paper, we consider the Group Lasso (i.e., group 2 ) penalty with the choice where |||β ij ||| F is the Frobenius norm of matrix β ij .As a convex penalty and a natural extension of Lasso regularization, Group Lasso has demonstrated remarkable performance in grouped variable selection (Yuan and Lin 2007).
To search over (π ∈ P, β ∈ D(π)) with distributed data as in ( 6), we propose the distributed annealing on regularized likelihood score (DARLS) algorithm, which applies annealing strategies to search over the permutation space, coupled with a distributed optimization method.Such manner of joint optimization over the topological sort space and the DAG space has demonstrated great effectiveness in learning BNs; see for example, Larrañaga et al. (1996); Friedman and Koller (2003); Scanagatta et al. (2015); Ye et al. (2021) and the references thereof.
The main steps of DARLS are outlined in Algorithm 1.At each annealing iteration, a permutation π * is proposed based on current π (line 5) and is accepted with probability according to simulated annealing given a decreasing temperature schedule.To compute the score of the optimal DAG structure for a given permutation, we use the distributed optimization approach outlined in Algorithm 2, the details of which are discussed in Section 3.2 below.This approach Algorithm 1 Distributed annealing on regularized likelihood score (DARLS).

5:
Central machine C proposes π + by flipping a random interval (length up to τ ) in π.
Algorithm 2 Distributed optimization to compute the global permutation score.
π ) and sends the result to C.

4:
and broadcasts it to local machines.

5:
Each M k calculates the minimizer β t) given by ( 9) and sends it to C.
and broadcasts it to local machines.

7: end for
allows multiple rounds of communications between local and the central machines to update and synthesize information.Note that DARLS can be applied to any objective function as long as the gradient w.r.t.β has a closed-form expression.

Local objective functions and distributed optimization
For any fixed π, we use distributed computing to evaluate f (π).That is, instead of directly working with the objective function in (6), we rely on local versions of it to guide a distributed algorithm that divides the task of computing f (π) among the K local machines.In particular, we consider the local objective functions The global version (6) can be rewritten as Typically, each of F and F k is nonsmooth due to the presence of the regularizer ρ, but the Ye, Amini, and Zhou The local regularized loss F k guided by ∇h k , denoted by F k , is a first-order approximation to the global regularized loss F , up to an additive constant.Let π be the global estimate of the algorithm at iteration t.At the next iteration, t + 1, we obtain local estimates β for k = 1, . . ., K.These local estimates are then passed to the central machine C to compute the next global estimate by averaging, i.e., β k,π .The main steps of this distributed optimization method are outlined in Algorithm 2.
The above approach is essentially a version of the DANE algorithm (Zhang et al. 2013;Shamir et al. 2014;Jordan et al. 2018;Fan et al. 2019).Note that to calculate local updates β Another piece in the distributed optimization is to compute the local update (9) (line 5, Algorithm 2), for which we use the proximal gradient algorithm (Algorithm 3).Given a current global estimate β (t) , optimizing local objective F k (ξ) (9) is equivalent to ), ξ , a surrogate for the global likelihood [n] (ξ).To solve (9), we use iterative proximal gradient descent.At each iteration, we minimize a quadratic approximation to I k around the current solution ξ, plus a regularization term, where s > 0 plays the role of a step size and ξ + is our next estimate of the solution.Equivalently, the update (10) can be re-written as where prox sρ (x) := argmin u sρ(u) + 1 2 |||x − u||| 2 F is a proximal operator applied to the scaled function sρ(•).Equation ( 11) is known as a proximal gradient update, and for our choice of the Algorithm 3 Use the proximal gradient algorithm to compute local permutation scores. 1).2: while iter < max-iter and err > tol do 3: s ← κs. 9: 10: ξ ← ξ + and iter ← iter +1.
11: end while 12: regularizer given by ( 7) and ( 8), has the following closed-form expression: This is often referred to as the block soft-thresholding operator.To determine the value of the step size s, we use backward line search, shrinking an initial value s 0 until a proper step size is found.We refer readers to Parikh and Boyd (2013) for more details on the proximal algorithms.

Tuning parameter selection and structure estimation
Given an initial permutation π 0 , we use BIC grid search to select tuning parameter λ that is used in the group Lasso penalty (7) (line 1, Algorithm 1).To construct the grid, we select 20 equal-space points of λ (i) in the log scale from the interval [0.01, 0.1], where λ = 0.1 is sufficiently large for an empty graph in our test.We select the tuning parameter λ (i) that minimizes the , where β (i) ∈ D(π 0 ) is the minimizer of F (β) with penalty parameter λ (i) , computed by Algorithm 2, and N ( β (i) ) is the number of free parameters in β (i) .
An estimated GLDAG parameter β is provided at the end of the DARLS annealing search, from which we can estimate a causal structure (line 9, Algorithm 1).Let W be a p × p weighted adjacency matrix of a DAG such that W ij := ||| β ij ||| F .The use of a group Lasso regularizer helps to eliminate some edges in learning a DAG, but it may still result in false positive edges.
Hence, we further refine estimated structures by setting One can adjust the value of α to achieve a desired sparsity level, especially when having prior knowledge.In our simulation tests, we fix α = 0.1 to remove edges whose weights are relatively small compared to others.

Theoretical guarantees
In this section, we study the convergence of the iterative distributed optimization algorithm (Algorithm 2) and establish the consistency of the global minimizer of (6).As our method is primarily motivated by applications involving a large amount of distributed data, we develop theoretical results under the setting that n is large and the number of variables p stays fixed.

Distributed estimate convergence
Recall the local iteration functions ϕ k,π defined in (9).The overall iteration function for the distributed algorithm can be written as be the population second-moment matrix of the model.For a matrix β, let B F (β; r) denote the Frobenius ball of radius r centered at β.We consider the case of numerical variables, i.e. d j = 1 for all j, which includes continuous and binary discrete random variables.The following theorem provides convergence guarantees on the distributed optimization algorithm represented by Φ π for any fixed π.Let β π be any global minimizer of the global objective function, i.e., where ρ(ξ) is a convex regularizer.Let Ω := π D(π) ⊂ R (p+1)×p be the parameter space of GLDAGs.We recall that for θ ∈ Ω, θ j denotes the jth column and that {x h } is an i.i.d.sample from a GLDAG model (3).
where ψ(x) := max{x, √ x} and m := min k |I k |.Assume further that np ≥ max{K + 1, 3}.There Theorem 1 applies to any θ ∈ Ω.It is natural to take θ to be β * π , the minimizer of the population loss defined as Since β π is a consistent estimate of β * π for any π (Theorem 2, Section 4.2), that P(||| β π − β * π ||| F > r) goes to zero as n grows.Thus, with high probability, the iteration operator Φ π (•) will be a contraction: the sequence {β (t) π } t≥0 produced by the distributed algorithm converges geometrically to β π if Cζ n < 1.For fixed p, and for sufficiently large r such that β Theorem 1 is proved by establishing the uniform concentration of the Hessian of the GLDAG model (3) around its expectation over certain balls in the parameter space, and then invoking a general convergence result for the DANE algorithm which we derive in the Supplementary material (cf.Theorem S2).Note that establishing such uniform concentration in the GLDAG model is challenging due to the highly dependent and nonlinear relation among the coordinates {x j h } p j=1 .A technical tool in establishing the concentration of the Hessian is the Ledoux-Talagrand contraction theorem.In order to extend the argument to the multi-logit and generally vector-valued DAG models with d j > 1, one needs a multivariate extension of the contraction theorem which is not available in literature at the moment.This extensions is, in principle, possible and we leave it for the future work.The T -boundedness assumption is trivially satisfied for binary and ordinal data, which are the primary focus of this work.Removing this assumption requires another nontrivial extension of the Ledoux-Talagrand theorem or a completely new approach.Whether an unbounded version of Theorem 1 with similar guarantees holds is unclear to us at this point.

Consistency
Let us write ψ(x; β) := − log p(x | β) for the negative log-likelihood of a single sample x from model (3).We view x, x h and β as vectors by concatenating the columns when dealing with ).The consistency results established in this section applies to the class of models in (3), not restricting to numerical variables.

Recall that
and Ω is the GLDAG parameter space.The optimization problem ( 6) is equivalent to min β∈Ω F (β).
We denote by β ∈ Ω a global minimizer of F (β) and β * ∈ Ω the true parameter with the true DAG G * .For any β, let us consider the (cross) Fisher information matrix We note that ψ(x; •) is a convex function for exponential families, and hence I(β; β * ) is always positive semi-definite.To establish the consistency of β as well as that of β π , used in Theorem 1, for any fixed π, we make the following assumptions: (A1) The true DAG G * is identifiable.
(A2) For every π, there exists a neighborhood of β * π , denoted by nb(β * π ) and functions M jkl such that , almost surely, and E β * [M jkl (x)] < ∞ for all j, k and l.
(A3) For every π, we have In (A2), it is impliclty assumed that ψ(x; •) is finite in nb(β * π ) almost surely.Before stating our theoretical results, we define Π * := {π : β * π = β * } which is exactly the set of permutations consistent with β * , and in particular, it is nonempty.Below is a sketch of argument, and a full proof is provided in the supplement (Section S2.3).First, for any π that is consistent with β * , we have β * ∈ D(π).A KL divergence argument then shows that β * is the unique solution of the optimization problem defining β * π .That is, any π consistent with , and hence π is consistent with β * .With this observation, we establish the desired consistency results in the following theorem.(b) F (•) has a unique minimizer β over Ω (the space of DAGs) and (c) With probability converging to one as n → ∞, β = β π for some (sequence of ) π ∈ Π * .
Theorem 2 confirms that the Group Lasso regularized estimator β, defined as a global minimzier of F , is √ n-consistent, and it will identify a correct topological sort π ∈ Π * in the large-sample limit.Moreover, the theorem also establishes the uniform consistency of restricted miminizers β π for all π, which is needed for Theorem 1. Assumption (A1) holds under mild conditions according to Proposition 1, Assumption (A2) is a standard regularity condition, and Assumption (A3) is related to the non-singularity of the second moment matrix E β * (xx ).For example, consider the case d j = 1 for all j and assume that the elements of x are T -bounded and let Denote by s 0 the number of edges in a graph on p nodes.We downloaded the following networks (p, s 0 ) from the Bayesian networks repository (Scutari 2007) to simulate data: Asia (8, 8), Sachs (11,17),Child (20,25) , Insurance (27, 52), Alarm (37, 46), Hailfinder (56, 66) and Hepar2 (70,123).We generated data under the GLDAG model (3) (Section 5.3) and other common DAG models (Section 5.4), where the latter is to examine the robustness of our method against violation of its model assumptions.

Methods
We compared the DARLS algorithm to the following DAG structure learning methods: the standard greedy hill climbing (HC) algorithm (Gámez et al. 2011), the Peter-Clark (PC) algorithm (Spirtes and Glymour 1991), the max-min hill-climbing (MMHC) algorithm (Tsamardinos et al. 2006), the fast greedy equivalence search (FGES) (Chickering 2002;Ramsey et al. 2017;Ramanan and Natarajan 2020), and the NOTEARS algorithm (Zheng et al. 2018).Among these methods, PC is a constraint-based method and MMHC is a hybrid method.The other three methods are score-based, where HC searches over DAGs, FGES searches over the equivalence classes, and NOTEARS uses continuous optimization to estimate DAG structure.
We applied a competing method on each local dataset D k to obtain a completed partially directed acyclic graph (CPDAG) A k , and then constructed a global graph using {A k } K k=1 .We used CPDAGs here because all the competing methods were developed under non-idenfitiable DAG models.Among the five competing methods, only PC and FGES output a CPDAG, and thus we converted estimated DAGs from the other methods to CPDAGs to obtain A k .Given 1 , we counted occurrences of the three possible orientations between each pair (i, j): i → j, i ← j, or i − j (undirected).We then ranked orientations across all node pairs in the descending order of their counts, and sequentially added these edge orientations to an empty graph, as long as they would not introduce a directed cycle (a cycle consisting of all directed edges).By the end of this process, we had a partially directed graph.Lastly, we applied Meek's rules (Meek 1995;Chickering 2002) to maximally orient undirected edges, and hence constructed a global CPDAG estimate.
Remark 1.The HC global estimates had too many edges using this approach, because its local CPDAGs lacked consensus, resulting in a large number of candidate edges and much higher FP edges.To solve this problem, we did not add any edge between a node pair (i, j) if the majority of the local graphs {A k } had no edges between them.In this way, global graphs estimated by HC became reasonably sparse compared to global estimates by the other methods.and true CPDAGs since they do not assume an identifiable DAG.
Table 1 reports the average performance metrics across 20 datasets for each of the seven graphs, using all six methods.Since NOTEARS estimates were too sparse to be competitive, using the default settings, we decreased the penalty tuning parameter to 10 −4 from the suggested value of 10 −1 .Its results are listed only for two small graphs, Asia and Sachs.PC had difficulty generating estimates within a reasonable time limit (30 minutes per estimation) for the last two networks, Hailfinder and Hepar2, and hence it was removed from the comparisons on these two graphs.In both cases n = 100p and n = 10, 000, DARLS consistently achieved the lowest SHD among all methods for every network, demonstrating higher accuracy in estimating graphical structures.DARLS identified more TP edges than the other methods in almost every case.The refinement step via thresholding β also helped to reduce SHD by cutting down FP edges.
To examine the loss of accuracy in estimating network structures on distributed data, we applied each method to combined data by pooling the K local datasets.For brevity in reporting the results, we chose the best performance on each network among the five competing methods, called the best competing method, to compare with DARLS. Figure 1 shows the performance, in terms of SHDs, of DARLS and the best competing method on distributed and combined data.
First, we observe that, applied to either distributed or combined data, DARLS achieves similar SHD values.The SHD difference between the combined and distributed data of DARLS is much smaller than that of the best competing method, demonstrating that DARLS is more effective in using distributed data.Second, consistent with Table 1, DARLS always outperformed the best competing method significantly when applied to distributed data (best-distributed).Furthermore, DARLS on distributed data shows a substantial overlap in the distribution of SHD, with the best competing method on combined data (best-combined), which is either HC or FGES for n = 100p and HC, MMHC or FGES for n = 10, 000.Such overlaps indicate the highly competitive performance of our distributed learning algorithm.The variability in SHD of DARLS is in general smaller than that of the best competing method, showing a higher consistency across different datasets.
To quantify the accuracy of distributed optimization using Algorithm 2, we computed permutation scores f (π) (6) under various values of K ∈ {1, 2, 5, 10} for a fixed π.For each value of K, we fixed the tuning parameter λ, permutation π and all internal computation parameters to ensure the only K varied in the calculation of f (π).Let f (K) be the value of f (π) computed by K local machines, and ∆f (K) := f (K) − f (1) be the relative increase in the loss f (K) .Figure 2a shows {∆f (K) : K = 2, 5, 10} across all the networks.The values of ∆f (K) is in the order of 10 −13 , verifying f (π) is essentially identical using either the overall (K = 1) or distributed data (K ≥ 2).Since the number of iterations is fixed, ∆f (K) increases with the network size.
We also examined computation time of finding f (K) for K ∈ [20] using the 20 Insurance datasets.In each test, the same data were split and distributed to different number K of local machines to solve the optimization problem ( 6), with all other parameters fixed.Figure 2b shows  Figure 1: SHD comparison between DARLS and the best competing method using combined or distributed data.For each graph, the box plots from left to right report the results for the best competing method on combined data (best-combined), DARLS on combined combined, DARLS on distributed data and the best competing method on distributed data (best-distributed).
the computation time versus K.As expected when distributing a complicated task, computing f (π) requires less time if using more machines.However, the reduction in computation time reaches approximately a stable level after K = 10, indicating a trade-off between the gains from parallel computation and the communication overhead.Additionally, the smallest local sample size m decreases when K is large, and Theorem 1 shows that would slow down the convergence of Algorithm 2.

Other data generation models
To test the robustness of DARLS against violations of its model assumptions, we also generated data from different DAG models.Particularly, we used threshold-Gaussian and multinomial  DAGs to generate discrete data, and then compared DARLS with the other methods.
For the threshold-Gaussian DAG model, we first generated continuous variables Y 1 , . . ., Y p using Gaussian structural equation models Y j = B j PA j + j where j is a Gaussian noise from N (0, 1) and each coefficient parameter in B j ∈ R s j was sampled uniformly from where s j = |PA j |.Then, we thresheld these continuous values to generate binary data X j = I(Y j > c j ), where c j is the sample mean of Y j .We used two-component mixture Gaussian to simulate continuous data for root variables in Y j , each drawn from N (1, 1) and N (−1, 1) with an equal probability.Under this design, most of the Y j 's were bi-modal in distribution, which greatly increased the robustness in thresholding.Such conversion between continuous and categorical variables has been used for modeling BNs in prior work (Lee and Kim 2010;Cai et al. 2020;Song et al. 2021), but the threshold-Gaussian DAG model is not the underlying model for any of six methods in our comparison.
We also simulated multinomial data from contingency tables provided in the BN repository (Scutari 2007).We made a few modifications to the contingency tables to ensure that (1) the number of states per variable was at most three, and (2) the marginal probability of any state was at least 0.1 for every variable by merging states.Due to the high structure complexity of Hailfinder and Hepar2, the original contingency tables of a few nodes were used without modifications, resulting in marginal probability less than 0.1 for some states.We note that the multinomial DAG model is the underlying model for the majority of the competing methods, including HC, PC, FGES and MMHC.Therefore, comparisons on these data would also test if the GLDAG model can approximate the multinomial model reasonably well.still could estimate the network with the lowest SHD for Sachs and Child.For most of the other cases, DARLS was only worse than HC, but better than or at least comparable to the other methods.Note that we specifically tuned the way for HC to combine local estimates and build a global graph, due to its lack of consensus among local estimates (Remark 1).It is encouraging to see that DARLS uniformly outperformed FGES, a consistent score-based method under the multinomial DAG model (Chickering 2002), highlighting the effectiveness of DARLS in using distributed data.This also suggests that the GLDAG (3) can be a pretty good approximation to the commonly used multinomial DAG model for discrete data.This study confirms that DARLS indeed performs relatively well on data generated from different DAG models, which is important for its practical use.This is further demonstrated by the application to a real dataset in next section.
For each TF, an association strength score, which is the weighted sum of the corresponding ChIP-Seq signal strength, was calculated for each of the 18,936 genes (Ouyang et al. 2009).
Roughly speaking, this score can be understood as a measure of the binding strength of a TF to a gene.Following the same preprocessing in Wang and Zhou (2021), the genes with zero association scores were removed from our analysis.Accordingly, our observed data matrix, of size n × p = 8462 × 12, contains the association scores of 12 TFs over 8,462 genes.We aim to build a causal network that reveals how these 12 TFs affect each other's binding to genes.The associate scores of a TF are typically bimodal, and they were discretized before network estimation; see Figure S1 in the supplemental material for illustration of the discretization.
We distributed this dataset across K = 20 local machines and applied DARLS, HC, MMHC, PC and FGES to the distributed data to learn the protein-DNA binding network.NOTEARS was excluded because its performance was not competitive.Local estimates of a competing method were combined to construct a global graph as we did in Section 5. To ease the comparison, we controlled the sparsity of estimated networks such that every method produced two graphs using the distributed data, with roughly s 0 = 17 and 29 edges, except FGES which had difficulty generating output close to 17 edges.We also applied each method on the combined data (i.e,K = 1) with the same parameters in Section 5.In this case, each estimated graph had around s 0 = 30 edges.The only exception is PC, whose estimate had 21 edges even when its significance level had been reduced to 10 −10 .Since the true network structure is unknown, test data likelihood under multinomial DAG models in ten-fold cross validation is used to assess the accuracy of estimated networks.Denote to add interaction terms to GLDAGs (3) to model nonlinear effects between variables.Such generalization is expected to approximate real causal relations with higher accuracy.We have established that continuous GLDAGs are identifiable, justifying their use in causal discovery, and it is left as future work to study the identifiability of general GLDAGs.
The primary focus of this paper is on big distributed data, with large n but moderate p.In this setting, we established the convergence of the solution obtained by distributed optimization to a global minimizer of the loss using the combined data, and the consistency of the global minimizer as an estimate of the true DAG parameter.However, generalizing the convergence and consistency results to allow diverging p is theoretically interesting and left as future work.
(t+1) k,π (line 5, Algorithm 2), only the current global estimate β (t) π and the global gradient ∇ [n] (β (t) π ) need to be communicated to each local machine.In Section 4.1, we show that for a sufficiently large minimum sample size per machine, i.e. min k n k , the sequence {β (t) π } t≥0 thus produced will converge to a global minimizer β π of F (•) over D(π).

Theorem 1 .
Assume that the coordinates of x h are T -bounded, that is, |x j h | ≤ T for all h ∈ [n] and j ∈ [p].Let θ ∈ Ω be any GLDAG parameter and r > 0, and set R * 1 = max j θ j 1 and r p := 2r √ p.Let b p = inf |t|≤T (rp+R * 1 ) b (t), and assume that b (•) is b p -Lipschitz on [−T r p , T r p ]. Define F ( β π , r), one can always satisfy the condition of Cζ n < 1 by taking m (the minimum sample size per machine) large enough.Hence, Theorem 1 provides a quantitative lower bound on m for the geometric convergence to kick in.

Following
the practice for DAG learning on distributed data as in Na and Yang (2010); Gou et al. (2007); Tang et al. (2019), we combined local estimates generated by a competing method to obtain a global graph estimate.Denote the dataset on a local machine by D

Figure 2 :
Figure 2: Accuracy and computation time comparison for f (π).computation time comparison, with mean and standard deviations plotted, is performed over 20 datasets generated from Insurance.
In this paper, we propose a score-based learning method that carries out multiple rounds of communication to estimate DAGs from distributed data.Our objective function is equivalent to a regularized log-likelihood of the overall data.To optimize it, the central machine proposes a candidate ordering π, where the score of π is evaluated via communication with local machines over DAGs compatible with π.Then, the candidate sort π is selected by simulated annealing.Because every DAG has at least one ordering, searching over the space of orderings ensures that the acyclicity constraint is always satisfied.We show that the convergence rate of our distributed estimate to the global one is O(log(n)/ √ m) for a fixed true DAG, where n is the total sample size across all local machines and m is the smallest local sample size (Theorem 1, Section 4.1).To the best of our knowledge, our approach is the first to learn causal graphs from distributed data with a theoretical guarantee of convergence to the global estimate as the sample size of the local machines grow.

Table 1 :
DARLS against other methods on distributed logistic data.

Table 2
reports the structural estimation accuracy of each method on data generated by the threshold-Gaussian and the mulitnomial DAG models, with n = 10, 000 and K = 20.When the underlying model is a threshold-Gaussian DAG, DARLS achieved the lowest SHD for 4 networks, i.e.Asia, Sachs, Child and Insurance.For data simulated from multinomial models, DARLS

Table 2 :
DARLS against other methods when its model assumption is violated.