Scalable and Privacy-Aware Online Learning of Nonlinear Structural Equation Models

An online topology estimation algorithm for nonlinear structural equation models (SEM) is proposed in this paper, addressing the nonlinearity and the non-stationarity of real-world systems. The nonlinearity is modeled using kernel formulations, and the curse of dimensionality associated with the kernels is mitigated using random feature approximation. The online learning strategy uses a group-lasso-based optimization framework with a prediction-corrections technique that accounts for the model evolution. The proposed approach has three properties of interest. First, it enjoys node-separable learning, which allows for scalability in large networks. Second, it offers privacy in SEM learning by replacing the actual data with node-specific random features. Third, its performance can be characterized theoretically via a dynamic regret analysis, showing that it is possible to obtain a linear dynamic regret bound under mild assumptions. Numerical results with synthetic and real data corroborate our findings and show competitive performance w.r.t. state-of-the-art alternatives.


I. INTRODUCTION
Structural Equation Models (SEM) are a prevalent tool to model interactions in real-world networks due to their simplicity and ability to express instantaneous directed relationships between interacting entities [1], [2], [3]. The advantages of SEM over simple correlation-based models lie in leveraging the directionality, which is key to many applications, such as modeling the functional connectivity between brain regions [4] and interactions in financial networks [5], to name a few. SEM modeling and its topology estimation are challenging because real-life networks are large, dynamic, and comprise nonlinear interactions, as well as leveraging directly node-specific data may raise privacy concerns [1].
Although SEM-based topology estimation has been explored in literature, most of the approaches are developed for stationary linear systems and provide offline (batch-based) solutions [6], [7]. Modeling time-varying systems call for online optimization strategies, which can be classified into timeunstructured and time-structured methods [8], [9]. The former update the model only when a new data sample arrives [10], whereas the latter first predict the model based on its evolution and then correct the prediction when the new data sample is available [11]. The time-structured algorithms are expected to perform better since they take advantage of the prior related to the model evolution but typically have a slightly higher computational cost. A SEM-based online topology estimation has been proposed in [12], but it adopts the time-unstructured strategy and fails to exploit the model evolution; hence, suboptimal. On the other hand, [9] and [13] propose time-structured online SEM learning strategies, but the models are restricted to linear interactions. Moreover, the node operations of these algorithms are computationally expensive, and they assume symmetric interactions of the network data, which destroys SEM's directionality features.
Aiming to overcome the above challenges, we propose an online topology learning algorithm for nonlinear and directed SEM using a time-structured optimization framework. The nonlinearity is tackled using kernel methods, and the curse of dimensionality of kernels is mitigated through random feature (RF) approximation. Kernel techniques are conventionally used for nonlinear topology estimation [14], [15], [16] and help transform the problem into an amenable form. Instead, RF is typically used to reduce the complexity of nonlinear models as well as to ensure that connectivity is inferred without revealing nodal attributes [17], [18], [19], [20], [21], [22]. Through a series of design choices and theoretical derivations, we show how kernels and RFs can be incorporated into the online nonlinear SEM model and show that the proposed algorithm has the following four properties: i) Sparse model evolution: The proposed SEM learning strategy uses a prediction-correction approach to model the SEM evolution. Exploiting the fact that real-world networks exhibit sparse directed interactions, we introduce a group-lasso-based regularizer to learn sparse models. ii) Scalability: The proposed algorithm is separable across the nodes with a fixed computational complexity per iteration, thereby facilitating scalability in large graphs. iii) Privacy: The node separability and the random features avoid sharing the true data, thus, ensuring node privacy. iv) Convergence Guarantee: A dynamic regret analysis of the proposed algorithm is conducted, guaranteeing convergence, and showing the role played by the different components of the proposed method. Numerical experiments on synthetic data and real data from neuroscience and finance corroborate the above contributions and show superior performance to competing alternatives. The rest of the paper is organized as follows. Section II presents the nonlinear SEM, kernel formulation, and random feature approximation. Section III develops an online strategy for learning the nonlinear SEM using a prediction-correction algorithm. The dynamic regret analysis of the proposed algorithm is performed in Section IV, and the numerical results are provided in Section V. Section VI concludes the paper. All proofs are collected in the Appendix.

II. PROBLEM FORMULATION
Consider N interdependent time series, and let y n [t] be the value of the n-th time series at time t. A nonlinear SEM with no exogenous variables models the dependencies among these time series as where f n,n (·) encodes the nonlinear influence of n -th time series on n-th time series, and u n [t] is the observation noise [23]. For example, in the context of brain networks, y n [t] represents the electroencephalogram (EEG) recorded at the n-th node (sensor) at time t, and f n,n (·) encodes the functional connectivity between the nodes n and n . Kernel representation: The nonlinear structure in (1) allows modeling a broader range of problems, but at the same time makes it more difficult to analyse and model the time series interactions. A typical way to approach these challenges is to consider the nonlinear function in (1) belonging to a reproducing kernel Hilbert space (RKHS): where κ n (·, ·) is a positive definite kernel function, measuring the similarity between its arguments. Every positive definite kernel has an associated RKHS characterized by the inner product: κ n (y, x 1 ), κ n (y, . RKHS kernels satisfy the reproducing property κ (p) n (y, x 1 ), κ n (y, and induces a norm f n,n . It is possible to express any function in RKHS as an infinite sum of kernel evaluations weighted by β n,n, t [15].
For a node n, the functional dependency can be obtained by solving f n,n n = argmin where (·) is a regularizing function with the hyperparameter λ > 0. We consider (x) = |x| to induce a sparse SEM model. In (3), the function f n,n (·) belongs to the RKHS, which is an infinite dimensional space [cf. (2)]. However, for non-decreasing regularizing functions such as (x) = |x|, x ≥ 0, the solution of (3) can be expressed with a finite number of parameters using the Representer Theorem [24]: As the number of data samples increases, the number of kernel evaluations in (4) and the parameters required to express the function also increases. We overcome this curse of dimensionality by using random feature (RF) approximation. RF approximation: RF approximation limits the kernel evaluations to a fixed lower-dimensional Fourier space for kernels with a shift-invariant property, i.e., κ n (y n [τ ], y n [t]) = κ n (y n [τ ] − y n [t]); thus, preventing the dimensionality growth. According to Bochner's theorem [25], an inverse Fourier transform of a probability distribution can represent a shift-invariant kernel: where π κ n (v) is the kernel-specific probability density function (pdf), v is the random variable associated to the pdf, and E[·] is the expectation operator. Given a sufficient number D of i.i.d. samples {v i } D i=1 drawn from distribution π κ n (v), the expectation is estimated by the sample mean: Finding the probability distribution which is the inverse Fourier transform of a kernel is a difficult task in general. However, choosing a Gaussian kernel with variance σ 2 avoids this difficulty since its Fourier transform is also a Gaussian with variance σ −2 . This allows writing the real part of (6) aŝ where A fixed dimensional (2D) representation of the function f n,n (·) is obtained by substituting (7) into (4): where α n,n = T −1 t=0 β n,n, t z v,n [t]. Using (9), we can reformulate the non-parametric problem (3) into a parametric optimization problem: The regularizer in (10) is a group-lasso regularizer to enforce sparsity in the RF coefficient α n,n ∈ R 2D . For brevity, we stack the vectors α n,n and z v,n [t] along the index n = 1, . . . , N, n = n to form α n ∈ R 2(N−1)D and z n [t] ∈ R 2(N−1)D , and compactly write (10) as where Solving problem (11) requires access to all the batch of time series {y n [τ ]} T −1 τ =0 which may be practically infeasible as they evolve over time and, at the same time, it is computationally demanding. Targeting real-world nonstationary systems with streaming data, we develop an online strategy enhanced by prediction correction mechanisms [11] that exploit the nonlinear SEM evolution. However, the group-lasso regularizer, required to enforce sparse dependencies is non-differentiable, making the deployment of prediction-correction methods not straightforward.

III. TIME-VARYING SOLUTION A. ONLINE LOSS FUNCTION
Following online optimization, we replace the batch loss in (12) with a recursive least square loss (RLS) using an exponential window: 2 is the instantaneous loss function, γ ∈ (0, 1) is the forgetting factor of the window, and μ = 1 − γ normalizes the window. The RLS loss function can be expanded as where The new optimization problem using the RLS loss becomes arg min The objective function in (17) has a differentiable loss but a non-differentiable regularizer. We solve it using composite objective mirror descent (COMID) [26] with the online updates: where α (1) n [t + 1] denotes the one-step COMID descent of α n [t], ν t the step size, and ∇ α˜ n t (α n [t]) the gradient of n t (α n [t]) w.r.t. α n , which can be computed from (14) as In an online setting, the parameters n [t] and r n [t] can be estimated recursively as (15) and (16)]. The COMID update (18) can be solved in closed-form for each lasso group α n,n ∈ α n [cf. (10)] using the multidimensional shrinkage thresholding operator (MSTO) [27]: where (20) involves a one-step CO-MID update. For brevity of the succeeding formulation, we represent the K-step version of (20) as which computes the K-step descent update of α n,n [t] as in (20), for n = 1, . . . , N, n = n, for the loss function˜ n t (·) with the parameters ν t and λ, and stacks them to form α (K ) n [t + 1].

B. PREDICTION-CORRECTION ALGORITHM
Although we can follow a time-unstructured learning strategy by directly using (21), such an approach discards the model evolution and leads to a suboptimal solution. Problem (17) features a strongly convex time-varying loss function and a properly convex regularizer, and such an optimization problem can be solved online using time-structured optimization methods that account for the model evolution. We follow the prediction-correction strategy as proposed in [11]. Prediction: The first step is to predict at time t, the yet unobserved loss function˜ n t+1 (α n ) using Taylor series expansion:˜ n,pr In addition to the gradient computed in (19), prediction (22) requires computing the Hessian ∇ αα˜ n t (α n [t]) and the partial derivative of ∇ α˜ The group-lasso regularizer is a time-invariant function and just performs the thresholding operation in (20), irrespective of the time indices. Hence, it does not require prediction.
Using the predicted loss (22) in place of (17), we predict the RF coefficients as where α pr n [t + 1] denotes the P-step COMID descent of α n [t] under the predicted loss. The gradient of the predicted loss involved in the MSTO operation (25) can be obtained from (22) as Correction: At time t + 1, the loss˜ n t+1 (·) [cf. the one appearing in (17)] becomes available, and the predicted RF coefficients α pr n [t + 1] are corrected via C-step COMID descents: A high-level system model of the proposed method is presented in Fig. 1. At each time instant, the algorithm computes two estimates of the model parameters. For instance, at time t + 1, we have two estimates: i) the predicted coefficient α pr n [t + 1], which is predicted based on evolution of the model and ii) the corrected coefficient α n [t + 1], which is obtained by correcting the predicted coefficient when a new data sample is available. Note that in the proposed framework, nodes do not share the actual data {y n [t + 1]} N n=1,n =n with each other; instead the random features {z v,n [t + 1]} N n=1,n =n are shared, which ensures the privacy. A pseudocode of the proposed prediction-correction algorithm is provided in Algorithm 1. The computational complexity of the proposed algorithm is mainly contributed by the gradient evaluation steps (26) and (19); and it is of order O(N 2 D 2 ) per node.

IV. DYNAMIC REGRET
To characterize the performance of the proposed online algorithm, we analyse its dynamic regret [28], which characterizes is the maximum eigenvalue operator. Dynamic regret is defined as the sum of differences between the online estimated cost function and optimal cost function: where α n [t] collects the estimated RF coefficients [cf. (27)] and z n [t] is the RF features; and β * n [t] ∈ R (N−1)t and κ n [t] are the optimal coefficients and the kernel-based features in RKHS, respectively. The function h n t (·, ·) is defined as which is related to (11) by replacing the cumulative loss by an instantaneous loss. We also define the optimal RF coefficients α * n [t] as Adding and subtracting h n t (α * n [t], z n [t]) in (28) gives where R rf n [T ] is the regret w.r.t. optimal cost in RF space and ξ n [T ] is the cumulative error in RF approximation. Dynamic regret can be bounded by bounding R rf n [T ] and ξ n [T ]. Theorem 1: Under assumptions A1, A2, A3, and A4, the dynamic regret R n (T ) satisfies where η > 0 is a constant, L h is the Lipschitz continuity parameter of function h n t (·, ·), d is the maximum temporal variation in the optimal solution α * n [t] − α * n [t − 1] 2 , and l is the maximum error in the optimal prediction α * n [t] − α pr * n [t] 2 with α pr * n [t] the optimum prediction at time t. The quantity q ∈ (0, 1) is the contraction coefficient, and its value for various optimization techniques is provided in [29].
Proof: See Appendix. The dynamic regret bound in Theorem 1 is linear in time, which implies that lim t→∞ R n (T )/T = constant, where constant is the steady state error, which depends on l = α * , and the constant ≥ 0. Having a linear variation of the dynamic regret is a favourable attribute of the proposed online algorithm, since it implies that asymptotically, the solution will converge to the optimal online solution, but with steady state errors. Further, if d and l are low (slowly varying systems), it is possible to have a very low bound for the asymptotic R n (T )/T by controlling . The constant is inversely related to the number of RF features [30], meaning that setting to zero requires an infinite number of RF features. Hence, is controlled in the expense of model complexity.

V. NUMERICAL EXPERIMENTS
This section compares the proposed algorithm with competing alternatives using both synthetic data from Erdös-Rényi graph models and real data from epileptic seizure and financial time series. We compare the proposed approach with the following alternatives: r MSTO: A nonlinear SEM by merely performing a onestep multidimensional shrinkage thresholding [cf. (21)] without any prediction-correction steps. The first two alternatives are considered as baselines as they have also shown superior performance to other online learning strategies in the respective papers. Instead, the third alternative is considered to highlight the importance of the proposed time-structured strategy.
In all experiments, the proposed algorithm has one-step prediction (P = 1) and one-step correction (C = 1). Wherever the SEM topologies are plotted for visualization, we use the normalized 2  Compared to the proposed algorithm, MSTO has the same computational complexity, whereas the linear Pro-SEM is computationally lighter. TV-SEM's computational complexity increases cubically, while that of the proposed method increases quadratically, which is favourable when scaling to large networks.

A. SYNTHETIC DATA
In this experiment, we consider simulated data from a slowlyvarying SEM model. We generate graph-connected time series using the following nonlinear SEM model: where y[t] ∈ R 5 is the signal at time t, u[t] ∼ N (0, 0.1), I ∈ R 5×5 is the identity matrix, and the operator sin(·) acts element-wise to introduce non-linearities. The matrix W[t] ∈ R 5×5 is constructed such that it attributes slowly-evolving model dynamics to (32), and is of the form: where W[0] ∈ R 5×5 is constructed using an Erdős-Rényi random graph with diagonal entries zero. 1 Our synthetic data set consists of 100 multi-variate time series, generated using (32), each having T = 5000 signal samples. Out of the 100 multi-variate time series, 20 are used to tune the hyperparameter of all the algorithms based on a grid search for the best model fitness. The model fitness is measured via Mean Squared Error (MSE), defined as whereŷ[t] ∈ R 5 is the signal estimated using the learned SEM model. The hyperparameter values of the proposed algorithm are (σ n , λ, γ , ν t ) = (5, 0.0009, 0.98, 2/ max{ max ( n [t])} n ) and the RF count is D = 5. The MSEs averaged across the remaining 80 multi-variate time series are plotted in Fig. 2, which shows that the proposed method outperforms all alternatives. This is because the alternatives do not exploit the evolution of the model or cannot learn non-linearities, whereas the proposed algorithm features both. Dynamic Regret: In Fig. 3, we plot the rate of change of the dynamic regret w.r.t. optimal cost function in RF 1 We choose a small Erdős-Rényi graph of size 5 to corroborate the dynamic regret, which involves high computational complexity.  space R rf n [T ]/T . The convergence of R rf [T ]/T is evident from Fig. 3, which supports our theoretical analysis in Theorem 1. We wish to note that a numerical evaluation of the second component of the dynamic regret ξ n [T ] is a daunting, complex process since it involves finding the optimal parameters in a high dimensional RKHS. However, ξ n [T ]/T is upper bounded by the value ηL h [cf. Lemma 2], where is a user-controlled parameter. By setting to be very small, the rate of change of the overall dynamic regret R n [T ]/T can be made closer to R rf [T ]/T , when T → ∞.

B. REAL DATA: EPILEPTIC SEIZURE
In this experiment, we examine the functional connectivities among different brain regions via learned SEM topologies using an EEG dataset. Our goal is to distinguish between the normal and epileptic dynamics in the brain networks. We use an EEG dataset of children with intractable seizures collected from the Children's Hospital, Boston [32]. The data set consists of multivariate time series of potential differences between electrodes inserted in the brain. There are a total of 23 times series measuring EEG activities in different brain regions. We fit this data using different algorithms and test their capability to distinguish the pre-seizure and the seizure events. We measure the performance via the Maximum Mean Discrepancy (MMD) of the distribution of nodal degrees, which is a standard approach used to measure the distance between two graphs [33], [34]. The MMD is defined as where k(x, y) is the radial basis kernel function computing the distance between x and y, and MMD 2 measures the distance between distributions p 1 and p 2 . In this experiment, p 1 and p 2  correspond to the distributions of nodal degrees for the preseizure and seizure events, respectively. We used the proposed method with the RF count D = 5 along with the hyperparameters (σ n , λ, γ , ν t ) = (1, 0.1, .98, 2/ max{ max ( n [t])}), obtained using a grid search for the best MMD. The hyperparameters of other algorithms are also tuned using the same strategy. Table 1 compares the MMD of the different algorithms using the seizure data from two subjects, S 1 and S 2 . The MMD of the proposed algorithm is an order-one magnitude higher compared to alternatives, which highlights that the proposed algorithm distinguishes the seizure and the pre-seizure events better. This is due to the fact that the functional connectivities in brain are highly nonlinear [15], and all alternatives, except MSTO, discard the nonlinear components in the connectivity. MSTO, on the other hand, can accommodate the non-linearities; however, it does not take advantage of the brain connectivity evolution, and is at the second place in the comparison.
A snapshot of the estimated graph topology before seizure and after seizure is shown in Fig. 4(a) and (e), respectively. Before the seizure, the connections are concentrated across certain regions, and during the seizure, they get more disrupted, which agrees with the observations in [35]. The reason for the disrupted topology is the increase in pathogenic neural discharge during seizure [36].
We further compare the per-node computational complexity of the proposed method and the time-structured benchmark

TABLE 2. Categorized List of Financial Times Series
TV-SEM. The experiment is conducted in a machine with specifications: 2.4 GHz 8-core Intel Core i9 and 16 GB 2667 MHz DDR4 RAM. In Fig. 5, we plot the cumulative computation time of the prediction and the correction steps, where it can be observed that the proposed model performs the prediction and the correction much faster. The shorter computation time stems from the node separability feature, which the TV-SEM does not have. The other alternatives are not considered in Fig. 5 since they are time-unstructured algorithms that do not take advantage of the model evolution, and hence, are faster than the time-structured methods.

C. FINANCIAL TIME SERIES
We consider financial time series belonging to three categories: airline industry, oil industry, and cryptocurrency, which are listed in Table 2. The data set includes 15 time series of 879 samples each, which are the closing price values of the stocks from 01-06-2019 to 14-10-2022, including the COVID-19 outbreak. The pandemic had a serious impact on world economy, affecting the natural dynamics of the stock market. A high dip in the S & P 500 index was observed around 25-02-2020 to 25-06-2020, which we mark as the pandemic period. Our goal in this experiment is to identify clusters in the data using the learned SEM topologies and examine the variations in the clusters during and after the pandemic. Since the stock groups in Table 2 are formed by selecting the stocks from similar industries, they are expected to show stronger intra-group dependencies than intergroup dependencies, under the normal market conditions [37].
Let V i = 1, 2, 3, denote the set of nodes corresponding to the stocks in each group. We measure the performance via the clustering coefficient ρ i that computes the ratio of the number of edges within group-i to the total number of edges connected to group-i members: where δ is a threshold selected to consider the strongest 2 N edges for clustering; and 1(·) is an indicator function defined as 1(x) = 1, when x is true, and 0, otherwise. A high value of ρ i indicates that intra-group interactions in groupi are stronger compared to its intergroup interactions. The first 20% of the data samples are used to tune the hyperparameter for the lowest MSE resulting in (σ n , λ, γ , ν t ) = (1, 1, .98, 2/ max{ max ( n [t])}) and RF count for the experiment is D = 10. Table 3 lists the clustering coefficients of the three groups, averaged across 80 days, randomly sampled from the COVID and post-COVID intervals. As expected, the clustering is more predominant with post-COVID market dynamics than with the COVID market dynamics. The proposed method identifies better such clusters compared with the alternatives. The MSTO algorithm is next in the comparison. This observation is supported by the fact that the interactions among the financial time series are complex [38], which cannot be effectively  modeled using the linear Pro-SEM and TV-SEM. It is further interesting to note here as the crypto cluster is much easier identified in the post-COVID period. This follows the intuition that the airline and oil sectors have more financial transactions between them, whereas cryptocurrencies are exchanged only with each other.
Further, the SEM topologies estimated using the proposed algorithm for a COVID-affected market day and a post-COVID day are shown in Figs. 6 and 7, respectively. In line with the expectation, more intra-group market interactions can be observed in Fig. 7, whereas these interactions get disrupted in Fig. 6.

VI. CONCLUSION
This paper proposed an online algorithm to learn the nonlinear structural equation model (SEM), targeting the streaming data from real-world systems with nonlinear dynamics. The proposed method leverages the kernel formulation with random feature approximation to obtain a low-dimensional representation of the nonlinear dynamics. The algorithm uses a prediction-correction strategy equipped with a group-lassobased optimization framework, solved via composite object mirror descent. Unlike the state-of-the-art algorithms, the proposed method offers data privacy at the network node through node separability and random features. In addition, the proposed online problem is separable across nodes, improving scalability in large graphs. A dynamic regret analysis has been derived to ensure the theoretical guarantee of the algorithm. Using synthetic, epileptic, and financial data, we demonstrated that the SEM topology learned using the proposed model fits the data better and can distinguish between the changes in the system dynamics with less computational complexity compared to the state-of-the-art alternatives. Our future research will involve extending the algorithm by incorporating vector autoregressive (VAR) or structural VAR models, considering time-lagged interactions among the nodes, which are inherent to many real-world networks.

APPENDIX PROOF OF THEOREM 1
Theorem 1 provides an upper bound for the dynamic regret R n (T ) = R rf n (T ) + ξ n (T ). We prove the theorem by bounding R rf n (T ) and ξ n (T ) using the following two lemmas. Lemma 1: Under assumptions A1, A3, and A4, and letting ν t = 2 L , the dynamic regret w.r.t. the optimal cost function in the RF space is upper bounded by Proof: The Cauchy-Schwarz inequality allows us to bound R rf n [T ] by bounding the cumulative optimality gap T −1 t=0 α n [t] − α * n [t] 2 and the gradient of the loss function ∇˜ n t (α n [t]) 2 [39]. The bound for optimality gap is given in [9, Prop. 1]: Since q < 1, we can express cumulative error in terms of the initial optimal solution α * n [0]. Setting α n [0] = 0, we bound the cumulative optimality gap as The gradient of the loss is bounded by following Lemma 1.2 in [19]: The claim can be then proved by adding (38) and (39). Lemma 2: Under assumptions A1 and A2, there exists a constant ≥ 0 such that the cumulative approximation error ξ n [T ] satisfies ξ n (T ) ≤ ηL h T.