Distributed Identification of Contracting and/or Monotone Network Dynamics

This paper proposes methods for identification of large-scale networked systems with guarantees that the resulting model will be contracting -- a strong form of nonlinear stability -- and/or monotone, i.e. order relations between states are preserved. The main challenges that we address are: simultaneously searching for model parameters and a certificate of stability, and scalability to networks with hundreds or thousands of nodes. We propose a model set that admits convex constraints for stability and monotonicity, and has a separable structure that allows distributed identification via the alternating directions method of multipliers (ADMM). The performance and scalability of the approach is illustrated on a variety of linear and non-linear case studies, including a nonlinear traffic network with a 200-dimensional state space.


I. INTRODUCTION
System identification is the process of generating dynamic models from data [1], and is also referred to as learning dynamical systems (e.g. [2]). When scaling control and identification algorithms to large-scale systems, it can be useful to treat a system as a sparse network of local subsystems interconnected through a graph [3], [4], [5]. In this paper, we propose algorithms for identification of such networked systems in state space form: where x t ∈ R n and u t ∈ R m are the state and input respectively, and the model dynamics a(·, ·, ·) can be either linear or non-linear. We assume that measurements (or estimates) of state and input sequences are available. Our approach: 1) uses distributed computation (i.e. network nodes only share data and parameters with immediate neighbors), 2) can generate models with a strong form of stability called contraction, 3) can generate monotone models, i.e. ordering relations between states are preserved. Imposing contraction and/or monotonicity on models provides two benefits when identifying systems that are known to satisfy those properties. Firstly, incorporating prior knowledge can significantly improve the quality of the identified models. Secondly, it guarantees that properties of the real system that are useful for controller design are present in the identified model. This work is motivated by the observation that many systems have the combination of large-scale, sparse dynamics, monotonicity and stability. Examples include traffic networks [6], [7], [8], chemical reactions [9], combination therapies [10], [11], [12], [13], wildfires [14] and power scheduling [10].
The key technical difficulty we address is the simultaneous identification and stability verification of large-scale networked systems. We propose a convex model set with scalable stability conditions and an algorithm based on ADMM that decomposes the identification problem into easily solvable, sub-problems that require only local communication between subsystems.

A. Networked System Identification
Standard approaches to system identification do not work well for large-scale networked systems for three reasons [15]: firstly, the dataset must be collected at a central location, a process which may be prohibitive for complex systems; secondly, the computational and memory complexities prohibit application to large systems; finally, the network structure may not be preserved by identification. For instance, standard subspace identification methods have O[n 3 ] an O[n 2 ] computational and memory complexities respectively, and any sparse structure in the dynamics is destroyed through an unknown similarity transformation [16].
Previous work in networked system identification can be loosely categorized into two areas; the identification of a network topology [17], [18], [19], and the identification of a system's dynamics with known topology. In the latter category, almost all prior work has focused on the case where subsystems are linear time invariant (LTI) and described by state space models [15], [20], [21] or transfer functions (a.k.a. modules) [22], [23].
When identifying the subsytem dynamics, states or outputs of neighbors are treated as exogenous inputs, ignoring feedback loops induced by the network topology. This improves scalability as the identification of each subsystem can be performed in parallel. However, accurate identification of the individual subsystems does not imply accurate identification of the full network, because the ignored feedback loops may have a strong effect and even introduce instability. A simple case with two subsystems which has received significant attention is closed-loop identification [24] Prior works in networked system identification assume stable LTI network dynamics and establish identifiability [25] and consistency [26]. These assumptions then imply model stability in the infinite data limit. However, model stability is not guaranteed with finite data sets or in non-linear black-box identification problems, where the true system is usually not in the model set.

B. Identification of Stable Models
Standard methods for system identification do not guarantee model stability, even if the system from which the data are collected is stable. For linear system identification considerable attention has been paid to this problem, and several methods have been suggested based on regularisation or model constraints [27], [28], [29], [30]. Even for linear systems, the set of stable models is not convex using standard parameterisations, to the authors' knowledge all existing methods introduce some bias in the identification procedure.
For linear systems most definitions of stability are equivalent. The nonlinear case is more nuanced, and the definition used depends on the requirements of the problem at hand. Standard Lyapunov methods are not appropriate in system identification as the stability certificate must be constructed about a known stable solution, whereas the very purpose of system identification is to predict a system's response to previously unseen inputs. Contraction [31] and incremental stability (e.g. [32]) are more appropriate since they ensure stability of all possible solutions and consequently, do not require a-priori knowledge of the inputs and state trajectories.
Stability guarantees have also been investigated for nonlinear system identification. For instance, systems can be identified using sets of stable recurrent neural networks [33] or stable Gaussian process state space models [34]. A limitation of these approaches is that they do not allow joint search for a model and its stability certificate, which can be conservative even for linear systems.
This paper builds on previous work in jointly-convex parameterization of models and their stability certificates via implicit models [35], [36], [37], [38], [39], [40] and associated convex bounds for model fidelity via Lagrangian relaxation [38], [41], [42]. The main development in this paper is to significantly improve scalability of this approach via a novel model parametrization and contraction constraint that are jointly convex and permit a particular upstream/downstream network decomposition (defined below).

C. Monotone and Positive Systems
Monotone systems are a class of dynamic system characterized by the preservation of an order relation for solutions (c.f. Definition 2 below). A closely related class is positive systems, for which state variables remain non-negative for all non-negative inputs (c.f. Definition 3 below). For linear systems, positivity and monotonicity are equivalent.
A useful property of monotone systems is that they often admit simplified stability tests. In particular, for linear positive systems the existence of separable Lyapunov functions, i.e. those representable as the sum or maximum over functions of individual state variables, is necessary and sufficient for stability [43]. This property has been used to simplify analysis [44], control [45] and identification [46] of positive systems. Separable stability certificates have also been shown to exist for certain classes of nonlinear monotone systems [47], [48], [49], [50]. and have been used for distributed stability verification [51] and control [52]. Monotonicity can also simplify nonlinear model predictive control [10] and formal verification using signal temporal logic [53].
There are however, few identification algorithms that guarantee monotonicity. In [54], monotone gene networks are identified using the monotone P-splines developed in [55]. This approach, however, does no guarantee model stability.

D. Least-Squares Equation Error
Identification typically involves the optimization of a quality of fit metric over a model set. In this paper we use what is arguably the simplest and most widely-applied quality-offit metric, least-squares equation error (a.k.a. one step ahead prediction error): wherex t ∈ R n andũ t ∈ R m are state and input measurements or estimates. Least-squares equation error is a natural choice for short-term prediction if state measurements are available. If long-term predictions are needed, then simulation error, defined as is a better measure of performance. The dependence on simulated states, however, renders the cost function nonconvex [56], [57] and notoriously difficult to optimize [58]. Consequently, equation error optimization is often used to initialize local search methods (e.g. gradient descent) for models with good simulation error or used as a surrogate for simulation error with better numerical properties. In the latter context, model stability is particularly important since a model can have small equation error but be unstable and therefore exhibit very large simulation error. In fact, when a model is contracting, it can be shown that small equation error implies small simulation error [35]. In many contexts, system state measurements are not available. Nevertheless, equation error frequently arises as a subproblem via estimated states, e.g. in subspace identification algorithms [59], [60], [21], where states are estimated using using matrix factorizations, or in maximum likelihood identification via the expectation maximization (EM) algorithm where they are estimated from the joint smoothing distribution [61], [62].

E. Contributions
The main contributions of this work as are follows: we propose a model structure and convex constraints that guarantee monotonicity, positivity, and/or contraction of the model. For large scale networked systems, we refine the model and constraints to have a separable structure, and we introduce a separable bound on equation error, so the identification problem can be solved using distributed computation. The algorithm, based on ADMM, decomposes into easily solved separable optimization problems at each step. Data and parameters are only communicated to immediate neighbours in the network. Finally, we evaluate the scalability and fitting performance of the method on a number of numerical examples.

Notation
A graph G is defined by a set of nodes (vertices) V = [1, ..., N ] and edges E ⊂ V × V . The vector 1 is the column vector of ones, with size inferred from context. For vectors v, v > 0 refers to the element-wise inequality. For matrices M , M ≥ 0 and M ≤ 0 refer to element-wise inequalities. For symmetric matrices M , M 0 means that M is positive definite. For a vector v, diag(v) is the matrix with the elements of v along the diagonal. The set of n × n symmetric matrices is denoted S n×n . The set of n × n non-singular M-matrices is denoted M n . For a matrix A, A ∈ M n means A ij ≤ 0, ∀i = j and real(λ i ) > 0 for i = 1, ..., n, where λ i are the eigenvalues of A. For brevity, we will sometimes drop the arguments from a function where the meaning may be inferred from context.

A. Differential Dynamics
The contraction and monotonicity conditions we study can be verified by way of a systems differential dynamics, a.k.a. linearized, variational, or prolonged dynamics. For the system (1), the differential dynamics are where A = ∂a ∂x and B = ∂a ∂u . In conjunction with (1), the differential dynamics describe the linearized dynamics along all solutions of the system.

B. Contraction Analysis
We use the following definition of nonlinear stability: Definition 1 (Contraction). A system is termed contracting with rate α, where 0 < α < 1, if for any two initial conditions x a 0 , x b 0 , given the same input sequence u t , and some p ∈ . Contraction can be proven by finding a contraction metric which verifies conditions on the differential dynamics [31]. A contraction metric is a function V (t, x, δ x ) such that: for some µ > 0 The choice of contraction metric V (t, x, δ) is problem dependent. Prior works have proposed quadratic contraction metrics for which (6) is linear in the stability certificate and can be verified using semi-definite programming. A number of works have also noted that using a weighted 1 norm can lead to separable constraints [63], [51] allowing for stability verification of large-scale networked systems.
In the context of system identification, the joint search for model a in (1) and contraction metric V is non-convex due to the nonlinear function composition

C. Monotone and Positive Systems
We now define system monotonicity and positivity of dynamical systems.
Definition 2 (Monotone System). A system (1) is termed monotone if for inputs u a t and u b t and initial conditions x a 0 , x b 0 , the following implication holds: Monotonicity results from A(x, u) ≥ 0 and B(x, u) ≥ 0 where A and B come from the differential dynamics (4). (1) is positive if for all inputs u 0 , ..., u T ≥ 0 and initial conditions x 0 ≥ 0, the resulting trajectory has x 1 , ..., x T ≥ 0.

Definition 3 (Positive System). A system
A sufficent condition for a system to positive is for it to be monotone and admit x t = 0, u t = 0 ∀t as a solution, i.e. a(0, 0, 0) = 0 in (1).

D. Network Structure
We assume model (1) is partitioned into N subsystems. The interactions between these subsystems is described by a directed graph G = (V , E ). Here, we have a set of nodes denoted V = {1, ..., N } corresponding to the subsystems. Each subsystem has its own state denoted x i ∈ R ni and may take an input denoted u i ∈ R mi (we allow for the case m i = 0). The global state and input is attained by concatenating the states and inputs of each subsystem, The set of edges E ⊆ V × V describes how the subsystems interact with each other. In particular, (j, i) ∈ E means that the state of subsystem j affects the state of subsystem i. The edge list E may arise naturally from the context of the problem, e.g. in traffic networks where edges come from the physical topology of the road network, or may be identified from data [17], [64].
For each subsystem i ∈ V , we define the set of upstream The term upstream neighbours of i refers to the subsystems whose state affects the state of subsystem i, and the term downstream neighbours refers to the subsystems whose state is affected by subsystem i's state. In general, we allow self-loops so that a node can be both upstream and downstream to itself. This notation is illustrated in Fig. 1.
We can write the dynamics of the individual interacting subsystems as follows: where a i corresponds to the i th element in (1)

E. Separable Optimization using ADMM
Consider an optimization problem of the form, which may include constraints on θ via indicator functions appearing in J. The indicator function for the constraint θ ∈ Θ is the function I Θ (θ) which is zero for θ ∈ Θ and infinite otherwise.
In this paper we encounter problems of the form: where {θ i a | i = 1, ..., N } and {θ j b | j = 1, ..., M } are two different partitions of the same vector θ. In our context, these partitionings correspond to the sets of upstream or downstream neighbors discussed in the previous section. For such problems, the alternating directions method of multipliers (ADMM) can be applied [65]. We write (10) as Applying ADMM results in iterations in which each step is separable with respect to the partition θ a or θ b , and can thus be solved via distributed computing. For convex problems, ADMM is guaranteed to converge to the optimal solution [65].

F. Problem Statement
To summarise, the main objective of this paper is as follows. Given state and input measurements {x t ,ũ t | t = 1, .., T }, and a graph G describing the network topology, identify models (8) at each node such that: • during the identification procedure, each subsystem only communicates with immediate (upstream and downstream) neighbours; • convergence is guaranteed and least-squares equation error is small at each subsystem; • model behavioural constraints such as contraction, monotonicity, and/or positivity can be guaranteed for the interconnected system (1).

III. CONVEX BEHAVIORAL CONSTRAINTS
In this section we develop a convex parametrization of models with contraction, monotonicity and/or positivity guarantees. As described in subsection II-B, jointly searching for a model (1) and contraction metric is non-convex.
Following [37], [38], we solve this problem by instead searching for models in the following implicit form: The differential dynamics of (12) are: where E = ∂e ∂x , F = ∂f ∂x and K = ∂f ∂u . Definition 5 (Well-Posed). An implicit model of the form (12) is termed well-posed if for every x t , u t , u t+1 there is a unique x t+1 satisfying (12).
I.e., well-posedness means that e(x, u) is a bijection with respect to its first argument, and implies the existence of an explicit model of the form (1) where a = e −1 •f . Furthermore, it implies that for any initial condition x 0 and sequence of inputs u 0 , ..., u T , there exists a unique trajectory x 1 , ..., x T satisfying (12).

A. Stability and Monotonicity Constraints
In this section, we develop convex conditions on the implicit model (12) that guarantee well-posedness, monotonicity, positivity, and contraction. The main result is the following: (b) contracting with rate α if (a) holds and there exists a matrix function S(x, u) : (c) monotone if (a) holds and for all (x, u): (d) positive if (c) holds and: (e) contracting and monotone if (a) and (c) hold, and for all (x, u) Positivity is also enforced if (18) holds.
Proof. See appendix A.
We refer to the stability conditions in Theorem 1 (b) or (e) as 1 contraction conditions as they ensure contraction using a state dependent weighted 1 norm of the differentials: V (t, x, δ) = |E(x, u)δ| 1 , noting that for the purpose of contraction analysis the exogenous input u can be considered as a time-variation. Remark 1. Theorem 1 requires an exponential contraction rate α to be specified. A weaker form of incremental stability can also be imposed by replacing (16) with

B. Model Parametrizations
As formulated above, Theorem 1 applies to models represented by the infinite dimensional space of continuously differentiable functions e and f . In practice, these functions are usually parametrized by a finite-dimensional vector. In this section we briefly discuss some common model parametrizations and how the constraints can be enforced.
For linear models, (14) is a semidefinite constraint, (15)- (19) are linear and can be enforced using semidefinite programming. Furthermore, if E is diagonal, then (14) is also linear and the model set is polytopic.
If the functions e and f are multivariate polynomials or trigonometric polynomials, then the constraints can be enforced using sum of squares programming [66], [67].
The model set (12) also contains a class of recurrent neural networks with slope-restricted, invertible activation functions. In this case, e(x) is the inverse of the activation functions, f (x, u) is affine, and simulation of the explicit model a = e −1 • f yields the equation of a standard recurrent neural network [68]. The conditions in Theorem 1 (b) or (d) then correspond to diagonal dominance conditions on the weight matrices which can be enforced via linear constraints.
Finally, if the requirement for global verification of these properties is relaxed, then these constraints can be applied pointwise for arbitrary parametrizations e and f , which amount to linear and semidefinite constraints if e and f are linearly parametrized.

IV. DISTRIBUTED IDENTIFICATION
In this section we consider the problem of distributed identification of networked systems with the behavioral constraints introduced in Theorem 1. First, we propose a particular structure for (12) for which the constraints in Theorem 1 are separable. We then propose an objective function that is separable (with respect to a different partition). Finally we propose an algorithm for fitting the proposed models that requires only local communication between subsystems at each step.

A. Distributed Model
We propose the following model structure for distributed identification, in which e depends only on local states and inputs, and f is a summation of nonlinear functions of states and inputs from upstream neighbours: Models of the form (21) are widely used for statistical modelling, and are referred to as generalized additive models (GAMs) [69]. This class of models also includes linear systems, and a class of recurrent neural networks. We assume that each of the functions e i : R ni × R mi → R ni and f ij : R nj × R mi → R ni are linearly parametrized by θ i e and θ ij f respectively. We define two partitions of the model parameters; the sets of upstream and downstream parameters. These are denoted Objective functions, constraints and optimization problems are called upstream-separable or downstream-separable if they are separable with respect to these partitions. Upstream and downstream separable optimization problems are closely related to the column-wise and row-wise separable optimization problems used in [70].
For the parametrization (21), the differential dynamics have a sparsity pattern determined by the network topology. In particular, the (i, k) th block of F is: and E is block diagonal. This means F ik depends only on parameters θ ik f and the block E ii depends only on θ i e . As each block of E and F has an independent parametrization, functions of disjoint sets of elements of E or F will be separable.

B. Convex Bounds for Equation Error
In Section IV-A we propose a convex set of implicit models. However, this approach shifts the convexity problem from the model set to the objective function as equation error (2), s.t. a = e −1 • f , is no longer convex in the model parameters.
One approach might be to minimize the implicit equation error as a surrogate for equation error. This approach however, strongly biases the resulting model and leads to poor performance [42]. Instead we use the convex upper bound for equation error proposed in [42], which is based on Lagrangian relaxation. The least-squares equation error (2) for the implicit model (12) is: Note that this problem is not jointly convex in x t+1 and θ.
The following convex upper bound was proposed in [42]: where λ t (x t+1 ) = x t+1 −x t+1 is a Lagrange multiplier. The function (24) is convex in θ as it is the supremum of an infinite family of convex functions [38]. For our parametrization (21), E is block diagonal which then implies that (24) is upstream separable so it can be written asĴ wherê The evaluation ofĴ i ee is not trivial as it involves the calculation of the supremum of a non-linear multivariate function. In this work we linearise (25) with respect to x i t and solved for the supremum of the resulting concave quadratic function, giving: The cost function (26) can be optimized via a semidefinite program. Alternative methods for minimizing LREE can also be found in [42].

C. Alternating Directions Method of Multipliers (ADMM)
In Section IV-A we introduced a model set for which the constraints in Theorem 1 are downstream separable and in Section IV-B we introduced an upstream separable objective function. Note however, that the constraints and objective are not jointly separable with respect to the same partition. We use ADMM to solve this problem.
We now develop the algorithm for the case where (12) is well-posed, monotone and contracting, however, a parallel construction without monotonicity or contraction constraints introduces no additional complexity. Consider the following set of parameters Applying ADMM as discussed in Section II-E to the problem min θ∈Θ m 1Ĵ ee gives the following iteration scheme for iteration k: φ(k + 1) = arg min for ρ > 0. When using a GAM structure (21), we have the following result: Proposition 1. For the model structure (21), the ADMM iteration (28) separates into N upstream-separable optimization problems of the form (31) and the ADMM iteration (29) separates into N downstream-separable optimization problems of the form (32).
Proof. See Appendix B.
In particular, the ADMM approach corresponds to performing the following iterations locally at each node i = 1, ..., N : The distributed algorithm is listed in Algorithm 1. The steps (31) and (32) require access to the upstream and downstream parameters respectively. These can be solved by the nodes in the graph, however, communication between both upstream and downstream parameters is necessary between steps. The update (33) is trivially separable and can be solved as either an upstream or downstream separable problem.
For this reason, we take φ as the solution to ensure that the well-posedness, monotonicity and contraction constraints (14), (17), (19) are satisfied.

A. Conservatism of the Separable Model Structure
We have proposed searching over the model set (21) with θ ∈ Θ m 1 (27), and it is important to understand which systems may fall into this model set. A particular question of interest is whether there are contracting and monotone systems which cannot be represented by this structure, and there are two main reasons why this may occur: the separable structure of the model (21), and the assumption of a separable contraction metric in condition (19).
An exact characterization of the functions functions that be approximated via the GAM structure (21) is difficult to give, however, they have widely applied in statistical modelling, see [69] for details. Note that while the functions in the implicit system (21) are additive, the resulting explicit system (8) may not be. For example, the scalar functions e(x) = √ x and f (x, y) = (x + y). Both e and f are additive; however, the function e −1 • f (x, y) = x 2 + 2xy + y 2 is not.
Conservatism may also be introduced by the assumption of a separable contraction metric. For the case of linear positive systems, it is has been shown that the existence of a separable Lyapunov functions is both necessary and sufficient [43]. This means that Θ m 1 contains all positive linear systems [46]: Theorem 2. For the system (21), if e and f are affine in (x, u), then the model set characterised by (14), (17) and (19) is a parametrization of all stable, discrete-time, positive linear systems.
Proof. See Appendix C.
Things are more complicated for nonlinear monotone systems. Separable contraction metrics have been shown to exist for certain classes of monotone systems [49] and separable weighted 1 contraction metrics have been used for the analysis of monotone systems [6], [51]. For incrementally exponentially stable systems, it has been shown that the existence of weighted 1 contraction metrics, are necessary and sufficient [50], however the state-dependant weighting depends on the all system states and is therefore not separable in the sense we use. To the authors' knowledge, a complete characterisation of the class of contracting monotone systems that admit separable metrics is still an open problem.

B. Consistency
It has be previously noted that system identification approaches that guarantee stability lead to a bias towards systems that are too stable [28], [29], [71]. Empirical evidence suggests that for methods based on Lagrangian relaxation [41], [42] this bias is smaller.
There are a number of situations that lend themselves towards consistent identification. Firstly, consider the situation where we have noiseless state and input measurements produced by a model with θ * ∈ Θ m 1 such that J ee (θ * ) = 0.
Then we also haveJ l (θ) * = 0 so the bound is tight and LREE recovers the true minimizer of equation error. Now, consider the situation where the unconstrained minimizer of equation error (2), is a monotone, additive function that is contracting in the identity metric. That is, for the function a φ * (x, u) where φ * = arg min J ee (φ), the following hold: 1) a φ * (x, u) is additive so that (8) can be written as Then, optimizing (26) returns the same solution as the unconstrained least squares minimizer of J ee . Proposition 2. Consider models of the form (21) with e θ (x, u) = Ex and f θ (x, u) = a φ * (x, u) for some θ. If properties 1, 2, 3 hold for a φ * (x, u) where φ * = arg min J ee (φ), then for θ * = arg min Proof. Our proof mirrors that of [42, Sec. IV Proposition 1].

C. Iteration Complexity of Distributed Algorithm
In this section, we investigate the computational complexity of each step in the distributed algorithm. In general, the complexity depends on the model parametrization used, however, we limit our discussion to the case where the models are parametrized by polynomials and the constraints are enforced using sum of squares programming.
The first step, (31), is a semi-definite program and can be solved using standard solvers. If no structural properties are exploited, a primal-dual interior point method (IPM), would require O max{n 3 θ i u , n θ i u n i 3 , n 2 θ i u n 2 i } operations per iteration per node [72], where n θ i u is the number of upstream free parameters .
The second step, (32), is a sum-of-squares problem that can solved as a semi-definite program. If e and f both have degree 2d, then the size of Gram matrix corresponding to (19) for the additive model (21)  If a local computational resource is associated with each node in the network, and the number of neighbours for each node satisfies a uniform bound, then the time taken for each iteration will not increase with the number of nodes. However, computation time will grow quickly with the number neighbours, the size of the local states and the degrees of the polynomials used in the model.

D. Other Quality of Fit Criteria
Lagrangian relaxation of least-squares equation error was chosen as it is convex, upstream separable, quick to compute, and leads to a simple implementation of ADMM. Any method that treats neighbouring states as exogenous inputs will be upstream separable. However, any such approach will also be susceptible to instability due to the introduction of new feedback loops via the network topology, even if it guarantees stability of the local models. Consequently, one can similarly apply any convex quality of fit criteria such us convex upper bounds on simulation error [38], [41] and still guarantee convergence of ADMM. Alternatively, a non-convex quality of fit criteria like simulation error can be used at the expense of ADMM's convergence guarantees.
If a model structure does not permit distributed identification, the conditions proposed in Section III can still be used to ensure stability and/or monotonicity. Joint convexity of the model set and stability constraints is still an important as it simplifies constrained optimization allowing for the easy application of penalty, barrier or projected gradient methods [39].

VI. NUMERICAL EXPERIMENTS
In this section we present numerical results exploring the scalability and identification performance the proposed approach.
This section is structured as follows: first, we look at the identification of positive linear systems, and explore the computational complexity of the 1 and 2 contraction conditions; we then explore the consistency of fitting nonlinear models when the true system lies in the model set, essentially analysing the effect of convex bound on equation error; finally, we apply the method to the identification of a (simulated) nonlinear traffic network. The traffic network does not lie in the model set so only an approximate model can be identified. We explore the regularising effect of the model constraints and scalability of the method to large networks.
Previous methods for the identification of models with stability guarantees have ensured contraction using a quadratic metric [38], [41], [42]. Contraction is implied by the following semidefinite constraint: where P ∈ S n×n , P 0, η > 0. We refer to (34) as an 2 contraction condition as it implies the contraction conditions (6) with a state dependent weighted 2 norm of the differentials We will make future reference to the following convex sets of parameters, in addition to θ ml1 defined in (27): Here the subscripts refer to the following properties: • m 1 -Monotone 1 contracting models i.e. θ ∈ Θ m 1 , • m -Monotone models i.e. θ ∈ Θ m , • u -Models that are not constrained to be contracting or monotone i.e. θ ∈ Θ u , • m 2 -Models that are monotone and contracting in 2 , i.e. θ ∈ Θ m 2 , All functions e i , and f ij are polynomials in all monomials of their arguments up to a certain degree. As a baseline for comparison, we will also compare to models denoted P oly, with explicit polyonomial models (1) fit by least-squares without any separable structure imposed. We will also compare to standard wavelet and sigmoid Nonlinear AutoRegressive with Exogenous input (NARX) models implemented as part of the Matlab system identification toolbox.
For the implicit models, the model class prefix is followed by the degrees of the polynomials in e and f in parenthesis. For example, the notation u (3,5) refers to unconstrained models with e having degree 3 and f having degree 5. For the explicit polynomial models P oly, the degree used follows in parenthesis, so P oly(5) are explicit polynomial models of degree 5 in all arguments.
The NARX models were fit at each node using the regressors (x i t ,ȗ i t ,ȗ i t+1 ). The wavelet NARX models were set to automatically choose the number of basis functions and the sigmoid NARX models were set to use 10 basis functions. The focus for each model was set to produce the best performance. For the wavelet network, we used a focus on simulation and for the sigmoid network, we used a focus on prediction.

A. Identification of Linear Positive Systems
In this subsection we study the scalability of the proposed method for the identification of linear positive systems.
We compare the computation time using the proposed 1 contraction constraint to a previously proposed 2 contraction constraint (i.e. quadratic Lyapunov function). Note that for linear systems, the model sets m 1 and m 2 both are parameterizations of all stable positive linear systems so no difference in quality of fit is observed.
1) Scalability of Separable Linear and Quadratic Metrics: We illustrate the difference in scalability between the models m 1 (1, 1) and m 2 (1, 1). Each experimental trial consists of the following steps: 1) A stable positive system with state dimension n x is randomly generated using Matlab's rand function; A ∈ R nx×nx has a banded structure with band width equal to 9. Stability was ensured by rescaling A to have a spectral radius of 0.95. 2) The system is simulated for T = 10 4 time steps;x 1:T is obtained by adding white noise to the simulated states at SNR equal to 40dB. 3) This process is repeated 5 times for each n x .
The time taken to solve each optimization problem is shown in Fig. 2. Here, we see a significant improvement in the computational complexity from approximately cubic growth for m 2 to linear growth for m 1 . The networked approach allows us to solve stable identification problems with at least 3000 states. Note that no explicit attempts to exploit the sparsity of the system were made; use of solvers and parsers designed to exploit sparsity could improve performance, especially for the SDPs associated with the LMI parametrization, e.g. [74].

B. Identification of Nonlinear Models
In this section we study the consistency of fitting nonlinear implicit models via the LREE bound on equation error. In Section V-B we saw that in the noiseless case, optimization of LREE will return the true model parameters. We will now explore the effect of introducing noise on the model estimates. The experiments in this section can be seen to supplement those in [42, Sec. IV] which studied the effects of noise and model stability on consistency in the linear setting.
We generate models a * (x, u) by sampling a parameter vector θ and then projecting onto the set Θ m 1 . The models have degree 3, state size n = 2 and m = 1. We then generate training data with T samples by randomly sampling (x t ,ũ t ) from the uniform distribution on [0, 1] and generated noisy measurements of x t+1 byx t+1 = a * (x t ,ũ t ) + v t , where v t is normally distributed noise with a specified Signal to Noise Ratio (SNR). Models a(x, u) are then trained by minimizingJ l with θ ∈ Θ m 1 and performance measured using Normalized Equation Error (NEE): where a(x, u) is the identified dynamic model and a * (x, u) is the true where |f (x)| 2 = x∈D |f (x)| 2 dx is the sample estimate of the 2-norm of the function f . In Figure 3, we have plotted the NEE that results from fitting models from m 1 (3, 3) by optimizing LREE (26) and implicit equation error (22). We can see that LREE provides a much better fit than implicit equation error, especially as the number of data points increases.
To explore the effect of noise on the consistency of LREE, we have plotted NEE versus the size of the dataset for varying noise level (measured in decibels) in Figure 4. If we had a  N EE = 0 with consistent slope for all SNR levels. What we in fact observe, however, is that in noisier conditions the NEE initially decreases and then plateaus at a certain level. This phenomena can also be seen in [75,Sec. IV], where LREE produces models biased towards being too stable, even in the infinite data limit.

C. Identification of Traffic Networks
In this section we examine a potential application of our approach, the identification of a traffic network. The dynamics of traffic networks are thought to be monotone when operating in the free flow regime [7]. Note that monotonicity of some traffic models is lost when certain nodes are congested [76].
The data are generated using the model in [7], which is not in the proposed model set. Hence this section provides a test of robustness of the proposed approach to modelling assumptions.
For this application, we consider using equation error as a surrogate for simulation error. Model performance is therefore measured using Normalized Simulation Error (NSE): where x t are the simulated states. We will first introduce the model, then study the effect of the model constraints by comparison to existing methods, and finally examine scalability to large networks.
1) Simulation of a traffic network : The dynamics are simulated over a graph (e.g. Fig. 6), where, each node i represents a road with state corresponding to the density of traffic on the road, denoted ρ i . Nodes marked in allow cars to flow into the network, and nodes marked out allow cars to flow out of the network. Each edge (i, j) is randomly assigned a turning preference denoted R ij such that i R ij = 1 (this ensures that the total number of cars at each intersection is conserved). Each node i has a capacity of C i = 1. Vehicles transfer from roads i to j according to the routing policy, where d i (ρ) = min(10, ρ) and s i (ρ) = max(2C i − ρ, 0) are monotone demand and supply curves for road i. The dynamics of the complete system are then found to bė where The input nodes i ∈ in take a time varying input u i . We use the following method to generate data sets of size T : (i) First, we generate an input signal for each u i of size T .
This signal changes value every 5 seconds to a new value that is normally distributed with mean µ u and standard deviation σ u . Negative values of u are set to zero. An example input signal is shown in Fig. 5. (ii) The dynamics (37) are integrated over t f seconds. (iii) A training set of size T = 2t f is generated by sampling every 0.5 seconds.  2) Regularization Effect of Model Constraints: In this section we will explore the effects of introducing monotonicity, positivity, and contraction constraints.
Introducing model constraints limits the expressivity of our model. Consequently, one might expect the estimator bias to increase and the variance to decrease [77,Chapter 7]. Empirical evidence in this section suggests that a judicious choice of constraints can reduce the variance with a minimal increase in bias.
Using the method outlined in Section VI-C1 for simulating a traffic network and the graph depicted in Fig. 6, we generate 100 different training sets of size T = 1000 with µ u = 0, σ u = 0.2 and then compare the results on three different validation sets. The first validation set has inputs generated with parameters µ u = 0, σ u = 0.2 (the same as the training set). The second and third validation sets have parameters µ u = 0, σ u = 0.3 and µ u = 0, σ u = 0.4 respectively. These are used to test the generalizability of our model to inputs outside the training set.
In figures 7 and 8, we have plotted the N SE on both the training set and validation sets 1 and 2 for our proposed model sets, the polynomial model, the NARX models and the model set m 2 . The percentage of total models that displayed instability is indicated in both the bar graph in the upper portion of the figures.
In all cases, the identified linear models performed poorly.   This is unsurprising as the true system is highly non-linear. Comparing the models m 1 and m 2 with the remaining models, we can see that the stability constraints have a regularizing effect where increasing the degree of the polynomials reduces the median NSE; in other words, increasing model complexity improves model fidelity. The other models on the other hand perform worse with increasing the complexity. This is most clearly seen in the models u, where increase the polynomial degree results in poorer fits on validation data.
Our results also suggest that model stability constraints significantly improve robustness. Without stability constraints, a model that appears stable during training may turn out to be unstable under a slight shift in the input data distribution. This can be seen most clearly in the models m(5, 5) and P oly (3), where on the training data distribution, most models are stable. However, increasing the variance of the inputs to the network results a large number of unstable models with unbounded NSE, c.f. Fig. 7c and Fig. 8c. Further evidence is shown in Table I, where we can see that once the variance of the input data doubles, almost all models that do not have stability constraints are unstable.
To compare to a standard approach, we also compare to wavelet and sigmoid NARX models fit using the Matlab system identification tool box. The resulting N SE is shown in Fig. 8 and show the number of models producing unstable models and negative state estimates in tables I and II respectively. While we observed extremely high performance of the individually identified sub-systems, simulating the network interconnection of those sub-systems produces many unstable models, many negative state estimates and poor quality of fit.
For positive linear systems, both Θ m 1 and Θ m 2 are parameterizations of the same set of models. This is not the case for nonlinear monotone systems and the choice of parametrization impacts the resulting model performance. This can be seen in Fig. 8a, Fig. 8b and Fig. 8c where the models fit using our proposed 1 contraction constraint outperform those fit using the previously-proposed 2 contraction constraint.
Finally, looking at Table II, we can see that when models were not constrained to be positive u and P oly, a large number of models producing negative state estimates were identified. This can lead to non-sensical results in many applications, and prevents the application of synthesis methods that depend on monotonicity.
3) Scalability Comparison of 1 and 2 contraction: We now explore the scalability of the 1 and 2 contraction constraints for nonlinear models.
We construct traffic networks consisting of N = P + 2M nodes by placing P points randomly in a unit square and triangulating. M in nodes and M out nodes are then randomly assigned throughout the network. We generate training data using the method described in Section VI with T = 600, µ u = 0, σ u = 0.4 and a corresponding validation set. We then fit models m 1 (3, 3) and m 2 (3, 3) using an interior point method. This is repeated 5 times for a varying number of nodes. Figure 9 shows a plot of the time taken to solve each problem versus the total number of nodes. We observe that fitting models with an 2 contraction constraint has a com-   The validation NSE versus the number of agents is shown in Fig. 10 for the model set in m 1 (3, 3). We observe no deterioration of model performance as the number of agents increases, suggesting that our method can be effective when scaled to large networks. 4) Scalability Compared to Interior Point Methods: We conclude our numerical experiments with a comparison of the computational complexity of the proposed distributed algorithm to centralized optimization via standard interior point methods.
We introduce additional notation to distinguish between the centralized and distributed algorithms. We will use a subscript C to refer to models fit using the off-the-shelf interior point method. The subscript D is used to denote models fit using ADMM. For example, m 1 (3, 3) D is the problem of fitting the model m 1 (3, 3) solved using the distributed algorithm.
To control for the number of neighbors of each node, we generate random, connected, regular graphs of size N and   degree 4 and randomly assign P 2 in nodes and P 2 out nodes. Training data is generated according to Section VI-C1 with T = 500 and σ u = 0.2.
We then solve the problems It is important to note, however, that this does not take into account many of the complexities of distributed computing, for example the overhead associated with communication between nodes.

VII. CONCLUSION
In this paper we have proposed a model set for system identification that allows model behavioural guarantees such as stability (contraction), monotonicity, and positivity. Furthermore, we have introduced a particular separable structure that allows distributed identification and scalability to large networked systems via local node-to-node communication.
We have examined the proposed approach via a selection of numerical case studies including a nonlinear traffic network. The main conclusions are that the approach scales much better than previous approaches guaranteeing stability, and that behavioural constraints such as stability and monotonicity can have a regularising effect that leads to superior model predictions.

A. Theorem 1
We use the following lemma in the proof of Theorem 1:  Fig. 11: Runtime of ADMM compared to IPM where the number of threads is one or equal to the number of nodes. When calculating the results for "simulated" distributed computing, ADMM is run in series and time per iteration is taken to be the sum of the maximum times to solve each step. The slopes of the lines are 1.05, 1.36 and 0.047 respectively. Lemma 1. Suppose that for the system (21), there exists a weighted differential 1 storage function V t = |E(x t , u t )δ t | 1 , where E : R n × R m → M n such that V t+1 ≤ αV t and there exists some K 0 such that |δ t | 1 ≺ K|E t δ t | 1 , then the system is contracting in the sense of definition 1.
Proof of Theorem 1. First we will show well-posedness and monotonicity. We will then prove stability of monotone contracting systems and finally just contracting systems. For brevity of the equations, we will use a subscript t to refer to the evaluation of a function at a specific time, so E t = E(x t , u t ).
Well-posedness: Assume (14). Since E is a non-singular M matrix, there exists a diagonal matrix D such that ED + DE 0. Well posedness follows from the same argument as [38,Theorem 5].
Monotonicity: Assume (17). Since E as an M-matrix, it is inverse positive and E −1 F ≥ 0. The differential dynamics of the explicit system (8) can be written as δ xt+1 = E −1 t+1 F δ xt . Therefore, the explicit system is monotone.
Condition (16) then implies, where || · || 1 is the induced matrix norm , ||M || := max j i M ij . Stability follows from the same argument as in the proof of Theorem 1. Multiply by |E t δ t | 1 , we get: Contraction then follows from Lemma 1 with contraction metric V t = |E(x t , u t )δ t | 1 . Monotonicity and Contraction Finally, to see how contraction follows from (17) and (19), note that they imply conditions (15) and (16).
B. Proof of Theorem 1 Proof. The first step (28) can be broken up into the following sum: θ(k + 1) = arg min which is equivalent to the N optimization problems in (31). The second step (29) can be written as We will show that the indicator function can be written as a sum over i = 1, ..., N indicator functions each depending on φ i d . Splitting it up in terms of the individual constraints, we get I Θ m 1 (φ) = I Fx≥0 (φ) + I Fu≥0 (φ) + I E∈M (φ)+ I 1 (αE−F ≥0) (φ). (49) The first two terms can be written as element-wise SOS constraints. The last two terms can then be written as a sum over the columns of the matrices E and F . We can therefore right (49) as: where,

C. Theorem 2
Sufficiency follows from Theorem 1. We now prove necessity, i.e. that if a positive linear system is Schur stable, then θ ∈ Θ m 1 . Suppose a matrix A is Schur stable. Then by [78, proposition 2], there exists some z > 0 such that z A − z < 0. We can always rescale z such that z A − z ≤ − 1. With this z, we choose E = diag(z) ≥ 0 and F = EA ≥ 0. Then