A Levenberg-Marquardt algorithm for sparse identification of dynamical systems

Low complexity of a system model is essential for its use in real-time applications. However, sparse identification methods commonly have stringent requirements that exclude them from being applied in an industrial setting. In this paper, we introduce a flexible method for the sparse identification of dynamical systems described by ordinary differential equations. Our method relieves many of the requirements imposed by other methods that relate to the structure of the model and the data set, such as fixed sampling rates, full state measurements, and linearity of the model. The Levenberg-Marquardt algorithm is used to solve the identification problem. We show that the Levenberg-Marquardt algorithm can be written in a form that enables parallel computing, which greatly diminishes the time required to solve the identification problem. An efficient backward elimination strategy is presented to construct a lean system model.


I. INTRODUCTION
I N many practical applications, a mathematical model of a dynamical system is a prerequisite to understanding, predicting and manipulating the behavior of the system in an effective manner. Obtaining a suitable model can be a challenging task. For some systems, first principles (e.g., established laws of physics) may be used to derive model equations that represent the dominant system dynamics. However, for many other systems, this way of model discovery is out of the question due to insufficient information about the system. Moreover, even if first principles can be used to obtain a model, the resulting model may be too complex to use in real-time applications for design, optimization and control. For example, in areas such as fluid dynamics, electrodynamics, and quantum mechanics, a first-principle model may consist of multiple partial differential equations for which the evolution of the state of the system is hard to compute, let alone in real time.
Alternatively, we may construct a system model using data following a system identification or machine learning This  approach. The corresponding system model can often be written in the form of an artificial neural network, where the type of neurons can range from commonly used neurons with sigmoidal or rectified linear activation functions to wavelets, splines, radial basis functions, monomials or Gaussian processes, to name a few [31]. Generally designed to be a universal function approximator, the use of an artificial neural network often leads to an unnecessarily complex model. High complexity strongly hampers the use of the model in real-time applications, and it commonly goes hand in hand with stringent data requirements. It should be noted that the complexity of the model strongly depends on the chosen coordinate frame and type of neurons. As a first example, the orbits of the planets in our solar system can be described much simpler and in fewer terms by taking the Sun as center of reference than Earth. As mentioned in [2] and [29], many physical systems allow for a sparse representation in a suitable coordinate frame. As a second example, it takes a large artificial network with rectified linear units to accurately approximate a quadratic function on a large (but finite) domain, while it requires only three monomials to do the same (i.e., monomials of degrees zero, one, and two, respectively). Training such a large artificial network needs many measurements of the quadratic function, while fitting the three monomials requires a minimum of only three data points (disregarding measurement noise). Without any knowledge of the system to go by, the chance that a chosen coordinate frame and neuron type result in an accurate model of low complexity is generally slim.
In many industrial applications, neither a first-principle model nor a purely data-driven model is constructed with high accuracy due to various reasons including unknown environmental conditions, a varying composition of raw materials, knowledge caveats, imprecise data registration, and a lack of exploration. In addition, there may be parts of the system for which only a few measurements are available due to a high cost or inability to conduct more. What complicates matters further is that these measurements may have irregular sampling rates because smart sampling algorithms (see [22]) have been applied or measurements have been conducted manually. To compose accurate models, one is often forced to combine both first-principle information and measurement data. These resulting grey-box models come in many different forms [15]. In an attempt to improve the model accuracy while keeping the model complexity low, we can use an artificial neural network to model only those parts of the system for which no first-principle model is available or for which the solution of the first-principle model is difficult to compute or inaccurate.

arXiv:2203.12379v1 [eess.SY] 23 Mar 2022
However, due to a high network complexity, this alone is generally not enough to ensure that the resulting model is simple enough to be used for real-time applications.
In this paper, we apply sparse identification to derive a low-complexity network model. The paper is inspired by the works in [2] and [29]. In these, a lean feedforward network is obtained by removing all edges (i.e., the connections between the neurons in the network) that have little or no influence on the quality of the fit. The network can be pruned by adding a sparsity-promoting regularization term to the cost function (or loss function) that is used to train the network; see lasso [33] and sparse relaxed regularized regression [36], for instance. Alternatively, a sequential algorithm, such as stepwise regression [6] or sequential thresholding [2], may be applied to remove all irrelevant edges in an iterative fashion. Other model-selection approaches are discussed in [7] and [20]. The sparse-identification methods in [2] and [29] rely on linear regression. Although they are relatively simple and fast enough to perform model identification in real time [9], [25], they rely on the limiting assumptions that the network model is linear in all its parameters, and that all system states are measured and their first-order time derivatives can be accurately approximated. This last condition commonly translates to the requirement of a sufficiently high sampling rate for all measurements in order to mitigate the negative influences of measurement noise by averaging over multiple samples. In [19], the computation of time derivatives of the states is based on Koopman theory. This technique can be applied to low-rate sampled data, but the requirement of a fixed sampling rate is often too restrictive to be applied to industrial data.
If the model is composed of differential equations, we may apply numerical integration to compute the state values of the system at any point in time and compare these to the measurements to optimize the model. This removes the requirement that the network should be linear in its parameters. Moreover, an arbitrarily sampled data set can be used. In [30], the solutions of differential equations modeled by deep neural networks are computed using a Runge-Kutta scheme. In [26], the numerical integration algorithm is directly encoded in the kernels of the Gaussian processes. As an alternative to traditional numerical integration, a neural network can be trained to produce solutions of differential equations [11], [14]. The deep neural network in [27] is designed to obtain solutions of differential equations and identify model parameters. To the best of our knowledge, no attempts of network sparsification have been made for any of the identification methods that rely on numerical integration.
A possible reason for this is the high computational demand for training the network. Numerical integration of the state equations implies calculating the state values on a finite (and flexible) time grid. Evaluating the cost function used for training the network subsequently requires computing the state values for all grid points. The effective dimension of the optimization problem to be solved for training the network is equal to the number of network parameters plus the number of state variables times the number of grid points. A large number of grid points may be required to accurately approximate the state solution, especially if the time series of measurements span a large time interval. Therefore, the dimension of the optimization problem may be very large. In turn, the optimization problem may be difficult to solve. Note that network pruning would require solving an even harder regularized optimization problem or retraining the network multiple times using a backward elimination strategy.
In this paper, we develop an efficient method for training an artificial neural network for system models described by ordinary differential equations. Common training algorithms include gradient descent, the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm and the Levenberg-Marquardt algorithm [12], [21], [32]. Compared to the first-order convergence of gradient descent, a faster second-order convergence can be obtained using L-BFGS or Levenberg-Marquardt. The Levenberg-Marquardt algorithm has the advantage that it is relatively easy to parallelize. Due to the Markovian nature of the system model, we can rewrite the approximation of the optimization problem by Levenberg and Marquardt as batches of recursive small-scale optimization subproblems. We effectively divide the time interval from the first to the last measurement in separate subintervals, where each subinterval corresponds to one batch of optimization subproblems. All batches can be solved in parallel. Subsequently, the solution to the original approximation can be reconstructed by combining the results from all batches. Parallelization can significantly reduce the wall time required to train the network. In addition, we outline how a similar problem formulation can be used to prune the network using backward elimination by removing irrelevant network edges one at a time.
The paper is organized as follows. A formulation of the system identification problem is given in Section II. The standard approach of solving the system identification problem using the Levenberg-Marquardt algorithm is discussed in Section III. In Section IV, it is shown how the solution method can be rewritten in order to parallelize parts of the algorithm. In Section V, a method for network sparsification using backward elimination is presented that utilizes a similar approach as outlined in Section IV to efficiently approximate the effect of removing a network edge on the overall quality of the fit. In Section VI, the sparsification method is illustrated by means of two examples. Advantages and drawbacks of the proposed methodology are summarized in Section VII.

A. Notations
The sets of real numbers, nonnegative real numbers and positive real numbers are denoted by R, R ≥0 and R >0 , respectively. N and N >0 denote the sets of natural numbers (i.e. nonnegative integers) and positive integers. The ceiling function that maps a real number α to the least integer greater than or equal to α is denoted by α . The transpose of any vector x is written as x T . For any nonsingular, square matrix A, its inverse is denoted by A −1 . Similarly, A −T is the transpose of the inverse. The Moore-Penrose pseudoinverse of the matrix M is written as M + . We denote the union of two sets B and C by B ∪ C. If C is a subset of B, then the set difference of B and C (i.e. the relative complement of C in B) is written as B \ C. Let T = {t 1 , t 2 , . . . , t n } be an ordered set with n elements. The notation {y(t)} t∈T is short for {y(t 1 ), y(t 2 ), . . . , y(t n )}. For any time-varying vector signal s(t), where t denotes the time, we write ds dt (t) asṡ(t). The Euclidean norm is denoted by · .

PROBLEM
Let the state trajectories of a dynamical system be given by the solutions of the following ordinary-differential equations that comprise a first-principle model: Here, x(t) ∈ R nx is the state vector with dimension n x ∈ N >0 , f phys : R × R nx → R nx is a continuously differentiable function, e(t) ∈ R nx is the model error, and t ∈ R denotes the time. The exact value of the state is unknown in most practical scenarios. To improve the accuracy of the first-principle model, we approximate the error by a continuously differentiable feedforward network f net : The network consists of an input layer, one or multiple hidden layers, and an output layer. The inputs of the network are the state x(t) and the vector of known time-varying functions u(t) ∈ R nu with dimension n u ∈ N. The output of the network is the estimate of the error in the first-principle model. The vector of network parameters (e.g., weights and biases) is given by a ∈ R na , where n a ∈ N >0 is its dimension. We do not pose any restrictions on the structure of the network or the choice of neurons. Note that this formulation can be easily adapted to include other kinds of error models, such as the closure models in [23]. Although the focus of this paper is to improve the accuracy of the first-principle model by approximating the model error with a feedforward network, we note that this formulation can be made more general by considering models of the form: where a is allowed to be any parameter vector. The only essential requirement is that the function f : R × R nx × R na → R nx is differentiable with respect to the state and parameter vectors. Obviously, the system model consisting of the first-principle model in (1) and the error model in (2) can be written in this form.
To identify suitable parameter values, we compare the state solutions of the model with process measurements. Consider a finite time series of process measurements. Let T m ⊂ R be a finite set containing all measurement times. We denote by y(t m ) ∈ R ny the vector of measurements with dimension n y ∈ N >0 taken at time t = t m ∈ T m . The dimension n y is allowed to be time varying (i.e., n y = n y (t m )). The relation between the measurements and the state is modeled by a continuously differentiable function h : for all t = t m ∈ T m . Note that the sampling of the measurements can be arbitrary. Notation-wise, it is convenient to number the measurements. We define where N m ∈ N >0 is the cardinality of T m (i.e., the set T m contains N m elements). For any i ∈ I m , t (i) m denotes the i th element of the set T m in ascending order. In turn, the i th vector of measurements is denoted by y(t Obtaining the parameter values that fit the measurements best requires that the state and the parameters of the model are identified simultaneously. To find suitable values, we compute the minimizer of the nonlinear least-squares problem: are the times of the first and last measurement, respectively. The symmetric, positive-definite matrices W x (t) ∈ R nx×nx and W y (t m ) ∈ R ny×ny are weighting matrices of the model errors ε x (t) and ε y (t m ), respectively. If ε x (t) and ε y (t m ) are regarded as vectors of stochastic variables, then W x (t) and W y (t m ) can be chosen equal to the covariance matrix of ε x (t) and ε y (t m ) in order to obtain an estimate with minimal variance. A more pragmatic approach is to think of W x (t) and W y (t m ) as scaling matrices, such that are vectors of elements with roughly equal magnitude based on prior knowledge, where and The terms in the cost function related to the (small) constants µ x , µ a ∈ R ≥0 are regularization terms to remedy the effects of overfitting [35]. If multiple or even infinitely many minima exist, these regularization terms keep the magnitudes of the computed, minimizing state and parameter values relatively small. The selection of suitable regularization constants is an open problem, and a topic for future work.
III. SOLVING THE SYSTEM IDENTIFICATION PROBLEM USING THE LEVENBERG-MARQUARDT ALGORITHM In order to compute the solution of the optimization problem in (6), we first discretize the problem in time. Let ∆t ∈ R >0 be the maximal step size of the discretization. Note that all continuous-time signals in (6) are defined on the interval points that are chosen such that T m ⊆ T d , and the distance between all subsequent points is smaller than or equal to ∆t. A routine to generate such set is outlined in Appendix A. To make the proceeding notations easier, we define the index set For any j ∈ J d , we denote by t (j) d the j th element of T d in ascending order. In addition, we define the index set We discretize the least-squares problem in (6) using the midpoint rule and the trapezoidal rule. With a minor abuse of notation, we get the optimization problem in (12), with )} jm∈Jm , respectively. Because the discretization errors of the midpoint rule and the trapezoidal rule for each discretization step are O(∆t 3 ) and the total number of discretization steps is O(∆t −1 ), the corresponding difference between the solutions of (6) and (12) for all discretization steps combined is O(∆t 2 ). Hence, the solutions are identical in the limit as ∆t approaches zero.
Remark 1: The midpoint rule and the trapezoidal rule are not the only numerical integration methods that can be applied to discretize the optimization problem in (6). However, any such integration method is required to be explicit in order to apply the Levenberg-Marquardt algorithm later in this section. Alternatively, we may change the problem formulation in (6) by replacing all integrals by finite, weighted sums of function evaluations, as in [30], for instance. However, the solution of this altered optimization problem may not correspond to the solution of (6), not even in the limit.
We may simplify the optimization problem in (12) by eliminating the variables of the model errors {ε x } and {ε y } using the constraint equations. Subsequently, the optimization problem can be written as and where and The optimization problem in (14) may be solved iteratively. Let k ∈ N be the iteration number. Writing the values of the optimization variables in (15) at iteration k as b k , the firstorder Taylor series approximation is accurate if γ k is sufficiently small, where γ k is defined as Applying the approximation in (20) and the change of variables in (21), we obtain the Gauss-Newton method: In order to prevent γ k from being too large, Levenberg [13] and Marquardt [18] propose to add a regularization term (also denoted as damping term) to the cost function, resulting in where λ k ∈ R >0 is a tuning parameter. Noting that (23) is a linear least-squares problem, the minimizer of the optimization problem is given by  (21) and (24), we get the update law: Hence, the new values b k+1 are equal to the old values b k minus the product of a symmetric, positive-definite matrix and the gradient of the cost function in (14). Note that the update law in (25) is similar to that of the Gauss-Newton method for small values of λ k . Moreover, the search direction in which the optimization variables are updated is similar to the gradient-descent direction for large values of λ k . Due to the uncertainty in the linearized model far from the linearization point, convergence of the algorithm cannot be guaranteed for small values of λ k . On the other hand, the decrease in the value of the cost function is small for large values of λ k . Therefore, it may take many steps to converge. Various algorithms have been proposed to balance the uncertainty and magnitude of the decrease in cost function value; see [16], [24], [34] and references therein. Algorithm 1 is a simple illustration of an adaptation method for λ k . While the gradient of the cost function is larger than a small constant σ > 0, new values of the optimization variables are generated in Line 3 in accordance with (25). If these new values lead to a decrease in cost-function value, the values are accepted and λ k is decreased by a factor 1 ρ1 . If not, the values are rejected and λ k is increased by a factor ρ 2 , where ρ 2 ≥ ρ 1 > 1. Note that robustness measures, such as a positive lower bound on the decrease of the cost-function value in Line 4 and an upper bound on the maximal number of iterations, should be added for practical applications of the algorithm [16]. Moreover, the optimization method by Levenberg and Marquardt only converges to local optima. If the measurement times span a large interval (i.e., t m −t m is large) and the maximal discretization step ∆t is small, the number of discretization points N d is very large. Therefore, the number of variables of the optimization problem in (23), which is equal to the dimension n b = n a + n x × N d of the vector b in (15), can be very large. In turn, inverting a very large matrix to compute the update in (25) can be costly. In this section, we present a method to parallelize the computation of the solution of the optimization problem in (23) to speed up computations.

Algorithm 1 Standard Levenberg-Marquardt algorithm
Let us define (21) is given by It follows that the optimization problem in (23) can be written as with for all j ∈ J d \ {N d }, and where {β k } is short for {β k (t (j) d )} j∈J d . We have omitted the arguments of the functions p x,j and p a and their derivatives to shorten the expressions in (28)- (31). Note that the corresponding arguments of p x,j and its derivatives are for j = N d , and that the argument of p a and its derivative is a k . Now, to compute the solution of the optimization problem in (23) using N s ∈ N >0 parallel processes, let us divide the sum in (28) in N s smaller sums:  (33). Note that w s,k can be rewritten as using the recursion H s,j+1,k = min for all j ∈ {1, 2, . . . , ζ(s+1)−ζ(s)−1}, with H s,1,k = q ζ(s),k . Subsequently, the optimization problem in (28) itself can be formulated as the series of nested optimization subproblems; see (36). Alternatively, we can write (36) as where G Ns,k is obtained by the recursion for all s ∈ {2, 3, . . . , N s }, with G 1,k = w 1,k . We are able to write the optimization problem in (23) as nested optimization subproblems due to the Markovian structure of the model in (3)-(4). The approach of solving an optimization problem by recursively solving optimization subproblems is known as dynamic programming [1]. There are two important things to note here. First, all optimization subproblems in (35), (37), and (38) are small-scale problems, depending only on a few optimization variables. Therefore, the operative memory required to compute the solution is much lower than for largescale problem in (23). Second, for each s ∈ {1, 2, . . . , N s }, the solution w s,k in (34) can be computed independently and, thus, in parallel. Given sufficient computational power, parallelization greatly reduces the time required to solve the optimization problem. This makes it viable to use much longer measurement series for the identification of the system, which may considerably improve the accuracy of the identified model. It should be noted that q j,k , r k , w s,k , H s,j,k and G s,k are all quadratic, sum-of-squares functions of optimization variables. It follows that the minimizers of the optimization subproblems in (35), (37) and (38) can be computed analytically as linear functions of other optimization variables (i.e., the ones we have not minimized over yet); see Appendix B. To be precise, for any s ∈ {1, 2, . . . , N s } and any integer j that satisfies ζ(s) < j < ζ(s + 1), the minimizer of the subproblem in (35) can be written as for some linear function l j,k . Similarly, for any s ∈ {2, 3, . . . , N s }, the minimizer of the subproblem in (38) is given by for some linear function l ζ(s),k . Because we have minimized the optimization problem over all other variables, the minimizing values of δ k can be directly obtained from (37). We can recursively reconstruct the optimal values of the minimizers β k (t (ζ(s)) d ) in (38) by substituting the optimal value for δ k in the linear functions in (40). Subsequently, we can recursively reconstruct the optimal values of the minimizers β k (t An overview of the resulting algorithm is given by Algorithm 2. Note that the for loops in Lines 3 to 10 and in Lines 22 to 27 can be parallelized. The same remarks regarding the robustness and convergence of the algorithm apply as for Algorithm 1 in Section III. (1)-(2), where the error of the first-principle model is modeled by an artificial neural network. To obtain a lean network, we propose to prune the network by sequentially removing network edges. By removing a network edge, we also remove the weight that corresponds to the network edge. Therefore, the number of parameters of the network decreases. Moreover, if all edges of a neuron are removed, then the neuron is no longer part of the network. Hence, removing network edges may also decrease the number of neurons in the network. We aim to only remove the network edges that have little or no effect on the minimal value of the cost function. In that way, we can maintain the quality of the fit after an edge removal. In general, we do not know by how much the minimal value of the cost function changes if any network edges are removed. To find out the difference in minimal value, we need to retrain the network after every

Consider the system model in
change in network structure. Due to the large number of edges and the substantial computational effort it takes to retrain the network, it is often infeasible to retrain the network for every network configuration. In the following section, we present a method to identify which network edges are likely to have the least effect on the minimal value of the cost function. These edges are removed sequentially after verification. This allows us to efficiently identify irrelevant network edges while keeping the overall computational cost relatively low.

A. Estimating the change in minimal cost function value due to a network edge removal
We note that we can effectively remove a network edge by setting its corresponding weight to zero. Because the weights of the network are part of the parameter vector a, determining the minimal value of the cost function after an edge removal is equivalent to determining the minimal cost function value after a certain element of the parameter vector a is set to zero. Suppose the artificial neural network in (2) is trained using Algorithm 2 of Section IV. Let the optimal state and parameter values be denoted by {x c } and a c , or by b c , for short; see (15). Moreover, let the associated cost function value be given by V c = g(b c ) 2 ; see (14). Similar to (20), we may approximate the output value of g(b) at any point b = b e close to b c by the first-order Taylor series approximation where γ e is defined as For ease of notation, we define for all j ∈ J m , such that γ e in (42) is given by Without loss of generality, we assume that the desired edge is removed from the network by setting the last element of the parameter vector a = a e to zero. Let the last element of a c be denoted by α c . If the last element of a e is zero and the last element of a c is α c , it follows from (43) that the last element of δ e is −α c . Equivalently, the last element of γ e is −α c ; see (44). We let whereδ e andγ e are vectors containing all remaining elements of δ e and γ e , respectively. Assuming that b e is close to b c (i.e., assuming that γ e is small), it follows that the minimal value of the cost function after removing the network edge can be accurately approximated by This optimization problem is very similar to the one in (23). Thus, a similar approach to computing the solution can be applied. As in Section IV, we may rewrite the optimization problem in (46) in the following recursive form: where G Ns,e is recursively defined as for all s ∈ {2, 3, . . . , N s }, with G 1,e = w 1,e . For s ∈ {1, 2, . . . , N s }, the functions w s,e are given by where H s,j+1,e is obtained by the recursion Here, for all j ∈ J d , the functions q j,e and r e are similarly defined as the functions q j,k and r k in (29)- (31). The differences are that their arguments are related to the vector γ e in (42) instead of the vector γ k in (21), that the functions p x,j and p a , that are used in their definitions, have arguments related to b c instead of b k , and that λ k = 0. for s = 2 to N s do 13: Compute l ζ(s),k in (40) 14: Compute G s,k in (38) 15: end for 16: Compute minimizer δ k in (37) 17: a k+1 ← a k + δ k

18:
for s = N s to 2 do 19: (40) 20: end for 22: for s = 1 to N s do 23: for j = ζ(s + 1) − 1 to ζ(s) + 1 do 24: (39) 25: end if 35: k ← k + 1 36: end while In addition, we may compute the state and parameter values that correspond to the minimizer of the optimization problem in (46). These can serve as initial conditions for retraining the network if removing the network edge appears beneficial. Therefore, we introduce the linear functions that describe the minimizers of the optimization subproblems in (48) and (50), which can be used to evaluate the minimizer of the original optimization problem in (46); see Section IV for more details. Similar to (39), for any s ∈ {1, 2, . . . , N s } and any integer j that satisfies ζ(s) < j < ζ(s + 1), the minimizer of the Compute l ζ(s)+j,e in (51) 5: Compute H s,j+1,e in (50) 6: end for 7: w s,e ← H s,ζ(s+1)−ζ(s),e 8: end for 9: G 1,e ← w 1,e 10: for s = 2 to N s do 11: Compute l ζ(s),e in (52) 12: Compute G s,e in (48) 13: end for 14: Compute minimizerδ e in (47) 15: Compute V e in (47) 16: δ e ← δ e −α c 17: a e ← a c + δ e 18: for s = N s to 2 do 19: Compute β e (t (ζ(s)) d ) by evaluating l ζ(s),e in (52) 20: Compute β e (t (j) d ) by evaluating l j,e in (51) 25: , δ e ), otherwise (51) for some linear function l j,e . Also, as in (40), for any s ∈ {2, 3, . . . , N s }, the minimizer of the subproblem in (38) can be written as for some linear function l ζ(s),e . The proposed method for estimating the minimal value of the cost function and the corresponding state and parameter values is summarized in Algorithm 3. Similar to Algorithm 2, note that the for loops in Lines 1 to 8 and in Lines 22 to 27 can be parallelized. Moreover, it should be noted that, if we want to compute the minimal cost function value for any other network edge removal instead of the one we have already computed, the results of Lines 1 to 13 of Algorithm 3 are identical and may be reused. This makes it relatively fast to compute all possible edge removals. Although Algorithm 3 may be applied to predict the minimal cost function value for any possible edge removal, in some cases, it is desirable to consider only a subset of network edges for removal. For example, by prioritizing the removal of edges related to monomials with a high degree, the polynomial output of the network may be of lower degree than without prioritization. In turn, this may improve the extrapolatory properties of the network model outside the measured region of the state. An example of this approach is given in Section VI-A.

B. Acceptance criteria for network edge removal
After we have estimated which edge removal results in the smallest increase in minimal cost function value, we compute the corresponding true minimal cost function value by retraining the network without the specific edge. We may use an acceptance criterion to decide whether to accept or reject a proposed removal of a network edge. The simplest criterion is solely based on the value of the cost function. A proposed removal of a network edge is accepted only if the value of the cost function after retraining the network with the proposed removal is lower than a preset limit value. Depending on the amount of initial overfitting, one may be able to remove many edges before exceeding the limit value after retraining the network, even if the limit value is only slightly larger than the minimal value of the cost function for the fully connected network.
Alternatively, we may use an information criterion to assess the quality of the estimate of the model error. In this case, the removal of a certain network edge is accepted only if the quality of the estimate does not decrease. By using an information criterion, both the value of the cost function as well as the number of network parameters are taken into account. There is a large selection of different information criteria to choose from. Several of them are described in detail in [3] and [10]. The choice of information criterion may significantly influence the sparsification process due to the differences in assumptions on which the various information criteria are derived. A similar approach of quality assessment has been proposed in [17], where Akaike's Information Criterion and the Bayesian Information Criterion are used to evaluate the results of the sparse identification approach in [2].
A third approach of arriving at an acceptance criterion is by cross-validation. Although there are many different forms of cross-validation, the main principle behind cross-validation is that the measurement data is partitioned in a training set, that is solely used for training the neural network, and a validation set (and/or test set; see [28], for example), to evaluate how well the predictions of the trained network generalize to previously unseen data. In that case, a proposed removal of a network edge is only accepted if it results in a lower value of the cost function based on the validation data. Note that determining the cost function value for the validation set requires solving the optimization problem in (6), where the (fixed) values of the network parameters are determined in the training stage. The use of a validation set avoids the problem of selecting a suitable limit value for the cost function or a suitable information criterion. However, because the training set and validation set must each cover the relevant part of the state space, this may put an unreasonably large demand on the amount of available measurement data.

VI. EXAMPLES
As discussed in Section I, our formulation of the system identification problem in Section II and the corresponding solution method in Sections IV and V allow for significantly lower data requirements and less stringent limitation on the network structure than other methods (e.g., in [2] and [29]) that are developed to construct a sparse system model. We illustrate our proposed sparse identification approach with two examples that could not have been computed with the methods in [2] and [29]. In Section VI-A, we identify the structure of an unknown system without measuring the full state vector, where the identification is based on measurements with different sampling rates. In Section VI-B, we train a network to approximate the error in the first-principle model without the requirement that the network model is linear in its parameters. We show that sparsification can simplify a network model while at the same time improve its accuracy by reducing overfitting.
This means that the true model error is given by The measurement noise has a Gaussian distribution with zero mean and a variance of one. We use a polynomial network with monomials up to degree two to estimate the model error.
A polynomial network is a network with a single hidden layer with monomials as neurons. Each output of the network is a linear combination of monomials. The corresponding monomial gains (i.e., the weights of the edges between the hidden layer and the output layer) are tuned to best match the measurement data. Note that all elements of the error vector (55) are polynomials of maximal degree two. Therefore, it is possible to obtain an exact expression of the model error using the network. We apply the sparsification approach presented in this paper to identify the model error, with weighting matrices W x = 10I and W y = I and regularization constants µ x = 10 −8 and µ a = 10 −3 ; see Section II. The maximal discretization step size is set to ∆t = 10 −3 . We use the stage-wise sparsification approach mentioned in Section V-A, where we remove all irrelevant edges connected to monomials with the highest degree in the first stage before we continue to remove irrelevant edges connected to monomials with the second-highest degree in the second stage, etc. Following this approach, we obtain the following network model: (56) Although we only have a few noisy measurements with a heterogeneous sampling rate at our disposal, the identified network model in (56) has the same structure as the model error in (55). Moreover, the identified parameter values are reasonably accurate. The time signals of the state, their estimates and corresponding measurements are depicted in Fig. 1. Now, suppose that only measurements of the first two state variables are available (i.e., y(t m ) ≈ [x 1 (t m ), x 2 (t m )] T ). To successfully identify the model error by inferring the dynamics of the third state from the measurements of the first two states, the measurement rate should be increased and the variance of the measurement should be decreased. Let the measurements be taken every 0.01 time units from time t = 0 to time t = 5. Moreover, let the noise variance be given by 10 −4 . We obtain the following network model: (57) There seem to be significant differences in structure and value between the error vector in (55) and the network model in (57) at first glance. However, using the change of coordinatẽ with scaling parameter γ ∈ R \ {0}, we get the following equivalent form of the dynamics in (53): For γ = 0.073, the corresponding model error is given bỹ Note that the network model in (57) is an accurate estimate of the error vector in (60). The time signals of the equivalent dynamics in (59) and the corresponding estimates are depicted in Fig. 2. We observe that the differences between the true values of the states and the obtained estimates are small, also for the third state, for which there are no measurements available.

B. Van der Pol oscillator
Consider the forced Van der Pol oscillator: with state x(t) = [x 1 (t), x 2 (t)] T ∈ R 2 and parameters A = 1, µ = 1 and ω = 0.2. Let the first-principle model and the corresponding model error respectively be given by and .
(63) The error in the first-principle model is visualized in Fig. 3. It is modeled by an artificial neural network with two hidden layers of 10 units each. The (differentiable) exponential linear unit with shape parameter a = 1 (see [5]) is used as activation function for all neurons. The inputs of the network are the states x 1 and x 2 . Note that, if the value of A would be uncertain, we could include sin(ωt) as an extra input to the network (i.e., u(t) = sin(ωt) in (2)) to also learn the model error related to an incorrect estimate of A. The outputs of the network are the functions f net,1 (x(t)) and f net,2 (x(t)) that represent the estimates of the two elements of the error vector. Because there exist no network parameters that represent the model error exactly for all values of x, the network provides only a local approximation of the model error. The network is trained based on noisy measurements of the (full) state of the system. The measurements are taken from time t = 0 to time t = 100, with a step size of 0.1 between subsequent measurements. The measurement noise has a Gaussian distribution with a zero mean and a variance of 0.01. By applying the sparsification approach in Section V with weighting matrices W x = W y = I, regularization constants µ x = 0 and µ a = 10 −3 , and maximal discretization step size ∆t = 10 −3 , we are able to reduce the number of network edges from 140 to 36 and the number of neurons from 20 to 16. This means that the overall number of network parameters (i.e., weights and biases) is decreased from 160 to 52. The layout of the sparsified network is depicted in Fig. 4. The absolute values of the estimation errors for the two elements of the error vector in (63) for the fully connected network as well as the sparsified network are depicted in Figs. 5 and 6. The state values that correspond to the process measurements are visualized by black dots in Fig. 7. Despite having much fewer edges and parameters, we observe that the sparsified network arguably represents the error in the first-principle model better on average in a neighborhood around the process measurements than the fully connected network does. The     sparsified network captures the first element of the error vector more accurately than the fully connected network for 100% of the area of the state space shown in the figures; it captures the second element of the error vector more accurately for roughly 63% of the area of the state space shown in the figures. This improvement of accuracy can be largely attributed to the reduction of overfitting due to network sparsification.

VII. DISCUSSION
In this paper, we have presented an efficient method for sparse identification of dynamical systems described by ordinary differential equations. We have shown that the Levenberg-Marquardt algorithm, on which our method is based, can be rewritten in a form that enables parallel computation to a large extent. Therefore, the wall time required to solve the identification problem can be greatly decreased. In addition, we have presented a sparsification approach based on backward elimination that utilizes the same time-efficient computation strategy to estimate the relevance of any given network edge, so that only those network edges that have a small effect on the quality of the model fit are removed. We have illustrated with two examples that our method is not affected by many of the limitations, such as fixed sampling rates, full state measurements, and linearity of the model, that plague other methods (e.g., in [2] and [29]). This flexibility is essential for the applicability of the method in many industrial settings. The main drawback of our approach is the relatively large computational demand. However, utilizing the time-efficient algorithm presented in this paper, the wide applicability of our method is likely to outweigh the larger computational cost in many practical scenarios. Therefore, a minimizer is given by Note that this minimizer is a linear function of the other optimization variables s. This minimizer is unique if M T r M r is invertible, in which case we can exchange the pseudoinverse by the regular inverse. The corresponding solution of the optimization problem in (69) is given by: with A. Computation using the QR-decomposition The computation of the minimizer in (74) and solution in (75) of the optimization problem in (69) can be simplified using the QR-decomposition. The QR decomposition decomposes any real, rectangular matrix A into an orthogonal matrix Q and an upper triangular matrix R (where the rank of R is equal to its number of nonzero diagonal elements), such that A = QR; see [8], for example. Note that A T A = (QR) T QR = R T R, because Q is orthogonal. Thus, for any partitioned, real, rectangular matrix we have for some partitioned, upper triangular matrix where the elements of R are implicitly given by (78). It follows that Hence, the minimizer in (74) and the solution in (75) can be efficiently computed using the QR-decomposition.