Learning Nonlinear Systems via Volterra Series and Hilbert-Schmidt Operators

This article examines the application of regularization techniques and kernel methods in addressing the task of learning nonlinear dynamical systems from input–output data. Our assumption is that the estimator belongs to the space of polynomials composed of Hilbert-Schmidt operators, which ensures the ability to approximate nonlinear dynamics arbitrarily, even within bounded but noncompact data domains. By employing regularization techniques, we propose a finite-dimensional identification procedure that exhibits computational complexity proportional to the square of the size of the training set size. This procedure is applicable to a broad range of systems, including discrete and continuous time nonlinear systems on finite or infinite dimensional state spaces. We delve into the selection of the regularization parameter, taking into account the measurement noise, and also discuss the incorporation of causality constraints. Furthermore, we explore how to derive estimates of the Volterra series of the operator by selecting a parametric inner product between data trajectories.

Learning Nonlinear Systems via Volterra Series and Hilbert-Schmidt Operators Filippo Cacace , Senior Member, IEEE, Vittorio De Iuliis , Member, IEEE, Alfredo Germani , Senior Member, IEEE, and Mario Merone , Member, IEEE Abstract-This article examines the application of regularization techniques and kernel methods in addressing the task of learning nonlinear dynamical systems from inputoutput data.Our assumption is that the estimator belongs to the space of polynomials composed of Hilbert-Schmidt operators, which ensures the ability to approximate nonlinear dynamics arbitrarily, even within bounded but noncompact data domains.By employing regularization techniques, we propose a finite-dimensional identification procedure that exhibits computational complexity proportional to the square of the size of the training set size.This procedure is applicable to a broad range of systems, including discrete and continuous time nonlinear systems on finite or infinite dimensional state spaces.We delve into the selection of the regularization parameter, taking into account the measurement noise, and also discuss the incorporation of causality constraints.Furthermore, we explore how to derive estimates of the Volterra series of the operator by selecting a parametric inner product between data trajectories.
Index Terms-Learning, nonlinear systems, reproducing kernel Hilbert space, system identification, volterra series.
In an abstract setting, the output of a system is generated by some nonlinear operator applied to the input, which may include the initial state of the system.The challenge is that the identification of this input-output operator is always an ill-posed problem since the operator cannot be uniquely inferred from a finite set of input and output trajectories.We recall that a problem is ill-posed if one of the three conditions of existence, uniqueness, and continuous dependence on data fails to hold for its solution [33], [37].Even when a solution exists in the solution space, the unavoidable presence of noise may cause existence to fail.This is where regularization comes into play.The essential idea is to find the best approximation of the solution in a space where the approximation exists, is unique, and is continuous with respect to the data.This introduces a tradeoff between accuracy, i.e., good accordance with actual and predicted data, and the requirement that solutions are well-behaved.In the regularization approach, this tradeoff is usually controlled by a scalarregularization parameter [35], [38].Kernel methods are then used to generate the best approximation in the infinitedimensional operator space by projecting on the subspace of operators that can be represented from the available data.This approach has been widely exploited in the learning techniques mentioned above.The existing techniques have been applied to linear systems [23], [27], [39], [40], [41], [42], [43], linear discrete-time systems with outliers [44], discrete-time dynamics in Euclidean spaces [45], [46], [47], and nonlinear functions with discrete samples [22], [48].It is also worth mentioning that the theoretical features of Volterra series and polynomial kernel regression have already been considered in several important works, for example [45], [47], [49], [50], and a huge literature exists on block-oriented approaches to nonlinear identification exploiting specific structures of nonlinear operators, such as Hammerstein models, Wiener models, and combinations of them (see [51] and [52]).
Our work extends the approach presented in [53] for the linear case.The core idea is to extend existing results to the general problem of nonlinear maps over separable Hilbert spaces.Our task is to compute a finite-dimensional approximation of the unknown input-output map H x → H y starting from a training set of input-output samples {x i , y i }, i = 1, . . ., N, where H x and H y are generic Hilbert spaces.This training set may either be generated by user-defined probe inputs, or be collected from the autonomous dynamics of the system under consideration.This abstract framework encompasses many cases, for example, x i and y i may be finite-dimensional vectors of input and output values but also vector-valued functions of time corresponding to the input and the output of a dynamical system, or functions on a spatial domain when the operator of interest is the solution of a PDE.Thus, the results obtained in this setting can be applied to discrete as well as continuous time systems evolving in R n , distributed parameter systems, nonlinear systems with delay, finite-dimensional sequences, etc.The aims and contributions of this work can be summarized as follows.
1) We establish an abstract framework to generalize regularization and kernel-based techniques to a broader class of systems.We study which hypotheses are needed to ensure that the essential properties of the solutions provided by these methods are not lost in the generalization.2) We show that, with a suitable choice of solution space, regularization techniques can approximate arbitrarily well nonlinear systems with reasonable physical constraints.3) We show how the available prior knowledge can be plugged into the framework to restrict the solution space and improve the accuracy of the learning process.However, we show that solutions can be found in the absence of any prior knowledge in a completely black-box setting.4) We show that solution spaces based on polynomial or Volterra series have computational complexity comparable to linear estimates and that solutions can incorporate causality requirements.As for the first point, we derive a closed-form optimal approximation of the Volterra series (or of its truncated version up to a specified order) by extending the well-known technique of minimizing a regularized quadratic performance index computed on the training set, and we show that this solution admits a finite-dimensional computation.
The second contribution is to identify the most general solution space that yields solution operators with a physical characterization.The unknown nonlinear input-output map is represented as a Volterra series of Hilbert-Schmidt operators (H-S operators in the following).This choice is motivated by the considerations reported in Section II-C.
As for the third point, we show that the general estimator can be specialized by defining the inner product that generates the estimator based on a specific kernel.In other words, one can introduce prior information on the unknown operator to be estimated by choosing the most appropriate kernel to define the inner product on the linear space of the input trajectories.
Finally, we prove that causal operators can be introduced in this framework without additional constraints on the solution space.Our results prove the existence of a solution without any a priori hypothesis on the structure of the operator.
Although most of the above points have been considered and studied in the previous literature mentioned above, we argue that providing a general and flexible framework has important benefits and implications.In the first place, it avoids the need for "reinventing the wheel" in each specific situation, whereas at the same time, it indicates where specific choices, for example, the kernel design or the choice of the meta-parameters, come into play to tailor the solution to specific application needs.
The rest of this article is organized as follows.In Section II, we introduce the basic definitions and assumptions used throughout the article.Section III contains the main results on the computation of the optimal polynomial and the Volterra series approximation.Section V links our framework to the well-established theory of RKHS, by showing that our polynomial operators with H-S terms constitute in fact an RKHS.Section VI discusses an example to illustrate the results.Finally, Section VII concludes this article.
Notation: L 2 (D; C) denotes the Hilbert space of square integrable functions f : D → C. The scalar product between elements Throughout the article, we denote col n i=1 (x i ) the vector with n entries x i , and row n i=1 (x i ) the row vector with n entries x i , which can be scalars, vectors, functions, or matrices.The lower and upper bound of the index is sometimes omitted for brevity.The linear combination k i=1 a i x i of a finite number k of elements

A. Problem Statement and Assumptions
The input-output map of an unknown nonlinear system S is modeled as a nonlinear operator from the input linear space H x = L 2 (D; H 1 ) to the output linear space H y = L 2 (D; H 2 ).The input and the output of the system are therefore square integrable functions on D, which is a bounded domain (for example D = [0, T ]). 1 When D is discrete, H x = 2 (D; H 1 ) and H y = 2 (D; H 2 ).H 1 and H 2 are real separable Hilbert spaces.The image spaces H 1 and H 2 are purposefully kept generic to include a variety of input and output data, such as those generated by finite-dimensional systems, distributed parameter systems, delay systems, etc.We will refer to x i ∈ H x and y i ∈ H y as input and output trajectories, bearing in mind the reference case where D = [0, T ] and H 1 = R n x , H 2 = R n y that models linear or nonlinear systems with finite-dimensional input and output.
We consider the following problem.Given a training set of pairs (x i , y i ), where x i ∈ H x is the input corresponding to y i ∈ H y , we want to find a finite-dimensional nonlinear approximation in the least-squares sense of the unknown input-output map.To this end, we shall make use of polynomial functions of the given order of the input and of the corresponding Volterra series as the solution space.For the reasons explained in Section II-C, the existence of a polynomial approximation of the input-output operator rests on the following assumption.
Assumption 1: The operator F : H x → H y is uniformly Scontinuous on the sets of bounded energy signals of interest.
S-continuity is defined in Definition 4. We also require that the y i be defined on all D, since their values are needed to compute the output of the estimated operator.
Assumption 2: The output signals y i of the training set are defined in all D.

B. Volterra Series and Hilbert-Schmidt Operators
We briefly introduce a few formal notations to represent polynomials and Volterra series in a compact way.The following definition is for the case H 1 = R n but it can be easily extended to separable Hilbert spaces.
Definition 1 [Kronecker product on Hilbert spaces]: Given where ⊗ is the ordinary Kronecker product.Remark 1: The value of x z is the product of the values of x and z at different points of D. Consequently, the temporal or spatial correlation between functions, when D is respectively a temporal or spatial domain, can be expressed through operators on x z.
Let {φ k } be an orthonormal basis of H n .Then x z belongs to H 2 n , a subspace of L 2 (D 2 ; R n 2 ) defined as In essence, H 2 n is the span of the Kronecker product of versors on the basis of H n .When x = z, x 2 = x x defines the Kronecker power of elements of H n .This can be generalized to higher-orders by defining Proof: It is sufficient to prove the first relationship.Since H n is separable, given an orthonormal basis In this article, we consider polynomials whose monomial terms are H-S operators.We first recall the definition of H-S operators.
Definition 2 [H-S operators]: A Hilbert-Schmidt operator on separable Hilbert spaces L : H x → H y is a linear operator such that where {φ j } is any orthonormal basis of H x (i.e., the norm does not depend on {φ j }).
The definition implies that H-S operators from H x to H y are bounded finite-energy inputoutput operators, as it is apparent from (9).We recall that on infinite-dimensional Hilbert spaces, the identity operator is not a H-S operator because thus the input-output behavior is not represented by H-S operators when there is a direct term from the input to the output, e.g., y i = x i cannot be represented through H-S operators when H y is infinite-dimensional. Finally, the linear space of H-S operators L : H x → H y endowed with inner product where x i ∈ H x , and its value in H y .M m is multilinear with respect to its arguments x i .Thus, a monomial M m (x m ) is a linear operator with respect to x m but not to x.
Since M m is a H-S operator, it is defined by specifying its value on some basis of H m x , as prescribed by (8).Let {φ k } be an orthonormal basis of H x and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
with m elements.Then ξ ∈ H m x can be represented as , and where H y < ∞ as required by (9).
In particular, it descends from (3) that when ξ = x m then is called a ν-degree polynomial operator.
Remark 2: Since P ν : H x → H y , by definition y = P ν (x) is square integrable over D. Notice that, however, P ν is neither a linear operator nor, consequently, a H-S operator.Definition 3 does not include a constant term M 0 ∈ H y , that can be added as a separate term.For example, in the case of L 2 (D; R n ), a complete second-degree polynomial is 15) and y is specified though the choice of M 0 = y 0 ∈ H y and of the operator kernels κ The Volterra series can be obtained as the limit lim ν→∞ P ν (x) of a sequence of polynomial operators.It is worth remarking that this representation of the input-output map is not necessarily a causal one.For example, to introduce causality in (15), one needs to add the constraints κ L (t, τ ) = 0 whenever τ > t and The following theorem is proved in Appendix A1.Theorem 1: For any ν, the linear space of polynomial operators P ν : H x → H y on separable Hilbert spaces defined in ( 14) is an Hilbert space L P ν with the inner product In particular, with

C. Polynomial Approximations of Nonlinear Operators
The choice of H-S operators to approximate nonlinear dynamical systems has two main motivations.The first one is that H-S operators are bounded as one would expect from a physical system.The second one is that a polynomial based on H-S operators can approximate with arbitrary precision a large class of input-output maps.In fact, the approximation of nonlinear operators by means of polynomials requires some version of the Weierstrass approximation theorem suited to Hilbert or Banach spaces.Such extensions have been widely explored in the literature in the past decades [54], [55], [56], [57], but they crucially depend on the compactness of the domain.This hypothesis is restrictive in the context of system identification, e.g., the ball of finite energy signals is bounded but not compact.A possibility to overcome this problem is to restrict ourselves to maps that are uniformly continuous with respect to the S-topology [58], since it was proved in [59,Th. 3] that given a map F : H x → H y uniformly continuous with respect to the S-topology on a bounded set Ω ⊂ H x ∀ > 0 there exists a continuous polynomial P : In other words, on bounded sets, the continuous polynomials are dense with respect to the family of uniformly S-continuous functions.Definition 4 ( [58], [59]): A map F : H x → H y is said to be uniformly continuous with respect to the S-topology if, for any > 0 there exists a self-adjoint nonnegative definite trace-class operator S : The S-topology is weaker than the norm topology and maps that are uniformly continuous with respect to the S-topology are also compact [59,Th. 1].In addition, the functions in this class can be represented as continuous nonlinear functions on H-S operators [60] and in particular linear operators that are uniformly continuous with respect to the S-topology are H-S operators.This is not only a nice mathematical characterization but it also implies the very "physical" property of being an inputoutput map with a smoothing action.Summarizing, the choice of H-S operators as monomial terms of P (x) is justified both on physical and mathematical terms.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Learning the Optimal Linear Approximation
We first consider the problem of estimating the best linear approximation [i.e., ν = 1 in ( 14)], to introduce regularized estimates that will be extended to the polynomial case.The material in this section summarizes the results in [53] for the linear case.Given a training set {(x i , y i )}, the simplest way to compute a linear approximation Lx = M1 x + M0 of the inputoutput operator is the well known least squares estimate (LSE) L = arg min L J(L), J(L) = i y i − Lx i 2 H y .This approach has two main drawbacks.The first one is that, when both {x i } and {y i } have N independent elements the LSE yields a L such that J( L) = 0 because ∀i y i = Lx i , and consequently the approach is prone to over-fitting.On the other side, when there is no linear operator such that ∀i y i = Lx i , LSE yields an infinite norm estimate.More precisely, any sequence { Ln } of operators with finite-rank n that minimizes J(L) is such that Ln H.S. → ∞ [33], [53].The problem is relevant in practice when the output y i is affected by noise n i , since the operator L tends to track the output noise in order to reduce the residual One method to cope with the above problems is to introduce a penalty for the norm of the estimate in the cost function.Intuitively, this regularization alleviates over-fitting because "too precise" solutions are discarded in favor of more compact ones, and at the same time, it yields finite norm estimates when the LSE estimate does not exist.In the case of the best affine estimate, the regularization approach amounts to defining a cost functional (22) where λ ∈ R + is a regularization parameter [13], [23], [61] (notice that the regularization parameter does not affect M 0 ).The solution is found by using variational calculus to find arg min L J N (λ, L).Let > 0 and The minimum is obtained by imposing that the derivative is null that can be satisfied ∀Δ M 0 only when that yields M0 as a function of M1 .We notice that the "affine" term M0 disappears when ȳ = x = 0.For this reason, we can consider the centered training set that has ȳ = x = 0, and compute the best linear and polynomial estimate with the tacit assumption that affine estimates are obtained by adding M0 given by (23).On a centered training set and for a given choice of the weight λ, the best estimate of L is now defined as and it can be computed as follows [53].
Theorem 2: If Assumption 2 holds, then, given a centered training set {x i , y i }, x = {x i } i = 1, . . ., N, and the matrix 1) For any λ ∈ R the optimal linear estimator M1 defined by ( 25) exists and, denoting where 2) The optimal linear estimate of M1 ξ ∀ξ ∈ H x , is 3 When the dataset is not centered, M1 is given by ( 27) where x i and y i are replaced by x i − x and y i − ȳ, and M0 is given by (23).For completeness, a proof of the first point is reported in Appendix A2.The second point follows from straightforward computations.For points 3)-4) see [53,Ths. 3 and 5] or [33,Ths. 3.12 and 3.15].
The second point of Theorem 2 provides a finite-dimensional closed-form expression of M1 ξ for any ξ ∈ H x as a linear combination of the y i in the training set.Since (λI N + A(x)) −1 col i (y i ), which is a vector of functions with entries in H y , depends only on the training set, it can be computed once and for all.The computation of M1 ξ for a generic ξ requires only to compute the vector row i ([ξ, x i ] H x ) ∈ R N .From the point of view of the computational complexity, the most computationally intensive operation is the scalar product, which typically involves the numerical computation of an integral.In this sense, the computational complexity of ( 30) is quadratic in the size N of the training set, since A(x) contains N (N + 1)/2 distinct entries of the kind [x i , x j ] H x .
Example 1: In the case of continuous-time systems where we recognize the underlying operator kernel of M1 ,

B. Learning the Optimal Polynomial Approximation
This section considers the problem of computing the νth degree polynomial that best approximates F : H x → H y from a training set of input and output signals {x i , y i }, i = 1, . . ., N. The setting and the notation are the same as in Section III-A.When M 0 = 0, our hypotheses model is that there exists a polynomial operator P ν such that and we aim at computing Mm , m = 1, . . ., ν so that Mm x m i (34) is the "best" approximation of y i , where, as before, x i ∈ H x , Given an orthonormal basis {φ k } of H x , Mm can be represented as in ( 12) where The optimal estimate P is obtained by minimizing the functional Theorem 3: If Assumption 2 holds, then, given a training set {x i , y i }, i = 1, . . ., N and the matrix A ν (x) ∈ R N ×N with entries for any λ ∈ R the optimal polynomial estimator with respect to (37) exists and its monomials Mm , are given by where {φ k } is an orthonormal basis of H x and Y = col i (y i ).
Proof: The minimum Pν of J ν N is the solution of the system x ; H y ) : Proceeding as in the proof of Theorem 2 in Appendix A2, we Since, by definition replacing (42) in (43) yields Thus, by stacking Pν (x i ) Finally, we replace (45) into (42) to obtain Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Again, Assumption 2 together with (46) guarantees that Corollary 1: In the hypotheses of Theorem 3, the optimal polynomial estimate Pν ξ, ξ ∈ H x , is given by The proof is entirely analogous to Theorem 2 in Section III-A.Remarkably, the polynomial estimate (47) has complexity proportional to N 2 just as the linear estimate (30).In other words, the computational complexity of ( 47) is insensitive to the degree ν of the polynomial, since A ν (x) is obtained by summing the entry-wise powers of A(x).
Remark 3: When one considers the full polynomial including the affine term, i.e., it is easy to prove that the estimation procedure of Theorem 3 and Corollary 1 is exactly the same and that the best estimate M0 of M 0 is given by Example 2: For vector-valued continuous-time systems consider the case ν = 2.The generic quadratic operator 2 can be represented as Let M λ,2 = λI n + A 2 .If we are interested in the explicit expression of the operator kernels of the optimal quadratic estimator (47), for κ M1 we can use (32) of Example 1, with M λ replaced by M λ,2 .To express κ M2 we use (47), with j = 2. Straightforward manipulations yield and the underlying kernel

C. Convergence of the Approximated Volterra Series
Having obtained the best polynomial estimate for any given order ν, our next step is logically to derive the best regularized estimate of the Volterra series of the system by letting ν → ∞.Before doing so, it is appropriate to offer some comments to put these results in the framework of recent kernel-based identification methods (see, among many others, [4], [23], [27], and [62]).The results of the previous sections emphasize the prominent role of the matrices A(x) (linear case) and A ν (x), that have entries of the form, respectively, [x i , x j ] H x , [x i , x j ] m H x , and encode the "similarity" of the elements x i and x j .The inner product may seem like a very specific choice of similarity, but the framework described so far is abstract and the inner product in H x is actually a design parameter that may lead to different estimates.In the continuous time case with finitedimensional inputs/outputs the standard inner product [x i , x j ] = D x i (s) x j (s) ds is an instance of the more general case [x i , x j ] κ = D x i (s) κ(s)x j (s) ds, where κ : D → R n x ×n x is symmetric and positive definite.By changing κ, we obtain different regularized estimates.As discussed in [53], κ corresponds to the notion of "kernel" of the regularized estimate (not to be confused with the kernel of an RKHS introduced in Section V), and its choice, discussed, for example, in [18] for the case of LTI systems, should crucially incorporate prior knowledge about the system to identify, for example, stability or frequency content.Although the solutions described in Sections III-A and III-B apply beyond LTI systems, they may incorporate this flexibility by customizing the inner product in H x to reflect the prior knowledge about the system.Clearly, in a totally black-box approach, one may not possess sufficient prior knowledge to make this choice.In this case, it is still possible to estimate the unknown operator by choosing any "natural" inner product in the input space.
We now apply this flexibility in the choice of the inner product to estimate the Volterra series as the limit of a polynomial approximation.To our knowledge, this problem has been studied for the first time in [49].In our case, the problem is whether the polynomial approximations in (47) converge for ν → ∞.From the definition of A ν (x) in (38) it is immediate to see that as ν → ∞ the entries of A ν (x) diverge when |[x i , x j ]| > 1.
In order to obtain a convergent series of polynomials we may redefine the norm (20) with a weight function w so that higher order monomials have a larger weight.Given a positive and unbounded function w : N → R + , lim m→∞ w(m) = ∞, let us redefine the inner product and the norm in L P ν as By repeating the steps in the proof of Theorem 3, we easily obtain the representation of the optimal polynomial estimator with respect to a functional based on this new norm.Theorem 4: If Assumption 2 holds, then, given a positive and unbounded function w : N → R + , a training set {x i , y i }, i = 1, . . ., N and the functional then for any λ ∈ R the optimal polynomial estimator with respect to (55) exists and its monomials Mm are given by Mm (φ Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. where {φ k } is an orthonormal basis of H x , Y = col i (y i ) and the matrix A w,ν (x) ∈ R N ×N has entries At this point, we can let ν → ∞, provided that w is suitably chosen.As already remarked in [49] different choices of w yield different possible polynomial kernels.The step forward with respect to [49] is that, in the light of the results mentioned in Section I, all these kernels are universal, in the sense that they are capable of uniformly approximating all S-continuous functions on bounded sets of H x .For example, if we choose w(m) = w m with w > 1, we obtain the following limit for the Volterra series.
Corollary 2: In the hypotheses of Theorem 4, if It is natural to identify (59) as the estimate of the Volterra series of the unknown system based on the rational kernel (61).Notice that, since [x 1 , x 2 ] ≤ x 1 x 2 , the constraint on w of Corollary 2 translates into a bound on the energy of the input signals.Another interesting choice of w, inspired to [49], is with κ(t) > 0 a positive and monotonic function of t ∈ D. The choice (62) yields the exponential kernel The function κ(t) is chosen to be monotonically increasing with t in order to normalize the (potentially large) values [x i , x j ] H x and to prevent numerical issues.Remark 4: The learning procedure described in Section III-B requires fixing two meta-parameters, (λ, ν), whereas the procedure of Section III-C requires (λ, w), where w is a function.The choice of λ is discussed in Section III-D.Other approaches are possible since, as discussed at the beginning of this section, the inner product itself can be considered a design choice.For example, if the inner product is defined so that |[x i , x j ]| < 1, the function w is no longer necessary.In all cases, the choice of the kernel for [x i , x j ] determines the kernel (A w,∞ ) of the Volterra series.
Remark 5: A property of the estimator in Section III-B is that its complexity does not depend on the degree chosen for the polynomial estimator.Moreover, the estimate of the Volterra series (59) only requires finite-dimensional operations.In general, the complexity of the learning procedure is quadratic in the size of the training set due to the need to compute [x i , x j ] H x for all possible pairs.The computational cost becomes linear in N if it is possible to design a training set of orthogonal input functions, since this makes the matrix A ν (x) block diagonal.
Remark 6: Since Pν : H x → H y , the framework described in this section guarantees that the estimated trajectories are square integrable on D for any input x.For example, when H y = L 2 (R + ; R n ) it is obvious that lim t→∞ P (x)(t) = 0.When D is a bounded temporal domain, it may be of interest to extend the estimate ŷ to a larger domain.In other words, given a training set in [0, T ], one might want to estimate the output of the system for an input in [0, ∞).This interesting extension is a theme of further research.

D. Choice of the Regularization Parameter and Noisy Data
In Section V, we shall see that the space of the estimators is a reproducing kernel Hilbert space, and thus, the choice of λ may exploit results in that area [27], [42].In this section, we show how to choose λ in the case of noisy measurements.
The results in Theorem 2 are easily extended to the linear space of polynomial operators.In particular, we have that whenever there exists a "true" polynomial operator P ν such that y i = P ν (x i ), for all i, and the data {(x i , y i )} are exact, then the estimate (47) is such that that is, lim λ→0 + Pν is the minimum norm polynomial operator of degree ν that yields a null residual on the training set.Notice that lim λ→0 + Pν = P ν is not ensured (the true operator could be not minimum norm).In the presence of noise, it is in general that is, the residual is not null even for the true operator.It is therefore useless to decrease λ to reduce i y i − P ν (x i ) 2 too much, and we obtain the following maximum likelihood rule to tune λ.
Optimal rule: Choose λ so that i y i − Pν (x i ) 2 equals the expected value of the residual of P ν .
The optimal rule can be implemented when the expected value does not depend on P ν itself, for example additive output noise, y i = P ν (x i ) + n y i , where the expected value of the residual is i E[ n y i 2 H y ].Conversely, the optimal rule cannot be implemented when the residual depends on P ν , that is not available.This is, for example, the case of additive noise in the input, The following heuristic rule is commonly used in these cases [33].
Heuristic rule: Choose λ so that i y i − Pν (x i ) 2 equals its expected value.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The rule can be justified as follows.In the first place, it is statistically satisfied by P ν itself.In the second place, when the "true" operator belongs to L P ν , the rule yields a unique value of λ, since the sample residual of Pν increases with λ (Theorem 2) and its expected value depends on Pν that decreases with λ (Theorem 2).Finally, it is worth mentioning that the application of Heuristic rule can be implemented by extending criteria that hold for the finite-dimensional case based on the concept of equivalent degrees of freedom of regularized estimators [63], [64].
When more hyper-parameters are added to the scheme, for example the function w, the degree ν of the polynomial, or parameters that specialize the inner product, it is necessary to resort to more powerful hyper-parameter estimation techniques, like empirical Bayes and marginal likelihood [65], C p statistic [66], k-fold cross validation [67], generalized cross validation [68], etc.

IV. RECURSIVE COMPUTATION OF CAUSAL ESTIMATES
In the estimation of a dynamical system D may be a temporal domain and the variables x, y in y = F (x) are functions of time, i.e., x ∈ L 2 ([0, T ]; H 1 ), y ∈ L 2 ([0, T ]; H 2 ).It makes sense to introduce it in the estimate of F a causality constraint, i.e., restrict ourselves to causal operators.Definition 5: An operator The optimal polynomial estimate ( 47) is not causal, because Pν (ξ)(t) is obtained as a linear combination of y i (t) with coefficients that depends on ξ ∈ [0, T ] via the inner products [ξ, x i ] H x .However, the estimate is causal for t = T .Consequently, a simple method to obtain a causal estimate is to define two parametric spaces H t x = L 2 ([0, t]; H 1 ), H t y = L 2 ([0, t]; H 2 ) and let the estimates evolve with t.This estimate can be computed at any t by integrating a differential representation.Let us rewrite (47) as where Clearly, ŷ(t) is a causal estimate, because it depends only on past values of x i and y i .We can easily obtain a differential representation by recalling that thus, the terms in ( 65) can be computed by integrating the scalar products from a null initial condition.( 65) provides the soughtafter causal and recursive estimate of the optimal Pν (ξ)(t).This causal estimate can be easily adapted to the case of discrete-tine systems by replacing the differential equation with a difference equation.

V. CONNECTION WITH RKHS
In this section, we interpret the results of Section III in light of the theory of RKHS in order to establish a connection and shed more light on the properties of the estimation framework previously described.We specialize in the abstract definition of RKHS [39], [40] to our framework.
Definition 6: Given a set X and a Hilbert space H y , a H yvalued RKHS on X is a Hilbert space H such that the elements of H are functions f : X → H y and ∀x ∈ X there exists a positive constant C x such that where B(H y ; H y ) is the Banach space of bounded operators H y → H y with the uniform norm, such that ∀N ∈ N, A H y -valued RKHS on X canonically defines a H y -valued kernel of positive type on X × X as follows [39], [40].Given x ∈ X, define the evaluation operator ev x : H → H y such that, for any f ∈ H Equation (70) guarantees that the evaluation map ev is a bounded operator with the conjugate ev * x : H y → H.The reproducing kernel K : X × X → B(H y ; H y ) associated to H is with y k = Lφ k .Since K is symmetric, the uniqueness of the kernel follows from [39, Prop.2.3].Theorem 5 shows that K(x 1 , x 2 ) generates any operator L ∈ L H.S. .H-S operators are, in this sense, the most "natural" space of linear operators between L 2 spaces that constitute a RKHS.
Example 3: In the case of continuous-time systems and the corresponding RKHS is the linear space of H-S operators The essential property that connects the RKHS and Volterra series based on H-S monomials is that the Hilbert space L P ν of polynomial operators P ν : H x → H y introduced in Section II-B turns out to be a RHKS (with X = H x ).
Theorem 6: The linear space L P ν of polynomial operators where I H y is the identity operator in H y .
Proof: We already know that L P ν is a Hilbert space of functions H x → H y .thus, we need only to prove (70).This is straightforward to prove, since where we have used Theorem 1.We have thus proved that L P ν is a RKHS.The kernel K ν is clearly of positive type.Notice that from (4) it follows that I H y and thus K ν (., x)y is an operator We can easily verify that K ν satisfies the reproducing property (77 where we have used (12), and Example 4: In the case of continuous-time systems and the corresponding RKHS is the linear space of polynomial operators P x ν : H x → H y defined as As a consequence of Theorem 6, we can apply the results concerning RKHS to L P ν .The optimal polynomial estimator (47) can be rewritten by using the notation of the RKHS L where we have denoted M λ,ν = (λI N + A ν (x)) ∈ R N ×N , and This result is a consequence of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the representer theorem for RKHS [61, Th. 4.2], see also [2], [32], and [43].Theorem 7 ( [61]): Given a set {(x i , y i )}, x i ∈ H x , y i ∈ H y , i = 1, . . ., N, and the functional J ν N in (37), the minimizer of J ν N , Pν = arg min L P ν J ν N (P ν ) can be represented as It is immediate to verify that (84) coincides with (47) with M λ,ν col i (ĉ i ) = col i (y i ).In other words, the optimal polynomial estimation derived in Theorem 3 and Corollary 1 coincides with the solution arising from the theory of RKHS.

VI. NUMERICAL EXAMPLE
In this example, we show that a good estimate of the inputoutput behavior of a nonlinear system can be obtained with a moderate computational burden when the training set is based on input functions that are similar to the ones to be estimated.This similarity condition is reasonable, since in the nonlinear case the only way to "learn" the behavior of the system is from similar cases, a restriction that can be lifted for linear operators.We also illustrate how to set the regularization parameter λ from prior information of the measurement noise.
We consider a Volterra-Lotka system with polynomial nonlinearities and measurement noise, described by where x ∈ H x = L([0, T ]; R) is a nonnegative input, the parameters are a 11 = a 22 = 0.5, a 12 = a 21 = 0.25, and we assume y = (y 1 , y 2 ) ∈ H y = L([0, T ]; R 2 ) are available noisy measurements of the state (z 1 , z 2 ).The measurement noise n = (n 1 , n 2 ) ∈ H y chosen in the example is a colored stochastic process generated as for h = 1, 2, with a n = −2; b n = 0.1 and ω h mutually independent white-noise processes.Notice that above, in accordance with standard notation, the subscripts h = 1, 2 denote state, output, and noise components, and not their realizations, which will be denoted with the subscript letter i in what follows.
Clearly, if F (x) is the true nonlinear operator of the Volterra- The noise can be represented as n = F N (ω 1 , ω 2 ), where F N is the H-S operator described by (89).From the white-noise theory, it is known that E[ n 2 H y ] = F N 2 H.S. (see, for example, [53,Lemma 4]).Therefore, the knowledge of the noise parameters allows to apply the optimal rule of Section III-D to choose the optimal value of λ.The norm of F N can be computed as and, with T = 250 and the parameters listed above, we obtain F N 2 H.S. = 1.25.We used a training set of N = 100 harmonic functions having form x i (t) = 1 + sin(θ i t + φ i ), where θ i ∈ [0, 15π T ], φ i ∈ [0, 1] are random variables uniformly distributed, T = 250 is the time horizon of the input trajectory, and i = 1, . . ., N. The estimate was computed by using the Volterra series estimate (59) with the rational kernel (61) A sample of ten noisy output trajectories y i in the training set is plotted in the phase state for t ∈ [0, T ] in Fig. 1, together with the corresponding estimates obtained by the estimator (59)-( 61) for λ = 2 • 10 −4 and w.
The same estimator is used on the input test function x(t) = 1 + sin( 8π T t + 0.5).Fig. 2 shows the true and estimated output for this test function.Notice that the true output is plotted without measurement noise, whereas the estimated output is not as smooth because of the noise in the training set.
The choice λ = 2 • 10 −4 for these plots was made in accordance with the optimal rule of Section III-D, based on the knowledge of the measurement noise.Fig. 3 shows that  H y ] = F N 2 H.S. = 1.25 (the vertical axis in this plot is in logarithmic scale).We notice that this value of λ allows to obtain an error norm in the test function very close to the minimum that is obtained approximately for λ ∈ [10 −4 , 10 −3 ].We remark that this choice of the regularization parameter is based on some a priori knowledge of the measurement noise.
Next, we compare causal and noncausal estimates for this system.The difference between the two cases is more evident with larger levels of measurement noise, thus the plots in Fig. 4   set, which is not surprising because of the additional causality constraint, but smaller errors on the test set.The error reduction when using the causal estimate, although not so evident from the logarithmic plot, is significant (around 15%).We also notice that with the exponential kernel for both estimates the minimum error on the test set is obtained when the error on the training set equals the norm of the measurement noise, in accordance with the optimal rule of Section III-D.

VII. CONCLUSION
In this work, we have presented a general estimator for nonlinear unknown input-output operators.There are several crucial aspects that warrant further investigation, including the selection of the regularization parameter in the presence of input noise, the determination of the Volterra series kernel based on prior information, and the potential extension of the estimator to a domain beyond the training set, such as unbounded time intervals.

APPENDIX A1 Proof of Theorem 1
It is clear that the linear combination of polynomials ( 14) is still a polynomial.It is also immediate to check that ( 19) is an inner product and ( 20) is a norm.The nontrivial point to prove is that the operators M m , that, by definition, form a linear space of H-S linear operators H m x → H y , have a H-S norm which does not depend on the choice of an orthonormal basis {φ k } of H x .In other words, we need to prove that M i 2 H.S. does not depend on {φ k }.This can be done similarly to the linear case (see, for example, [69,Lemma 3.4.2])thanks to property (3) in Lemma 1.In fact, given two multiindexes K i and L i and denoting Φ K i = φ k 1 . . .φ k i , Φ L i = φ l 1 . . .φ l i , (3) implies that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.He currently serves as Assistant Professor with the University of L'Aquila.From February 2021 to July 2021, he has been Visiting Researcher with the ICTEAM Institute, Université Catholique de Louvain, Ottignies-Louvainla-Neuve, Belgium.His research interests include delay systems, positive systems, filtering and system identification, and machine learning methods for system identification.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig. 1.Noisy output y i on ten samples of the training set compared to ŷi generated by the estimate (59) of the Volterra series for λ = 2 • 10 −4 .

Fig. 2 .
Fig. 2. Noise-less output y of the test function compared to (noisy) ŷ generated by the estimate (59) of the Volterra series for λ = 2 • 10 −4 .The value of λ has been chosen according to the optimal rule of Section III-D.

Fig. 3 .
Fig. 3. Average residual of the training set for the Volterra-Lotka system, 1 N

2 H 1 κ
have been obtained with b n = 0.4 that yields F N .S. = 19.98.In this case, we used the exponential kernel(63), (A w,∞ ) (i,j) (x) = e (t) [x i ,x j ] Hx − 1, with κ(t) = max i,j [x i , x j ] H t x (91)as a normalization factor to avoid numerical instabilities.Fig.4shows that causal estimates have a larger error on the training

Fig. 4 .
Fig.4.Average residuals and errors at varying λ for causal and noncausal estimates (log scale on the y−axis).Causal estimates have larger residuals but smaller errors.Notice that in both cases the minimum is reached when the residual equals the norm of the noise.

Filippo Cacace (
Senior Member, IEEE) received the Laurea degree in electronic engineering and the Ph.D. degree in computer science from Politecnico di Milano, Milan, Italy, in 1988 and 1992, respectively.In 2003, he joined University Campus Bio-Medico of Rome, Selcetta, Italy, where he is currently an Associate Professor.His current research interests include nonlinear systems and observers, stochastic and delay systems, system identification, and applications to systems biology.Vittorio De Iuliis (Member, IEEE) received the M.Sc.degree in computer and systems engineering and the Ph.D. degree in information and communication technology from the University of L'Aquila, L'Aquila, Italy, in 2014 and 2018, respectively.
[39]a 2 ([39]): K defined by (74) is of positive type.Conversely, if H is a real vector space, any symmetric H yvalued kernel of positive type (i.e., K(x 1 , x 2 ) = K(x 2 , x 1 )) defines a unique H y -valued RKHS H whose reproducing kernel is K [39, Prop.2.3].If H is a complex vector space, a kernel of positive type is always hermitian.The first property that descends from the definition (74) of the reproducing kernel is that ∀x 1 , x 2 ∈ X and y ∈ H y The second property that immediately descends from the definition (74) is the reproducing property[f (x), y] H y = [ev x f, y] H y = [f, ev * x y] H (77) that holds ∀f ∈ H, x ∈ X, y ∈ H y .We now return to the operators H x → H y considered in Section III.Given two Hilbert spaces H x and H y , it is easy to prove that the Hilbert space L H.S. of H-S operators H x → H y is a RKHS (with X = H x ).Theorem 5: The Hilbert space L H.S. of H-S operators L :H x → H y is a RKHS with kernel K(x 1 , x 2 ) = [x 1 , x 2 ] H x I H y ,where I H y denotes the identity in H y .toverifythat K is a H y -valued kernel of positive type and thatK(•, x)y = [•, x] H x y is a H-Soperator H x → H y that satisfies the reproducing property (77) since, ∀L ∈ L H.S. , x ∈ H x , y ∈ H y , one has [L, ev * x y] L H.S. = [L, [., x] H x y] L H.S. , [φ k , x] H x y] H y = [Lx, y] H y .(79) Finally, the completion of the linear space of H-S operators in the form N k=1 K(., φ k )y k = N k=1 [., φ k ] H x y k , with arbitrary N and {y k } is exactly L H.S. , since it follows from (8) that: