Discrete-Time Signatures and Randomness in Reservoir Computing

A new explanation of the geometric nature of the reservoir computing (RC) phenomenon is presented. RC is understood in the literature as the possibility of approximating input–output systems with randomly chosen recurrent neural systems and a trained linear readout layer. Light is shed on this phenomenon by constructing what is called strongly universal reservoir systems as random projections of a family of state-space systems that generate Volterra series expansions. This procedure yields a state-affine reservoir system with randomly generated coefficients in a dimension that is logarithmically reduced with respect to the original system. This reservoir system is able to approximate any element in the fading memory filters class just by training a different linear readout for each different filter. Explicit expressions for the probability distributions needed in the generation of the projected reservoir system are stated, and bounds for the committed approximation error are provided.

between the time evolution of one or several explanatory variables (the input) and the second collection of dependent or explained variables (the output).
A generic question in all those fields is to determine the IO system underlying an observed phenomenon. This is the so-called system identification problem. For this purpose, first, principles coming from physics or chemistry can be invoked, when either these are known or the setup is simple enough to apply them. In complex situations, in which access to all the variables that determine the behavior of the systems is difficult or impossible, or when a precise mathematical relationship between input and output is not known, it has proved more efficient to carry out the system identification using generic families of models with strong approximation abilities that are estimated using observed data. This approach, which we refer to as empirical system identification, has been developed using different techniques coming simultaneously from engineering, statistics, and computer science.
In this article, we focus on a particularly promising strategy for empirical system identification known as reservoir computing (RC). RC capitalizes on the revolutionary idea that there are learning systems that attain universal approximation properties without the need to estimate all their parameters using, for instance, supervised learning. More specifically, RC can be seen as a RNNs approach to model IO systems using state-space representations in which the following holds.
1) The state equation is randomly generated, sometimes with sparsity features. 2) Only the (usually very simple) functional form of the observation equation is tailored to the specific problem using observed data. RC can be found in the literature under other denominations, such as liquid state machines [1]- [5], and is represented by various learning paradigms, with echo state networks (ESNs) [6]- [8] being a particularly important example.
RC has shown superior performance in many forecasting and classification engineering tasks (see [9]- [12], and references therein) and has shown unprecedented abilities in the learning of the attractors of complex nonlinear infinite dimensional dynamical systems [8], [13]- [15]. In addition, RC implementations with dedicated hardware have been designed and built (see [16]- [24]) that exhibit information processing speeds that largely outperform standard Turingtype computers.
The most far-reaching and radical innovation in the RC approach is the use of untrained, randomly generated, and, sometimes, sparse state maps. This circumvents well-known difficulties in the training of generic RNNs arising bifurcation phenomena [25], which, despite recent progress in the regularization and training of deep RNN structures (see [26]- [28], and references therein), renders classical gradient descent methods nonconvergent. Randomization has already been This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ successfully applied in a static setup using neural networks with randomized weights, in particular, in seminal works on random feature models [29] and extreme learning machines [30]. This built-in randomness makes reservoir models different from other conventional approaches where statespace systems appear. For instance, the Kalman filtering [31] has been used for decades in signal processing, and in that case, both linear and nonlinear [32], [33] Kalman techniques hinge on the idea of designing the state map to result in a posteriori residual errors of minimal variance. This requires a significant computational effort in relation to recursive parameter estimation, which is not needed for RC systems. In the context of the dynamical systems, an important result in [34] shows that randomly drawn ESNs can be trained by exclusively optimizing a linear readout using generic 1-D observations of a given invertible and differentiable dynamical system to produce dynamics that are topologically conjugate to that given system; in other words, randomly generated ESNs are capable of learning the attractors of invertible dynamical systems. More generally, the approximation capabilities of randomly generated ESNs have been established in [35] in the more general setup of IO systems. There, approximation bounds have been provided in terms of their architecture parameters.
In this article, we provide additional insight on the randomization question for another family of RC systems, namely, for the nonhomogeneous state-affine systems (SAS). These systems have been introduced and proved to be universal approximants in [36] and [37]. We here show that they also have this universality property when they are randomly generated. The approach pursued in this work is considerably different from the one in the above-cited references and is based on the following steps. First, we consider causal and time-invariant (TI) analytic filters with semi-infinite inputs. The Taylor series expansion of these objects coincides with what is known as their Volterra series representation. Second, we show that the truncated Volterra series representation (whose associated truncation error can be quantified) admits a state-space representation with linear readouts in a (potentially) high-dimensional adequately constructed tensor space. We refer to this system as the signature SAS (SigSAS): on the one hand, it belongs to the SAS family, and on the other hand, it shares fundamental properties with the so-called signature process from the (continuous-time) theory of rough paths, which inspired the title of the article.
The rough path theory, as introduced by Lyons [38] in the seminal work, has initially been developed to deal with controlled differential equations driven by rough signals in a pathwise way. These equations can be seen as a continuoustime analog of time-series models, where the rough signals play the role of the model innovations. The key object in this theory is the signature, which was first studied by Chen [39], [40] and consists in enhancing the rough input with additional curves (satisfying certain algebraic properties) mimicking what, in the smooth case, corresponds to iterated integrals of the curve with itself.
It is a deep mathematical fact that unique solutions of the rough differential equation exist and are a continuous map of the signature (in appropriately chosen topologies). Surprisingly, this nonlinear continuous map can be arbitrarily well approximated by linear maps of the signature. More generally, on compact sets of so-called "nontree-like" paths (see [41] for a precise definition), every continuous path functional (with respect to a certain p-variation norm) can be uniformly approximated by a linear function of the signature. Indeed, linear functionals of the signature form a point separating algebra on sets of "nontree-like" paths, which, by the Stone-Weierstrass theorem, then yields a universal approximation theorem for general path functionals (see [42]). The rough path theory has been substantially extended by Hairer [43] toward the theory of regularity structures and is, nowadays, the tool to analyze deep analytic properties of continuous-time IO systems.
From a machine learning perspective, the signature can be thought of as a feature map capturing all specific characteristics of a given path. More precisely, it serves as a linear regression basis and can, thus, be interpreted as an abstract reservoir (for the moment without random specifications) for solutions of rough differential equations. These appealing properties made signature methods highly popular for machine learning applications both for streamed data (in particular, in finance) and complex classification tasks. For inspiring examples of the rapidly growing literature on machine learning using signature methods, we refer to [44]- [51], and references therein.
Returning to the SAS family, we will show that the solutions of the SigSAS introduced in this article share exactly the two crucial properties, which makes signature central in rough path theory: first, the SigSAS solutions fully characterize the input sequences; second, any (sufficiently regular) IO system can be written as a linear map of the SigSAS system. These properties have been exploited in the continuous-time setup in [52].
Finally, we use the Johnson-Lindenstrauss (JL) Lemma [53] to prove that a random projection of the SigSAS system yields a smaller dimensional SAS system with random matrix coefficients (that can be chosen to be sparse) that approximate the original system. Moreover, this constructive procedure gives us full knowledge of the law that needs to be used to draw the entries of the low-dimensional SAS approximating system, without ever having to use the original large dimensional SigSAS, which amounts to a form of information compression with efficient reconstruction in this setup [54]. An important feature of the dimension-reduced randomly drawn SAS system is that it serves as a universal approximator for any reasonably behaved IO system and that only the linear output layer that is applied to it depends on the individual system that needs to be learned. We refer to this feature as the strong universality property.
This approach to the approximation problem in RNNs using randomized systems provides a new explanation of the geometric nature of the RC phenomenon. The results in the following show that randomly generated SAS reservoir systems approximate well any sufficiently regular IO system just by tuning a linear readout because they coincide with an errorcontrolled random projection of a higher dimensional Volterra series expansion of that system.

II. TRUNCATED VOLTERRA REPRESENTATIONS
OF ANALYTIC FILTERS We start by describing the setup that we shall be working on, together with the main approximation tool that we will be using later on in the article, namely, Volterra series expansions. Details on the concepts introduced in the following can be found in, for instance, [55]- [57], and references therein.
All along this article, the symbol Z denotes the set of all integers, and Z − stands for the set of negative integers with the zero element included. Let D d ⊂ R d and D m ⊂ R m . We refer to the maps of the type U : (D d ) Z −→ (D m ) Z between infinite sequences with values in D d and D m , respectively, as filters, operators, or discrete-time IO systems, and to those like H : as R m -valued functionals. These definitions will be, sometimes, extended to accommodate situations where the domains and the targets of the filters are not necessarily product spaces but just arbitrary subsets of (R d ) Z and (R m ) Z , such as, for instance, ∞ (R d ) and ∞ (R m ).
A filter U : (D d ) Z −→ (D m ) Z is called causal when, for any two elements z, w ∈ (D d ) Z that satisfy that z τ = w τ for any τ ≤ t; for a given t ∈ Z, we have that The filter U is called TI when it commutes with the time delay operator, that is, for any τ ∈ Z (in this expression, the two operators T τ have to be understood as defined in the appropriate sequence spaces). There is a bijection between causal and TI filters and functionals. We denote by U H : Causal and TI filters are fully determined by their restriction to semiinfinite sequences, that is, U : will be denoted using the same symbol.
In most cases, we work in the situation in which D d and D m are compact and the sequence spaces (D d ) Z − and (D m ) Z − are endowed with the product topology. It can be shown (see [55]) that this topology is equivalent to the norm topology induced by any weighted norm defined by is an arbitrary strictly decreasing sequence (we call it weighting sequence) with zero limit and such that w 0 = 1. Filters and functionals that are continuous with respect to this topology are said to have the fading memory property (FMP).
A particularly important class of IO systems is those generated by state-space sytems in which the output y ∈ (D m ) Z − is obtained out of the input z ∈ (D d ) Z − as the solution of the equations (1) and (2), we say that this state-space system has the echo state property (ESP); in that case, it determines a unique filter When the ESP holds at the level of the state (1), then it determines another filter U F : The filters U F h and U F , when they exist, are automatically causal and TI (see [55]). The continuity and the differentiability properties of the state and observation maps F and h imply continuity and differentiability for U F h and U F under very general hypotheses; see [56] for an indepth study of this question.
We denote by · the Euclidean norm if not stated otherwise and use the symbol |||·||| for the operator norm with respect to the two-norms in the target and the domain spaces. In addition, for any z ∈ (R d ) Z − , we define p-norms as and use the same symbol B M whenever d = 1. In addition, we will write L(V, W ) to refer to the space of linear maps between the real vector spaces V and W . The following statement is the main approximation result that will be used in the article.
be a causal and TI fading memory filter whose restriction U | B M is analytic as a map between open sets in the Banach spaces ∞ − (R d ) and ∞ − (R m ) and satisfies U (0) = 0. Then, for any z ∈ B M , there exists a Volterra series representation of U given by Moreover, there exists a monotonically decreasing sequence w U with zero limit such that, for any p, l ∈ N

A. Signature State-Affine System
We now show that the filter obtained out of the truncated Volterra series expansion in the expression (5) can be written down as the unique solution of a nonhomogeneous SAS with linear readouts that, as we shall show in Section II-B, have particularly strong universal approximation properties. We first briefly recall how the SAS family is constructed.
Let α = (α 1 , . . . , α d ) ∈ N d and z = (z 1 , . . . , z d ) ∈ R d , and define the monomials z α := z α 1 1 . . . z α d d . We denote by M N 1 ,N 2 the space of real N 1 × N 2 matrices with N 1 , N 2 ∈ N and use M N 1 ,N 2 [z] to refer to the space of polynomials in z ∈ R d with matrix coefficients in M N 1 ,N 2 , that is, the set of elements p of the form being a finite subset and A α ∈ M N 1 ,N 2 the matrix coefficients. An SAS is given by are polynomials with matrix and vector coefficients, respectively, and W ∈ M m,N . If we consider inputs in the set K M and p and q in the state-space system (6) such that where B · (0, M) denotes the closed ball in R d of radius M and center 0 with respect to the Euclidean norm, then a unique state-filter U p,q : K M −→ K L can be associated with it, with It has been shown in [36] and [37] that SAS systems are universal approximants in the fading memory and in the L p -integrable categories in the sense that, given a filter in any of those two categories, there exists an SAS system of type (6) that uniformly, or in the L p -sense, approximates it. The SigSAS that we construct in this section exhibits what we call the strong universality property. This means that the state equation for this state-space representation is the same for any fading memory filter that is being approximated, and it is only the linear readout that changes. In other words, we provide a result that yields the approximation (as accurate as desired) of any fading memory IO system, as the linear readout of the solution of a fixed nonhomogeneous SAS system that does not depend on the filter being approximated.
Since the important property that we just described is reminiscent of an analogous feature of the signature process in the context of the representation of the solutions of controlled stochastic differential equations [52], we shall refer to this state system as the SigSAS system.
Before we proceed, we need to introduce some notation. First, for any l, d ∈ N, we denote by T l (R d ) the space of tensors of order l on R d , that is The tensor space T l (R d ) will be understood as a normed space with a crossnorm [58] that we shall leave unspecified for the time being. We shall be using an order lowering map π l : The order lowering map is linear, and its operator norm satisfies that |||π l ||| = 1. We shall restrict the presentation to 1-D inputs, that is, we consider input sequences z ∈ K M ⊂ ∞ − (R). Now, for fixed l, p ∈ N, we define for any z ∈ K M and t ∈ Z − z t := p+1 i=1 z i−1 t e i ∈ R p+1 and z t := z t−l ⊗ · · · ⊗ z t . (7) Note that z t is the Vandermonde vector [59] associated with z t and that z t is a tensor in T l+1 (R p+1 ) whose components in the canonical basis are all the monomials on the variables z t , . . . , z t−l that contain powers up to order p in each of those variables, namely Finally, given I 0 ⊂ {1, . . . , p + 1} an arbitrarily chosen but fixed subset of cardinality higher than 1 that contains the element 1, we define The next proposition introduces the SigSAS state system for fixed l, p ∈ N, whose solution is used later on in Theorem 4 to represent the truncated Volterra series expansions in Theorem 1 of polynomial degree p and lag −l [see expression (5) This state equation is induced by the state map F SigSAS λ,l, p which is a contraction in the state variable with contraction constant and, hence, restricts to a map F This state system has the echo state and the fading memory properties and its continuous, and TI, and a causal associated filter U SigSAS λ,l, p Remark 3: The state (9) is, indeed, an SAS with states defined in T l+1 (R p+1 ) as it has the same form as the first equality in (6). Indeed, this equation can be written as x t = p(z t )x t−1 + q(z t ) with p(z t ) and q(z t ) being the polynomials in z t with coefficients in L(T l+1 (R p+1 ), T l+1 (R p+1 )) and T l+1 (R p+1 ), respectively, given by

B. SigSAS Approximation Theorem
As we already pointed out, z t is a vector in T l+1 (R p+1 ) whose components in the canonical basis are all the monomials on the variables z t , . . . , z t−l that contain powers up to order p in each of those variables. Moreover, it is easy to see that all the other summands in the expression (13) of the filter U SigSAS λ,l, p are proportional (with a positive constant) to monomials already contained in z t . This implies the existence of a linear map A λ,l, p ∈ L(T l+1 (R p+1 ), T l+1 (R p+1 )) with an invertible matrix representation with nonnegative entries such that In the sequel, we will denote the matrix representation of A λ,l, p using the same symbol A λ,l, p ∈ M N,N , N := ( p +1) l+1 . This observation, together with Theorem 1, can be used to prove the following result. Theorem 4: Let M, L > 0 and U : be a causal and TI fading memory filter whose restriction U | B M is analytic as a map between open sets in the Banach spaces ∞ − (R) and ∞ − (R m ) and satisfies U (0) = 0. Then, there exists a monotonically decreasing sequence w U with zero limit such that, for any p, l ∈ N and any 0 < λ < min{1, 1/ Remark 5: Theorem 4 establishes the strong universality of the SigSAS system in the sense that the state equation of this system is the same for any fading memory filter U that is being approximated, and it is only the linear readout that changes. Nevertheless, we emphasize that the quality of the approximation is not filter independent, as the decreasing sequence w U in the bound (15) depends on how fast the filter U "forgets" past inputs.
Remark 6: The analyticity hypothesis in the statement of Theorem 4 can be dropped by using the fact that finite order and finite memory Volterra series are universal approximators in the fading memory category (see [60] and [56,Th. 31]). In that situation, the bound for the truncation error in (15) does not necessarily apply anymore, in particular, its second summand, which is intrinsically linked to analyticity. A generalized bound can be formulated in that case using arguments along the lines of those found in [35].

III. JL REDUCTION OF THE SIGSAS REPRESENTATION
The price to pay for the strong universality property exhibited by the SigSAS that we constructed in Section II-B is the potentially large dimension of the tensor space in which this state-space representation is defined. In this section, we concentrate on this problem by proposing a dimension reduction strategy, which consists of using the random projections in the JL Lemma [53] in order to construct a smaller dimensional SAS system with random matrix coefficients (that can be chosen to be sparse). The results contained in Sections III-B and III-C quantify the increase in approximation error committed when applying this dimensionality reduction strategy.
We start by introducing the JL Lemma [53] and some properties that are needed later on in the presentation. Following this, we spell out how to use it in the dimension reduction of state-space systems, in general, and the SigSAS representation, in particular.

A. JL Lemma and Approximate Projections
Given an N-dimensional Hilbert space (V, ·, · ) and Q a n-point subset of V , the JL Lemma [53], [61] guarantees, for any 0 < < 1, the existence of a linear map f : which respects -approximately the distances between the points in the set Q. More specifically for any v 1 , v 2 ∈ Q. The norm · in R k comes from an inner product that makes it into a Hilbert space, or in other words, it satisfies the parallelogram identity. This remarkable result is even more so in connection with further developments that guarantee that the linear map f can be randomly chosen [61]- [63] and, moreover, within a family of sparse transformations [64], [65] (see also [66]).
In the developments in this article, we use the original version of this result, in which the JL map f is realized by a matrix A ∈ M k,N whose entries are such that It can be shown that, with this choice, the probability of the relation (17) to hold for any pair of points in Q is bounded below by 1/n. Lemma 7: Let (V, · ) be a normed space, and let Q be a (finite or infinite countable) subset of V . Define · Q : 1) · Q defines a seminorm in span{Q}. If is finite, then · Q is a norm.

Remark 8:
If the hypothesis M Q < ∞ is dropped in part 1) of Lemma 7, then · Q is, in general, not a norm as the following example shows. Take V = R and v i = i , i ∈ N. It is easy to see that, in this setup ; and let f * : R k −→ V the adjoint map with respect to a fixed inner product ·, · in R k . Then for any w 1 , w 2 ∈ span {Q}.
Corollary 10: In the hypotheses of the previous proposition, let Then, for any v ∈ span{Q} such that ( f * • f )(v) ∈ span{Q}, we have This corollary is just a consequence of the inequality (20) that guarantees that which yields (22).

B. JL Projection of State-Space Dynamics
The next result shows how, when the dimension k of the target of the JL map f determined by (16) is chosen so that this map is generically surjective, then any contractive state-space system with states in the domain of f can be projected onto another one with states in its smaller dimensional image. This result also shows that, if the original system has the ESP and the FMP, then so does the projected one. In addition, it gives bounds that quantify the dynamical differences between the two systems.
Theorem 11: Let F ρ : R N × D d −→ R N be a oneparameter family of continuous state maps, where D d ⊂ R d is a compact subset, 0 < ρ < 1, and F ρ is a ρ-contraction on the first component. Let Q be a n-point spanning subset of R N satisfying −Q = Q. Let f : R N −→ R k be a JL map that satisfies (17) with 0 < < 1 where the dimension k has been chosen so that f is generically surjective. Then, the following holds. 1 for any x ∈ R k and z ∈ D d . If the parameter ρ is chosen so that then F f ρ is a contraction on the first entry. The symbol |||·||| in (24) denotes the operator norm with respect to the two-norms in R N and R k .
for any x ∈ V k and z ∈ D d . If the contraction parameter satisfies (24), then F f ρ is also a contraction on the first entry. Moreover, the restricted linear map f * : Then, both F ρ and F f ρ have the ESP and have unique FMP associated filters U ρ : is isomorphic to the restricted version of F f ρ and also has the ESP and an FMP associated filter U where M Q and C Q are given by (19) and (21), respectively. Alternatively, it can also be shown that Then, the elements in the set Q can be chosen so that the bounds in (26) and (27) reduce to and respectively.

C. JL-Reduced SigSAS System
We now use the previous theorem to spell out the JL projected version of SigSAS approximations and establish error bounds analogous to those introduced in (28) and (29). Given that Theorem 11 is formulated using the one and the two-norms in Euclidean spaces and Proposition 2 define the SigSAS system on a tensor space endowed with an unspecified cross-norm, we notice that those two frameworks can be matched by using the norms · and · 1 in T l+1 (R p+1 ) given by ..,i l+1 e i 1 ⊗ · · · ⊗ e i l+1 and {e i 1 ⊗ · · · ⊗ e i l+1 } i 1 ,...,i l+1 ∈{1,..., p+1} being the canonical basis in T l+1 (R p+1 ). It is easy to check that these two norms are crossnorms and that · is the norm associated with the inner product defined by the extension by bilinearity of the assigment e i 1 ⊗ · · · ⊗ e i l+1 , e j 1 ⊗ · · · ⊗ e j l+1 := δ i 1 j 1 · · · δ i l+1 j l+1 which makes (T l+1 (R p+1 ), ·, · ) into a Hilbert space, a feature that is needed to use the JL Lemma.
Corollary 12: Let M > 0, and let (F SigSAS λ,l, p , W ) be the SigSAS system that approximates a causal and TI filter U : K M −→ ∞ − (R m ), as introduced in Theorem 4. Let N := ( p + 1) l+1 , M as in (11), and let 0 < < 1. Let f : R N −→ R k be a JL map that satisfies (17), where the dimension k has been chosen to make f generically surjective. Then, for any R > max{1/||| f ||| 2 , 1/( M||| f ||| 2 ), 1}, λ := 1/(R M||| f ||| 2 ), and L as in (12), there exists a , which has the ESP and a unique FMP associated filter U SigSAS λ,l, p, f : for any z ∈ K M and t ∈ Z − , where W : This result shows that causal and TI filters can be approximated by JL-reduced SigSAS systems. The goal in the following paragraphs consists of showing that such systems are just SAS systems with randomly drawn matrix coefficients and, in addition, in precisely spelling out the law of their entries. These facts show precisely that a large class of filters can be learned just by randomly generating an SAS and by tuning a linear readout layer for each individual filter that needs to be approximated. We emphasize that the JL-reduced randomly generated SigSAS system is the same for the entire class of FMP filters that are being approximated and that only the linear readout depends on the individual filter that needs to be learned, which amounts to the strong universality property that we discussed in Sections I and II-A. As in Remark 5, we recall that the quality of the approximation using a JL-reduced random SigSAS system may change from filter to filter because of the dependence on the sequence w U in the bound (15) and the presence of the linear readout W in (30) and (31).
The next statement needs the following fact that is known in the literature as Gordon's Theorem (see [67,Th. 5.32] and references therein): given a random matrix A ∈ M n,m with standard Gaussian independent and identically distributed (IID) entries, we have that In addition, the element z 0 ∈ T l+1 (R p+1 ) introduced in (8) for the construction of the SigSAS system will be chosen in a specific randomized way in this case. Indeed, this time around, we replace (8) by where r is a Rademacher random variable that is chosen independent of all the other random variables that will appear in the different constructions. If we take in T l+1 (R p+1 ) the canonical basis in lexicographic order, the element z 0 can be written as the image of a linear map as with and S c ∈ M p+1 a diagonal selection matrix with the elements given by S c ii = 1 if i ∈ I 0 , and S c ii = 0 otherwise.  (11), l, p, k ∈ N, and define N := ( p + 1) l+1 , N 0 := ( p + 1) l . Consider an SigSAS state map F SigSAS λ,l, p of the type introduced in (10) and defined by choosing the nonhomogeneous term z 0 as in (33). Let, now, f : R N −→ R k be a JL projection randomly drawn according to (18). Let δ > 0 be small enough so that Then, the JL-reduced version F SigSAS λ 0 ,l, p, f of F SigSAS λ 0 ,l, p has the ESP and the FMP with probability at least 1 − δ, and in the limit N 0 → ∞, it is isomorphic to the family of randomly generated SAS systems F SigSAS λ 0 ,l, p, f with states in R k and given by where A 1 , . . . , A p+1 ∈ M k and B ∈ M k, p+1 are random matrices whose entries are drawn according to All the entries in the matrices A 1 , . . . , A p+1 are independent random variables. The entries in the matrix B are independent of each other, and they are decorrelated and asymptotically independent (in the limit as N 0 → ∞) from those in A 1 , . . . , A p+1 . We conclude with a result that uses, in a combined manner, the SigSAS approximation (see Theorem 4) with its JL reduction in Corollary 12, as well as its SAS characterization with random coefficients in Theorem 13. This statement shows that, in order to approximate a large class of sufficiently regular FMP filters with uniformly bounded inputs, it suffices to randomly generate a common SAS system for all of them and tune a linear readout for each different filter in that class that needs to be approximated.
Theorem 14: Let M, L > 0, and let U : be a causal and TI fading memory filter that satisfies the hypotheses in Theorem 4. Now, fix l, p, k ∈ N and δ > 0 small enough so that (35) holds. Now, construct the SAS system with states in R k given by with matrix coefficients randomly generated according to the laws spelled out in (37) and (38).
If p and l are large enough, then the SAS system F SigSAS λ 0 ,l, p, f has the ESP and the FMP with probability at least 1−δ. In that case, F SigSAS λ 0 ,l, p, f has a filter U SigSAS λ 0 ,l, p, f associated, and there exists a monotonically decreasing sequence w U with zero limit and a linear map W ∈ L(R k , R m ) such that, for any z ∈ B M , it holds that where I l, p is either In these expressions, (11), and 0 < < 1 satisfies (16) with n replaced by N.
IV. NUMERICAL ILLUSTRATION In order to illustrate the main contributions of the article, we consider an IO system given by the so-called generalized autoregressive conditional heteroskedastic (GARCH) model [68], [69]. GARCH is a popular discrete-time process in time-series analysis, which is used in the econometrics literature and by practitioners to model and forecast the dynamics of conditional volatilities in financial time series. More specifically, the GARCH(1, 1) model is given by where ω > 0, α, β ≥ 0, and α + β < 1 (see [70] for a careful discussion of the properties of GARCH processes). The IO system is driven by the input innovations {z t } t∈Z , and the observations {y t } t∈Z represent its output. In the experiment, we use ω = 0.0001, α = 0.1, and β = 0.87, and in order to learn the corresponding IO system, we construct: 1) an SigSAS system as in Proposition 2; 2) a JL-reduced SigSAS system as in Corollary 12; and 3) a randomly generated SAS as in Theorem 13. For all the systems, the corresponding readout maps are obtained by a linear regression. Fig. 1 illustrates the result in Theorem 4 and shows that the SigSAS approximation error decreases with N. Fig. 2 shows that the approximation errors committed by both the JL-reduced SigSAS and its randomly generated analog decrease as the JL dimension k increases. We emphasize that the mean errors are computed using 160 randomly drawn instances of these two reduced SigSAS systems, and note that the errors reported in this figure for the two systems are visually indistinguishable. We remind that, even though the result of Theorem 13 is Box plots for the distributions of training mean squared errors (all MSE values are multiplied by 1e + 4 for convenience) committed by 160 instances of randomly JL-reduced SigSAS systems and randomly generated SAS systems according to Theorem 13. The MSEs are computed with respect to one given GARCH path of length 7000 for different values of k. For each k, the box plots corresponding to the two systems are plotted next to each other to ease comparison (JL SigSAS in blue and random SAS in magenta). The subplot in the upper right corner shows a comparison of a part of this GARCH path for t = 1, . . . , 100 and its approximations using a JL SigSAS and a randomly generated SAS system with k = 10. proved to hold in the limit as N 0 = ( p + 1) l → ∞, it is clear from this particular example that, even for moderately small N 0 ( p = 8 and l = 3), randomly generated small-dimensional SigSAS can excel in learning a given IO system.
The implications of the strong universality features of the randomly generated SAS systems are far-reaching in terms of their empirical performance since, as we already emphasized several times, it is only the linear readout that is tuned for each individual IO system of interest. In particular, this opens door to multitask learning (when different components of the readout are trained for different tasks in parallel) and to new hardware implementations of these randomized SAS systems.
V. CONCLUSION RC capitalizes on the remarkable fact that there are learning systems that attain universal approximation properties without requiring that all their parameters are estimated using a supervised learning procedure. These untrained parameters are most of the time randomly generated, and it is only an output layer that needs to be estimated using a simple functional prescription. This phenomenon has been explained for static (extreme learning machines [30]) and dynamic (ESNs [34], [35]) neural paradigms, and its performance has been quantified using mostly probabilistic methods.
In this article, we have concentrated on a different class of RC systems, namely, the state-affine (SAS) family. The SAS class was introduced and proved universal in [36], and we have shown here that the possibility of randomly constructing these systems and, at the same time, preserving their approximation properties is of geometric nature. The rationale behind our description relies on the following points.
1) Any analytic filter can be represented as a Volterra series expansion. When this filter is additionally of fading memory type, the truncation error can be easily quantified. 2) Truncated Volterra series admit a natural state-space representation with linear observation equation in a conveniently chosen tensor space. The state equation of this representation has a strong universality property whose unique solution can be used to approximate any analytic fading memory filter just by modifying the linear observation equation. We refer to this strongly universal filter as the SigSAS system.
3) The random projections of the SigSAS system yield SAS systems with randomly generated coefficients in a potentially much smaller dimension, which approximately preserves the good properties of the original SigSAS system. The loss in performance that one incurs because of the projection mechanism can be quantified using the JL Lemma. These observations, together with the numerical experiment, collectively show that SAS reservoir systems with randomly chosen coefficients exhibit excellent empirical performances in the learning of fading memory IO systems because they approximately correspond to very high-degree Volterra series expansions of those systems.