Learning of Linear Dynamical Systems as a Non-Commutative Polynomial Optimization Problem

There has been much recent progress in forecasting the next observation of a linear dynamical system (LDS), which is known as the improper learning, as well as in the estimation of its system matrices, which is known as the proper learning of LDS. We present an approach to proper learning of LDS, which in spite of the non-convexity of the problem, guarantees global convergence of numerical solutions to a least-squares estimator. We present promising computational results.


I. INTRODUCTION
We consider the identification of vector autoregressive processes with hidden components from time series of observations, which is a key problem in system identification [32].Its applications range from the identification of parameters in epidemiological models [3] and reconstruction of reaction pathways in other biomedical applications [10], to identification of models of quantum systems [5], [6].Beyond this, one encounters either partially observable processes or questions of causality [43], [14] in almost any application domain.In the "prediction-error" approach to forecasting [32], it allows the estimation of subsequent observations in a time series.
To state the problem formally, let us define a linear dynamic system (G, F, V, W ) as in [57].
where ϕt ∈ R n×1 is the hidden state, Yt ∈ R m×1 is the observed output (measurements, observations), G ∈ R n×n and F ∈ R n×m are system matrices, and {ωt, νt} t∈N are normally distributed process and observation noises with zero mean and covariance of W and V respectively.The transpose of F is denoted as F ′ .Learning (or proper learning) refers to identifying the quadruple (G, F, V, W ) given the output {Yt} t∈N .We assume that the linear dynamical system (G, F, V, W ) is observable [37], i.e., its observability matrix [57] has full rank.Note that a minimal representation is necessarily observable and controllable, cf.Theorem 4.1 in [48], so the assumption is not too strong.There are three complications.First, the dimension n of the hidden state ϕt is not known, in general.Although [42] have shown that a lower-dimensional model can approximate a higher-dimensional one rather well, in many cases, it is hard to choose n in practice.Second, the corresponding optimization problem is non-convex, and guarantees of global convergence have been available only for certain special cases.Finally, the operators-valued optimization problem is noncommutative, and hence much work on general-purpose commutative non-convex optimization is not applicable without making assumptions [5, cf.] on the dimension of the hidden state.
Here, we aim to develop a method for proper learning of LDS that could also estimate the dimension of the hidden state and that would do so with guarantees of global convergence to the best possible estimate, given the observations.This would promote explainability beyond what forecasting methods without global convergence guarantees allow for.In particular, our contributions are: • We cast learning of a linear dynamical system with an unknown dimension of the hidden state as a non-commutative polynomial optimization problem (NCPOP).This also makes it possible to utilize prior information as shape constraints in the NCPOP.• We show how to use Navascules-Pironio-Acin (NPA) hierarchy [40] of convexifications of the NCPOP to obtain bounds and guarantees of global convergence.The runtime is independent of the (unknown) dimension of the hidden state.• In two well-established small examples of [30], [18], [26], our approach outperforms standard subspace and least squares methods, as implemented in Matlab™ System Identification Toolbox™.

II. BACKGROUND
First, we set our work in the context of related work.Next, we provide a brief overview of non-commutative polynomial optimization, pioneered by [40] and nicely surveyed by [8], which is our key technical tool.Prior to introducing our own results, we introduce some common notation, following [57].

A. Related Work in System Identification and Control
There is a long history of research within system identification [32].In forecasting under LDS assumptions (improper learning of LDS), a considerable progress has been made in the analysis of predictions for the expectation of the next measurement using autoregressive (AR) processes in Statistics and Machine Learning.In [2], first guarantees were presented for auto-regressive moving-average (ARMA) processes.In [30], these results were extended to a subset of autoregressive integrated moving average (ARIMA) processes.[26] have shown that up to an arbitrarily small error given in advance, AR(s) will perform as well as any Kalman filter on any bounded sequence.This has been extended by [51] to Kalman filtering with logarithmic regret.
Another stream of work within improper learning focuses on sub-space methods [23], [37] and spectral methods [18], [17].[49], [50] presented the present-best guarantees for traditional sub-space methods.[47] utilize regularizations to improve sample complexity.Within spectral methods, [18] and [17] have considered learning LDS with input, employing certain eigenvalue-decay estimates of Hankel matrices in the analyses of an auto-regressive process in a dimension increasing over time.We stress that none of these approaches to improper learning are "prediction-error": They do not estimate the system matrices.
In proper learning of LDS, many state-of-the-art approaches consider the least squares method, despite complications encountered in unstable systems [12].[46] have provided non-trivial guarantees for the ordinary least squares (OLS) estimator in the case of stable G and there being no hidden component, i.e., F ′ being an identity and Yt = ϕt.Surprisingly, they have also shown that more unstable linear systems are easier to estimate than less unstable ones, in some sense.[45] extended the results to allow for a certain pre-filtering procedure.arXiv:2002.01444v6[math.OC] 27 Feb 2024 [41], [42] extended the results to cover stable, marginally stable, and explosive regimes.[38] provide a finite-horizon analysis of the Ho-Kalman algorithm.Most recently, [4] provided a detailed analysis of the use of the method of moments in learning linear dynamical systems, which could be seen as a polynomial-time algorithm for learning a LDS from a trajectory of polynomial length up to a polynomial error.Our work could be seen as a continuation of the work on the least squares method, with guarantees of global convergence.

B. Non-Commutative Polynomial Optimization
Our key technical tool is non-commutative polynomial optimization, first introduced by [40].Here, we provide a brief summary of their results, and refer to [8] for a book-length introduction.NCPOP is an operator-valued optimization problem with a standard form in (2): where X = (X1, . . ., Xn) is a tuple of bounded operators on a Hilbert space H.In contrast to traditional scalar-valued, vector-valued, or matrix-valued optimization techniques, the dimension of variables X is unknown a priori.Let [X, X † ] denotes these 2n operators, with the †-algebra being conjugate transpose.The normalized vector ψ, i.e., ∥ψ∥ 2 = 1 is also defined on H with the inner product ⟨ψ, ψ⟩ = 1.p(X) and qi(X) are polynomials and qi(X) ≽ 0 denotes that the operator qi(X) is positive semidefinite.Polynomials p(X) and qi(X) of degrees deg(p) and deg(qi), respectively, can be written as: where i = 1, . . ., m. Monomials ω, µ, u and ν in following text are products of powers of variables from [X, X † ].The degree of a monomial, denoted by |ω|, refers to the sum of the exponents of all operators in the monomial ω.Let W k denote the collection of all monomials whose degrees |ω| ≤ k, or less than infinity if not specified.Following [1], we can define the moments on field R or C, with a feasible solution (H, X, ψ) of problem (2): for all ω ∈ W and y1 = ⟨ψ, ψ⟩ = 1.Given a degree k, the moments whose degrees are less or equal to k form a sequence y = (yω) |ω|≤2k .We call k as the moment order.With a finite set of moments y of moment order k, we can define a corresponding k th -order moment matrix M k (y): for any |ν|, |ω| ≤ k and the localizing matrix M k i (qiy): for any |ν|, |ω| ≤ ki, where ki = k −⌈deg(qi)/2⌉, and i = 1, . . ., m.
If (H, X, ψ) is feasible, one can utilize the Sums of Squares theorem of [19] and [35] to derive semidefinite programming (SDP) relaxations.In particular, we can obtain a k th -order SDP relaxation of the non-commutative polynomial optimization problem (2) by choosing a moment order k that satisfies the condition of 2k ≥ max{deg(p), deg(qi)}.In the Navascules-Pironio-Acin (NPA) hierarchy [40], the SDP relaxation of moment order k, has the following form: Notice that there are variants [56], [55], [53] that exploit the sparsity and significantly reduce the computational burden.
Let us define the quadratic module, following [40].Let Q = {qi, i = 1, . . ., m} be the set of polynomials determining the constraints.The positivity domain SQ of Q are n-tuples of bounded operators X = (X1, . . ., Xn) on a Hilbert space H making all qi(X) positive semidefinite.The quadratic module MQ is the set of where fi and gij are polynomials from the same ring.As in [40], we assume: If the Archimedean assumption is satisfied, Pironio et al. [40] have shown that lim k→∞ P k = P * and how to use the so-called rank-loop condition [40] to detect global optimality.We refer to an extended version online [59] for further details.

C. Minimizer Extraction and Gelfand-Naimark-Segal Construction
Notice that the solution of the SDP relaxation makes it possible to read out the value of the objective function ⟨ψ, p(X)ψ⟩ of (2) easily, by looking up the correct entries of the moment matrix (5).To extract the optimizer with this objective-function value, one may utilize a variant of the singular-value decomposition of the moment matrix pioneered by [20], which can be construed [25] as the Gelfand-Naimark-Segal (GNS) construction [15], [44], [11].(The GNS construction essentially produces a *-representation from a positive linear functional of a C*-algebra on a Hilbert space.Under the Archimedean assumption, this method could be applied to noncommutative polynomials, which are not C*-algebras otherwise.We refer to an extended version online [59] for further details.)These SVD-based approaches do not require the rank-loop condition to be satisfied, as is well explained in Section 2.2 of [25].Once global optimality is detected (cf. the previous section), it is possible to extract the global optimum (H * , X * , ψ * ) from the solution of the SDP relaxation of (2) by Gram decomposition; cf.Theorem 2 in [40].

III. THE MAIN RESULT
Given a trajectory of observations Y1,...,Yt−1, loss is a one-step error function at time t that compares an estimate ft with the actual observation Yt.Within the least squares estimator, we aim to minimize the sum of quadratic loss functions, i.e., min where the estimates ft, t ≥ 1 are decision variables.The properties of the optimal least squares estimate are well understood: it is consistent, cf.Mann and Wald [34] and Ljung [31], and has favorable sample complexity, cf.Theorem 4.2 of Campi and Weyer [9] in the general case, and to Jedra and Proutiere [22] for the latest result parameterized by the size of a certain epsilon net.We stress, however, that it has not been understood how to solve the non-convex optimization problem, in general, outside of some special cases [16] and recent, concurrent work of [4].In contrast to [16], we focus on a method achieving global convergence under mild assumptions, and specifically without assuming the dimension of the hidden state is known.When the dimension of the hidden state is not known, we need operator-valued variables mt to model the state evolution, and some additional scalar-valued variables.We denote the process noise and the observation noise at time t by ωt and νt, respectively.We also denote as such the decision variables corresponding to the estimates thereof, if there is no risk of confusion.If we add the sum of the squares of ωt and the sum of the squares of νt as regularizers to the objective function with sufficiently large multipliers and minimize the resulting objective, we should reach a feasible solution with respect to the system matrices with the process noise ωt and observation noise νt being close to zero.
Overall, such a formulation has the form in Equations ( 9) subject to (10)(11).The inputs are Yt, t ≥ 1, i.e., the time series of the actual measurements, of a time window T thereof, and multipliers c1, c2.Decision variables are system matrices G, F ; noisy estimates ft, realizations ωt, νt of noise, for t ≥ 1; and state estimates mt, for t ≥ 0, which include the initial state m0.We minimize the objective function: for a 2-norm ∥ • ∥ over the feasible set given by constraints for t ≥ 1: We call the term F ′ mt noise-free estimates, which are regarded as our simulated/ predicted outputs.Equations ( 9) subject to (10)(11) give us the least squares model.We can now apply the techniques of non-commutative polynomial optimization to the model so as to recover the system matrices of the underlying linear system.
Theorem 2. For any observable linear system (G, F, V, W ), for any length T of a time window, and any error ϵ > 0, under Assumption 1, there is a convex optimization problem whose objective function value is at most ϵ away from (9) subject to (10)(11).Furthermore, an estimate of (G, F, V, W ) can be extracted from the solution of the same convex optimization problem.
Proof.First, we need to show the existence of a sequence of convex optimization problems, whose objective function approaches the optimum of the non-commutative polynomial optimization problem.
As explained in Section II-B above, [40] show that there is a sequence of natural semidefinite-programming relaxations of (2).The convergence of the sequence of their objective-function values is shown by Theorem 1 of [40], which requires Assumption 1.The translation of a problem involving multiple scalar-and operator-valued variables ft, mt, G, F, ωt, νt in (9)(10)(11) to (H, X, ψ) of (2), also known as the product-of-cones construction, is somewhat tedious, but routine and implemented in multiple software packages [58], [54, e.g.].Second, we need to show that extraction of an estimate of (G, F, V, W ) from the SDP relaxation of order k(ϵ) in the series is possible.There, one utilizes the Gelfand-Naimark-Segal (GNS) construction [15], [44], as explained in Section 2.2 of [25] or in Section II-C above.Notice that [29, cf.] the estimate of (G, F, V, W ) may have a higher error than ϵ.
This reasoning can be applied to more complicated formulations, involving shape constraints.For instance, in quantum systems [5], density operators are Hermitian and this constraint can be added to the least squares formulation.
Crucially for the practical applicability of the method, one should like to exploit the sparsity in the NCPOP (9)(10)(11).Notice that one can decompose the problem (9-11) into t subsets of variables involving ft, mt−1, mt, G, F, ωt, νt, which satisfy the so-called running intersection property [54].We refer to [24] for a seminal paper on trace optimization exploiting correlative sparsity, and to [54] for the variant exploiting term sparsity.We also present a brief summary online [59].
Also, note that the extraction of the minimizer using the GNS procedure, as explained in Section II-C above, is stable to errors in the moment matrix, for any NCPOP, including the pre-processing above.See Theorem 4.1 in [25].That is: it suffices to solve the SDP relaxation with a fixed error, in order to extract the minimizer.
One can also utilize a wide array of reduction techniques on the resulting SDP relaxations.Notable examples include facial reduction [7], [39] and exploiting sparsity [13].Clearly, these can be applied to any SDPs, irrespective of the non-commutative nature of the original problem, but can also introduce [27] numerical issues.We refer to [33] for an up-to-date discussion.

IV. NUMERICAL ILLUSTRATIONS
Let us now present the implementation of the approach using the techniques of non-commutative polynomial optimization [40], [8] and to compare the results with traditional system identification methods.Our implementation is available online1 .We present our experimental settings in more detail online [59].
A. The general setting a) Our formulation and solvers: For our formulation, we use Equations ( 9) subject to (10)(11), where we need to specify the values of c1 and c2.To generate the SDP relaxation of this formulation as in (7), we need to specify the moment order k.Because the degrees of objective (9) and constraints in (10)(11) are all less than or equal to 2, the moment order k within the respective hierarchy can start from k = 1.
In our implementation, we use a globally convergent Navascués-Pironio-Acín (NPA) hierarchy [40] of SDP relaxations, as utilized in the proof of Theorem 2, and its sparsity-exploiting variant, known as the non-commutative variant of the term-sparsity exploiting moment/SOS (TSSOS) hierarchy [56], [55], [54].(See [59] for a summary.)The SDP of a given moment order within the NPA hierarchy is constructed using ncpol2sdpa 1.12.22 of Wittek [58].The SDP of a given moment order within the non-commutative variant of the TSSOS hirarchy is constructed using the nctssos3 of Wang et al. [54].Both SDP relaxations are then solved by mosek 9.2 [36].
b) Baselines: We compare our method against leading methods for estimating state-space models, as implemented in MathWorks™ Matlab™ System Identification Toolbox™.Specifically, we test against a combination of least squares algorithms implemented in routine ssest ("least squares auto"), subspace methods of [37] implemented in routine n4sid ("subspace auto"), and a subspace identification method of [21] with an ARX-based algorithm to compute the weighting, again utilized via n4sid ("ssarx").
To parameterize the three baselines, we need to specify the dimension d of the estimated state-space model.We would set d = n directly or alternatively, iterate from 1 to the highest number allowed in the toolbox when the underlying system is unknown, e.g., in realworld stock-market data.Then, we need to specify the error to be minimized in the loss function during estimation.In fairness to the baselines, we use the one-step ahead prediction error when comparing prediction performance and simulation error between measured and simulated outputs when comparing simulation performance.
c) The performance index: To measure the goodness of fit between the ground truth {Yt} T t=1 (actual measurements) and the noise-free simulated/ predicted outputs {F ′ mt} T t=1 , using different system identification methods, we introduce the normalized root mean squared error (nrmse) fitness value: where Y and F ′ m are the vectors consisting of the sequence {Yt} T t=1 and {F ′ mt} T t=1 respectively.A higher nrmse fitness value indicates better simulation or prediction performance.

B. Experiments on the example of Hazan et al.
Experiments in Sections IV-B-IV-C utilize synthetic time series of T observations generated using LDS of the form in (1), with the tuple (G, F, V, W ) and the initial hidden state ϕ0 detailed next.We use the dimension n to indicate that the time series of observations were generated using n × n system matrices, while we use operator-valued variables to estimate these.The standard deviations of process noise and observation noise W, V are chosen from 0.1, 0.2, . . ., 0.9.Note that W is an n × n matrix in general, while we consider the spherical case of W = 0.1 × In, where In is the n-dimensional identity matrix, which we denote by W = 0.1.
In our first experiment, we explore the statistical performance of feasible solutions of the SDP relaxation using the example of Hazan et al. [18], [26].We performed one experiment on each combination of standard deviations of process W and observation noise V from the discrete set 0.1, 0.2, . . ., 0.9, i.e., 81 runs in total.
Figure 1 illustrates the nrmse values of the 81 runs of our method in different combinations of standard deviations of process noise W and observation noise V (upper), and another 81 experiments in different combinations of c1 and c2 (lower).In the upper subplot of Figure 1, we consider: n = 2, G = 0.9 0.2 0.1 0.1 , F ′ = 1 0.8 , the starting point ϕ ′ 0 = 1 1 , and T = 20.In the lower subplot of Figure 1, we have the same settings as in the upper one, except for W = V = 0.5 and the parameters c1, c2 being chosen from 10 −4 , . . ., 1.It seems clear the highest nrmse is to be observed for the standard deviation of both process and observation noises close to 0.5.While this may seem puzzling at first, notice that higher standard deviations of noise make it possible to approximate the observations by an auto-regressive process with low regression depth [26,Theorem 2].The observed behavior is therefore in line with previous results [26, e.g., Figure 3].

C. Comparisons against the baselines
Next, we investigate the simulation performance of our method in comparison with other system identification methods, for varying LDS used to generate the time series.Our method and the three baselines described in Section IV-A are run 30 times for each choice of the standard deviations of the noise, with all methods using the same time series.
Figure 2 illustrates the results, with methods distinguished by colors: blue for "least squares auto", purple for "subspace auto", pink for "ssarx", and yellow for our method.The upper subplot presents the mean (solid lines) and mean ± one standard deviations (dashed lines) of nrmse as standard deviation of both process noise and observation noise ("noise std") increasing in lockstep from 0.1 to 0.9.The underlying system is the same as in the upper subplot of Figure 1, except for W = V = 0.1, 0.2, . . ., 0.9.The middle subplot is similar, except the time series are generated by systems of a higher differential order: observation noise std V and the formulation of our method is changed accordingly.In the lower subplot of Figure 2, we consider the mean (solid dots) and mean ± one standard deviations (vertical error bars) of nrmse at different dimensions n = 2, 3, 4 of the underlying system (1).As Figure 2 suggests, the nrmse values of our method on this example are almost 100%, while other methods rarely reach 50% despite the fact that the dimensions used by the baselines are the true dimensions of the underlying system (d = n).(We will use "least squares auto", which seems to work best within the other methods, in the following experiment on stock-market data.)Additionally, our method shows better stability; the gap between the yellow dashed lines in the upper or middle subplot, which suggests the width of two standard deviations, is relatively small.

D. Experiments with stock-market data
Our approach to proper learning of LDS could also be used in a "prediction-error" method for improper learning of LDS, i.e., forecasting its next observation (output, measurement).As such, it can be applied to any time series.To exhibit this, we consider real-world stock-market data first used in [30].In particular, we predict the evolution of the stock price from the 21 st period to the 121 st period, where each prediction is based on the 20 immediately preceding observations (T = 20).For our method, we use the same formulation (9) subject to ( 10)-( 11), but with the variable F ′ removed.For comparison, the combination of least squares algorithms "least squares auto" is used again.Since we are using the stock-market data, the dimension n of the underlying system is unknown.Hence, the dimensions d of the "least squares auto" are iterated from 1 to 4, wherein 4 is the highest setting allowed in the toolbox for 20-period observations.The nrmse fitness values (52) of our method compared to the leading system identification methods implemented in Matlab™ System Identification Toolbox™.Upper & middle: the mean (solid lines) and mean ± one standard deviations (dashed lines) of nrmse as standard deviation of both process noise and observation noise increasing in lockstep from 0.1 to 0.9.The time series used for simulation are generated from systems in (1) (upper) and higher differential order systems in (13) (middle), with the dimensions n of both systems being 2. Lower: the mean (solid dots) and mean ± one standard deviations (vertical error bars) of nrmse at different dimensions n of the underlying systems in (1).Higher nrmse indicates better simulation performance.
Figure 3 shows in the left subplot the results obtained by our method (a yellow curve), and the "least squares auto" of varying dimensions d = 1, 2, 3, 4 (four blue curves).The true stock price "origin" is displayed by a dark curve.The percentages in the legend correspond to nrmse values (52).Both from the nrmse and the shape of these curves, we notice that "least squares auto" performs poorly when the stock prices are volatile.This is highlighted in the right subplot, which zooms in on the 66 th -101 st period.

E. Runtime
Next, we consider the runtime of two implementations of solvers for (9) subject to (10)- (11).The first implementation constructs the SDP relaxation of NPA hierarchy via ncpol2sdpa 1.12.2 with moment order k = 1.The second implementation constructs the non-commutative variant of the TSSOS hierarchy via nctssos, with time Fig. 3: Left: The time series of stock price (dark) for the 21 st -121 st period used in [30], and the predicted outputs of our method (yellow) compared against "least squares auto" (blue) implemented in Matlab™ System Identification Toolbox™.The dimension d of "least squares auto" is iterated from 1 to the highest number of 4. The percentages in legend are corresponding nrmse values of one-step predictions.
Right: a zoom-in for the 66 th -101 st period.For comparison purposes, we include the baseline "least squares auto" at dimensions d = 1, 2. We randomly select a time series from the stock-market data, with the length of time window T chosen from 5, 6, . . ., 30, and run these three methods three times for each T .Figure 4 illustrates the runtime of the SDP relaxations and the baseline "least squares auto" as a function of the length of the time window.These implemented methods are distinguished by colors: blue for "least squares auto", green for the non-commutative variant of the TSSOS hierarchy ("nctssos"), and yellow for the NPA hierarchy ("npa").The mean and mean ± one standard deviation of runtime are displayed by (solid or dashed) curves and shaded error bands.The upper-right subplot compares the runtime of our method with "nctssos" at moment order k = 1 against "least squares auto" with dimension d = 1.The red bars in the lower-right subplot display the sparsity of NPA hierarchy of the experiment on stock-market data against the length of time window, by ratios of non-zero coefficients out of all coefficients in the SDP relaxations.
As in most primal-dual interior-point methods [52], runtime of solving the relaxation to ϵ error is polynomial in its dimension and logarithmic in 1/ϵ, but it should be noted that the dimension of the relaxation grows fast in the length T of the time window and the moment order k.It is clear that the runtime of solvers for SDP relaxations within the non-commutative variant of the TSSOS hierarchy exhibits a modest growth with the length of time window, much slower than that of the plain-vanilla NPA hierarchy.

V. CONCLUSIONS
We have presented an alternative approach to the recovery of hidden dynamic underlying a time series, without assumptions on the dimension of the hidden state.For the first time in system identification and machine learning, this approach utilizes noncommutative polynomial programming (NCPOP), which has been recently developed within Mathematical Optimization [40], [58], [25], [54].NCPOP can accommodate a variety of other objectives and constraints [60, e.g. in fairness].This builds upon a long history of work on the method of moments [1], [19] and its applications in Machine Learning [28], as well as recent progress [33] in the scalability of semidefinite programming.

VI. TRACE OPTIMIZATION
Let S n be the space of n-tuple real symmetric matrices of the same order, where n is the number of variables.Suppose X = {X1, . . ., Xn} ∈ S n , then X1, . . ., Xn have the same order despite that the specific order is not given.Let R[X] k denote the polynomial ring, i.e., polynomials whose coefficients are from R and whose degrees are less or equal to k, or Let Q be a subset of Sym R[X] with elements qi, i = 1, . . ., m. Trace optimization of polynomials in non-commutative variables, with respect to a polynomial p ∈ Sym R[X], has the form in (15): where the former formulation is the unconstrained trace minimization, and the latter is the constrained version subject to the positive semidefiniteness of all polynomials in Q ⊆ Sym R[X].Note that if Q = ∅, then trmin(p, Q) becomes an unconstrained problem.We would need the following definitions to relax these trace minimization problems.
A commutator of two polynomials p, q ∈ R[X] is defined as [p, q] := pq − qp.Since operators can be cyclically interchanged inside a trace, i.e., tr(pq) = tr(qp), the trace of a commutator is zero.Conversely, trace zero matrices are (sums of) commutators, using the proof in [?].This fact allows us to define an equivalence relation between two polynomials when they have the same trace.Definition 3 ([8], Definition 1.50).Cyclic equivalence.Two polynomials p, q ∈ R[X] are cyclic equivalent if p − q is a sum of commutators, denoted by Proposition 4 ([8], Proposition 1.51).Given two monomials ω, µ.Then ω cyc ∼ µ if and only if there exist ν1, ν2 ∈ W, such that ω = ν1ν2 and µ = ν2ν1.Given two polynomials p, q ∈ R[X].Following (3), they can be written as p(X) = |ω|≤deg(p) pωω and q(X) = |µ|≤deg(q) qµµ, with pω, qµ ∈ R. Then As shown in (17), cyclic equivalence between two polynomials could be formulated as a set of linear constraints.

Given a subset of polynomials
. With respect to Q, recall from Section II-B that the quadratic module MQ is the set of , where fi and gij are polynomials from the same ring.We would use them to define the tracial variants, such that the polynomials within this set have nonnegative trace.Formally, the quadratic module generated by Q is Note that when Q = ∅, the quadratic module MQ, or alternatively M∅ only contains sums of hermitian squares polynomials.As a tracial variant of MQ, the cyclic quadratic module Θ 2 Q includes MQ and other polynomials that are cyclically equivalent to elements in MQ: We define the truncated quadratic module and the truncated cyclic quadratic module of order 2k generated by Q: Lemma 6 ( [19], Lemma 2.1).For a polynomial p ∈ Sym R[X] 2k , there exist a (nonunique) Gram matrix Gp, such that where W k is a vector consisting all monomials in W k .
Proof.A symmetric polynomial p ∈ Sym R[X] 2k is a weighted sum of symmetrized monomials, such that where ω ∈ W 2k could be written as a product of two monomials with u, ν ∈ W d .Let − → p denote the vectors of coefficients of p ∈ R[X].Monomials u, ν could also be regarded as a polynomial, with show that the symmetrized monomial ω + ω † could be associated with a symmetric Gram matrix, such that Substituting ( 26) for ( 23), we then obtain a Gram matrix Gp of p, with Gp = ω∈W 2k pω Using the fact that the weighted sum of symmetric matrices is also symmetric, we complete the proof.
Proposition 7 ([8], Propositions 1.16 and 3.10).In the unconstrained case (Q = ∅), the cones M∅ and Θ 2 ∅ could be formulated as a set of linear matrix inequalities (LMI): where W is a vector consisting of all monomials in W. Since Proposition 4 has shown that cyclic equivalence is linear, hence the cone Θ 2 ∅ is also LMI.
Especially, Gp is a Gram matrix of p and we call Gp as a tracial Gram matrix for p.
Proposition 8 ([8], Proposition 5.7).In the constrained case, . Each element qi ∈ Q could be written as qi = ω∈W qi,ωω.Then the cone Θ 2 Q could be formulated as a set of linear matrix inequalities (LMI): The semialgebraic set generated by Q is the class of n-tuples X of real symmetric matrices of the same order such that q(X) ≽ 0, for all q ∈ Q. Formally, Note that if Q = ∅, the semialgebraic set is just S n .Furthermore, let F be a type II1-von Neumann algebra.Then D II 1 Q is the union of the F -semialgebraic set over all type II1-von Neumann algebras with separable predual.It has that DQ ⊆ D II 1 Q (cf.Remark 1.61 of [8]).The finite von Neumann algebras are introduced because we cannot work on the algebra of all bounded operators on the infinite dimensional Hilbert space, which does not admit a trace.
Following Proposition 1.25 of [8], it is easy to see that for all X ∈ DQ if f ∈ MQ, f (X) ≽ 0, thus tr f (X) ≥ 0. According to Proposition 1.62 of [8], we still have that for Using the new notation, Problem (15b) could be rewritten as finding the trace minimum in the region of DQ.Instead of working with DQ directly, tr II 1 min (p, Q) provides a lower bound for trmin(p, Q), and it is also more approachable to solve: Consider the finite dimensional case, a symmetric linear functional L : R[X] 2k → R (cf.Section VII) is in bijective corresponse to a Hankel matrix M k , indexed by monomials u, ν ∈ W k , such that we define a localizing matrix M k i for qi ∈ Q, which is indexed by monomials u, ν ∈ W k−⌈deg(q)/2⌉ , such that Corollary 10.Given two polynomials p, q ∈ R[X] and p = u puu, q = ν qν ν.Following from the definition of the Hankel matrix M k , we have that Lemma 11 ([8], Lemma 1.44).Positivity of L is equivalent to the positive semidefinitness of M k .According to the definition, L is nonnegative on M Q,2k .Hence, using Corollary 10, it is easy to see that In the tracial cases, L also needs to be zero on commutators, hence M k is invariant under cyclic equivalence, such that for u, v, w, z Definition 12 ([8], Remark 1.64).The dual cones.We define the dual cone Θ 2 2k ∨ to consist of symmetric linear functionals which are nonnegative on Θ 2 2k and zero on commutators.Likewise, if given Q ⊆ Sym R[X], we also give a constrained dual cone.Using the Hankel matrix M k , the dual cone can also be rewritten as a set of LMI.
There are two important propositions that would be crucial for the following relaxations.
Note that the ϵ in Proposition 14 could be arbitrarily small but non-zero.In other words, f is very close to Θ 2 Q .Along with Propositions 13-14, the new formulations in (36) could be relaxed to (37).
where the unconstrained problem is relaxed to a single SDP (cf.Corollary 15), with 2k1 = min g cyc ∼ f deg g .In the constrained case, tr II 1 min (p, Q) is relaxed to a hierarchy of SDP (cf.Corollary 15) with 2k2 ≥ min g cyc ∼ f deg g.
To obtain the dual problem of (37a), we conduct the standard Lagrangian duality approach, such that tr ≤ inf = inf = inf The equality in (39) uses the Definition 12 to transfer the cone constraint.The inequality in (40) utilizes the weak duality, since the variable a is unbounded.The equality in ( 41) is due to the linearity of L. Finally, the inner problem in ( 42) is bounded only if L(1) = 1, such that we obatin the dual problem in (43).Note that the same procedures could be applied to (37b) simply by changing Θ 2 2k to Θ 2 Q,2k , and the resulted formulation is denoted by L k Θ 2 (p, Q).Corollary 15. tr Θ 2 (p), tr k Θ 2 (p, Q) are SDP.This conclusion follows directly from Propositions 7 and 8, such that the constrains in Proof.For L Θ 2 (p), we can transfer this problem to the space of Hankel matrix M k .We first look at the objective function of (43).Since p ∈ Sym R[X] 2k , the polynomial p can be expressed by a Gram matrix Gp, such that p = W † k GpW k and Gp is symmetric.Using the proof of Lemma 6, we have Then, using Definition 12, the first constraint where E1,1 is a matrix with all entries 0 but 1 in the (1, 1)-entry.Further, L k Θ 2 (p, Q) only differs from L Θ 2 (p) by the dual cone constraint.By reusing Definition 12, we know that L k Θ 2 (p, Q) is also a SDP.

B. The exploiting correlative sparsity variant
As shown in the lower-right subplot of Figure 4, there is much room of sparsity exploiting in the interest of significant reduction in runtime.Here, we give a brief summary of correlative sparsity exploiting trace optimization, and we refer readers to [24] for more details and to [54] for term sparsity exploiting variant.
The index set of the tuple X ∈ S n is I = {1, 2, . . ., n}.Suppose there is a partition such that I1, I2, . . ., Ir ⊆ I and r i=1 Ii = I.Let X(Ii) ⊂ X denote the set of variables Xj whose indices belong to Ii.Given a sparse objective function p ∈ Sym R[X], we assume that p can be decomposed as With the assumptions of boundedness of each subset X(Ii) and running intersection property, cf.Assumptions 2,3 and 2.4 in [24], we obtain the sparse trace minimization formulations: Definitions 17 ( [24]).Sparse semialgebraic sets.Given a subset of polynomials Q ⊆ Sym R[X] and the sparse version of D sparse Q is defined as Definitions 18 ([24]).Sparse cyclic quadratic modules.Given a subset of polynomials Q ⊆ Sym R[X] and the sparse version of D sparse Q .We define For unconstrained cases, we can drop the Q in subscriptions, i.e., Q = ∅.

VII. THE GELFAND-NAIMARK-SEGAL CONSTRUCTION
The Gelfand-Naimark-Segal (GNS) construction essentially produces a *-representation from a positive linear functional of a C*-algebra on a Hilbert space.Under the Archimedean assumption, this method could be applied to non-commutative polynomials, which are not C*-algebras otherwise.The GNS-construction view of [25] starts with the construction of the Hilbert spaces.Let L : R[X] → R be a positive linear functional with L(MQ) ≥ 0 and MQ denotes the quadratic module of (2).There exists a tuple of bounded operators X = (X1, . . ., Xn) and a normalized vector ψ, such that for all p ∈ R[X] By Cauchy-Schwarz inequality, the sesquilinear form L in ( 44) induces an inner product on the quotient space R[X]/N , where , as in Section 1.5 of [8].Notice that the moment matrix M k of moment order k and its extension M δ , δ > k are positive semidefinite Hankel matrices, where M k is a short form of M k (y).Associated with L, we recall from Section II-B that the element of moment matrix M k is defined as b) Our formulation and solvers: For our formulation, we use Equations ( 9) subject to (10)(11), where we need to specify the values of c1 and c2.To generate the SDP relaxation of this formulation as in (7), we need to specify the moment order k.Because the degrees of objective (9) and constraints in (10)(11) are all less than or equal to 2, the moment order k within the respective hierarchy can start from k = 1.
In our implementation, we use a globally convergent Navascués-Pironio-Acín (NPA) hierarchy [40] of SDP relaxations, as utilized in the proof of Theorem 2, and its sparsity-exploiting variant, known as the non-commutative variant of the term-sparsity exploiting moment/SOS (TSSOS) hierarchy [56], [55], [54].The SDP of a given order within the NPA hierarchy is constructed using ncpol2sdpa 1.12.25 of Wittek [58].The SDP of a given order within the non-commutative variant of the TSSOS hirarchy is constructed using the nctssos6 of Wang et al. [54], with trace minimization implemented.Both SDP relaxationa are then solved by mosek 9.2 [36].We refer readers to Section VI B for a summary of the sparse version of trace minimization.c) Baselines: We compare our method against other leading methods for estimating state-space models, as implemented in MathWorks™ Matlab™ System Identification Toolbox™.Specifically, we test against a combination of least squares algorithms implemented in routine ssest ("least squares auto"), subspace methods of [37] implemented in routine n4sid ("subspace auto"), and a subspace identification method of [21] with an ARX-based algorithm to compute the weighting, again utilized via n4sid ("ssarx").
To parameterize the three baselines, we need to specify the dimension d (i.e., order) of the estimated state-space model.We would set d = n directly or alternatively, iterate from 1 to the highest number allowed in the toolbox when the underlying system is unknown, e.g., in real-world stock-market data.Then, we need to specify the error to be minimized in the loss function during estimation.In fairness to the baselines, we use the one-step ahead prediction error when comparing prediction performance and simulation error between measured and simulated outputs when comparing simulation performance.
d) The performance index: To measure the goodness of fit between the ground truth {Yt} T t=1 (actual measurements) and the noise-free simulated/ predicted outputs {F ′ mt} T t=1 , using different system identification methods, we introduce the normalized root mean squared error (nrmse) fitness value: where Y and F ′ m are the vectors consisting of the sequence {Yt} T t=1 and {F ′ mt} T t=1 respectively.A higher nrmse fitness value indicates better simulation or prediction performance.

B. Experiments on the example of Hazan et al.
In our first experiment, we explore the statistical performance of feasible solutions of the SDP relaxation.In the upper subplot of Figure 1, utilizing the same LDS as in [18], [26], we consider: n = 2, G = 0.9 0.2 0.1 0.1 , F ′ = 1 0.8 , the starting point ϕ ′ 0 = 1 1 , and T = 20.The standard deviations of process noise and observation noise W, V are chosen from 0.1, 0.2, . . ., 0.9.For our method, we set parameters c1 = 5 × 10 −4 , c2 = 10 −4 , and the moment order k = 1, then the formulation is relaxed via ncpol2sdpa 1.12.2 and solved by mosek 9.2.We therefore performed one experiment on each combination of standard deviations of process W and observation noise V , i.e., 9 × 9 runs in total.
In the lower subplot of Figure 1, we have the same settings as in the upper one, except for W = V = 0.5 and the parameters c1, c2 being chosen from 10 −4 , 10 −4 √ 10, . . ., 1.We further perform 9 × 9 experiments on each combination of c1 and c2. Figure 1 illustrates the nrmse values of 81 experiments of our method in different combinations of standard deviations of process noise W and observation noise V (upper), and another 81 experiments in different combinations of c1 and c2 (lower).It seems clear the highest nrmse is to be observed for the standard deviation of both process and observation noises close to 0.5.While this may seem puzzling at first, notice that higher standard deviations of noise make it possible to approximate the observations by an auto-regressive process with low regression depth [26,Theorem 2].The observed behavior is therefore in line with previous results [26, e.g., Figure 3].

C. Comparisons against the baselines
Next, we investigate the simulation performance of our method in comparison with other system identification methods, for varying underlying systems.In the upper subplot of Figure 2, we still consider the same LDS as in the upper subplot of Figure 1, except for W = V = 0.1, 0.2, . . ., 0.9.For our method, we set the parameters c1 = 5 × 10 −4 , c2 = 10 −4 , and the moment order k = 1, then the relaxation is constructed via ncpol2sdpa and solved by mosek 9.2.For the newly added baselines, we set d = n = 2 and minimize the simulation errors.Our method and the three baselines are implemented 30 times at each noise standard deviation, i.e., 4 × 30 × 9 runs in total, with all methods using the same time series.
In the middle subplot of Figure 2, we consider the same setting as in the upper subplot of Figure 2, but another underlying system, with a higher differential order: where G = 0.9 0.2 0.1 0.1 , F ′ 1 = 1 1 , F ′ 2 = 0.2 0.5 .The relaxation is built via ncpol2sdpa 1.12.2 and solved by mosek 9.2, with c1 = 5 × 10 −4 , c2 = 10 −3 and k = 1.Our method and the three baselines are implemented 30 times at each noise standard deviation, i.e., 4 × 30 × 9 runs in total, with all methods using the same time series.
In the lower subplot of Figure 2, we consider varying dimensions n = 2, 3, 4 of the underlying system.Given a dimension n, we let G be a random n-dimensional unitary matrix, both F and ϕ0 be a n-dimensional column vector with all entries being 1, T = 30, and W = V = 0.5.We only take the real part of the measured outputs as the time series.The settings for our methods and baselines are the same as in the upper subplot of Figure 2, notably the dimensions used by the baselines is the true dimension of the underlying system (d = n).Our method and the three baselines are run 30 times for each dimension n, i.e., 4 × 30 × 3 runs in total, with all methods using the same time series.
Figure 2 illustrates the simulation performance of our method, compared with three baselines, for different underlying systems.These methods are distinguished by colors: blue for "least squares auto", purple for "subspace auto", pink for "ssarx" and yellow for our method.The upper subplot presents the mean (solid lines) and mean ± one standard deviations (dashed lines) of nrmse as standard deviation of both process noise and observation noise ("noise std") increasing in lockstep from 0.1 to 0.9, with the underlying system being (1).The middle subplot reports the same as the upper one.The difference is that the time series used for simulation are generated by higher differential order systems in (13), and the formulation of our method changed accordingly.The lower subplot depicts the mean (solid dots) and mean ± one standard deviations (vertical error bars) of nrmse at different dimensions n of the underlying systems in (1).
As Figure 2 suggests, the nrmse values of our method on this example are almost 100%, while other methods rarely reach 50%.(We will use "least squares auto", which seems to work best within the other methods, in the following experiment on stock-market data.)Additionally, our method shows better stability as the gap between the yellow dashed lines in the upper or middle subplot, which suggests the width of 2 standard deviations, is relatively small.

D. Experiments with stock-market data
Our approach to proper learning of LDS could also be used in a "prediction-error" method for improper learning of LDS, i.e., forecasting its next observation (output, measurement).As such, it can be applied to any time series.To exhibit this, we consider real-world stock-market data first used in [30].(The data are believed to be related to the evolution of daily adjusted closing prices of stocks within the Dow Jones Industrial Average, starting on February 16, 1885.Unfortunately, the data are no longer available from the website of the authors of [30].) In Figure 3, we predict the evolution of the stock price from the 21 st period to the 121 st period, where each prediction is based on its previous 20-period observations (T = 20).For example, we use the first 20 periods of the stock prices to predict the stock price for the 21 st period.Next, we use the prices from the 2 rd period to the 21 st period to predict the stock price for the 22 st period, and so on.For our method, we use the same formulation (9) subject to ( 10)-( 11), but with the variable F ′ removed.We set c1 = c2 = 0.01 and the moment order k = 1.The SDP relaxation is generated in ncpol2sdpa 1.12.2, and solved by mosek 9.2.For comparison, the combination of least squares algorithms "least squares auto" is used again.Since we are using the stock-market data, the dimension n of the underlying system is unknown.Hence, the dimensions d of the baseline are iterated from 1 to 4, as 4 is the highest number allowed in the toolbox for 20-period observations.Notably, "least squares auto" is set to focus on producing a good predictor model, as the one-step ahead prediction error between measured and predicted outputs is minimized during estimation.
Figure 3 shows the prediction results for the 21 st -121 st period, using our method (the yellow curve), and the baselines "least squares auto" of varying dimensions d = 1, 2, 3, 4 (four blue curves).The true stock price "origin" is displayed by a dark curve.The right subplot is a zoom-in plot of the left one for the 66 th -101 st period.The percentages in the legend correspond to nrmse values (52).Both from the nrmse and the shape of these curves, we also notice that "least squares auto" perform poorly when the stock prices are volatile.This is highlighted in the right zoom-in subplot.
In Figure 4, we reuse the stock-market data.Our method, using formulation (9) subject to (10)- (11), is implemented twice: one implementation constructs the SDP relaxation of NPA hierarchy via ncpol2sdpa 1.12.2 with k = 1, the other implementation constructs the non-commutative variant of the TSSOS hierarchy (yellow) via nctssos, with k = 1, 2. Both relaxations are solved by mosek 9.2.For comparison purposes, we implement the baseline "least squares auto" at dimensions d = 1, 2. We randomly select a time series from the stock-market data, with the length of time window T chosen from 5, 6, . . ., 30, and run these three methods for three times at each T .
Figure 4 illustrates the runtime of the SDP relaxations and the baseline "least squares auto" as a function of the length of the time window.These implemented methods are distinguished by colors: blue for "least squares auto", green for our method using the non-commutative variant of the TSSOS hierarchy ("nctssos"), and yellow for our method using NPA hierarchy ("npa").The mean and mean ± one standard deviation of runtime are displayed by (solid or dashed) curves and shaded error bands.The upper-right subplot compares the runtime of our method with "nctssos" at moment order k = 1 against "least squares auto" with dimension d = 1.The red bars in the lower-right subplot display the sparsity of NPA hierarchy of the experiment on stock-market data against the length of time window, as ratios of non-zero coefficients out of all coefficients in the SDP relaxations.
As in most primal-dual interior-point methods [52], runtime of solving the relaxation to ϵ error is polynomial in its dimension and logarithmic in 1/ϵ, but it should be noted that the dimension of the relaxation grows fast in the length T of the time window and the moment order k.It is clear that the runtime of solvers for SDP relaxations within the non-commutative variant of the TSSOS hierarchy exhibits a modest growth with the length of time window, much slower than that of the plain-vanilla NPA hierarchy.This should be expected, given that the number of elements in positive-semidefinite constraints in the NPA hierarchy [40] is equivalent to the number of entries in the moment matrix M k (y).The sparsity of NPA hierarchy of the experiment is crucial for the significant reduction in runtime when using TSSOS hierarchy.We refer readers to correlative sparsity exploiting variant in [24] and term sparsity exploiting variant in [54], or Section VI B for a brief summary.

Fig. 1 :
Fig. 1: Upper: The nrmse fitness values (52) of 81 experiments of our method at different combinations of noise standard deviations of process noise W and observation noise V and Lower: at different combinations of parameters c1 and c2.Both use the data generated from systems in (1).Lighter colors indicate higher nrmse and thus better simulation performance.
Fig.2:The nrmse fitness values(52) of our method compared to the leading system identification methods implemented in Matlab™ System Identification Toolbox™.Upper & middle: the mean (solid lines) and mean ± one standard deviations (dashed lines) of nrmse as standard deviation of both process noise and observation noise increasing in lockstep from 0.1 to 0.9.The time series used for simulation are generated from systems in (1) (upper) and higher differential order systems in (13) (middle), with the dimensions n of both systems being 2. Lower: the mean (solid dots) and mean ± one standard deviations (vertical error bars) of nrmse at different dimensions n of the underlying systems in (1).Higher nrmse indicates better simulation performance.

Fig. 4 :
Fig.4: Left: The (solid or dashed) curves show the mean runtime of the SDP relaxation of the baseline "least squares auto" (blue), the TSSOS hierarchy (green) and the NPA hierarchy (yellow), at different moment orders k or dimensions d.The mean ± one standard deviation of runtime is displayed by shaded error bands.Upper-right: The mean and mean ± one standard deviation of runtime of the SDP relaxation of TSSOS hierarchy at moment order k = 1 and the "least squares auto" with dimension d = 1.Lower-right: The red bars display the sparsity of NPA hierarchy of the experiment on stock-market data against the length of time window, by ratios of non-zero coefficients out of all coefficients in the SDP relaxations

TABLE I :
An overview of parameters in the implementation.Please refer to Section VIII A for the definitions of parameters.