Finite Sample Identification of Low-Order LTI Systems via Nuclear Norm Regularization

This paper studies the problem of identifying low-order linear time-invariant systems via Hankel nuclear norm (HNN) regularization. This regularization encourages the Hankel matrix to be low-rank, which corresponds to the dynamical system being of low order. We provide novel statistical analysis for this regularization, and contrast it with the unregularized ordinary least-squares (OLS) estimator. Our analysis leads to new finite-sample error bounds on estimating the impulse response and the Hankel matrix associated with the linear system using HNN regularization. We design a suitable input excitation, and show that we can recover the system using a number of observations that scales optimally with the true system order and achieves strong statistical estimation rates. Complementing these, we also demonstrate that the input design indeed matters by proving that intuitive choices, such as i.i.d. Gaussian input, lead to sub-optimal sample complexity. To better understand the benefits of regularization, we also revisit the OLS estimator. Besides refining existing bounds, we experimentally identify when HNN regularization improves over OLS: (1) For low-order systems with slow impulse-response decay, OLS method performs poorly in terms of sample complexity, (2) the Hankel matrix returned by regularization has a more clear singular value gap that makes determining the system order easier, (3) HNN regularization is less sensitive to hyperparameter choice. To choose the regularization parameter, we also outline a simple joint train-validation procedure.


I. INTRODUCTION
System identification is a central topic in control theory; accurate estimation of system dynamics is the basis of control or policy decisions in tasks varying from linear-quadratic regulator to model predictive control. A Linear time-invariant (LTI) dynamical system is described as where x t , u t , y t , z t are the state, input, output, and noise vectors respectively, and the initial state is denoted by x 0 . Suppose the true system order is R, i.e., R is the smallest state dimension in (1). In practice, the true order R is unknown so it is customary to pick a user-specified order n ≥ R and fit a larger system with state dimension n. For such a problem, the total number of parameters in A, B, C, D matrices grow linearly in n. However, due to underlying low-order dynamics, the degrees of freedom of the system dynamics is proportional to R, regardless of n. This discussion naturally motivates regularization strategies that promote low-order systems to achieve system identification with near-optimal sample size. To this aim, we will use nuclear norm regularization for the Hankel matrix associated with the system. It is well known that the rank of a Hankel matrix corresponds to the system order R, and that nuclear norm regulariztion promotes low-rank solutions in matrix regression problems. In light of these, our main contribution is a statistical analysis of the Hankel nuclear norm (HNN) regularization approach. We will shed light on the fundamental limits of HNN, and contrast it with the  ordinary least squares (OLS) estimator for low-order system identification. Table 2 covers the core notation used in the paper, most of which are introduced in Section II.
We note that a shorter version of this paper was published as [1]. Compared to this work, we make several contributions by providing (1) a negative result for suboptimality of sample complexity with i.i.d. input (Section III Part III-B), (2) a discussion on model selection and how to tune the regularization parameter to establish estimation bounds for the train-validation procedure (Section V), (3) numerical simulations that demonstrate the advantage of HNN over least squares (Section VI Part VI-A).

A. CONTRIBUTIONS
Our main contribution is establishing data-driven guarantees for Hankel nuclear norm (HNN) regularization and shedding light on the benefit of regularization through a comparison to the ordinary least-squares (OLS) estimator. The core technical challenge in this work is the analysis of HNN-based system identification which requires novel statistical machinery compared to the prior finite-sample analysis which mostly focus on OLS. Table 1 summarizes our findings at a high-level: We not only show HNN can overcome the O(n) sample size dependence but its performance strongly depends on the input design illustrated in Fig. 1. More specifically, we make the following contributions.
• Sample complexity of Hankel nuclear norm (Section III): For multi-input/single-output (MISO) systems (p input channels), we establish near-optimal sample complexity bounds for the Hankel-regularized system identification and show that sample size grows as O(pR log 2 n) where R is the true system order and n is the user-specified order which is equal to the number of blocks in the Hankel matrix. This result utilizes an input-shaping strategy (rather than i.i.d. excitation, see Fig. 1(a)) and builds on [2] who studied the recovery of a sum-of-exponentials signal.
In light of [3], systems with low-rank Hankel can be viewed as a simultaneously structured signal with Hankel and lowrank structures. Through degrees-of-freedom counting one can deduce that (A) enforcing only low-rankness without Hankel structure would require O(nR) samples and (B) enforcing only Hankel structure without low-rankness would require O(n) samples. Thus, O(R log 2 (n)) is a best-of-both worlds bound. This near-optimal bound is especially interesting since simultaneously structured signals can be provably hard to estimate with few measurements [3].
Interestingly, in Section III-B, we also show that our inputshaping strategy is indeed critical for establishing logarithmic sample complexity dependence in n. Specifically, we prove that if the inputs are i.i.d. standard normal ( Fig. 1(b)), the minimum number of observations to recover the impulse response grows polynomially as T n 1/3 .
• Improved statistical rates for HNN and OLS (Section IV) : We establish noise robustness of HNN regularization by upper bounding the estimation errors of impulse response (IR) of the system and the associated Hankel matrix. Unlike related results by [2], our bounds exhibit the optimal O(1/ √ T ) decay with growing sample size T . These findings also reveal the competitiveness of HNN compared to OLS for noisy systems.
For OLS, we also establish a near-optimal spectral error rate for the Hankel matrix estimation when T np. Our error rate improves over that of [4] and our sample complexity improves over [5] and [6] which require O(n 2 ) samples rather than O(n). This refinement is accomplished by tighter analysis that relate the IR and Hankel errors (see Thm. 3).
• Guarantees on model-selection (Section V): HNN method requires a proper choice of the regularization parameter λ. In practice, the optimal choice is data dependent and one usually estimates λ via trial and error based on the validation error. We provide a complete procedure for model selection (training & validation phases), and establish statistical guarantees.
• Contrasting HNN regularization and OLS (Section VI): Finally, we assess the benefits of regularization via numerical experiments on system identification focusing on data collected from a single-trajectory. Following Table 1, we consider synthetic data and focus on low-order systems with slow impulse-response decay. The slow-decay is intended to exacerbate the FIR approximation error arising from truncating the impulse-response at O(n) terms. In this setting, OLS as well as [5] are shown to perform poorly due to O(n) dependence. In contrast, HNN regularization is able to avoid the truncation error as it allows for fitting a long impulse-response with few data (thanks to log 2 (n) dependence).
Our real-data experiments (on a low-order example from the DaISy datasets [7]) show that the regularized algorithm has empirical benefits in sample complexity, estimation error, and Hankel spectral gap, and demonstrate that the regularized algorithm is less sensitive to the choice of the tuning  parameter λ, compared to OLS whose tuning parameter is the user-specified order n.
Finally, we remark that, while this work focuses on the estimation of impulse response and the associated Hankel matrix, the results can be extended to identify the state-space matrices (A, B, C, D). This can be accomplished via Ho-Kalman type post-processing algorithms which can estimate state-space matrices (up to a similarity transformation) given a noisy Hankel matrix [8]. Our primary interest is system identification and not control. However, it is also well-known that stronger estimation guarantees translate to stronger modelbased controllers (e.g. certainty equivalent control). Upon reviewer suggestion, we further elaborate on benefits of our results to H ∞ or adaptive control.

B. RELATED WORKS ON SYSTEM IDENTIFICATION
System identification has a rich and dynamic literature. We emphasize that our work focuses on regression methods (OLS and HNN) to estimate the impulse response of the system. These methods can be augmented with preprocessing/postprocessing techniques to improve estimation as well as identifying low-order state-space realizations. We gather the related works into following categories.
Classical methods and asymptotic guarantees: The traditional unregularized methods include Cadzow approach [9], [10], matrix pencil method [11], Ho-Kalman approach [12] and the subspace method raised in [13], [14], [15], further modified as frequency domain subspace method in [16] when the inputs are single frequency (sine/cosine) signals. Much of the prior literature on these are asymptotic guarantees rather than finite sample analysis.
Finite sample identification guarantees: This literature is closer to our work and prior works can be categorized into identification with and without having direct access to state observations. State observation corresponds to setting C, D = I, 0 in the state-space (1). A notable line of work derives statistical bounds in this setting from a single output trajectory (defined in Fig. 2) by least squares [17], [18], [19], [20] by estimating the A, B matrices. [19] assumes that the system is marginally stable whereas [20] removes the assumptions on the spectral radius of A.
Closer to us, for the partially observed setting (1), recent works show that least-squares can be used to recover the Markov parameters and reconstruct A, B, C, D from the Hankel matrix via the classical Ho-Kalman algorithm [4], [12]. To identify a stable system from a single trajectory, [4] estimates the impulse response and [5] estimates the Hankel matrix via least-squares. While [4], [5] use random input, [6], Thm 1.1, 1.2] uses impulse and single frequency signal respectively as input. These works assume known system order, or traverse the Hankel size n to fit the system order. There are more recent extensions on identifying unstable systems [21], [22]. [23], [24] propose equipping least squares with frequency domain filtering to assist system identification.
Representing the system as auto-regressive or ARX models: Besides fitting impulse response and system matrices, one can also predict the output via past outputs as well. Recent works develop methods and finite sample analysis to identify auto-regressive model [25]. There is also a rich system identification literature on ARX models [26], iterative ARX models [27], and related kernel methods [28], [29] to best explain the dynamics of a system. These works focus on either empirical/algorithmic performance or asymptotically correct theoretical guarantees, rather than finite sample error bounds, which is the focus of our work.
Results on regularized system identification: With only access to the input/output data, [30], [31] propose onvex optimization formulations to establish a confidence set corresponding to the estimated impulse response without making assumptions on the impulse response structure (e.g., low order, fast decay). However, these results do not say much about the required sample size under which the resulting confidence sets are guaranteed to be small. In contrast, our work is about determining the minimum sample size to ensure parameters truly lie on a small ball around the estimate.
There has also been growing interest in kernel-based regularization methods for system identification [28], [29], [32], [33], [34]. Section 7.5 of [32] provides non-asymptotic guarantees on system identification through stable spline estimators which is closer to our work. This is a kernel method that essentially solves regularized least squares and is an alternative to nuclear norm regularization. Specifically, they use a regularizer of the form Reg(h) = h Mh where M is a positive definite matrix that depends on the spectral radius or the a-priori distribution of the impulse response. The guarantees for spline estimators are heavily dependent on stability of the system and require a small spectral radius ρ (see Theorem 7.11 of [32]). This is unlike our bounds where the sample size depends only on the true order of the system regardless of stability ρ. We also highlight that [34] provides a further survey on kernel/spline methods and extensively compares kernel, atomic norm, and nuclear norm approaches for system identification. In Section VI, we find that nuclear norm regularization outperforms least-squares methods for systems with slow impulse response decay. This finding is consistent with the guarantees on spline estimators which require small spectral radius and fast decay.
Hankel nuclear norm regularization: These works focus on recovering a low order system, in other words, a system with low-rank Hankel matrix. Low-rank Hankel matrices arise in a range of applications [2], [35], [36], [37]. It is known that nuclear norm regularization is used to find a low rank matrix [38], [39], [40], and [41] uses it for recovering a low rank Hankel matrix. Traditional methods such as subspace methods can synergistically be applied together with regularization to further boost performance [27], [42], [43], [44], [45], [46], [47], [48], [49], [50]. The prior works on HNN regularization emphasize efficient optimization algorithms and do not provide statistical bounds. More recently, Ref. [2] theoretically shows that a low order SISO system from multiple trajectories of input-output pairs can be recovered via HNN approach. However, they provide pessimistic noise robustness guarantees which are agnostic to sample size (unlike our optimal O(1/ √ T ) rates). Ref. [50] provides a thorough analysis of Hankel nuclear norm regularization applied to system identification, including discussion on proper error metrics, role of rank/system order in formulating the problem, and selection of tuning parameters. Besides nuclear norm regularization, the singular value shrinkage method can also be applied for low rank matrix recovery [51], [52], which is anticipated to have similar benefits for finite sample system identification.

C. ROADMAP
The rest of the paper is organized as follows. Next section introduces the technical setup. Sections III Part III-A proposes our results on nuclear norm regularization. Section III Part III-B discusses the role of the input distribution and establishes lower bounds. Section IV provides our results on least-squares estimator. Section V discusses model selection algorithms. Finally Section VI presents the numerical experiments.

II. PROBLEM SETUP AND ALGORITHMS
Consider a linear time-invariant system of order R with the minimal state-space representation where x t ∈ R R is the state, u t ∈ R p is the input, y t ∈ R m is the output, z t ∈ R m is the output noise, A ∈ R R×R , B ∈ R R×p , C ∈ R m×R , D ∈ R m×p are the system parameters, and x 0 is the initial state. We denote ρ(A) as the spectral radius of A. Generally with the same input and output, the dimension of the hidden state x t can be any number no less than R.
In this paper we are interested in the minimum dimensional representation (i.e., minimal realization). The goal of system identification is to find the system parameters, such as A, B, C, D matrices or impulse response, given input and output observations. If (C, D) = (I, 0), we directly observe the state.
The state evolves as x t+1 = Ax t + η t where η t is the white noise that provides excitation to states [19], [20].
When we do not directly observe the state x t (also known as hidden-state), one has only access to u t and y t and lack the full information on x t . We recover the impulse response (also known as the Markov parameters) sequence . . that uniquely identifies the end-to-end behavior of the system. The impulse response can have infinite length, and we let h = [D, CB, CAB, CA 2 B, . . . ,CA 2n−3 B] denote its first 2n − 1 entries, which can be later placed into an n × n block Hankel matrix. Without knowing the system order, we consider recovering the first n terms of h where n is larger than system order R. To this end, let us also define the block Hankel map H : R m×(2n−1)p → R mn×pn as If n ≥ R, the Hankel matrix H is of rank R regardless of n [53], Sec. 5.5]. Specifically, we will assume that R is small, so the Hankel matrix is low rank. Throughout, we estimate the first 2n − 1 terms of the impulse response denoted by h. The system is excited by an input u over the time interval [0, t] and the output y is measured at time t, i.e., We assumed x 0 = 0. If the system has spectral radius ρ 1, then the effect of x 0 at the time N scales as x 0 ρ N which is a small number, thus does not harm our result. If the system is close to marginally stable, a thorough discussion is in Section VI-A.
Performance criteria for system identification: To explain our contributions, we introduce common performance metrics. To ensure a standardized comparison, we define sample complexity to be the number of equations (equality constraints in variables h t ) used in the problem formulation, which is same as the number of observed outputs (see Fig. 2 and Section II). With this, we explore the following performance metrics for learning the system from T output measurements.
r Sample complexity: The minimum sample size T for recovering system parameters with noise robustness. This quantity is lower bounded by the system order R which can be seen as the "degrees of freedom" of the system. r Impulse Response (IR) Estimation Error: The Frobenius norm error ĥ − h F for the IR. A good estimate of IR enables the accurate prediction of the system output.
r Hankel Estimation Error: The spectral norm error H(ĥ − h) of the Hankel matrix. This metric is particularly important for system identification as described below. The Hankel spectral norm error is a critical quantity for several reasons. First, the Hankel spectral norm error connects to the H ∞ estimation of the system [54]. Secondly, bounding this error allows for robustly finding balanced realizations of the system; for example, the error in reconstructing statespace matrices (A, B, C, D) via the Ho-Kalman procedure is bounded by the Hankel spectral error. Finally, it is beneficial in model selection, as a small spectral error helps distinguish the true singular values of the system from the spurious ones caused by estimation error. Indeed, as illustrated in the experiments, the Hankel singular value gap of the solution of the regularized algorithm is more visible compared to leastsquares, which aids in identifying the true order of the system as explored in Section VI.
Algorithms: Hankel-regularization & OLS. In our analysis, we consider a multiple rollout 1 setup where we measure the system dynamics with T separate rollouts. For each rollout, the input sequence is 2 u (i) = [u (i) 2n−1 , . . ., u (i) 1 ] ∈ R (2n−1)p and we measure the system output at time 2n − 1. Note that the i th output at time 2n − 1 is simply h u (i) . DefineŪ ∈ R T ×(2n−1)p where the i th row is u (i) . Let y ∈ R T ×m denote the corresponding observed outputs. Hankelregularization refers to the nuclear norm regularized problem (HNN).
Finally, setting λ = 0, we obtain the special case of ordinary least-squares (OLS). Data acquisition: Generally there are several rounds (ith round is denoted with super script (i) in Fig. 2) of inputs sent into the system, and the output can be collected or neglected at arbitrary time. In the setting that we refer to as "multi-rollout" (Fig. 2(b)), for each input signal u (i) we take only one output measurement y t at time t = 2n − 1 and then the system is restarted with a new input. Here the sample complexity is T , the number of output measurements as well as the round of inputs. Recent papers (e.g., [4] and [5]) use the "single rollout" model ( Fig. 2(c)) where we apply an input signal from time 1 to T + 2n − 2 without restart, and collect all output from time 2n − 1 to T + 2n − 2, in total T output measurements; we use this model in the numerical experiments in Section VI. We consider two estimators in this paper: the nuclear norm regularized estimator and the least squares estimator defined later.
We will bound the various error metrics mentioned earlier in terms of the sample complexity T , the true system order R, the dimension of impulse response n R, and signal to noise ratio (SNR) defined as Table 3 provides a summary and comparison of these bounds. All bounds are order-wise and hide constants and log factors. We can see that, with nuclear norm regularization, our paper matches the least squares impulse response and Hankel spectral error bound while sample complexity can be as small as O(R 2 ), and we can recover the impulse response with guaranteed suboptimal error when sample complexity is O(R). Our least square error bound matches the best error bounds among [4] and [5], which is proven optimal for least squares.
Next, we discuss the design of the input signal and introduce input shaping matrix.
Input shaping: Note that the H operator does not preserve the Euclidean norm, so [2] proposes using a normalized operator G, where they first define the weights and let K ∈ R (2n−1)p×(2n−1)p be a block diagonal matrix where the jth diagonal block of size p × p is equal to K j I p×p .

LS-IR and LS-Hankel Stands for Least Squares Regression on the Impulse Response and on the Hankel Matrix
In other words, Using this change of variable and letting U =Ū K −1 , problem (HNN) can be written aŝ

III. HANKEL NUCLEAR NORM REGULARIZATION
To promote a low-rank Hankel matrix, we add nuclear norm regularization to the quadratic-loss objective and solve the regularized regression problem. Here we give a finite sample analysis for the recovery of the Hankel matrix and the impulse response found via this approach. We consider a random input matrixŪ and observe the corresponding noisy output vector y as in (4).

A. STATISTICAL GUARANTEES FOR HANKEL-REGULARIZATION
The theorem below shows that Hankel-regularization achieves near-optimal sample complexity with similarly strong estimation error rates that decay as 1/ √ T . For simplicity, we assume initial state to be x 0 = 0. For stable systems, it is known that the impact of initial state decays exponentially in sampling time 2n − 1.
Theorem 1: Consider the problem (HNN) in the MISO (multi-input single-output) setting (m = 1, p input channels). Suppose the system order is R,Ū ∈ R T ×(2n−1)p , each row consists of an input rollout u (i) ∈ R (2n−1)p , and the scaled input U =Ū K −1 has i.i.d Gaussian entries. Let snr = E[ u 2 /n]/ E[ z 2 ] and σ = 1/ √ snr. For some con- whereR = min(pR 2 log 2 (n), n). Theorem 1 jointly bounds the impulse response and Hankel spectral errors of the system under mild conditions. We highlight the improvements that our bounds provide: When the system is low order, the sample complexity T is logarithmic in n and improves upon the O(n) bound of the least-squares algorithm. The number of samples for recovering an order-R system is O(R log(n)), where the SISO case is also proven in [2], and we follow a similar proof. The error rate with respect to the system parameters n, R, T is same as [4], [5], [6] (e.g. compare to Thm. 3). We can choose λ from a range of a constant multiplicative factors, like λ ∈ [Cσ pn T log(n), 2Cσ pn T log(n)] for a positive constant C. When λ is in this range, there is always a sample complexity T such that (7) holds. The flexibility of λ makes tuning λ for Algorithm 1 easier.
The regularized method also has the intrinsic advantage that it does not require knowledge of the rank or the singular values of the Hankel matrix beforehand. Tuning method Algorithm 1 and numerical experiments on real data in Section VI demonstrate the performance and robustness of the regularized method.

B. SAMPLE COMPLEXITY LOWER BOUNDS FOR IID INPUTS AND THE IMPORTANCE OF INPUT SHAPE
Theorem 1 uses shaped inputs whereas, in practice, one might expect that i.i.d. input sequence (without shaping matrix K) should be sufficient for recovering the impulse response with near-optimal sample size. For instance, [4] proves an optimal sample complexity bound for system identification via least-squares and i.i.d. standard normal inputs. Naturally, we ask: Does Hankel-regularization enjoy similar performance guarantees with i.i.d. inputs? Do we really need input design?
The following theorem proves that for a special system with order r = 1, the sample complexity of the problem under i.i.d. input is no less than n 1/3 , compared to O(log n) under the shaped input setting. This is accomplished by carefully lower bounding the Gaussian width induced by the Hankelregularization. Gaussian width directly corresponds to the square-root of the sample complexity of the problem required for recovery in high probability [55], Thm. 1]. Thus, Theorem 2 shows that the sample complexity with i.i.d. input can indeed be provably larger than with shaped input.
Theorem 2: Suppose the impulse response h of the system is h t = 1, ∀t ≥ 1, which is order 1. Consider the tangent cone associated with Hankel-regularization (normalized to unit 2 The Gaussian width of this set is lower bounded by 1 4 n 1/6 when n ≥ 2. This implies that, in the noiseless setting, the sample complexity to recover the impulse response is T n 1/3 , which is larger than log n dependence with shaped input. This result is rather counter-intuitive since i.i.d. inputs are often optimal for structured parameter estimation tasks (e.g. compressed sensing, low-rank matrix estimation). In contrast, our result shows the provable benefit of input shaping.

IV. REFINED BOUNDS FOR THE LEAST-SQUARES ESTIMATOR
To better understand the Hankel-regularization bound of Theorem 1, one can contrast it with the performance of unregularized least-squares. Interestingly, the existing least-squares bounds are not tight when it comes to Hankel spectral norm. In this section, we revisit the least-squares estimator, tighten existing bounds, and contrast the result with our Theorem 1.
We consider the MIMO setup where y ∈ R T ×m and h ∈ R (2n−1)p×m . This is obtained by setting λ = 0 in (HNN), hence the estimator is given via the pseudo-inversê Using the fact that rows of the Hankel matrix are subsets of the IR sequence, we always have the inequality Observe that there is a factor of √ n gap between the left-hand and right-hand side inequalities. We show that the left-hand side is typically the tighter one, thus ĥ − h F ∼ H(ĥ − h) . The next theorem formalizes this bound when inputs and noise are randomly generated.
Theorem 3: Denote the solution to (8) asĥ. LetŪ ∈ R T ×(2n−1)p be input matrix obtained from multiple rollouts, with i.i.d. standard normal entries, y ∈ R T ×m be the corresponding outputs and z ∈ R T ×m be the noise matrix with i.i.d. N (0, σ 2 z ) entries. Then the spectral norm error obeys H(ĥ − h) σ z mnp T log(np) when T mnp. This theorem improves the spectral norm bound compared to [4], which naively bounds the spectral norm in terms of IR error using the right-hand side of (9). Instead, we show that spectral error is same as the IR error up to a log factor (when there is only output noise). We remark that O(σ z √ np/T ) is a tight lower bound for H(h −ĥ) as well as h −ĥ [4], [56]. The proof of the theorem above is in Appendix A5. Note that, we apply the i.i.d input here for least squares which is different from regularized algorithm. It works with at least O(n) samples, whereas the sample complexity for the regularized algorithm in Theorem 1 is O(R).

V. MODEL SELECTION FOR HANKEL-REGULARIZED SYSTEM ID
In Thm. 1, we established the recovery error for system's impulse response for a particular parameter choice λ, which depends on the noise level σ . In practice, we do not know the noise level and the optimal λ choice is data-dependent. Thus, given a set of parameter candidates ⊂ R + , one can evaluate λ ∈ and minimize the validation error to perform model selection.
Covering model selection -selecting the model with the correct λ, the end-to-end algorithm will contain trainingvalidation-selection procedures: 1) Training: For every λ ∈ , we run (HNN) on training samples containing T rollouts, and record the solution h λ . 2) Validation: The validation dataset contain T val rollouts.
We collect allĥ λ and check their validation error. 3) Selection: Choose theĥ λ associated with smallest validation error. In Appendix G, Algorithm 1 summarizes our training and validation procedure where | | denotes the cardinality of . The theorem below states the performance guarantee for this algorithm.
Theorem 4: Consider the setting of Theorem 1. Sample T i.i.d. training rollouts (U , y) and T val i.i.d. validation rollouts (U val , y val ). Set λ * = Cσ pn T log(n) which is the choice in Thm. 1. Fix failure probability P ∈ (0, 1). Suppose that: (a) There is a candidateλ ∈ obeying λ * /2 ≤λ ≤ 2λ * . (b) Validation set obeys T val ( T log 2 (| |/P) R log 2 (n) ) 1/3 . SetR = min(R 2 , n). With probability at least 1 − P, Algorithm 1 achieves an estimation error equivalent to (7): In a nutshell, this result shows that as long as candidate set contains a reasonable hyperparameter (Condition (a)), using few validation data (Condition (b)), one can compete with the hindsight parameter choice of Theorem 1. T val mildly depends on T and only scales poly-logarithmically in | | which is consistent with the model selection literature [57].
In Algorithm 1, we fix the size of the Hankel matrix (which is usually large/overparameterized) and tune λ > 0. In contrast to this, for OLS and the least-squares procedure of [5], model selection is accomplished by varying the size of the Hankel matrix. The next section provides experiments and contrasts these methods and provides insights on how regularization can improve over least-squares for certain class of dynamical systems.
Comparison of sample complexity of regularized algorithm and unregularized least squares: model selection with data being requested online. Algorithm 1 uses static data for training and validation, which means that, the total T + T val samples are given and fixed, and we split the data and run Algorithm 1. We denote the total sample complexity T tot = T + T val . To be fully efficient in sample complexity, we can start from T tot = 0, keep requesting new samples, which means increasing T tot , and run Algorithm 1 for each T tot . When the validation error is small enough (which happens when T tot R), we know the algorithm recovers a impulse response estimation with the error in Theorem 4 and we can terminate the algorithm.
We compare it with the model selection algorithm in [5] for least squares estimator, and we find that it does not terminate until T tot n. For least squares, the parameter to be tuned is the dimension of the variable, i.e., we vary the length of estimated impulse response. We call the tuning variable n t and it is upper bounded by n. We keep increasing T tot and train by varying n t ∈ [1, T tot /2] (so the least squares problems are overdetermined). The output y is collected at time 2n t − 1. We consider two impulse responses truncated at length n: h 1 = 1 n (order = 1) and h n 1 = [1 n 1 ; 0 n−n 1 ] (order = n 1 ). As long as T tot < n 1 , one cannot differentiate h 1 and h n 1 , because y is collected at time T tot and the T tot + 1-th to the n-th terms of h 1 , h n 1 do not contribute to y. Even if the system is order 1, one does not know it and cannot terminate the algorithm. Thus the tuning algorithm in [5] requires T tot n. This does not happen with regularization, because we always collect y at time n in Algorithm 1, but not at time 2n t − 1, thus the algorithm always detects the difference between h 1 , h n 1 after time n 1 .

A. EXPERIMENTS WITH SYNTHETIC DATA
When does regularization beat least-squares? Low-order slow-decay systems (Fig. 3): We use an experiment with synthetic data to answer this question. So far, we showed that for fixed Hankel size, nuclear norm regularization requires less data than unregularized least-squares especially when the Hankel size is set to be large (Table 3). However, for least squares, one can choose to use a smaller Hankel size with n ≈ R, so that we solve a problem of small dimension compared to n R. We ask if there is a scenario in which fine-tuned nuclear norm regularization strictly outperforms fine-tuned least-squares.
In what follows, we discuss a single trajectory scenario. An advantage of the Hankel-regularization is that, it addresses scenarios which can benefit from large n while both the sample complexity T and the system order R are small. Given choice of n, in a single trajectory setting, least-squares suffers an error of order ρ(A) 2n (1 − ρ(A) n ) −1 (Thm 3.1 of [4]). This error arises from FIR truncation of impulse response (to n terms) and occurs for both regularized and unregularized algorithms. In essence, due to FIR truncation, the problem effectively incurs an output noise of order ρ(A) 2n (1 − ρ(A) n ) −1 . Thus, if the system decays slowly, i.e., ρ(A) ≈ 1, we will suffer from significant truncation error.
As an example, suppose sample size is T = 40 and ρ(A) = 0.98. If the problem is kept over-determined (i.e. n < 40), then n will not be large enough to make the truncation error ρ(A) 2n (1 − ρ(A) n ) −1 ≈ 0.36 vanishingly small. In contrast, Hankel-regularization can intuitively recover such slowlydecaying systems by choosing a large Hankel size n as its sample complexity is (mostly) independent of n (as in the setting of Theorem 1). This motivates us to compare the performance on recovering systems with low-order slow-decay. In Fig. 3, we set up an order-1 system with a pole at 0.98 and generate single rollout data with size 40. We tune λ when applying regularized algorithm with n = 45 (as it is safe to choose a large Hankel dimension), whereas in unregularized method, Hankel size cannot be larger than 20 (n × n Hankel has 2n − 1 parameters and we need least squares to remain overdetermined). With these in mind, in the first two figures we can see that the best validation error of regularization algorithm is 0.44, which is significantly smaller than the unregularized validation error 0.73. In the third figure, we use regularized least squares with n = 20, which also causes large truncation error (due to large ρ(A)) compared to the initial choice of n = 45 (the first figure). In this case, the best validation error is 0.56 which is again noticeably worse than the 0.44 error in the first figure.
When the number of variables 2n − 1 is larger than T , the system identification problem is overparameterized and there can be infinitely many impulse responses that achieve zero squared loss on the training dataset. This happens in the first figure when λ → 0, and in the second figure when n is large. In this case, regularized algorithm chooses the solution with the smallest Hankel nuclear norm and the least squares chooses the one with smallest 2 norm. The first figure has smaller validation error when 1/λ tends to infinity. We verified that, when regularization weight is 0, the setup in the first figure achieves validation error 0.58, which is better than least squares for n = 45 with validation error 1.60. So among the solutions that overfits the training dataset, the one with small Hankel nuclear norm has better generalization performance when the true system is low order.

B. EXPERIMENTS WITH DAISY DATASET
Our experiment uses the DaISy dataset [7], where a known input signal (not random) is applied and the resulting noisy output trajectory is measured. Using the input and output matrices we solve the optimization problem (HNN) using single trajectory data. While the input model is single instead of multiple rollout, experiments will demonstrate the advantage of Hankelregularization over least-squares in terms of sample complexity, singular value gap and ease of tuning.
Large data regime: Both Hankel and least-squares algorithms work well (Fig. 4): The first two figures in Fig. 4 show the training and validation errors of Hankel-regularized and unregularized methods with hyperparameters λ and n respectively. We then choose the best system by tuning the hyperparameters to achieve the smallest validation error. The  Fig. 4 plots the training and validation output sequence of the dataset for these algorithms. We see that with sufficient sample size, the system is recovered well. However, the validation error is more flat as a function of 1/λ (first figure) whereas it is sensitive to the choice of n (second figure), thus λ is easier to tune compared to n. If the hyperparameter is very sensitive, it (intuitively) indicates that the algorithm is less robust because small variations in the problem variables can seriously hurt the performance. In the machine learning literature, this is also known as sharpness [58].
Small data regime: Hankel-regularization succeeds while least-squares may fail due to overfitting (Fig. 5): The first two figures in Fig. 5 show that the Hankel spectrums of the two algorithms have a notable difference: The system recovered by Hankel-regularization is low-order and has larger singular value gap. The last two figures in Fig. 5 show the advantage of regularization with much better validation performance. As expected from our theory, the difference is most visible in small sample size (this experiment uses 50 training samples). When the number of observations T is small, Hankel-regularization still returns a solution close to the true system while least-squares cannot recover the system properly. We also remark that nuclear norm regularized least squares has been empircally compared to subspace-based methods such as the Matlab implementation N4sid, for example see [45] which solves the Hankel nuclear-norm regularized least squares problem with interior-point methods, and [47] which solves the same problem with first-order optimization algorithms. Both papers make observations about the empirical benefits of the nuclear norm regularization versus standard subspace system ID for single-rollout experiments using Daisy benchmark data [7].
Identifying a linear approximation of a nonlinear system with few data samples (Fig. 6): Finally, we show that Hankelregularization can identify a stable nonlinear system via its linearized approximation as well. We consider the inverted pendulum as the experimental environment. First we use a linearized controller to stabilize the system around the equilibrium, and apply single rollout input to the closed-loop system, which is i.i.d. random input of dimension 1. The dimension of the state is 4, and we observe the output of dimension 1, which is the displacement of the system. We then use the Hankel-regularization and least-squares to estimate the closed-loop system with a linear system model and predict the trajectory using the estimated impulse response. We use T = 16 observations for training, and set the dimension to n = 45. Fig. 6 shows the singular values and estimated trajectory of these two methods. Despite the nonlinearity of the ground-truth system, the regularized algorithm finds a linear model with order 6 and the predicted output has small error, while the correct order is not visible in the singular value spectrum of the unregularized least-squares.

VII. FUTURE DIRECTIONS
This paper established new sample complexity and estimation error bounds for system identification. We showed that nuclear norm penalization works well with small sample size regardless of the misspecification of the problem (i.e. fitting impulse response with a much larger length rather than the true order). For least-squares, we provided the first guarantee that is optimal in sample complexity and the Hankel spectral norm error. These results can be refined in several directions. In the proof of Theorem 1, we used a weighted Hankel operator. We expect that directly computing the Gaussian width of the original Hankel operator will also lead to improvements over least-squares. We also hope to extend the results to account for single trajectory analysis and process noise. In both cases, an accurate analysis of the regularized problem will likely lead to new algorithmic insights.

A. SAMPLE COMPLEXITY FOR MISO AND MIMO PROBLEMS
This section establishes sample complexity bounds for MISO and MIMO systems. The technical argument builds on [2] and extends their results from from SISO case to MISO. We consider recovering a MISO system impulse response. The system is given in (2)), with output size m = 1 and the system is order R. For multi-rollout case, we only observe the output at time 2n − 1, and let u 2n = 0, we have Denote the impulse response by h ∈ R p(2n−1) , which is a block vector where each block h (i) ∈ R p . β ∈ R p(2n−1) is a weighted version of h, with weights β (i) = K i h (i) and β = β (1) β (2) . . . β (2n−1) . Define the reweighted Hankel map for the same h by and G * is the adjoint of G. We define each rollout input u 1 , . . ., u 2n−1 as independent Gaussian vectors with Now let U ∈ R T ×p(2n−1) , each entry is iid standard Gaussian. Let y ∈ R T be the concatenation of outputs where y i ∈ R m is defined in (13). We consider the question where the norm of overall (state and output) noise is bounded by δ. We will present the following theorem, which generalizes the result of [2] from SISO case to MISO case.
Theorem 5: Let β be the true impulse response. If T = (( √ pR log n + ) 2 ) is the number of output observations, C is some constant, the solutionβ to (15) satisfies β −β 2 ≤ 2δ/ with probability When the system output is y = U β + z and z is i.i.d. Gaussian noise with variance σ 2 z , we have that β −β 2 ( √ pR + )σ z log n with probability ([59], Thm 1]) This theorem says that when the input dimension is p, the sample complexity is O( √ pR log n). The proof strongly depends on the following lemma [2], [60]: Lemma 1: Define the Gaussian width where g is standard Gaussian vector of size p. The Gaussian width of the normal cone of (6) and (15) are different up to a constant [61].
We will present the proof in Appendix B. MIMO: For MIMO case, we say output size is m. We take each channel of output as a system of at most order R, and solve m problems and for each problem we have failure probability equal to (17), which means the total failure probability is We propose another way of MIMO system identification. For each rollout of input data, the output is m dimensional, but we take 1 channel of output from the observation and throw away other m − 1 output. And we uniformly pick among channels and get T observations for each channel, and in total mT observations/input rollouts. In this case, when the sample complexity is m √ pR log n (m times of before), we can recover the impulse response with Frobenius norm √ mδ/ with probability

B. GAUSSIAN WIDTH OF REGULARIZATION PROBLEM WITH MISO AND MIMO SYSTEM
The following sequence of theorems/lemmas are similar to [2], and we present them here for the integrity of the proof. Theorem 6: Let β be the true impulse response. If T = (( √ pR log(n) + ) 2 ) is the number of output observations, C is some constant, the solutionβ to (15) satisfies β −β 2 ≤ 2δ/ with probability Proof: Let I (β ) be the descent cone of G(β ) * at β, we have the following lemma: This is proven in [2], Lemma 1]. To prove Theorem 6, we only need lower bound the left hand side with Lemma 2. The following lemma gives the probability that the left hand side is lower bounded. γ T g (16) where g is standard Gaussian vector of size p. Let = I (β ) ∩ S where S is unit sphere. We have Now we need to study w( ).
Note that I * (β ) is just the cone of subgradient of G(β ), so it can be written as For right hand side, we have Let P W be projection operator onto subspace spanned by W , i.e., W |V T 1 W = 0, WV 2 = 0 and P V be projection onto its orthogonal complement.
We have We will upper bound it by the specific choice λ = P W (G(g)) , W = P W (G(g))/λ.
Bound the first term by (note V 1 and V 2 span R dimensional space, so V 1 ∈ R n×R and V 2 ∈ R pn×R ) So we get We know that, if p = 1, then E G(g) = O(log(n)). For general p, let we rearrange the matrix as and the expectation of operator norm of each block is log(n). Then (note v below also has a block structure [v (1) ; . . .; v (n) ]) Ḡ (g) = G(g) . So we have G(g) = √ p log(n). So w( ) = C √ pR log(n). Get back to (17), we want the probability be smaller than 1, and we get √ At the end, we provide a different version of Theorem 6. Theorem 6 in [2] works for adversarial noise with bounded norm. Here we consider the i.i.d. Gaussian noise, and use the result in [59] to establish the following theorem.
Remark 1: Since the power of U is n times of that ofŪ and the variance of U is 1/T ,

C. PROOF OF REGULARIZATION ALGORITHM'S SPECTRAL NORM ERROR (THM. 1)
We will prove the first case of (7). The second case is a direct application of [2].
Remark 2: To be consistent with the main theorem in the paper, we need to find the relation between σ ξ and SNR, or σ . We do the following computation: (1) G(β − β ) = H(ĥ − h), so we are bounding the Hankel spectral norm error here; (2) Each column of the input is unit norm, so each input is N (0, 1/T ), and the average power of input is 1/T ; (3) Because of the scaling matrix K, the actual input ofŪ is n times the power of entries in U . With all above discussion, we have σ ξ = σ √ n/T , which results in G(β − β ) np T σ log n. Proof: Now we bound G(β − β ) by partitioning it to G(I − U U )(β − β ) and G (U U (β − β )) . We have And then we also have Sinceβ is the optimizer, we have U (Uβ − y) Combining (20) and (21), we have . First we prove a result for later use.
is an important result to note, and following that, Let U be iid Gaussian matrix with scaling E(U U ) = I.
Here we need to study the Gaussian width of the normal cone w(J (β )) of (19). [61] proves that, if (22) is true, and λ ≥ 2 G(U ξ ) , then the Gaussian width of this set (intersecting with unit ball) is less than 3 times of Gaussian width of {β : Consequently simple algebra gives that max 0<σ i <σ, i σ =S i σ 2 i ≤ Sσ . So let σ i be singular values of P V G(δ) or P W G(δ), and S = P V G(δ) * or P W G(δ) * , the second last inequality comes from (23). Thus if so < 1. To get this, we need √ T /w(J (β )) √ R where T pR 2 log 2 n [62], Thm 9.1.1], still not tight in R, but O(min{n, R 2 log 2 n}) is as good as [4] and better than [5], which are O(n) and O(n 2 ) correspondingly. [62], Thm 9.1.1] is a bound in expectation, but it naively turns into high probability bound since ≥ 0.

D. PROOF OF SUBOPTIMAL RECOVERY GUARANTEE WITH IID INPUT (THM. 2)
We consider the Gaussian width w( ) defined in this specific case.
First, we note that Proof strategy: Based on the previous derivation, we focus on the case when 1 g ≤ 0. Denote z = λH * (V + W ) − g, and the vector z 1:k is the first 1 to k entries of z. Then we prove that (1) λ ≤ z 2 / √ n, (2) z 1:1/λ 2 λ −1/2 . Then we have z 2 ≥ z 1:1/λ 2 λ −1/2 ( z 2 / √ n) −1/2 which suggests z 2 n 1/6 . Lemma 6: Let g be a standard Gaussian vector of size 2n − 1 conditioned on 1 g ≤ 0. Let z = λH * (V + W ) − g where V = 1 n 11 , and W 1 = W 1 = 0, W ≤ 1. Then we have that λ ≤ z 2 / √ n. We observe that 1 H * (X ) is the summation of every entry in X for any matrix X . Thus 1 H * (W ) = 0 since W 1 = 0. Conditioned on 1 g ≤ 0, we have Setting the right hand side to be equal, we have Plugging it into any lower bound for T val in (30), we get the bound in the main theorem.

G. EXPERIMENTS WITH SYNTHETIC DATA
We present Algorithm 1 for model selection which was omitted from the main text.
In this subsection, we check Theorem 1 via synthetic experiments, and compare with least square estimator. In the following experiment, we have a fixed strictly stable SISO linear system with order 9, the Hankel size n is initiated as 20 which exceeds the true order. The input is multiple rollout, shaped Gaussian exactly as in Theorem 1, which means that we send in the input up to time 2n − 1, and observe the output at the end as an observation, and restart the system. The input satisfies that, after shaping by K −1 , E(U T U ) = I. The number of observations is 30 (undetermined for least square). We tune the regularized model by training with different weight λ of regularization, and tune the least square model by changing the size of Hankel matrix and observe the validation/test error.
In Fig. 7, we show the case with insufficient and noisy data. Since the number of samples is undetermined for least squares, even if we take the solution with the smallest 2 norm in impulse response, it suffers large error on validation and test set. We can see that the regularized algorithm is robust to noise, while least square algorithm remains bad. It indicates that, the solution with the least Hankel nuclear norm behaves better than least Frobenius norm solution in low sample complexity case which verifies Theorem 1.
SAMET OYMAK (Member, IEEE) received the M.S. and Ph.D. degrees from the California Institute of Technology, Pasadena, CA, USA. He is currently an Assistant Professor with the Department of Electrical and Computer Engineering, University of California, Riverside (UCR), CA. Before joining UCR, he spent time at Google and in the financial industry, and prior to that he was a Fellow with the Simons Institute, Berkeley, CA, and a Postdoctoral Scholar with UC Berkeley. His research interests include the mathematical foundations of data science and machine learning by using tools from optimization and statistics, mathematical optimization, reinforcement learning, deep learning theory, and high-dimensional problems. He was the recipient of the Wilts Prize for the Best Thesis in electrical engineering at the California Institute of Technology.
MARYAM FAZEL (Senior Member, IEEE) received the B.S. degree from the Sharif University of Technology, Tehran, Iran, and the M.S. and Ph.D. degrees from Stanford University, Stanford, CA, USA. She is currently the Moorthy Family Professor of Electrical and Computer Engineering with the University of Washington (UW), Seattle, WA, USA, with adjunct appointments in computer science and engineering, mathematics, and statistics. Before joining UW, she was a Postdoctoral Scholar with Caltech, Pasadena, CA. She was the recipient of the NSF Career Award, UWEE Outstanding Teaching Award, and the UAI conference Best Student Paper Award with her student. She is currently the Director of the Institute for Foundations of Data Science, a multi-university NSF TRIPODS Institute. Prof. Fazel is also on the Editorial board of the MOS-SIAM Book Series on Optimization, and is an Associate Editor for the SIAM Journal on Mathematics of Data Science.