A Low Complexity Approach to Model-Free Stochastic Inverse Linear Quadratic Control

In this paper, we present a Model-Free Stochastic Inverse Optimal Control (IOC) algorithm for the discrete-time infinite-horizon stochastic linear quadratic regulator (LQR). Our proposed algorithm exploits the richness of the available system trajectories to recover the control gain K and cost function parameters (Q,R) in a low (space, sample, and computational) complexity manner. By leveraging insights on the stochastic LQR, we guarantee well-posedness of the Model-Free Stochastic IOC LQR via satisfaction of the Certainty Equivalence optimality conditions. The exact solution of the control gain K is recovered via a deterministic, low complexity Least Squares approach. Using K, we solve a completely model-free non-iterative SemiDefinite Programming (SDP) problem to obtain a unique (up to a scalar ambiguity) (Q,R), in which optimality and feasibility are jointly ensured. Via derivation of the sample complexity bounds, we show that the non-asymptotic performance of the Model-Free Stochastic IOC LQR can be characterized by the signal-to-noise (SNR) ratio of the finite set of system state and input signals. We present a model-based version of the algorithm for the special case where (A,B) is available, and we, further, provide the extension to the Stochastic Model-Free IOC linear quadratic tracking (LQT) case.


I. INTRODUCTION
T HE inverse optimal control (IOC) problem was first formulated in 1964 by Kalman [1] for the deterministic single-input linear quadratic regulator (LQR). The extension to the multi-input case was later formulated in [2]. Since then, research interests in studying the IOC problem have been strongly motivated by the variety of applications that require an understanding of "optimal behaviors" [3], [4]. By reverse-engineering the control policy and cost function from the system trajectories, we can gain a greater understanding on how individuals make decisions to better inform system (and individual) behavioral models, which allows us to make better predictions.
Historically, many IOC LQR works have focused on recovering the cost function parameters (Q, R) for a known stabilizing gain K [2], [5], [6]. More recently, however, the assumption on the knowledge of the control gain K has been relaxed, and emphasis has been placed on the wellposedness of the IOC LQR problem, i.e., the feasibility of the IOC LQR, and the existence and uniqueness of the feasible solution [7]. Not only does ill-posedness make it difficult to numerically solve the IOC LQR, but also even if there exists a solution, non-uniqueness of the solution does not allow for useful inferences on the decision-making process [8]. Many papers have therefore established feasibility ( [8], [9]) and sufficient conditions on uniqueness ( [10]- [12]) via the use of the Karush-Kuhn-Tucker (KKT) optimality conditions. Feasibility and uniqueness conditions are well established for the deterministic IOC LQR, which is more well-studied than the stochastic case. This is unsurprising as it turns out that it is difficult to demonstrate well-posedness if the assumption that the data is noiseless no longer holds. However, the solution of the deterministic IOC LQR was proven to be statistically consistent to the stochastic case for zero-mean noise [9]. Still, the proofs on the well-posedness of the IOC problem in [9] are only shown for the ideal noiseless case. Proofs on the sufficient uniqueness conditions of the solution (Q, R) for the stochastic IOC LQR, using zero-mean noisy data, are presented in [10] for the infinite-horizon case, and in [12] for the finite-horizon case).
Related works, in inverse reinforcement learning (IRL), typically solve the stochastic IOC LQR using a direct approach [13]- [16], while traditional IOC LQR methods use model-based approaches. Approaches in which the nominal model (in our case, the system matrices (A, B)) is assumed to be completely unknown are called direct / model-free approaches. Correspondingly, indirect or model-based approaches either estimate or assume the nominal model to be known a-priori. Model-free approaches have the following distinct strengths: 1) model-free approaches are better able to capture parametric uncertainties and encapsulate the full system behavior that may remain unmodeled owing to datafitting methods [17]; 2) it has been posited that directly learning a task is often easier than fitting a general purpose system model [18]; and 3) model-based methods tend to achieve more conservative asymptotic behavior than modelfree techniques owing to compounding model errors [17].
The lack of model knowledge, in direct approaches, further exacerbates the challenge of establishing well-posedness of the stochastic IOC LQR. Moreover, since model-free approaches utilize a finite number of data points, nonasymptotic properties must be derived in order to show performance guarantees [19]. In general, stochastic IRL works, such as [13]- [15], [20], tend to suffer from ill-posedness and nonconvexity issues [7]. In addition, few IRL works give comments on the sample and computational complexities of their algorithms that require large quantities of data [21], [22]. We cite the works in some of the most recent IRL papers to highlight this gap in the literature. Sample complexity bounds for general IRL methods were presented in both [22] and [23] without comments on well-posedness; and the sufficient conditions on the solution uniqueness, without a feasibility analysis, were presented in [16]. No derivations or comments on the non-asymptotic properties of the proposed method in [16] were provided. Additionally, the method in [16] requires either a-priori knowledge or a guess on R to estimateQ (which may then be inaccurate).
In this paper, we propose a Model-Free Stochastic IOC LQR algorithm for the linear time-invariant (LTI) system subject to additive zero-mean Gaussian noise. The contributions of this paper are three-fold: 1) A Well-Posed Model-Free Stochastic IOC LQR algorithm: the proposed algorithm is a SemiDefinite Programming (SDP) optimization problem whose wellposedness is proven via satisfaction of the Certainty Equivalence KKT optimality conditions which establishes feasibility and uniqueness of a solution up to a scalar ambiguity. 2) Low Complexities: the proposed method derives the finite bounds of the sample and space complexities of the proposed algorithm for the LTI system with a n-dimensional state space and m-dimensional control input space. The proposed algorithm is also prescribed a worst-case polynomial-time computational complexity.

3) A variant Model-based Stochastic IOC LQR algorithm:
we provide a variant of the Model-Free Stochastic IOC LQR algorithm that incorporates the system parameters, if available, to formulate a stochastic model-based algorithm that is also well-posed.
This paper is organized as follows: Section II presents the problem formulation of the infinite-horizon discretetime stochastic IOC LQR. Section III presents the Model-Free Stochastic IOC LQR algorithm with proofs on wellposedness and low complexities. In Section IV, we present the Model-based Stochastic IOC LQR algorithm, a variant of the original model-free version. In Section V, we demonstrate the performance of the proposed model-free algorithm via simulation. Here, we also compare the performance of the model-based variant algorithm with other existing methods in [8] and [10]. Finally, conclusions are drawn in Section VI, and the extensions to the model-based and Stochastic IOC LQT case are presented in Appendix A.

II. PROBLEM FORMULATION
Consider the discrete-time linear time-invariant (LTI) system: where x k ∈ R n and u k ∈ R m are the respective state and input vectors. (A ∈ R n×n , B ∈ R n×m ) is the pair of unknown system matrices. The external mismatched disturbance term w k ∈ R n ∼ N (0, Σ w ) is the zero-mean Gaussian noise with a covariance Σ w ≻ 0. In this formulation, all state variables are fully observed, and the initial condition x 0 ∈ R n ∼ N (0, X 0 ) with mean E(x 0 ) = 0 and (possibly unknown) covariance X 0 ⪰ 0, without loss of generality. The notations, M ≻ 0 and M ⪰ 0, denote a positive definite and semi-positive definite matrix, M , respectively. For the stochastic infinite-horizon linear quadratic regular (LQR) in optimal control theory, a stabilizing optimal control policy u k = Kx k , ∀k ≥ 0, is used to minimize a cost function, parametrized by weights Q ∈ R n×n ⪰ 0 and R ∈ R m×m ≻ 0 for the state and input terms, respectively. Conversely, the stochastic inverse optimal control (IOC) LQR problem is defined by the following two subproblems [6]. Given a control policy u k = Kx k , ∀k ≥ 0, that generates a set Γ of noisy trajectories of (1), Subproblem P1: Recover the gain K ∈ K A,B such that K is the stabilizing optimal control gain that is necessary and sufficient to minimize the cost function in (2).
Subproblem P2: Determine the pair of unknown weights (Q, R) such that the cost function (2) is minimized for every trajectory in Γ.
We denote K A,B to be the set of all stabilizing feedback gains such that, for K ∈ K A,B ⊂ R m×n , the spectral radius ρ(A + BK) < 1. To facilitate our discussion, we state the assumptions that are made throughout the rest of this paper.
There is a set of r ≥ n + m state and input trajectories Γ ≜ {(x kj , u kj ), · · · , (x Nj , u Nj )}, for j ∈ {1, . . . , r} available for a fixed control policy u k = Kx k , ∀k > k j , with varying starting con- Nj k=kj Σ v is the covariance of the starting condition v kj and we impose that ||Σ v || 2 ≥ I n K T T X 0 I n K 2 . I n ∈ R n×n is the identity matrix of size n × n. Assumption 1.1) is standard in LQR applications [8], [24] and allows for the existence of a feasible solution to the (stochastic) LQR problem. Under Assumption 1.2a), unlike the methods in [9], [12], our proposed method allows for the use of multiple partial (or incomplete) trials of state and input information. As such, the trajectories in Γ can have varying lengths T j and starting conditions v kj . The benefit of which is that there is no need for uniform trajectory lengths or fixed starting conditions, and extensive information preprocessing is rendered unnecessary. Assumption 1.2a) also facilitates and allows Assumption 1.2b) to hold. Assumption 1.2b) requires that the collected trajectories must be rich enough such that the system information varies sufficiently to encapsulate the entire behavior of the data-generating dynamical system in (1). This assumption is mild and common in data-driven/model-free approaches [25], [26]; and is related to the idea of Persistency of Excitation [26], [27]. To ensure that this condition on richness is met, the system trajectories are excited via two means: (i) varying the starting condition v kj (as per Assumption 1.2a), and (ii) injecting the zero-mean white noise process w k . Both excitive means (i) and (ii) encourage exploration of the state space. We refer readers to [26] and [27] for a treatise on Persistency of Excitation in data-driven control methods. Remark 1. We must remark that our proposed approach has a lower space complexity over the methods in [26] and [16], in which the dimensions of the relevant (Hankel) data matrices increase as the number r and the length T of trajectories increase. In our approach, the associated data matrix, in Assumption 1.2b) above, maintains a fixed dimension (n + m) × (n + m), irrespective of r and T .

III. THE INFINITE-HORIZON MODEL-FREE STOCHASTIC IOC LQR
As previously discussed in Subproblems P1 and P2 respectively, we are interested in 1) recovering the stabilizing optimal control gain K, and 2) computing (Q, R) in a modelfree manner since we only have access to a set of rich trajectories Γ. In this section, we present the proposed Model-Free Stochastic IOC LQR method and its corresponding algorithm (Algorithm 1) that is comprised of two main parts. The first part, in Section III-A, extracts an exact solution of K from the system trajectories, Γ. We prove that we can efficiently and exactly recover K from as little as n + m rich trajectories, each of, at least, length 2. The second part, in Section III-B, presents the Model-Free Stochastic IOC LQR SDP formulation with a rigorous proof on its wellposedness and sample complexity bounds. The Model-Free Stochastic IOC LQR algorithm (Algorithm 1) generates a unique certainty-equivalence solution (Q,R) that is cheap in terms of its space, sample and computational complexities.

A. EXTRACTION OF THE CONTROL GAIN K
In order to extract an exact solution for the control gain K, consider the data matrix X i r ≻ 0 defined as: ( Going forward, we define the partition of any matrix M ∈ R n×n , M 12 ∈ R n×m , M 21 ∈ R m×n , and M 22 ∈ R m×m . X i r can be viewed as a measure of the sample covariance of the system data from a zero-mean equilibrium point. In this work, we use X 0 r as the marginal sample covariance as it incorporates the starting conditions v kj . Remember that varying the starting conditions encourages state-space exploration and as a result, X 0 r gives the best measure of the sample covariance. Since the trajectories, Γ, are propagated using a control law u k = Kx k , ∀k > k j , from some starting condition v kj , the one-step forward sample covariance, X 1 r , is directly parametrized in terms of K, and can be used to extract K exactly as K = (X 1 r ) T 12 (X 1 r ) −1 11 , as shown in Lemma 1.
We examine the partitions of X 1 r : We see that (X 1 Algorithm 1 Model-Free Stochastic IOC LQR Require: Γ Step 1: Construct X 0 r , X 1 r , and Z using Γ Step 2: Step 3: Solve for (P,Q,R) from (7) return (K 1 r ,Q,R) The computation of K from (4) may seem trivial. As the reader may suspect, it is a linear least-squares computation. However, the contribution here is not an artifact of the computational method used but rather from the fact that, by exploiting the richness of the data, the computation of K is a linear matrix equality system that is consistent and has exactly one solution. Therefore, we can compute K in a deterministic low (computational) complexity manner that leaves an exactly zero residual, ϵ K ≜ ||K − K i r || F = 0. ||M || F denotes the Frobenius norm of the matrix M . We do not use the matrices (A, B) to compute K nor do we use trend-stationary regression methods, as in [8], [9], [12], to solve for K by fitting a curve to time-series stochastic data. For simple low-dimensional problems, linear regression techniques perform well and may give an almost perfectK with 0 < ϵ K ≪ 1. However, these regression techniques falter in the settings where the data have large variances, and the number of predictors is more than the number of observations [28]. Now, the questions arise for Lemma 1: i) how easy is it to ensure that X 1 r ≻ 0 and (X 1 r ) 11 ≻ 0?; and ii) what is the minimum number of data points required such that X 1 r ≻ 0? We denote v kj +i as the augmented state and input vector propagated i steps forward from v kj , for the j-th trajectory. An analysis of the matrix space of X 1 r reveals that the variability of the trajectory data, that produces X 1 r ≻ 0, is a result of excitation by the process noise with Σ w ≻ 0 and the variability in the starting conditions v kj , since any v kj +i is a linear combination of v kj and the process noise. That is, where We use Proposition 1 to show that, for a set of trajectories Γ, of minimum length T j = 2, generated by varying starting conditions and white process noise with a positive definite covariance, we require r ≥ n + m so that X 1 r ≻ 0 and (X 1 r ) 11 ≻ 0. Proposition 1. For a set of trajectories, Γ, of minimum length 2, we need r ≥ n+m rich trajectories so that X 1 r ≻ 0 and (X 1 r ) 11 ≻ 0. Proof. We fix T j = 2 and let V kj ≜ v kj v T kj . Then, where r is made of linear combinations of the matrices V kj and H 1,j for j ∈ {1, . . . , r}. Thus, for a non-zero matrix A K , we need at least n + m independent v kj or n + m independent w kj so that rank(X 1 r ) = n + m and X 1 r ≻ 0. Additionally, X 1 r ≻ 0 if and only if all its leading principal minors are positive. Hence, (X 1 r ) 11 ≻ 0. This result makes intuitive sense. Where the number of predictors (resp. dimension of X 1 r ) is of size n + m, we need at least n + m observations (resp. number of trajectories). This is a classic result from the literation on linear regression system identification techniques [29]. We can also talk about what is the minimum length of a single trajectory such that X 1 r ≻ 0 via Corollary 1. Corollary 1. For a set, Γ, containing a single trajectory, i.e. r = 1, we need a trajectory of length T r ≥ n+m+1, excited by a sequence of independent n+m noise vectors w k , so that X 1 r ≻ 0 and (X 1 r ) 11 ≻ 0. Remark 2. We see that the condition X 1 r ≻ 0 can be met using a single trajectory, but this requires a trajectory of length T r ≥ n + m + 1 with a sequence of n + m independent noise vectors that excites all the modes of the system. However, it is arguably easier to vary the system starting conditions than to excite all the modes of a unknown dynamical system using noise.

B. STOCHASTIC MODEL-FREE IOC LQR SDP OPTIMIZATION
Given the exact computation of the control gain K, we present the Stochastic Model-Free IOC LQR SDP Optimization problem, P inv , in (7), and claim that P inv is well-posed and obtains unique solutions, (Q Pinv ,R Pinv ), within a scalar ambiguity of (Q, R). We prove these claims via Theorems 1 and 2. Consider the Stochastic Model-Free IOC LQR SDP Optimization problem: . blkdiag(C, D) denotes a block diagonal matrix with matrices C and D. We investigate the wellposedness of P inv via a two-pronged approach. First, we prove that P inv is feasible by establishing that there exists a nonempty set of solutions that satisfy the constraints in (7), and that, despite the lack of upperbound on P in (7), J Pinv is bounded. We leverage insights on the feasibility of the stochastic LQR to inform that of the stochastic IOC LQR; and then, solve the feasible stochastic IOC LQR using a Certainty Equivalence approach. The Certainty Equivalence approach involves adopting the estimate of a parameter as if it were the true parameter [30]. We refer readers to [30], [31] for further exposition.
Consider the stochastic LQR problem with the LTI dynamics (1) and the cost function (2). Let Ω 0 be the set of solutions of the stochastic LQR, given a pair of parameters (Q, R) belonging to a set Ω Q,R . In the stochastic LQR setting under Assumption 1.1), it is commonly known that there exists an optimal solution, Therefore, (K ⋆ , S ⋆ ) ∈ Ω 0 and Ω 0 is non-empty. The stochastic LQR problem is equivalent to: with the parametrization: X ≜ lim T →∞ X T , and Σ K ≜ lim T →∞ Σ K,T are the timeaveraged steady state covariances of the augmented state and input vector, and the excitive inputs, respectively. We use Lemma 2 to prove that the primal problem P 1 is equivalent to the LQR problem by be the optimal solution of the primal P 1 .
Proof. For any stabilizing K ∈ K A,B , the cost function J 0 is equivalently written as Consider the sum of state and input trajectories propagated via the dynamics in (1): Then, given all covariance terms E(x k w T k ) = E(u k w T k ) = 0, We take the lim T →∞ of the time average of (13) to find the time-averaged steady state formulation, which is the matrix equality constraint as in (9). The stochastic LQR problem also requires that the gain K be stabilizing. By Proposition and also X is unique. Thus, X ⋆ P1 is unique, K ⋆ P1 = K ⋆ is unique, and J ⋆ P1 = J ⋆ 0 . From P 1 , we can analyze the feasibility of the stochastic LQR (and thereby, the feasibility of the stochastic IOC LQR) from consideration of its Karush-Kuhn-Tucker (KKT) conditions. The first-order necessary KKT conditions of P 1 are: and the second-order sufficient KKT conditions are: where A b = A B . (P , Y ) is the pair of Lagrangian dual variables of the respective equality and inequality conditions in (9). Given that there exists a non-empty Ω 0 for a fixed Ω Q,R that satisfies the KKT conditions, the KKT conditions can be used to find the feasible Ω Q,R , for a fixed non-empty Ω 0 . We use Theorem 1 to prove that the Stochastic Modelfree IOC LQR problem, P inv is feasible. We set X 0 r as the (so-called) minimum-stress estimate of X by the Certainty Equivalence Principle.
Lemma 3. For a non-empty set Ω 0 containing (K, X), the set of solutions ΩQ ,R to P inv is non-empty and contains a feasible (Q ⋆ Pinv ,R ⋆ Pinv , P ⋆ Pinv ). Proof. By construction, X 0 r ≻ 0 satisfies the conditions on primal feasibility. Z T P Z + X 0 r (Λ − P )X 0 r = 0 is the Certainty Equivalence formulation of X(Λ−P +A T K P A K )X = 0, which satisfies the first stationarity condition in (14). The constraint P 12 +K T P 22 = 0 satisfies the second condition on stationarity. For satisfaction of optimality and dual feasibility, we respectively include the sufficient condition P 22 ≻ 0 and P ⪰ 0. The condition on complementarity is satisfied for Y = 0. Therefore, P inv satisfies the necessary and sufficient VOLUME 4, 2016 KKT conditions. Given that (K, X) ∈ Ω 0 and the stochastic LQR necessarily satisfies (14), ΩQ ,R is non-empty. Now, we prove the cost function J Pinv is bounded.

Lemma 4. J Pinv is bounded.
Proof. Any solution P Pinv ⪰ 0 implies that J Pinv ≥ 0. For any K such that ρ(A K ) < 1, the first stationarity condition in (14) ensures that Since ||A t K || 2 ≤ ||A K || t 2 , the upperbound on P becomes Given that the Certainty Equivalence solution P Pinv satisfies the first stationarity constraint, P Pinv is upperbound. Now, we have all the tools to discuss the feasibility of P inv using Theorem 1.

Theorem 1. P inv is feasible.
Proof. From Lemmas 3 and 4, P inv obtains a non-empty set of feasible solutions ΩQ ,R , and J Pinv is bounded between 0 and P max .
Remark 3. P inv is a feasible SDP optimization problem that finds an ϵ-approximate solution, using the interior-point method, with worst-case computational complexity linear in log(1/ϵ) and polynomial in n + m [33].
The second prong in the investigation of well-posedness of P inv is to prove that the optimal solution tuple (Q ⋆ Pinv ,R ⋆ Pinv ) is unique. While (Q ⋆ Pinv ,R ⋆ Pinv ) satisfies the KKT conditions, it is not necessarily true that (Q ⋆ Pinv ,R ⋆ Pinv ) = (Q, R). This is because there is a manifold of possible solutions that satisfy the KKT conditions [8]. However, by solving (7), we can recover a unique, optimal solution (Q ⋆ Pinv ,R ⋆ Pinv ) up to a scalar ambiguity, i.e., (Q ⋆ Pinv ,R ⋆ Pinv ) = (γ 2 Q, γ 2 R), where γ ∈ R 1 is a scalar constant. Theorem 2 proves this result.

Theorem 2. (Q ⋆
Pinv ,R ⋆ Pinv ) = (γ 2 Q, γ 2 R) for γ ∈ R 1 . Proof.Λ ⪰ 0 is satisfied in (7). ρ(A K ) < 1 and Z T P Z + X 0 r (Λ − P )X 0 r − 0 is the Certainty Equivalence form of X(A T K P A K +Λ − P )X = 0. P ⪰ 0 is unique for This means that all other possible solutions are such that X(Ā T KPĀ K −P + Λ)X = XT −T (A K P A K −P +Λ)T −1 X = 0. Thus, T −1 X and T −T X must be commuting since the columns of the transformation matrix T −1 do not provide the eigenbases of the invariant space. T −1 X and T −T X commute if and only if T −1 is a scaled identity matrix, i.e. T −1 = γI n+m for γ ∈ R 1 . Hence, (Q ⋆ Pinv ,R ⋆ Pinv ) = (γ 2 Q, γ 2 R). From Theorems 1 and 2, P inv is well-posed and the proposed Stochastic Model-Free IOC LQR algorithm yields a unique solution. Up to this point, we have investigated the well-posedness of the stochastic IOC LQR via the use of the Certainty Equivalence Principle. That is, we have obtained the performance weights (Q,R) by treating the marginal sample covariance X 0 r as the true covariance X. Although it may be efficient to use such an estimate, it is important to establish the performance gap between the Certainty Equivalence form and the true form via the derivation of nonasymptotic performance bounds. We investigate the sample complexity of the minimum stress estimate X 0 r and provide the error bound ϵ X ≜ ||X 0 r − X|| max as a measure of the performance of P inv . The smaller the error bound, the smaller the performance gap between P inv and the non-Certainty Equivalence (true) form of P inv ; and the smaller the error ||Λ −Λ|| 2 . We provide the sample complexity bounds on X 0 r using Theorem 3.
Theorem 3. For the stochastic LTI system (1) under Assumption 1, with a marginal sample covariance X 0 r defined in (3), we have, with a probability no smaller than 1 − p, where p ≜ r j=1 3(n + m) 2 exp − T j + 1 2 The proof of Theorem 3 is given in Appendix B. Theorem 3 offers some interesting insights that we can use to further develop our intuition on how the sample complexity affects the algorithm's performance. Let us consider the following extreme cases.
• Case 1: r large and all T j large such that r ≈ ∞ and T j ≈ ∞ for all j ∈ {1, . . . , r}. Theorem 3 states that with probability 1, the error ϵ X = 0. In other words, for an infinite set of trajectories of infinite length, the marginal sample covariance X 0 r is the true covariance X, and the Certainty Equivalence formulation P inv is the true formulation.
• Case 2: r = 1 and all T r ≫ n + m + 1 large such that T r ≈ ∞. In this case, from Theorem 3, we see the error ϵ X now depends on the term ||X|| 2 . We investigate the asymptotic behavior of X using (9) and (10). ||X|| 2 is on the order of (1 + ||X 0 /T || 2 /||Σ w || 2 ). If T = T r is large, then the contribution of the excitation from the (noiseless) input, X 0 , is filtered out over time, and the noise contribution ||Σ w || 2 dominates. We can think of this as the signal X having a small signal-tonoise ratio (SNR). We denote (||X 0 /T || 2 /||Σ w || 2 ) as a measure of the SNR of a single trajectory. Intuitively, this makes sense. If noise is injected into the trajectory over time from a fixed starting condition, then a single long trajectory of length T filters the excitive impact of the starting condition, and is mainly driven by the excitive noise input. A small SNR makes ||X|| 2 large and widens the gap ϵ X so that ϵ X is on the order of ||Σ w || 2 . • Case 3: T j = 2 for all j ∈ {1, . . . , r}, and r large such that r ≫ n + m ≈ ∞. In a similar analysis to Case 2, we can see that the SNR is large in this case. For multiple trajectories of short lengths, ||X|| 2 is driven by the contributions of the excitive starting conditions, and the effects of the process noise are filtered out over the r trajectories. In comparison to Case 2, Case 3 gives a gap ϵ X on the order of ||X 0 /T || 2 , which is less than ||Σ w || 2 for a large SNR.

Remark 4.
Characterizations of the non-asymptotic performance in terms of the SNR of the system data have only been done in one other paper, [19], for the data-driven stochastic LQR. The results in our paper, for the model-free stochastic IOC LQR, corroborates the result in [19], and motivates further discussion on how to best exploit the properties of the state and input data signals, collected in offline or online settings, to improve the performance of model-free controllers.

IV. A VARIANT ALGORITHM: INFINITE-HORIZON MODEL-BASED STOCHASTIC IOC LQR
Given (Â,B), we use Algorithm 2 to solve for an estimate (Q,R). In Algorithm 2, the Certainty Equivalence formulation problem P inv in (7) is replaced by whereÂ K ≜ I n (K 1 r ) T T ÂB . This may seem a somewhat trivial replacement. However, in the Model-based stochastic IOC LQ case, there are several associated challenges. Firstly, there is the issue of well-posedness. It is easy to see here that P M inv is well-posed and generates a unique up to a scalar ambiguity. In a proof similar to Theorem 2, we can guarantee that P M inv admits Algorithm 2 Model-Based Stochastic IOC LQR Require: Γ Step 1: Construct X 1 r using Γ Step 2: Step 3: Solve for (P,Q,R) from (20) return (K 1 r ,Q,R) a feasible, unique solution that satisfies the necessary and sufficient KKT conditions in (14) and (15). Secondly, model-based IOC LQR methods utilize the set of trajectories, Γ, to estimate the control gainK via some regression technique, and subsequently, solve (8) (or some equivalent optimality equation) to compute (Q, R) given (Â,B). Feasibility of the IOC LQR is then particularly sensitive to inexact estimates of (K,Â,B) owing to the combinatorial nature of (8). An inaccurate estimate ofK,Â orB can cause infeasibility so that there exists no solution to (8), as discussed in [8]. Most existing works, therefore, assume that the nominal model is perfectly known, i.e., (Â,B) = (A, B). The proposed method in this paper deals with this issue in two ways: 1) The estimation of K from K 1 r is exact and admits an error ϵ K = 0, as discussed in Section III-A. In fact, since there is no longer a need to estimate the sample covariance X 0 r , K 1 r can be exactly estimated from as little as r = n + m trajectories of length 2 or a single trajectory of length n + m + 1, as discussed in Proposition 1 and Corollary 1, respectively. 2) If the pair (A, B) is not perfectly known, we can estimate (Â,B) using system identification techniques as in [34,Algorithm 1]. (Since these system identification techniques for an LTI system are numerous in the literature, we do not provide such techniques in this paper). We also refer readers to [34, Prop. 1.1] for a sample complexity analysis for estimating the matrices (Â,B) using a least-squares regression approach. We illustrate the comparative performance of the Modelbased Stochastic IOC LQR with the methods in [8] and [10] that use stochastic data (in simulation) to solve the Modelbased IOC LQR.

V. SIMULATION RESULTS
Via an illustrative example, we demonstrate the performance of the proposed Stochastic Model-Free IOC LQR algorithm. For completeness, we also compare the performance of the our model-based algorithm, Algorithm 2, with the existing works [10] and [8]. Consider the LTI dynamics with , for the cooling of a three-source data center, whose dynamics corresponds to a marginally unstable Laplacian system [35]. Thus, it is imperative that a high quality fit of the dynamics be obtained. We use weights Q = I n and VOLUME 4, 2016 R = 1000I n to illustrate the case where there is a relatively high cost for power consumption to encourage small control inputs. Small control inputs may destabilize a ill-fitted system by generating small inputs for (unstable) modes that are estimated to be stable. The optimal control gain K = can be obtained by solving the DARE. In this simulation, we solve the set of SDP equations in (7) and (20) using the MATLAB cvx SDPT3 solver [36], that has worst-case polynomial time and computational complexity.

A. NOISY MODEL-FREE CASE
In the model-free case, the system trajectories are generated with the nominal model (A, B) and control gain K. In Fig. 1, we demonstrate the Monte-Carlo simulation results of our proposed algorithm in the noisy case. For each episode, the r state and input trajectories were simulated from randomly generated initial points, v kj ∼ N (0, blkdiag(10 2 I n , 10 2 I m )) with T j = N equal for all trajectories. Figure 1 shows that the accuracy of recovering the cost function increases as the number and length of the trajectories increase. This is to be expected as, with more trajectories of longer lengths, X 0 r is better able to encapsulate the entire system behavior, and better approximates X, as discussed from Section III-B.
We analyze Fig. 1 using the insights gained from Theorem 3. For ease of analysis, we look at the corners of Fig. 1. The lower left-hand corner of Fig. 1 presents the case where r is small and T j is small, which yields an average error ||Λ −Λ|| 2 ≈ 0.25. (Remember that for all trajectories, T j = N , in this simulation). Here, the algorithm requires more data points to give a better estimate. In comparison, the upper left-hand corner yields a smaller performance error ||Λ −Λ|| 2 ≈ 0.2. The upper left-hand corner of Fig. 1 illustrates the case where r is small as compared to T j . Here, the lengths of the trajectories are longer, which accounts for some performance improvement, but this improvement is tempered by a small SNR. This is because the contribution of the input (noiseless) signal, to the covariance, is dominated by that of the process noise. As is expected from our insights on Theorem 3, a small SNR widens the performance gap. The lower right-hand corner outperforms the results from the left half of Fig. 1. Here, the SNR is larger than that of both the lower and upper left-hand corners, as r is large and T j is small. From Theorem 3, this narrows the performance gap while increasing the probability of success 1 − p. Finally, the upper right-hand corner illustrates the case where both r and T j are large. In this corner, the performance results are the best of all the cases.

B. NOISELESS MODEL-FREE CASE
For completeness, we present the performance results for the model-free noiseless case using our proposed Model-Free Stochastic IOC LQR algorithm. That is, the excitation of the system stems only from varying the starting points, v kj ∼ N (0, blkdiag(10 2 I n , 10 2 I m )), and Σ w = 0I n . It is obvious here that this is a (ideally) large SNR. For (T j = 6, r = 6), the normalized error ||Λ −Λ|| 2 = 3.4 × 10 −4 and for (T j = 300, r = 50), ||Λ −Λ|| 2 = 6.93 × 10 −6 . In the noiseless case, we can even generate good results for the small dataset (T j = 2, r = 6): ||Λ −Λ|| 2 = 0.0046. Overall, the model-free noiseless case shows better comparative accuracy, on the order of 10 −2 , to the noisy case since the system data is uncorrupted by noise.

C. MODEL-BASED CASE
In this section, we demonstrate the comparative accuracy of our proposed variant model-based algorithm to the LMIbased method in [8] and the inverse KKT approach presented in [10]. Note that the model-based version of our algorithm (Algorithm 2) is employed for this comparison study, since the methods proposed in [8] and [10] are applicable to only the model-based case. Table 1 We provide the eigenvalues of (Â 2 +B 2 K) and (Â 2 +B 2 K) to show that the closed-loop stability for each case. The eigenvalues of (Â 1 +B 1 K) are {0.9432, 0.9608, 0.9608} and the eigenvalues of (Â 2 +B 2 K) are {0.8791, 1.0329, 0.9511}. Table 1 shows that our algorithm outperforms the others in terms of accuracy across all the scenarios with varying noise, varying number of trajectories of varying lengths, and even varying descriptions of the system models. Compared to [10], our proposed method is well-posed for any given (Â,B) and is able to obtain a solution for low sample complexities and data with high noise covariances. The inverse KKT approach in [10] relies on a specific rank condition on the recovery matrix, which is sensitive to the length of the trajectories, TABLE 1. A comparison of error ||Λ −Λ||2 for the model-based approaches proposed in this paper, [8] and [10] .

Σw
Method T j = 2 T j = 2 T j = 2 T j = 50 T j = 50 T j = 50 T j = 300 T j = 300 T j = 300 r = 6 r = 50 r = 300 r = 6 r = 50 r = 300 r = 6 r = 50 r = 300 0.001In the covariance of the noise, and the estimates of the system matrices (Â,B). For example, if there are too few data points, then the rank of the recovery matrix is too low; or if the covariance of the noise is large, then the rank of the recovery is too high. Table 1 shows that the inverse KKT method in [10] is only guaranteed to obtain valid results for the cases in which there is a large data set with low noise and (Â,B) = (A, B) perfectly known. To ensure wellposedness, the method in [8] artificially restricts the manifold of possible solutions to guarantee the uniqueness of any obtained solution. As such, our proposed method outperforms that of [8] by virtue of having a larger searchable solution space, while still guaranteeing well-posedness. Additionally, the method in [8] does not explicitly model the disturbance in the development of their IOC LQR methodology. As such, this lack of accounting may be the reason why their method performs badly for disturbances with larger covariances. We also note that in general, when the estimated dynamics (Â,B) is ill-fitted, the performance of all the methods, including our proposed method, worsens. This is to be expected as the model estimation error affects the computation ofΛ.

D. COMPARING MODEL-FREE TO MODEL-BASED PERFORMANCES
In this last section, we compare the performance results of Algorithm 1 to that of the Algorithm 2, as per Table 2. We see that the model-free algorithm (Algorithm 1) obtains an overall better robust performance to noise, while the modelbased algorithm (Algorithm 2), generally, yields better results INSEOK HWANG received the Ph.D. degree in Aeronautics and Astronautics from Stanford University, and is currently a professor in the School of Aeronautics and Astronautics and a university faculty scholar at Purdue University. His research interests lie in high assurance autonomy (which guarantees safety, security and performance) for Cyber-Physical Systems (CPS) based on the hybrid systems approach and its applications to safety critical systems such as aircraft/spacecraft/unmanned aerial systems (UAS), air traffic management, and multi-agent systems, which are complex systems with interacting cyber (or logical or human) elements and physical components.