Learning Lipschitz Feedback Policies from Expert Demonstrations: Closed-Loop Guarantees, Generalization and Robustness

In this work, we propose a framework to learn feedback control policies with guarantees on closed-loop generalization and adversarial robustness. These policies are learned directly from expert demonstrations, contained in a dataset of state-control input pairs, without any prior knowledge of the task and system model. We use a Lipschitz-constrained loss minimization scheme to learn feedback policies with certified closed-loop robustness, wherein the Lipschitz constraint serves as a mechanism to tune the generalization performance and robustness to adversarial disturbances. Our analysis exploits the Lipschitz property to obtain closed-loop guarantees on generalization and robustness of the learned policies. In particular, we derive a finite sample bound on the policy learning error and establish robust closed-loop stability under the learned control policy. We also derive bounds on the closed-loop regret with respect to the expert policy and the deterioration of closed-loop performance under bounded (adversarial) disturbances to the state measurements. Numerical results validate our analysis and demonstrate the effectiveness of our robust feedback policy learning framework. Finally, our results suggest the existence of a potential tradeoff between nominal closed-loop performance and adversarial robustness, and that improvements in nominal closed-loop performance can only be made at the expense of robustness to adversarial perturbations.


I. INTRODUCTION
Robustness of data-driven models to adversarial perturbations has attracted much attention in recent years.One of the approaches to robust learning seeks to modulate the Lipschitz constant of the data-driven model [1], [2], [3], either via a regularization [4], [5] of the learning loss function or by imposing a Lipschitz constraint [6], [7].Since the Lipschitz constant determines the (worst-case) sensitivity of a model to perturbations of the input, data-driven models trained with Lipschitz constraints/regularizers are expected to be robust to bounded (adversarial) perturbations [8].Prior works have primarily explored this approach for static input-output models [7], [8], [9].However, in a feedback control setting, a static input-output robustness guarantee for a data-driven controller may not result in robust closed-loop performance.When a data-driven controller is integrated into the feedback loop, a static input-output robustness guarantee for the data-driven controller must be combined with appropriate robust control notions to yield a robustness certificate for the closed-loop system [10], [11].Obtaining safety and robustness certificates for data-driven controllers in closed-loop systems remains an active area of research in general.In this work, we propose a learning framework to learn Lipschitz feedback policies with provable guarantees on closed-loop performance and robustness against bounded adversarial perturbation, where these policies are learned directly from expert demonstrations without any prior knowledge of the task and the system model.
The problem of learning optimal feedback control policies from data for a nonlinear system with unknown dynamics and control cost is not only technically challenging, but also has high sample complexity, which presents obstacles for data collection and the use of data-driven algorithms.In imitation learning framework, this issue is often mitigated by expert demonstrations of the optimal feedback policy, which help reduce the problem to one of learning the policy implemented by the expert.Yet, learning is not a simple repetition of the expert controls, but rather the ability to generalize and respond to unseen conditions as the expert demonstrator would.A naive learning approach (such as behavioral cloning in imitation learning [12]) that overlooks the generalization and robustness requirements may not only result in pointwise differences between the expert and implemented policies, but also in unstable trajectories and failure of the controlled system [13].This raises the following question: For an unknown system and control task, what is an appropriate method to learn a feedback policy from a finite number of expert demonstrations (dataset of state-control input pairs) such that (i) the learned policy generalizes expert performance beyond the finite data points to a broader region of interest, and (ii) closed-loop performance remains robust to (adversarial) disturbances of the state measurements.
Related work.There have been several proposals to address the above question in various settings.Broadly, this problem falls under the umbrella of imitation learning, which has been studied extensively in the literature and implemented in various contexts including video games [13], [14], robotics [15] and autonomous driving [16].Generalization: The key obstacle to widespread adoption of imitation learning is that it is difficult to guarantee performance in unseen scenarios.One approach to overcome this obstacle is inverse reinforcement learning (also referred to as apprenticeship learning in the literature), where the learner infer the unknown cost function from expert demonstrations, then learn an optimal policy that optimizes the learned cost using reinforcement learning [17], [18], [19], [20], [21].Since the learned cost represents the task of the expert, inverse reinforcement learning algorithms are able to generalize to unseen scenarios that are not covered by the expert demonstrations.However, one drawback of inverse reinforcement learning is that there can exist multiple cost functions that can be optimized under the expert's policy, which adds ambiguity in learning the cost function [22].Another approach that overcomes the obstacle of performing in unseen scenarios is direct policy learning via interactive expert [13], [23].In this approach, the learner can query an interactive expert at each iteration, then, the learner uses the expert's feedback to correct its mistakes and improve its policy.Since this approach keeps expanding the expert's data, it will eventually cover all possible scenarios in the long run.However, one drawback of this approach is that it requires the expert to be always available for feedback.In [24], noise is injected into the expert's policy in order to provide demonstrations on how to recover from errors.In [25], the authors develop a framework for learning a generative model for planning trajectories from demonstrations, which allows it to capture uncertainty about previously unseen scenarios.Closed-loop performance and robustness: Several approaches to adversarial imitation learning have been proposed in [26], [27], where inverse reinforcement learning is used.In [28], the authors proposed an adversarially robust imitation learning framework, where an agent is trained in an adversarially perturbed environment with the expert being available for queries at any time step.In [29], the authors learn robust control barrier functions from safe expert demonstrations.In all these works, robustness of imitation learning algorithms is considered to be the ability of the learned policy to recover from errors, which is similar to the notion of generalization.
In contrast to many of the works referenced above, we seek a principled feedback policy learning framework with strong theoretical guarantees.In particular, we seek explicit bounds on the finite sample performance, stability and robustness of the closed-loop system under the learned feedback policy.The broader problem of obtaining closed-loop performance and robustness guarantees for learned feedback policies and understanding the underlying tradeoffs has attracted attention recently [30], [31], [32]; yet it remains an active area of research.This requires the integration of theoretical tools from several areas: (i) the underlying control task is typically specified as an optimal control problem with performance measured in terms of the cost incurred, (ii) the feedback control policy is learned from finite offline data which involves considerations of generalization and robustness to distributional shifts, and (iii) closed-loop performance guarantees typically rely on an underlying robust stability guarantee for the learned policy.Prior works have addressed this problem within various frameworks, such as the H ∞ -control framework for linear systems [33].However, the problem of obtaining guarantees on closed-loop generalization and robustness to distributional shifts of learned policies for general nonlinear systems still remains a challenge.In this work, we address this problem within a Lipschitz feedback policy learning framework.The Lipschitz property is a fairly mild requirement in nonlinear control, and through our analysis we will see that it can be exploited to provide closed-loop bounds on generalization and robustness to distributional shifts for learned policies, highlighting the effectiveness of this approach.
Contributions.Our primary contribution in this paper is a robust feedback control policy learning framework based on Lipschitz-constrained loss minimization, where the feedback policies are learned directly from expert demonstrations.We then undertake a systematic study of the performance of feedback policies learned within our framework using meaningful metrics to measure closed-loop stability, performance and robustness.Our work integrates robust learning, optimal control and robust stability into a unified framework for robust feedback policy learning.More specifically, our technical contributions include: (i) an analysis of the Lipschitz-constrained policy learning problem, resulting in a finite sample bound on the learning error, (ii) a robust stability bound for the closed-loop system under the learned feedback policies, (iii) a Lipschitz analysis resulting in a bound on the regret incurred by learned feedback policies in terms of the learning error, and a bound on the deterioration of performance in the presence of (adversarial) disturbances to state measurements.This sheds light on the dependence of closed-loop control performance and robustness on learning.Conversely, our results specify target bounds on policy learning error for desired closedloop performance.We then demonstrate our robust feedback policy learning framework via numerical experiments on (i) the standard LQR benchmark, and (ii) a non-holonomic differential drive mobile robot model.Finally, our analysis points to the existence of a potential tradeoff between nominal performance of the learned policies and their robustness to adversarial disturbances of the feedback, which is borne out in numerical experiments where we observe that improvements to adversarial robustness can only be made at the expense of nominal performance.
Notation.For open and bounded sets X ⊂ R dim(X ) and Y ⊂ R dim(Y) , and a Lipschitz continuous map f : X → Y, we denote by f the Lipschitz constant of f .Furthermore, we denote by Lip(X ; Y) the space of Lipschitz continuous maps from X to Y.We denote by Vol(X ) the volume of X .A function g : X → R is said to be λ-smooth if it has a Lipschitzcontinuous gradient, i.e., ∇g(x 1 ) − ∇g(x 2 ) ≤ λ x 1 − x 2 for any x 1 , x 2 ∈ X , where • denotes the norm operator.A continously differentiable function g : X → R is said to be µ-strongly convex if ∇g(x 1 ) − ∇g(x 2 ) ≥ µ x 1 − x 2 for any x 1 , x 2 ∈ X .The maximum eigenvalue of a square matrix A is denoted by ρ(A).The cardinality of a set X is denoted by |X |.

APPROACH
In the section, we setup the problem of learning robust feedback control policies from expert demonstrations and present an outline of our approach.

A. Problem setup
We begin by specifying the properties of the system, the control task and the dataset of expert demonstrations.Consider a discrete-time nonlinear system of the form: where the map f : R n × R m → R n denotes the dynamics, We now explain the motivation behind the above assumptions on the properties of System (1).The control task is often formulated as one of stabilizing the system to the origin.Assumption II.1-(i) states that the origin, in the absence of control input, is indeed a fixed point of the system.The Lipschitz continuity Assumption II.1-(ii) specifies the level of regularity intrinsic to the system dynamics and is fairly standard in the literature.From a control design perspective it is crucial that the system indeed possesses the desired stabilizability properties from within this class of feedback policies considered in design.In this paper, we seek to learn feedback policies with Lipschitz regularity and Assumption II.1 specifies that this is the case and that System (1) is exponentially stabilizable by Lipschitz feedback.The task is one of infinite-horizon discounted optimal control of System (1) by a Lipschitz-continuous feedback policy, with stage cost c : R n × R m → R ≥0 and discount factor γ ∈ (0, 1): where Lip(R n ; R m ) is the space of Lipschitz-continuous feedback policies.Furthermore, we would like the closed-loop performance to be robust to the disturbance δ.
Assumption II.2 (Task properties).The following hold for Task (2) and System (1): (i) Strong convexity and smoothness of stage cost: The stage cost c : R n × R m → R ≥0 is µ-strongly convex and λ-smooth.Furthermore, c(x, u) = 0 if and only if x = 0 and u = 0.
1 the output equation allows for the modeling of sensors that are susceptible to bounded (adversarial) disturbances [34].
The choice of optimal control cost function plays an important role in determining the properties of the optimal feedback policy.Assumption II.2-(i) specifies the convexity and smoothness properties of the control cost.Existence of an optimal feedback policy within the considered class, as specified in Assumption II.2-(ii) is a minimum requirement for control design.
We now verify the properties in Assumptions II.1 and II.2 in the Linear-Quadratic control setting.
Example 1 (Linear quadratic control).For a linear system with f (x, u) = Ax + Bu such that (A, B) is a controllable pair, it can be seen that the properties in Assumption II.1 readily follow.It can be seen that a quadratic stage cost c(x, u) and . Furthermore, we note that Assumption II.2-(ii) readily follows from the existence of an optimal feedback gain for the discounted infinite-horizon LQR problem, and the fact that the corresponding optimal value function is quadratic.
In this paper, we consider the problem of data-driven feedback control, where we have access neither to the underlying dynamics f nor to the task cost function (stage cost c and discount factor γ). Instead, we have access to N < ∞ expert demonstrations of an (unknown) optimal feedback policy π * on System (1) over a finite horizon of length T .The initial state of the demonstrations is sampled uniformly i.i.d.from B r (0) ⊂ R n , the ball of radius r centered at the origin.The data is collected in the form of matrices X, U as follows: where T ) and u (i) = (u T −1 ) are the state and input vectors from the i-th demonstration, satisfying u

B. Outline of the approach
Our objective is to learn a feedback policy from the dataset X, U of expert demonstrations to solve the control task (2) while remaining robust to (adversarial) disturbances δ of fullstate measurements.To this end, we seek an optimizationbased learning formulation that allows us to explicitly constrain the sensitivity of the learned policy to (adversarial) disturbances.The Lipschitz constant of the learned policy serves as a measure of its sensitivity to disturbances, and we thereby formulate the (adversarially) robust policy learning problem as a Lipschitz-constrained policy learning problem [7]: x where L is a strictly convex and Lipschitz continuous loss function for the learning problem, lip(π) is the Lipschitz constant of the policy π, and α ∈ R ≥0 is a target upper bound for the Lipschitz constant of the learned policy π (the minimizer in ( 3)).The Lipschitz constraint in (3) serves as a mechanism to induce robustness of the learned policy to disturbances δ (the smaller the parameter α, the more robust the policy π is to the disturbances δ [1]).Figure 1 illustrates our setup.
We then use the robustness of the learned policy, along with a bound on the training loss, to obtain guarantees on the closed-loop performance under the learned policy π.For this, we first combine the training loss with the Lipschitz bound to obtain a bound on the worst-case error π − π * ∞ , where π − π * ∞ = sup x∈Br(0) π(x) − π * (x) .Then for a given worst-case learning error bound π − π * ∞ ≤ ε, we obtain via our analysis (i) a robust closed-loop stability bound as a function of ε, and (ii) bounds on closed-loop performance (measured in terms of the cost ( 16) incurred on the task) as a function of ε.Conversely, in order to satisfy target bounds on stability and performance, our analysis can be used to obtain a target bound on ε which must be satisfied by the learned policy, if some additional information on the system and task are available.
We now develop appropriate notions of closed-loop performance and robustness under the feedback policies learned from expert demonstrations.We note that the control task (2), being one of optimal control of System (1), has a natural performance metric given by the value function.Let V π be the value function associated with the learned feedback policy π for System (1): Since the expert implements the optimal policy π * , the performance of the learned policy can be measured by its regret with respect to the expert policy π * .The regret associated with the learned policy π relative to the expert policy π * is: When R( π) = 0, the performance of the learned policy equals the performance of an optimal policy for the control task (2).Conversely, the performance of the learned policy degrades as R( π) increases.Naturally, the objective of the policy learning problem is now to minimize the regret incurred by the learned policy π.Note that this is a more important performance metric in the closed-loop setting than the loss function L used for learning in (3), as it encodes the cost incurred by the evolution of the system under the learned feedback policy.We now note that the regret R only measures the performance of the learned policy under nominal conditions (in the absence of perturbations on the state measurements) and does not shed light on its performance in the presence of adversarial perturbations.This calls for an appropriate robustness metric, for which we will use the regret associated with the policy π when subject to perturbations relative to when deployed under nominal conditions, that is, where π δ (x) = π(x + δ).Intuitively, if S( π) is small, then the performance of the policy π under perturbation is close to its performance in nominal conditions, and π is robust to feedback perturbations.Again, we note that this robustness metric measures closed-loop robustness by encoding the cost incurred by the evolution of the system under the learned feedback policy subject to feedback perturbations.We would ideally like to keep both R and S low, which would imply that the policy performs well both under nominal conditions and when subjected to feedback perturbations.However, we shall see later that there may exist tradeoffs between the two objectives, presenting an obstacle to such a goal.We now address some crucial technical issues arising in the closed-loop dynamic setting in relation to minimizing the performance metrics R and S. We note that the policy learning problem (3) is formulated over the set B r (0) ∈ R n , which is the region of interest containing the data from expert demonstrations.Now, in order to measure the performance of a learned policy π using the metrics R and S, we must first ensure that the closed-loop trajectories of the system, under policy π, remain in B r (0) (for initial conditions in B r (0)).In the absence of such a guarantee, the metrics R and S are likely to be unbounded, and would therefore not serve as useful measures of performance.We therefore obtain robust stability bounds that specify the conditions under which closed-loop trajectories remain bounded in B r (0).

PERFORMANCE
In this section, we present the theoretical results underlying the robust feedback policy learning framework outlined in Section II.The results are presented in three parts: (i) We begin with an analysis of the Lipschitz-constrained policy learning problem (3).In Theorem IV.1, we provide a finite sample guarantee on the maximum learning error incurred in the region of interest B r (0), i.e., π − π * ∞ .This guarantee on the learning error bound is to be combined with the closedloop stability and performance guarantees obtained later for policies satisfying a given learning error bound.(ii) We then present a closed-loop stability analysis for System (1) under learned feedback control policies satisfying a given learning error bound.In Theorem III.3-(i) we establish that the closedloop system under optimal feedback π * is exponentially stable.Furthermore, in Theorem III.3-(ii) we establish a robust stability guarantee (to bounded adversarial disturbances on the state measurements) for learned feedback control policies satisfying a given learning error bound.(iii) We finally present an analysis of performance on the control task ( 14) under learned feedback control policies satisfying a given learning error bound.Theorem III.6-(i) provides an upper bound on the regret incurred by the learned policy with respect to the expert policy.Theorem III.6-(ii) quantifies the robustness of the closed-loop performance in terms of the Lipschitz constant of the learned feedback policy.

A. Robust stability with learned feedback policy
We first present the following result on the quadratic boundedness of the optimal value function: Lemma III.1 (Quadratic boundedness of optimal value function).There exist κ * , κ * ∈ R ≥0 such that µ 2 ≤ κ * ≤ κ * and the optimal value function in (2) We make the following assumption on the constants κ * , κ * in Lemma III.1: The following theorem establishes robust stability of the closed-loop system under the learned policy π from a bound on the policy error π(x) − π * (x) ∞ and measurement disturbances δ: Theorem III.3 (Robust exponential stability under Lipschitz policy).Let γ = 1 − µ/(2κ * ).Let π * be the minimizer in (2) for some γ ∈ (κ * γ /κ * , 1), and let π be any For the closed-loop trajectory f t π δ (x) starting from x ∈ B r (0) and generated by the policy π δ , the following holds: where We refer the reader to Appendix A-B for the proof.The robust stability result can be understood in the sense of input-to-state stability [35], [36], in that we exploit the exponential stability result for the expert policy π * and treat the learned policy π as a perturbation on π * .By obtaining boundedness of the learning error along the closed-loop trajectory, we establish that the closed-loop trajectory under the learned policy both stays within a bounded region around the optimal trajectory and converges asymptotically to a bounded region around the origin.

B. Regret and robustness with learned feedback policy
Having clarified the issue of robust stability, we now present a regret analysis for the learned control policy π.We first present the following lemma on an incremental exponential stability property of exponentially stabilizing Lipschitz feedback policies on B r (0): Lemma III.4 (Incremental exponential stability).Let π be an exponentially stabilizing Lipschitz feedback policy for System (1) We make the following assumption on the existence of a uniform bound on M (x 1 , x 2 ) in Lemma III.4 over x 1 , x 2 ∈ B r (0): The following theorem establishes a bound on the suboptimality of the closed-loop performance of system (1) with π and a robustness bound for the deterioration of the closedloop performance under bounded disturbances: Theorem III.6 (Regret and robustness of learned policy).Let γ be as specified in Theorem III.3, M, β be as in Lemma III.4 and Assumption III.5, and let π * be the minimizer in (2) (i) Regret: The regret R of the policy π relative to π * , as defined in (5), satisfies: where (ii) Robustness: Let δ t ≤ ζ for any t ∈ N.For any γ ∈ (0, 1), the robustness metric S of the policy π, as defined in (6), satisfies: where Theorem III.6-(i) establishes that the regret bound for the learned policy scales linearly with the deviation of the learned policy from the expert (optimal) policy.We also note that the regret bound scales with λ, the Lipschitz constant of the gradient of the stage cost, and the Lipschitz constant of the dynamics (w.r.t.u), as they modulate the sensitivity to variations of the input.Furthermore, we want the performance of the learned policy under disturbances to be close its nominal performance, i.e., a low value of S. Theorem III.6-(ii) establishes that the robustness of performance is determined by the sensitivity of the learned policy to disturbances, in particular that the robustness bound scales linearly with the Lipschitz constant of the learned policy.Theorem III.6-(ii) provides the designer with a robustness guarantee while implementing the learned policy in the presence of bounded (possibly adversarial) disturbances to measurements.Furthermore, we note that in the limit N → ∞ of the size of the dataset, Theorem III.6 suggests a tradeoff between the regret R( π) and the robustness metric S( π) as we vary the Lipschitz bound α in (3).As we decrease α, the deviation of the learned policy π from the optimal policy π * increases, and so does the bound in Theorem III.6-(i) (via an increase in ε).Instead, as we increase α such that the constraint in ( 3) is no longer active, the learned policy converges to the optimal policy π * , and the bound in Theorem III.6-(i) decreases to zero.Similarly, as we decrease α, the Lipschitz constant of the learned policy, π , decreases, and so does the bound in Theorem III.6-(ii).See Fig. 5 in Section V for an illustration of this tradeoff.Furthermore, we see that strong convexity of the cost induces stability properties and λ-smoothness allows for the tuning of regret.

IV. LIPSCHITZ-CONSTRAINED POLICY LEARNING
We now present results from our analysis of the Lipschitz constrained policy learning problem (3).We note that the training data for the feedback policy learning problem (3) consists of evaluations of the expert policy π * over a finite set of points {x (i) t } ⊂ B r (0) in the state space, and the objective is to generalize over the region of interest B r (0).The following theorem establishes a (maximum) generalization error bound for the minimizer π over the region B r (0): Theorem IV.1 (Finite sample guarantees on Lipschitz policy learning).Let π be a minimizer in (3).For any δ > 0, the maximum learning error in B r (0) satisfies: where: We refer the reader to Appendix A-E for a proof of this result.Theorem IV.1 shows that although a larger α allows for achieving a lower ε train , it can result in worse generalization performance.This is due to the fact that the (α + α * )δ term in the bound scales linearly with α, which can potentially result in a higher maximum learning error π − π * ∞ .Furthermore, from Appendix A-E-(c) in the proof of Theorem IV.1, we remark that the probabilistic bound in Theorem IV.1 is a worstcase bound which can potentially be tightened.In particular, the tightness of the estimate provided by the bound worsens with an increase in the length T of the control horizon in the demonstrations.
We finally note that Theorems III.3 and III.6 establish robust stability and performance bounds for policies π that satisfy (i) π − π * ∞ ≤ ε, and (ii) lip( π) ≤ α, whereas Theorem IV.1 yields a (probabilistic) bound on the violation of the condition π − π * ∞ ≤ ε for finite datasets of size N (while the Lipschitz bound still holds).Therefore, by combining the bounds in Theorems III.3 and III.6 with the finite sample bound in Theorem IV.1, we obtain the desired closed-loop generalization and robustness bounds.
We now present a graph-based Lipschitz policy learning algorithm to solve (3).We sample n points {X i } n i=1 , uniformly i.i.d from B r (0).Considering the points {X i } n i=1 as the (embedding of) vertices, we construct an undirected, weighted, connected graph G = (V, E), with vertex set V = {1, . . ., n}, edge set E ⊆ V ×V.We then define a partition W = {W i } n i=1 of the training dataset D = {x (i) t } (set of points in the state space where evaluations of the expert policy are available) as follows: Finally, we write the discrete (empirical) Lipschitz-constrained policy learning problem over the graph G as follows (which can be viewed as the discretization of (3) over the graph G): Fig. 2.This figure shows the surface of policy π in the state space for system (11), which is learned using Alg. 1 with α = 1 (red surface) and α = 0.1 (green surface), and the expert being the LQR for system (11).
We note that Problem ( 8) is convex (strictly convex objective function with convex constraints) and the corresponding Lagrangian is given by: where Λ = [λ ij ] n i,j=1 is the matrix of Lagrange multiplier for the pairwise Lipschitz constraints.Define a primal-dual dynamics for the Lagrangian L G ( u, Λ) with time-step sequence {h(k)} k∈N : The primal dynamics is a discretized heat flow over the graph G with a weighted Laplacian, where u), and ∆(Λ) is the Λ-weighted Laplacian of the graph G.The convergence of the solution {( u(k), Λ(k))} k∈N of the primal-dual dynamics (9) to the saddle point of the Lagrangian L G follows [37] from the convexity of Problem (8).

V. NUMERICAL EXPERIMENTS
In this section, we present the results from numerical experiments applying our algorithm to (i) learn the Linear Quadratic Regulator (LQR), and (ii) learn nonlinear control for a nonholonomic system (differential drive mobile robot).

A. Learning the Linear Quadratic Regulator
We consider a vehicle obeying the following dynamics (see also [30] and [38]): (10) where x t ∈ R 4 contains the vehicle's position and velocity in cartesian coordinates, u t ∈ R 2 the input signal, y ∈ R 4 the show trajectory tracking performance for the LQR (dashed blue line), the learned policy π learned using Alg. 1 with α = 1 (dash-dotted red line) and α = 0.1 (solid green line).In panel (a), the policies are deployed in nominal conditions.The policy π α=1 performs as good as the LQR while the policy π α=0.1 performs poorly compared to the LQR and π α=1 .In panel (b), the policies are deployed in non-nominal conditions.The performance of the LQR and policy π α=1 is worse than when deployed in nominal conditions, while the performance of policy π α=0.1 in non-nominal conditions remains almost the same as in nominal conditions.
state measurement, δ t ∈ R 4 bounded measurement noise with δ t ≤ ζ and ζ ∈ R ≥0 , and T s the sampling time.We consider the problem of tracking a reference trajectory, and we write the error dynamics and the controller as where e t = x t − ξ t is the error between the system state and the reference state, ξ t ∈ R 4 at time t, v t ∈ R 2 is the control input generating ξ t , and K denotes the control gain.We consider the expert policy to correspond to the optimal LQR gain, K lqr , which minimizes a discounted value function as in (2) but with horizon T , quadratic stage cost c(e t , ūt ) = e T t Qe t + ūT t Rū t with error and input weighing matrices Q 0 and R 0, respectively.Notice that the quadratic stage cost is strongly convex and Lipschitz bounded over bounded space e ∈ B r (0) ⊂ R 4 and ū ∈ R 2 .Expert demonstrations.We generate N expert trajectories using (11) with K = K lqr , T s = 0.1, γ = 0.82, Q = 0.1I 4 , R = 0.1I 2 , and δ t = 0, contained in the data matrices E, U : E = e (1) . . .e (N ) , U = u (1) . . .u (N ) , with e (i) = (e Each trajectory is generated with random initial condition, e (i) 0 ∈ B 2 (0) for i = 1, . . ., N .Note that, since the initial conditions, e (i) 0 , are inside B 2 (0) and K = K lqr is stabilizing, then, all the data points in E are inside B 2 (0).Policy learning.Using Alg. 1, we learn policy π with α = 1 and α = 0.1 denoted by π α=1 and π α=0.1 , respectively.Fig. 2 shows the surface of the learned policies π α=1 and π α=0.1 in the state space.Note that, since the Lipschitz constant of the expert policy, π * = K lqr , is π * = K lqr 2 = 0.51 < α = 1, we get π α=1 − π * 2 = 0, which implies that π α=1 learns exactly the expert policy.On the other hand, since α = 0.1 < π * = 0.51, we get π α=0.1 − π * 2 = for > 0, which implies that π α=0.1 learns the expert policy with some learning error .As observed in Fig. 2, the Lipschitz constant constraints the slope of the learned surface, where π α=0.1 has smaller slope than π α=1 , and hence more robust to perturbations in the states.However, smaller Lipschitz constant implies larger learning error, and hence poorer nominal performance.Fig. 3 shows the trajectory tracking performance for the optimal LQR controller, π α=1 , and π α=0.1 .The policies are deployed in nominal conditions, Fig. 3(a), and in nonnominal conditions with ζ = 0.5, Fig.
. We observe in Fig. 3(a) that π α=1 performs better than π α=0.1 in nominal conditions.On the other hand, we observe in Fig. 3(b) that the performance of π α=1 degrades when deployed in non-nominal conditions, while the performance of π α=0.1 remains almost the same, as predicted by [7].
Regret bounds.The parameters of the bounds in Theorem III.6 are obtained as follows, λ = max{ρ(2Q), ρ(2R)}, α * = K lqr 2 , u f = B 2 , and Θ = 1 1−γρ(A+BK) 2 , where K is a stabilizing gain.Fig. 4 shows the regrets R( π) and S( π) in ( 5) and ( 6), and the corresponding upper bounds derived in Theorem III.6 as a function of the Lipschitz bound, α, in (3).As can be seen, the regret R( π) and the corresponding upper bound in Theorem III.6-(i) decrease as α increases, while the regret S( π) and the corresponding upper bound in Theorem III.6-(ii) increase with α.Further, the regrets and the bounds remains constant for α ≥ 0.51, since the constraint in (3) becomes inactive and π converges to the optimal LQR controller.Fig. 5 shows the tradeoff between the regrets, as well as the tradeoff between the regrets upper bounds as we vary the Lipschitz bound, α, in (3).This suggests that improving the robustness of the learned policy to perturbations comes at the expenses of its nominal performance.

B. Learning nonlinear control for nonholonomic system
We consider nonholonomic differential drive mobile robot (see Fig. 6) obeying the following discrete-time nonlinear dynamics for t ≥ 0  5) and ( 6), respectively.Panel (b) shows the tradeoff between the upper bounds of R( π) and S( π) derived in Theorem III.6, respectively.
x t ∈ R and y t ∈ R are the position of the robot's centroid in the cartesian coordinate frame (O; x, y), θ t ∈ R is the robot's v t ∈ R and ω t ∈ R are the robot's forward and angular velocity at time t, respectively, which are the system's inputs, and T s > 0 is the sampling time.The dynamics in (12), can be written in the following vector form where T be a point fixed on the robot at a fixed distance d from [x t , y t ] T (see Fig. 6).The task is to stabilize the point r t at [0, 0] T , which is described by the following regulator problem where where γ is the discount factor, and Q 0 and R 0 are weighing matrices.Expert demonstrations.We consider the expert policy, π * , to be the minimizer of (14).The derivation of π * for this example is presented in Appendix A-F.Policy learning.Using Alg. 1, we learn policy π with α = 50 and α = 0.5 denoted by π α=50 and π α=0.5 , respectively.Fig. 7 shows the surface of the learned policies π α=50 and π α=0.5 that correspond to the input ω in the subspace [x, y] T for θ = 0. Since the Lipschitz constant of the expert policy, π * , is π * = 16.65 < α = 50, the policy π α=50 learns exactly the expert policy.On the other hand, since α = 0.5 < π * = 16.65, the policy π α=0.5 learns the expert policy with some learning error.As observed in Fig. 7, the Lipschitz constant constraints the slope of the learned surface, where π α=0.5 has smaller slope than π α=50 , and hence more robust to perturbations in the states.However, since π α=0.5 has larger learning error, it has poorer nominal performance.Fig. 8 shows the trajectory of the point (r x t , r y t ) (see Fig. 6) induced by the expert policy,  8(b).We observe in Fig. 8(a) that π α=50 performs as good as the expert and better than π α=0.5 in nominal conditions.On the other hand, we observe in Fig. 8(b) that the performance of π α=50 and that of the expert degrade when deployed in nonnominal conditions, while the performance of π α=0.5 remains almost the same.

VI. CONCLUSION
In this paper propose a framework to learn feedback control policies with provable robustness guarantees.Our approach draws from our earlier work [7] where we formulate the adversarially robust learning problem as one of Lipschitzconstrained loss minimization.We adapt this framework to the problem of learning robust feedback policies from a dataset obtained from expert demonstrations.We establish robust stability of the closed-loop system under the learned feedback policy.Further, we derive upper bounds on the regret and robustness of the learned feedback policy, which bound its nominal suboptimality with respect to the expert policy and the deterioration of its performance under bounded (adversarial) disturbances to state measurements, respectively.The above bounds suggest the existence of a tradeoff between , the learned policy π learned using Alg.with α = 50 (dashed red line) α = 0.5 (dashdotted green In panel (a), the policies are deployed in nominal conditions.The policy π α=50 outputs the same trajectory as the expert while the policy π α=0.5 outputs a different trajectory towards the equilibrium.In panel (b), the policies are deployed in non-nominal conditions.The performance of the expert and policy π α=50 is worse than when deployed in nominal conditions, while the performance of policy π α=0.5 in nonnominal conditions remains almost the same as in nominal conditions.nominal performance of the feedback policy and closed-loop robustness to adversarial perturbations on the feedback.This tradeoff is also evident in our numerical experiments, where improving closed-loop robustness leads to a deterioration of the nominal performance.Finally, we demonstrate our results and the effectiveness of our robust feedback policy learning framework on several benchmarks.∞ t=0 γ t c(x t , u t ) = 0, it follows that V * (0) = 0. Now, we have that π * (0) ∈ arg min u∈R m c(0, u)+ V * (f (0, u)), from which we clearly get that π * (0) = 0 is the only minimizer.Now, since c is µ-strongly convex, with c(0, 0) = 0, we have c π * (x) ≥ µ x 2 /2.Furthermore, since V * (x) = min u∈R m c(x, u) + γV * (f (x, u)), we get: Let π be a Lipschitz-continuous (with constant α), exponentially stabilizing feedback policy as in Assumption II.1.We then have: We then have: and the statement of the lemma follows.
B. Proof of Theorem III.3 (i) Exponential stability under expert (optimal) feedback policy: We now recall that V * (x) = c π * (x) + γV * (f π * (x)) and κ * x 2 ≤ V * (x) ≤ κ * x 2 (from Lemma III.1).It then follows that: It follows from the above inequality and the quadratic boundedness of V * that f π * is uniformly globally exponentially convergent [39].In what follows, we obtain an estimate for the upper bound on f t π * (x) .From the above inequality and the fact that V . Therefore, we get: (ii) Robust stability under learned policy: For x ∈ B r (0), let x t = f t π (x) and x * t = f t π * (x).We have: r.It then follows that π( x t ) − π * ( x t ) ≤ ε for any t ∈ N. We then have: Furthermore, we have from part (i) that x * t ∈ B r (0) for any t ∈ N. Therefore, B r (0) is invariant under f π * and f π , and we immediately obtain the uniform bound x t − x * t ≤ 2r.We now have: where the final equality holds for f π * = 1.If f π * = 1, then we have: We also have: Now, for the policy π δ , we have π δ − π * (Br(0),∞) ≤ π δ − π (Br(0),∞) + π − π * (Br(0),∞) ≤ αζ + ε, and the earlier analysis now carries through with this bound, and the statement the theorem follows.
C. Proof of Lemma III. 4 From the exponential stability of f π and f π -invariance of B r (0), for x, x ∈ B r (0), we have f

D. Proof of Theorem III.6
The following lemma establishes a difference bound for the value function under a Lipschitz feedback policy: ) of policy π, the following holds: where Θ = ∞ t=0 γ t θ 2 t and θ t = M β t .Proof.We first note that (0, 0) ∈ B r (0) × R m is a strict minimizer of c (by Assumption II.2) and since c is differentiable, we have ∇c(0, 0) = 0.For any x ∈ B r (0): since π(x) = π(x) − π(0) ≤ α x .For any x, x ∈ B r (0), let p be the straight line segment between x and x , such that p(t) = x + t(x − x) for t ∈ [0, 1].From the λsmoothness of c, we have: We also have: and therefore we get: We now have: and the statement of the lemma follows.The statement of the theorem follows from the above two inequalities.

E. Proof of Theorem IV.1
We first let X N = {x (1) , . . ., x (N ) } where x (i) = x In particular, the following holds: Where we assumed that T s is very small and used the approximation sin(T s ω t ) ≈ T s ω t and cos(T s ω t ) ≈ 1.Let [v t , ω t ] T = R −1 [µ x t , µ x t ] T , then ( 16) is written as r t+1 = r t + µ t , where µ t = [µ x t , µ y t ] T .

Fig. 1 .
Fig. 1.The block diagram in panel (a) corresponds to the implementation of the learned control policy π in non-nominal conditions under adversarial perturbations δ on the state measurement.Panel (b) illustrates the Lipschitz-constrained policy learning scheme implemented on the expert generated dataset to obtain policy π.

) Algorithm 1
Graph-based Lipschitz policy learning Input: Training data, Graph size n, Number of edges |E|, Lipschitz bound α, Number of iterations k 1: Sample n points (graph vertices) uniformly i.i.d.from B r (0) 2: Partition training dataset as in (7) 3: Implement k iterations of primal-dual algorithm (9) Output: Minimizer u

Fig. 3 .
Fig.3.Panel (a) and panel (b) show trajectory tracking performance for the LQR (dashed blue line), the learned policy π learned using Alg. 1 with α = 1 (dash-dotted red line) and α = 0.1 (solid green line).In panel (a), the policies are deployed in nominal conditions.The policy π α=1 performs as good as the LQR while the policy π α=0.1 performs poorly compared to the LQR and π α=1 .In panel (b), the policies are deployed in non-nominal conditions.The performance of the LQR and policy π α=1 is worse than when deployed in nominal conditions, while the performance of policy π α=0.1 in non-nominal conditions remains almost the same as in nominal conditions.

Fig. 4 .
Fig. 4. Panel (a) and panel (b) show the true regrets R( π) and S( π) in (5) and (6) (solid blue line), and the regret bounds in Theorem III.6 (dashed red line) as a function of the Lipschitz bound, α, in (3), respectively.The regret R( π) and the bound in Theorem III.6-(i) decrease as α increases, as shown in panel (a), while The regret S( π) the bound in Theorem III.6-(ii) increase with α, as shown in panel (b).

6 .Fig. 7 .
Fig.7.This figure shows the surface of policy π that correspond to the input ω in the subspace [x, y] T for system(13) for θ = 0. Two policies are learned using Alg. 1 with α = 50 (red surface) and α = 0.5 (green surface), and the expert demonstrations are generated as in Appendix A-F.

Fig. 8 .
Fig.8.Panel (a) and panel show the trajectory of the expert (solid blue line), the learned policy π learned using Alg.with α = 50 (dashed red line) α = 0.5 (dashdotted green In panel (a), the policies are deployed in nominal conditions.The policy π α=50 outputs the same trajectory as the expert while the policy π α=0.5 outputs a different trajectory towards the equilibrium.In panel (b), the policies are deployed in non-nominal conditions.The performance of the expert and policy π α=50 is worse than when deployed in nominal conditions, while the performance of policy π α=0.5 in nonnominal conditions remains almost the same as in nominal conditions.
state, u t ∈ R m the control input and y t ∈ R n the full-state measurement at time t ∈ N, respectively, with disturbance δ t ≤ ζ for any t ∈ N 1 .