Certifying Black-Box Policies With Stability for Nonlinear Control

Machine-learned black-box policies are ubiquitous for nonlinear control problems. Meanwhile, crude model information is often available for these problems from, e.g., linear approximations of nonlinear dynamics. We study the problem of certifying a black-box control policy with stability using model-based advice for nonlinear control on a single trajectory. We first show a general negative result that a naive convex combination of a black-box policy and a linear model-based policy can lead to instability, even if the two policies are both stabilizing. We then propose an <italic>adaptive <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>-confident policy</italic>, with a coefficient <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula> indicating the confidence in a black-box policy, and prove its stability. With bounded nonlinearity, in addition, we show that the adaptive <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>-confident policy achieves a bounded competitive ratio when a black-box policy is near-optimal. Finally, we propose an online learning approach to implement the adaptive <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>-confident policy and verify its efficacy in case studies about the Cart-Pole problem and a real-world electric vehicle (EV) charging problem with covariate shift due to COVID-19.


I. INTRODUCTION
Deep neural network (DNN)-based control methods such as deep reinforcement learning/imitation learning have attracted great interests due to the success on a wide range of control tasks such as humanoid locomotion [1], playing Atari [2] and 3D racing games [3]. These methods are typically model-free and are capable of learning policies and value functions for complex control tasks directly from raw data. In real-world applications such as autonomous driving, it is impractical to dynamically update the already-deployed policy. In those cases, pre-trained black-box policies are applied. Those partially-optimized solutions on the one hand can sometimes be optimal or near-optimal, but on the other hand can be arbitrarily poor in cases where there is unexpected environmental behavior due to, e.g., sample inefficiency [4], reward sparsity [5], mode collapse [6], high variability of policy gradient [7], [8], or biased training data [9]. This uncertainty raises significant concerns about applications of these tools in safety-critical settings. Meanwhile, for many real-world control problems, crude information about system models exists, e.g, linear approximations of their state transition dynamics [10], [11]. Such information can be useful in providing model-based advice to the machine-learned policies.
To represent such situations, in this paper we consider the following infinite-horizon dynamical system consisting of a known affine part, used for model-based advice, and an unknown nonlinear residual function, which is (implicitly) used in developing machine-learned (DNN-based) policies: where x t ∈ R n and u t ∈ R m are the system state and the action selected by a controller at time t; A and B are coefficient matrices in the affine part of the system dynamics. Besides the linear components, the system also has a state and action-dependent nonlinear residual f t : R n × R m → R n at each time t ≥ 0, representing the modelling error. The matrices A and B are fixed and known coefficients in the linear approximation of the true dynamics (1).
Given the affine part of the dynamics (1), it is possible to construct a model-based feedback controller π , e.g., a linear quadratic regulator or an H ∞ controller. Compared with a DNN-based policy π (x t ), the induced linear controller often has a worse performance on average due to the model bias, but becomes more stable in an adversarial setting. In other words, a DNN-based policy can be as good as an optimal policy in domains where accurate training data has been collected, but can perform highly sub-optimally in other situations such as when covariate shift happens; while a policy based on a linearized system is stabilizing, so that it has a guaranteed worst-case performance with bounded system perturbations, but can lose out on performance to DNN-based policies in non-adversarial situations. To illustrate this treade-off for control policies, we may look no further than the widely used RL algorithms in the well-established RL benchmark environments. Consider the Cart-Pole problem (See Section V) in Fig. 1. The figure shows a pre-trained trust region policy optimization (TRPO) [12], [13] agent and an augmented random search (ARS) [14] agent achieve lower costs when the initial angle of the pole is small; but become less stable when the initial angle increases. Using the affine part of the nonlinear dynamics, a linear quadratic regulator achieves better performance when the initial angle becomes large. Motivated by this trade-off, we ask the following question in this paper: Can we certify a sub-optimal machine-learned policy π with stability guarantees for the nonlinear system (1) utilizing only model-based advice from the known, affine part?
The goal of this paper is to answer the question above by designing a policy that ensures stabilization of states while also minimizing costs. Traditionally, switching between different control policies has been investigated for linear systems [15], [16]. All candidate polices need to be linear and therefore can be represented by their Youla-Kucera parametrizations (or Q-parameterizations). When a policy is a black-box machinelearned policy modeled by a DNN that may be nonlinear, how to combine or switch between the policies remains an open problem that is made challenging by the fact that the modelfree policy typically has no theoretical guarantees associated with its performance. On the one hand, a model-free policy works well on average but, on the other hand, model-based advice stabilizes the system in extreme cases.
Contributions: In this work we propose a novel, adaptive policy that combines model-based advice with a black-box machine-learned controller to guarantee stability while retaining the performance of the machine-learned controller when it performs well. In particular, we consider a nonlinear control problem whose dynamics is given in (1), where we emphasize that the unknown nonlinear residual function f t (x t , u t ) is time-varying and depends not only on the state x t but also the action u t at each time t. Our first result is a negative result (Theorem 1) showing that a naive convex combination of a black-box model-free policy with model-based advice can lead to instability, even if both policies are stabilizing individually. This negative result highlights the challenges associated with combining model-based and model-free approaches.
Next, we present a general policy that adaptively combines a model-free black-box policy with model-based advice (Algorithm 1). We assume that the model-free policy has some consistency error ε, compared with the optimal policy and that the residual functions ( f t : t ≥ 0) are Lipschitz continuous with a Lipschitz constant C > 0. Instead of employing a hard-switching between policies, we introduce a time-varying confidence coefficient λ t that only decays and switches a black-box model-free policy into a stabilizing model-based policy in a smooth way during operation as needed to ensure stabilization. The sequence of confidence coefficients converges to λ ∈ [0, 1]. Our main result is the following theorem (summarized as a corollary in Section IV-A), which establishes a trade-off between competitiveness (Theorem 3) and stability (Theorem 2) of this adaptive algorithm.
Theorem (Informal): With system assumptions (Assumption 1, 2 and an upper bound on C ), the adaptive λ-confident policy (Algorithm 1) has the following properties: (a) the policy is exponentially stabilizing whose decay rate increases when λ decreases; and (b) when the consistency error ε is small, the competitive ratio (Definition 2) of the policy satisfies

Model-based bound
The theorem shows that the adaptive λ-confident policy is guaranteed to be stable. Furthermore, if the black-box policy is close to an optimal control policy (in the sense that the consistency error ε is small), then the adaptive λ-confident policy has a bounded competitive ratio that consists of three components. The first one is a bound inherited from a modelbased policy; the second term depends on the sub-optimality gap between a black-box policy and an optimal policy; and the last term encapsulates the loss induced by switching from a policy to another and it scales with the 2 norm of an initial state x 0 and the nonlinear residuals (depending on the Lipschitz constant C ).
Our results imply an interesting trade-off between stability and optimality, in the sense that if λ is smaller, it is guaranteed to stabilize with a higher rate and if λ becomes larger, it is able to have a smaller competitive ratio bound when provided with a high-quality black-box policy. Different from the linear case, where a cost characterization lemma can be directly applied to bound the difference between the policy costs and optimal costs in terms of the difference between their actions [21], for the case of nonlinear dynamics (1), we introduce an auxiliary linear problem to derive an upper bound on the dynamic regret, whose value can be decomposed into a quadratic term and a term induced by the nonlinearity. The first term can be bounded via a generalized characterization lemma and becomes the model-based bound and model-free error in (2). The second term becomes a nonlinear dynamics error via a novel sensitivity analysis of an optimal nonlinear policy based on its Bellman equation. Finally, we use the Cart-Pole problem to demonstrate the efficacy of the adaptive λ-confident policy.
Related work: Our work studies a new learning-augmented control setting that draws inspiration from a variety of classical and learning-based policies for control and reinforcement learning (RL) problems that focus on combining model-based and model-free approaches.
Combination of model-based information with model-free methods: Our paper adds to the recent literature seeking to combine model-free and model-based policies for online control. Some prominent recent papers with this goal include the following. First, in [22], Q-learning is connected with model predictive control (MPC) whose constraints are Q-functions encapsulating the state transition information. Second, MPC methods with penalty terms learned by model-free algorithms are considered in [23]. Third, deep neural network dynamics models are used to initialize a model-free learner to improve the sample efficiency while maintaining the high task-specific performance [24]. Next, using this idea, the authors of [11] consider a more concrete dynamical system x t+1 = Ax t + Bu t + f (x t ) (similar to the dynamics in (1) considered in this work) where f is a state-dependent function and they show that a model-based initialization of a model-free policy is guaranteed to converge to a near-optimal linear controller. Another approach uses an H ∞ controller integrated into model-free RL algorithms for variance reduction [7]. Finally, the model-based value expansion is proposed in [25] as a method to incorporate learned dynamics models in model-free RL algorithms. Broadly, despite many heuristic combinations of model-free and model-based policies demonstrating empirical improvements, there are few theoretical results explaining and verifying the success of the combination of model-free and model-based methods for control tasks. Our work contributes to this goal.
Combining stabilizing linear controllers: The proposed algorithm in this work combines existing controllers and so is related to the literature of combining stabilizing linear controllers. A prominent work in this area is [15], which shows that with proper controller realizations, switching between a family of stabilizing controllers uniformly exponentially stabilizes a linear time-invariant (LTI) system. Similar results are given in [16]. The techniques applied in [15], [16] use the fact that all the stabilizing controllers can be expressed using the Youla parameterization. Different from the classical results of switching between or combining stabilizing controllers, in this work, we generalize the idea to the combination of a linear model-based policy and a model-free policy, that can be either linear or nonlinear.
Learning-augmented online problems: Recently, the idea of augmenting competitive/robust online algorithms with machine-learned advice has attracted attention in online problems in settings like online caching [26], ski-rental [27], [28], smoothed online convex optimization [29], [30] and linear quadratic control [21]. In many of these learning-augmented online algorithms, a convex combination of machine-learned (untrusted) predictions and robust decisions is involved. For instance, in [21], competitive ratio upper bounds of a λconfident policy are given for a linear quadratic control problem. The policy λπ MPC + (1 − λ)π LQR combines linearly a linear quadratic regulator π LQR and an MPC policy π MPC with machine-learned predictions where λ ∈ [0, 1] measures the confidence of the machine-learned predictions. To this point, no results on learning-augmented controllers for nonlinear control exist. In this work, we focus on the case of nonlinear dynamics and show a general negativity result (Theorem 1) such that a simple convex combination between two policies can lead to unstable outputs and then proceed to provide a new approach that yields positive results.
Stability-certified RL. Another highly related line of work is the recent research on developing safe RL with stability guarantees. In [17], Lyapunov analysis is applied to guarantee the stability of a model-based RL policy. If an H ∞ controller π H ∞ is close enough to a model-free deep RL policy π RL , by combining the two policies linearly λ π H ∞ + (1 − λ)π RL at each time in each training episode, asymptotic stability and forward invariance can be guaranteed using Lyapunov analysis but the convergence rate is not provided [7]. In practice, [7] uses an empirical approach to choose a time-varying factor λ according to the temporal difference error. Robust model predictive control (MPC) is combined with deep RL to ensure safety and stability [20]. Using regulated policy gradient, input-output stability is guaranteed for a continuous nonlinear control model f t (x(t )) = Ax(t ) + Bu(t ) + g t (x(t )) [10]. In those works, a common assumption needs to be made is the ability to access and update the deep RL policy during the episodic training steps. Moreover, in the state-of-the-art results, the stability guarantees are proven, either considering an aforementioned episodic setting when the black-box policy can be improved or customized [10], [17], or assuming a small and bounded output distance between a black-box policy and a stabilizing policy for any input states to construct a Lyapunov equation [7], which is less realistic. Stability guarantees under different model assumptions such as (constrained) MDPs have been studied [18], [19], [20]. Different from the existing literature, the result presented in this work is unique and novel in the sense that we consider stability and optimality guarantee for black-box deep policies in a single trajectory such that we can neither learn from the environments nor update the deep RL policy through extensive training steps. The related results are summarized in Table 1.

II. BACKGROUND AND MODEL
We consider the following infinite-horizon quadratic control problem with nonlinear dynamics: where in the problem Q, R 0 are n × n and m × m positive definite matrices and each f t : (1) is an unknown nonlinear function representing state and actiondependent perturbations. An initial state x 0 is fixed. We use the following assumptions throughout this paper. Our first assumption is the Lipschitz continuity assumption on the residual functions, which is satisfied by various applications such as the CartPole problem [11]. We further assume that the system is regulated to nominal states such that f t (0) = 0 for all 0 ≤ t < T . Note that · denotes the Euclidean norm throughout the paper. Assumption 1 (Lipschitz continuity): The function f t : Next, we make a standard assumption on the system stability and cost function [31], [32].
Assumption 2 (System stabilizability and costs): The pair of matrices (A, B) is stabilizable, i.e., there exists a real matrix K such that the spectral radius ρ(A − BK ) < 1. We assume In summary, our control agent is provided with a blackbox policy π and system parameters A, B, Q, R. The goal is to utilize π and system information to minimize the quadratic costs in (3), without knowing nonlinear residuals ( f t : t ≥ 0). Next, we present our policy assumptions.
(Crude) Model-based advice: In many real-world applications, linear approximations of the true nonlinear system dynamics are known, i.e., the known affine part of (1) is available to construct a stabilizing policy π . To construct π , the assumption that (A, B) is stabilizable implies that the following discrete algebraic Riccati equation (DARE) has a unique semi-positive definite solution P [33]: Given P, define K := (R + B PB) −1 B PA. The closedloop system matrix F := A − BK must have a spectral radius ρ(F ) less than 1. Therefore, the Gelfand's formula implies that there must exist a constant C F > 0, ρ ∈ (0, 1) such that F t ≤ C F ρ t , for any t ≥ 0. The model-based advice considered in this work is then defined as a sequence of actions (u t : t ≥ 0) provided by a linear quadratic regulator (LQR) such that Black-box policy: To solve the nonlinear control problem in (3), we take advantage of both model-free and model-based approaches. We assume a pre-trained, possibly model-free policy, whose policy is denoted by π : R n → R m , is provided beforehand. The policy is regarded as a "black box," whose detail is not the major focus in this paper. The only way we interact with it is to obtain a suggested action u t = π (x t ) when feeding into it the current system state x t . The performance of the black-box policy is not guaranteed and it can make some error, characterized by the following definition, which compares π against a clairvoyant optimal controller π * t knowing the nonlinear residual perturbations in hindsight: Definition 1 (ε-consistency ) A policy π : R n → R m is called ε-consistentif there exists ε > 0 such that for any x ∈ R n and t ≥ 0, π (x) − π * t (x) ≤ ε x where π * t denotes an optimal policy at time t knowing all the nonlinear residual perturbations ( f t : t ≥ 0) in hindsight and ε is called a consistency error.
The parameter ε measures the difference between the action given by the oracle policy π and the optimal action given the state x. There is no guarantee that ε is small. With prior knowledge of the nonlinearity of system gained from data, the sub-optimal model-free policy π suffers a consistency error ε > 0, which can be either small if the black-box policy is trained by unbiased data; or high because of the high variability issue for policy gradient deep RL algorithms [7], [8] and distribution shifts of the environments. In these cases, the error ε > 0 can be large. In this paper, we augment a black-box model-free policy with stability guarantees using the idea of adaptively switching it to a model-based stabilizing policy π, which often exists provided with exact or estimates of system parameters A, B, Q and R. The linear stabilizing policy is conservative and highly sub-optimal as it is neither designed based on the exact nonlinear model nor interacts with the environment like the training of π potentially does.
Performance metrics: Our goal is to ensure stabilization of states while also providing good performance, as measured by the competitive ratio. Formally, a policy π is (asymptotically) stabilizing if it induces a sequence of states (x t : t ≥ 0) such that x t → 0 as t → ∞. If there exist C > 0 and 0 ≤ γ < 1 such that x t ≤ Cγ t x 0 for any t ≥ 0, the corresponding policy is said to be exponentially stabilizing. To define the competitive ratio, let OPT be the offline optimal cost of (3) induced by optimal control policies (π * t : t ≥ 0) when the nonlinear residual functions ( f t : t ≥ 0) are known in hindsight, and ALG be the cost achieved by an online policy. Throughout this paper we assume OPT > 0. We formally define the competitive ratio as follows.
Definition 2: Given a policy, the corresponding competitive ratio, denoted by CR, is defined as the smallest constant C ≥ 1 such that ALG ≤ C · OPT for fixed A, B, Q, R satisfying Assumption 2 and any adversarially chosen residual functions ( f t : t ≥ 0) satisfying Assumption 1.

III. WARMUP: A NAIVE CONVEX COMBINATION
The main results in this work focus on augmenting a blackbox policy π with stability guarantees while minimizing the quadratic costs in (3), provided with linear system parameters A, B, Q, R of a nonlinear system. Before proceeding to our policy, to highlight the challenge of combining model-based advice with model-free policies in this setting we first consider a simple strategy for combining the two via a convex combination. This is an approach that has been proposed and studied previously, e.g., [7], [21]. However, we show that it can be problematic in that it can yield an unstable policy even when the two policies are stabilizing individually. Then, in Section IV, we propose an approach that overcomes this challenge.
A natural approach for incorporating model-based advice is a convex combination of a model-based control policy π and a black-box model-free policy π . The combined policy generates an action u t = λ π (x t ) + (1 − λ)π (x t ) given a state x t at each time, where λ ∈ [0, 1]. The coefficient λ determines a confidence level such that if λ is larger, we trust the blackbox policy more and vice versa. In the following, however, we highlight that, in general, the convex combination of two polices can yield an unstable policy, even if the two policies are stabilizing, with a proof in Appendix VI-C.
Theorem 1: Assume B is an n × n full-rank matrix with n > 1. For any λ ∈ (0, 1) and any linear controller K 1 satisfying A − BK 1 = 0, there exists a stabilizing linear controller K 2 such that their convex combination Theorem 1 brings up an issue with the strategy of combining a stabilizing policy with a model-free policy. Even if both the model-based and model-free policies are stabilizing, the combined controller can lead to unstable state outputs. In general, the space of stabilizing linear controllers {K ∈ R n×m : K is stabilizing} is nonconvex [34]. The result in Theorem 1 is a stronger statement. It implies that for any arbitrarily chosen linear policy K 1 and a coefficient λ ∈ (0, 1), we can always adversarially select a second policy K 2 such that their convex combination leads to an unstable system. It is worth emphasizing that the second policy does not necessarily have to be a complicated nonlinear policy. Indeed, in our proof, we construct a linear policy K 2 to derive the conclusion. In our problem, the second policy K 2 is assumed to be a black-box policy π potentially parameterized by a deep neural network, yielding much more uncertainty on a similar convex combination. As a result, we must be careful when combining policies together.
Note that the idea of applying a convex combination of an RL policy and a control-theoretic policy linearly is not a new approach and similar policy combinations have been proposed in previous studies [7], [21]. However, in those results, either the model-free policy is required to satisfy specific structures [21] or to be close enough to the stabilizing policy [7] to be combined. In [21], a learning-augmented policy is combined with a linear quadratic regulator, but the learning-augmented policy has a specific form and it is not a black-box policy. In [7], a deep RL policy π is combined with an H ∞ controller π and they need to satisfy that for any state x ∈ R n , π (x) − π (x) ≤ C π for some C π > 0. However, it is possible that when the state norm x becomes large, the two policies in practice behave entirely differently. Moreover, it is hard to justify the benefit of combining two policies, conditioned on the fact that they are already similar. Given that those assumptions are often not satisfied or hard to be verified in practice, we need another approach to guarantee worst-case stability when the black-box policy is biased and in addition ensure near-optimality if the black-box policy works well.

IV. ADAPTIVE λ-CONFIDENT CONTROL
Motivated by the challenge highlighted in the previous section, we now propose a general framework that adaptively selects a sequence of monotonically decreasing confidence coefficients (λ t : t ≥ 0) in order to switch between blackbox and stabilizing model-based policies. We show that it is Algorithm 1: Adaptive λ-Confident.
possible to guarantee a bounded competitive ratio when the black-box policy works well, i.e., it has a small consistency error ε, and guarantee stability in cases when the black-box policy performs poorly.
The adaptive λ-confident policy introduced in Algorithm 1 involves an input coefficient λ at each time. The value of λ t is the minimal of λ t−1 decreased by a fixed step size α > 0 and a variable learned from known system parameters in (3) combined with observations of previous states and actions. In Section V-A, we consider an online learning approach to generate a value of λ at each time t, but it is worth emphasizing that the adaptive policy in Algorithm 1 and its theoretical guarantees in Section IV-A do not require specifying a detailed construction of λ .
The adaptive policy differs from the naive convex combination that has been discussed in Section III in that it adopts a sequence of time-varying and monotonically decreasing coefficients (λ t : t ≥ 0) to combine a black-box policy and a model-based stabilizing policy, where the former policy is adaptively switched to the later one, which is necessary to ensure that the stabilizing policy eventually takes full control if the black-box policy does not bring the system to the fixed point at 0 in finite time. The coefficient λ t converges to lim t→∞ λ t = λ, where the limit λ can be a positive value, if the state converges to a target equilibrium (0 under our model assumptions) before λ t decreases to zero. This helps stabilize the system under assumptions on the Lipschitz constant C of unknown nonlinear residual functions and if the black-box policy is near-optimal, a bounded competitive ratio is guaranteed, as we show in the next section.

A. THEORETICAL GUARANTEES
The theoretical guarantees we obtain are two-fold. First, we show that the adaptive λ-confident policy in Algorithm 1 is stabilizing, as stated in Theorem 2. Second, in addition to stability, we show that the policy has a bounded competitive ratio, if the black-box policy used has a small consistency error (Theorem 3). Note that if a black-box policy has a large consistency error ε, without using model-based advice, it can lead to instability and therefore possibly an unbounded competitive ratio.
Stability: Before presenting our results, we introduce some new notation for convenience. Denote by t 0 the smallest time index when λ t = 0 or x t = 0 and note that 0 is an equilibrium state. Denote by λ = lim t→∞ λ t . Since (λ t : t ≥ 0) is a monotonically decreasing sequence and λ t has a lower bound, t 0 ≤ 1/α and λ exist and are unique. Let H := R + B PB. Define the parameters γ := ρ + C F C (1 + K ) and μ := C F (ε(C + B ) + C sys a C ) where A, B, Q, R are the known system parameters in (3), P has been defined in the Riccati equation (4); C F > 0 and ρ ∈ (0, 1) are constants such that F t ≤ C F ρ t , for any t ≥ 0 as defined in Section II; C is the Lipschitz constant in Assumption 1; ε > 0 is the consistency error in Definition 1; Finally, C  (1) and (3). The details of them are relegated to Appendix VI-B.
Given the above notation, the theorem below guarantees the stability of the adaptive λ-confident policy.
Theorem 2 (Stability): Suppose the Lipschitz constant C satisfies C < 1−ρ C F (1+ K ) . The adaptive λ-confident policy (Algorithm 1) is an exponentially stabilizing policy such that For Theorem 2 to hold such that γ < 1, the Lipschitz constant C needs to have an upper bound (1 − ρ)/(C F (1 + K )) where C F > 0 and ρ ∈ (0, 1) are constants such that F t ≤ C F ρ t , for any t ≥ 0. Since A and B are stabilizable, such C F and ρ exist. The upper bound only depends on the known system parameters A, B, Q and R. In addition to stability guarantees, we show that the policy is competitive.
Competitiveness: The theorem below implies that when the model-free policy error ε and the Lipschitz constant C of the residual functions are small enough, Algorithm 1 is competitive.
Theorem 3 (Competitive Ratio Bound): Suppose the Lipschitz constant satisfies C < min{1, C sys a , C sys c }. When the consistency error satisfies the competitive ratio of the adaptive λ-confident policy (Algorithm 1) is bounded by where CR model := 2κ ( C F P 1−ρ ) 2 /σ . Note that C ε in (5) is a constant that only depends on the known system parameters A, B, Q, R, and ( f t : 0 ≤ t < T ). Combining Theorem 2 and 3, our main result (summarized below as a corollary) is proved when the Lipschitz constant

Corollary 1 (Competitiveness and Stability Trade-off):
Under Assumption 1 and 2, if the Lipschitz constant satisfies C < min{1, C sys a , C sys c , (1−ρ) C F (1+ K ) }, the adaptive λ-confident policy (Algorithm 1) has the following properties: 1) If the consistency error ε ≤ C ε , it has a competitive ratio upper bound in (6); 2) If the consistency error ε > C ε , it is an exponentially stabilizing policy. Note that a small enough Lipschitz constant is commonly required to guarantee stability for nonlinear control models. For instance, in [11], convergence exponentially to the equilibrium state is guaranteed when the Lipschitz constant satisfies C = O( σ 2 (1−ρ) 8 κ 9 C F 15 ). Theorem 2, 3, and Corollary 1 have some interesting implications. First, if the selected time-varying confidence coefficients converge to some λ that is large, then we trust the black-box policy and use a higher weight in the per-step combination. This requires a slower decaying rate of λ t to zero so as a trade-off, t 0 can be higher and this leads to a weaker stability result and vice versa. In contrast, when the nonlinear dynamics in (1) becomes linear with unknown constant perturbations, [21] shows a trade-off between robustness and consistency, i.e., a universal competitive ratio bound holds, regardless the error of machine-learned predictions. Different from the linear case where a competitive ratio bound always exists [21] and can be decomposed into terms parameterized by some confidence coefficient λ, for the nonlinear system dynamics (1), there are additional terms due to the nonlinearity of the system that can only be bounded if the consistency error ε is small. This highlights a fundamental difference in terms of the competitiveness and stability trade-off between linear and nonlinear systems, where the latter is known to be more challenging. Proofs of Theorem 2 and 3 are provided in Appendix E and G.

V. PRACTICAL IMPLEMENTATION AND EXPERIMENTS A. LEARNING CONFIDENCE COEFFICIENTS ONLINE
Our main results in the previous section are stated without specifying a sequence of confidence coefficients (λ t : t ≥ 0) for the policy; however in the following we introduce an online learning approach to generate confidence coefficients based on observations of actions, states and known system parameters. The negative result in Theorem 1 highlights that the adaptive nature of the confidence coefficients in Algorithm 1 are crucial to ensuring stability. Naturally, learning the values of the confidence coefficients (λ t : t ≥ 0) online can further improve performance.
In this section, we propose an online learning approach based on a linear parameterization of a black-box model-free t ≥ 0) are parameters representing estimates of the residual functions for a black-box policy. Note that when f t = f * t := f t (x * t , u * t ) where x * t and u * t are optimal state and action at time t for an optimal policy, then the model-free policy is optimal. In general, a black-box model-free policy π can be nonlinear, the linear parameterization provides an example of how the time-varying confidence coefficients (λ t : t ≥ 0) are selected and the idea can be extended to nonlinear parameterizations such as kernel methods.
Under the linear parameterization assumption, for linear dynamics with no nonlinear residual functions, [21] shows that the optimal choice of λ t+1 that minimizes the gap between the policy cost and optimal cost for the t time steps is where η( f ; s, t ) := t τ =s (F ) τ −s P f τ . Compared with a linear quadratic control problem, computing λ t in (7) raises two problems. The first is different from a linear dynamical system where true perturbations can be observed, the optimal actions and states are unknown, making the computation of the term η( f * ; s, t − 1) impossible. The second issue is similar. Since the model-free policy is a black-box, we do not know the parameters ( f t : t ≥ 0) exactly. Therefore, we use approximations to compute the terms η( f * ; s, t ) and η( f ; s, t ) in (7) and the linear parameterization and linear dynamics assumptions are used to derive the approximations. Let (BH −1 ) † denote the Moore-Penrose inverse of BH −1 and f τ := Ax τ + Bu τ − x τ +1 . We use the following online-learning choice of a confidence coefficient λ t = min{λ, λ t−1 − α} where α > 0 is a fixed step size and based on the crude model information A, B, Q, R, the previously observed states (x s : 0 ≤ s < t ), the black-box policy actions ( u s : 0 ≤ s < t ) and the implemented actions (u s : 0 ≤ s < t ). This online learning process provides a choice of the confidence coefficient in Algorithm 1. It is worth noting that other approaches for generating λ t exist, and our theoretical guarantees apply to any approach.

B. EXPERIMENTS 1) THE CART-POLE PROBLEM
To demonstrate the efficacy of the adaptive λ-confident policy (Algorithm 1), we first apply it to the Cart-Pole OpenAI gym environment (Cart-Pole-v1 [35]. 1 Next, we apply it to an adaptive electric vehicle (EV) charging environment modeled by a real-world dataset [36]. a) Problem setting: The Cart-Pole problem considered in the experiments is described by the following example.

Example 1 (The Cart-Pole Problem)
In the Cart-Pole problem, the goal of a controller is to stabilize the pole in where u is the input force; θ is the angle between the pole and the vertical line; y is the location of the pole; g is the gravitational acceleration; l is the pole length; m is the pole mass; and M is the cart mass. Taking sin θ ≈ θ and cos θ ≈ 1 and ignoring higher order terms provides a linearized system and the discretized dynamics of the Cart-Pole problem can be represented as for any t, where (y t ,ẏ t , θ t ,θ t ) denotes the system state at time t; τ denotes the time interval between state updates; η := (4/3)l − ml/(m + M ) and the function f t measures the difference between the linearized system and the true system dynamics. Note that f t (0) = 0 for all time steps t ≥ 0.

b) Policy setting:
The pre-trained agents Stable-Baselines3 [13] of A2C [37], ARS [14], PPO [1] and TRPO [12] are selected as four candidate black-box policies. The Cart-Pole environment is modified so that quadratic costs are considered rather than discrete rewards to match our control problem (3). Note that we vary the values of the pole mass m and the cart mass M in the LQR implementation to model the case of only having crude estimates of linear dynamics. The LQR outputs an action 0 if −Kx t + F < 0 and 1 otherwise. A shifted force F = 15 is used to model inaccurate linear approximations. The pre-trained RL policies output a binary decision {0, 1} representing force directions. To use our adaptive policy in this setting, given a system state x t at each time t, we implement where F = 10 is a fixed force magnitude; π RL denotes an RL policy; π LQR denotes an LQR policy and λ t is a confidence coefficient generated based on (8). c) Results: We use the Stable-Baselines3 pre-trained agents [13] of A2C [37], ARS [14], PPO [1] and TRPO [12] as four candidate black-box policies. In Fig. 2, the adaptive policy finds a trade-off between the pre-trained black-box polices and an LQR with crude model information (i.e., about 50% estimation error in the mass and length values). In particular, when θ increases, it stabilizes the state while the A2C and PPO policies become unstable given a large initial angle θ .

2) REAL-WORLD ADAPTIVE EV CHARGING DURING COVID-19
The second application considered is an EV charging problem modeled by real-world large-scale charging data [36]. The problem is formally described below.  Let n be the number of EV charging stations. Denote by x t ∈ R n + the charging states of the n stations, i.e., x (i) t > 0 if an EV is charging at station-i and x (i) t (kWh) energy needs to be delivered; otherwise x (i) t = 0. Let u t ∈ R n + be the allocation of energy to the n stations. There is a line limit γ > 0 so that n i=1 u (i) t ≤ γ for any i and t. At each time t, new EVs may arrive and existing EVs may depart from previously occupied stations. Each new EV j induces a charging session, which can be represented by s j := (a j , d j , e j , i) where at time a j , EV j arrives at station i, with a battery capacity e j > 0, and departs at time d j . The system dynamics is x t+1 = x t + u t + f t (x t , u t ), t ≥ 0 where the nonlinear residual functions ( f t : t ≥ 0) represent uncertainty and constraint violations. Let τ be the time interval between state updates. Given fixed session information (s j : j > 0), denote by the following sets containing the sessions that are assigned to a charger i and activated (deactivated) at time t: The charging uncertainty is summarized as for any i = 1, . . . , n, f (i) t (x t , u t ) can be represented as Note that the charging session profiles (s j : j > 0) change significantly due to COVID-19, therefore so do the nonlinear residual functions ( f t : t ≥ 0). Moreover, the residual functions in Example 2 may not satisfy f t (0) = 0 for all t ≥ 0 in Assumption 1. Our experiments further validate that the adaptive policy works well in practice even if some of the model assumptions are violated. The goal of an EV charging controller is to maximize a system-level reward function including maximizing energy delivery, avoiding a penalty due to uncharged capacities and minimizing electricity costs. The reward function is Penalty with coefficients φ 1 , φ 2 , φ 3 and φ 4 shown in Table 3. The environment is wrapped as an OpenAI gym environment [35]. In our implementation, for convenience, the state x t is in R 2n + with additional n coordinates representing remaining charging duration. The electricity prices (p t : t ≥ 0) are average locational marginal prices (LMPs) on the CAISO day-ahead market in 2016. b) Policy setting: We train an SAC [38] policy π SAC for EV charging with 4-month data collected from a real-world charging garage [36] before the outbreak of COVID-19. The public charging infrastructure has 54 chargers and we use the charging history to set up our charging environment with 5 chargers. Knowledge of the linear parts in the nonlinear dynamics is assumed to be known, based on which an LQR controller π LQR is constructed. Our adaptive policy presented in Algorithm 1 learns a confidence coefficient λ t at each time step t to combine the two policies π SAC and π LQR . c) Impact of COVID-19: We test the policies on different periods from 2019 to 2021. The impact of COVID-19 on the charging behavior is intuitive. As COVID-19 became an outbreak in early Feb, 2020 and later a pandemic in May, 2020, limited Stay at Home Order and curfew were issued, which significantly reduce the number of active users per day. A dramatic fall of the total number of monthly charging sessions and total monthly energy delivered can be observed between Feb, 2020 and Sep, 2020 [36]. Moreover, despite the recovery of the two factors since Jan, 2021, COVID-19 has a long-term impact on lifestyle behaviors. The distribution of the arrival times of EVs (start times of sessions) have been more uniform in the post-COVID-19 period, compared to a more concentrated arrival peak before COVID-19. Overall, this significant shift of distributions of key charging parameters prominently deteriorates the performance of DNN-based  model-free policies (e.g., SAC) in the post-COVID-19 period, if they are trained on normal charging data collected before COVID-19. In this work, we demonstrate that taking advantage of model-based information, the adaptive λ-confident policy (Algorithm 1) is able to correct the mistakes made by DNN-based model-free policies trained on biased data and achieve more robust charging performance. d) Rewards: We show the testing rewards and their bar-plots in Fig. 4 and 3. In addition, the average total rewards for the SAC policy and the adaptive policy are summarized in Table 2. e) Summary: In the EV charging application, an SAC [38] agent is trained with data collected from a pre-COVID-19 period and tested on days before and after COVID-19. Due to a policy change (the work-from-home policy), the SAC agent becomes biased in the post-COVID-19 period. With crude model information, the adaptive policy has rewards matching the SAC agent in the pre-COVID-19 period and significantly outperforms the SAC agent in the post-COVID-19 period with an average total award 1951.2 versus 1540.3 for SAC.

VI. CONCLUDING REMARKS
This work considers a novel combination of pre-trained blackbox policies with model-based advice from crude model information. A general adaptive policy is proposed, with theoretical guarantees on both stability and near-optimality. The effectiveness of the adaptive policy is validated empirically. We believe that the results presented lead to an important first step towards improving the practicality of existing DNN-based algorithms when using them as black-boxes in nonlinear real-world control problems. Exploring other forms of model-based advice theoretically, and verifying practically other implementations to learn the confidence coefficients online are interesting future directions.

A. HYPERPARAMETERS
We describe the experimental settings and choices of hyperparameters in Table 3.

B. CONSTANTS IN THEOREM 2 AND 3
Let H := R + B PB. With σ > 0 defined in Assumption 2, the parameters C sys a , C sys b , C sys c > 0 in the statements of Theorems 2 and 3 (Section IV) are the following constants that only depend on the known system parameters in (3):

C. PROOF OF THEOREM 1
Proof of Theorem 1: Fix an arbitrary controller K 1 with a closed-loop system matrix F 1 := A − BK 1 = 0. We first consider the case when F 1 is not a diagonal matrix, i.e., A − BK 1 has at least one non-zero off-diagonal entry. Consider the following closed-loop system matrix for the second controller K 2 : where 0 < β < 1 is a value of the diagonal entry; L is the lower triangular part of the closed-loop system matrix F 1 . The matrix S(F 1 ) is a singleton matrix that depends on F 1 , whose only non-zero entry S i+k,i corresponding to the first non-zero off-diagonal entry (i, i + k) in the upper triangular part of F 1 searched according to the order i = 1, . . . , n and k = 1, . . . , n with i increases first and then k. Such a non-zero entry always exists because in this case F 1 is not a diagonal matrix. If the non-zero off-diagonal entry appears to be in the lower triangular part, we can simply transpose F 2 so without loss of generality we assume it is in the upper triangular part of F 1 . Since F 1 is a lower triangular matrix, all of its eigenvalues equal to 0 < β < 1, implying that the linear controller K 2 is stabilizing. Then, based on the construction of F 1 in (10), the linearly combined controller K := λK 2 + (1 − λ)K 1 has a closed-loop system matrix F which is upper-triangular, whose determinant satisfies where the term (−1) 2i+k in (11) comes from the computation of the determinant of F and F is a sub-matrix of F by eliminating the i + k-th row and i-th column. The term (−1) k+1 in (12) appears because F is a permutation of an upper triangular matrix with n − 2 diagonal entries being β's and one remaining entry being (F 1 ) i,i+k since the entries S j, j+k are zeros all j < i; otherwise another non-zero entry S i+k,i would be chosen according to our search order. Continuing from (13), since S i+k,i can be selected arbitrarily, setting S i+k,i = −2 n β(λβ ) −n+1 1−λ gives 2 n ≤ det(F ) ≤ |ρ(F )| n , implying that the spectral radius ρ(F ) ≥ 2. Therefore K = λK 2 + (1 − λ)K 1 is an unstable controller. It remains to prove the theorem in the case when F 1 is a diagonal matrix, which can be easily verified for the case when n = 2 and extended to the general setting. Therefore the spectral radius of the convex combination λF 1 + (1 − λ)F 2 is greater than one. This completes the proof.

D. USEFUL LEMMAS
To deal with the nonlinear dynamics in (1), we consider an auxiliary linear system, with a fixed perturbation w t = f t (x * t , u * t ) for all t ≥ 0 where each x * t denotes an optimal state and u * t an optimal action, generated by an optimal policy π * . We define a sequence of linear policies (π t : t ≥ 0) where π t : R n → R m generates an action which is an optimal policy for the auxiliary linear system. The gap between the optimal cost and algorithm cost for the system in (3) can be characterized below. Lemma 1: For any η t ∈ R m , if at each time t ≥ 0, a policy π : R n → R m takes an action u t = π (x t ) = π t (x t ) + η t , then the gap between the optimal cost OPT of the nonlinear system (3) and the algorithm cost ALG induced by selecting control actions (u t : t ≥ 0) equals to where H := R + B PB and F := A − BK. For any t ≥ 0, we write f t := f t (x t , u t ) and where (x t : t ≥ 0) denotes a trajectory of states generated by the policy π with actions (u * t : t ≥ 0) and (x * t : t ≥ 0) denotes an optimal trajectory of states generated by optimal actions (u * t : t ≥ 0). Proof of Lemma 1: The result follows by recursively representing the terminal cost at time t given a state x t similar to the proof of Lemma 13 in [39] using the backward induction. Note that with optimal trajectories of states and actions fixed, the optimal controller π * induces the same cost OPT for both the nonlinear system in (3) and the auxiliary linear system. Moreover, the linear controller defined in (14) induces a cost OPT that is smaller than OPT when running both in the auxiliary linear system since the constructed linear policy is optimal. Therefore, ALG − OPT ≤ ALG − OPT and t ≥ 0, (15) and (16) shown at the bottom of the next page, is obtained.
The next lemma generalizes a lower bound on OPT shown in the proof of Theorem 2.2 in [21].
Lemma 2: The optimal cost OPT can be bounded from below by

1) PROOF OUTLINE
In the sequel, we present the proofs of Theorem 2 and 3. The proof of our main results contains two parts-the stability analysis and competitive ratio analysis. First, we prove that Algorithm 1 guarantees a stabilizing policy, regardless of the prediction error ε (Theorem 2). Second, in our competitive ratio analysis, we provide a competitive ratio bound in terms of ε and λ. We show in Lemma 3 that the competitive ratio bound is bounded if the adaptive policy is exponentially stabilizing and has a decay ratio that scales up with C , which holds assuming the prediction error ε for a black-box model-free policy is small enough, as shown in Theorem 6. Theorem 6 is proven based on a sensitivity analysis of an optimal policy π * in Theorem 4.

2) STABILITY ANALYSIS
We first analyze the model-based policy π (x) = −Kx where K := (R + B PB) −1 B PA and P is a unique solution of the Riccati equation in (4). Theorem 4: Suppose the Lipschitz constant C , K and the closed-loop matrix F satisfy ρ + C F C (1 + K ) < 1 where F t ≤ C F ρ t for any t ≥ 0. Then the model-based policy π (x) = −Kx exponentially stabilizes the system such that x t ≤ C F (ρ + C F C) t x 0 for any t ≥ 0. Proof of Theorem 4: Let u t = π (x t ) = −Kx t for all t ≥ 0 and let F : Rewriting above recursively, for any t ≥ 0, Since ( f t : t ≥ 0) are Lipschitz continuous with a constant C (Assumption 1), we have where C F > 1 is a constant such that F t ≤ C F ρ t for any t ≥ 0. Denote by C := C (1 + K ). Then, using (18), S t ≤ (1 + C F C/ρ)S t−1 . Therefore, noting that S 1 = (1 + C/ρ) x 0 , recursively we obtain S t ≤ (1 + C F C/ρ) t−1 (1 + C/ρ) x 0 , which further implies x t ≤ C F (ρ + C F C) t x 0 Next, based on Theorem 4, we verify the stability of the convex-combined policy π = λ π + (1 − λ)π where π is a model-free policy satisfying the ε-consistency in Definition 1 and π is a model-based policy.
Theorem 5: Let π * be an optimal policy and π (x) = −Kx be a linear model-based policy. It follows that for any t ≥ 0, π * t (x) − π (x) ≤ C sys a C x for some constant C sys a > 0 where C is the Lipschitz constant defined in Assumption 1, and C sys a and C sys b are defined in Appendix B. Proof Sketch of Theorem 5: The theorem follows by considering the Bellman equation: V t (x) = min u {x Qx + u t Ru t + V t+1 (Ax + Bu + f t (x, u))} where V t : R n → R + denotes the optimal value function, which can be bounded by V t (x) ≤ V ∞ (0) + C F 2 Q+K RK 1−(ρ+C F C) 2 x 2 utilizing Theorem 4. The result follows by writing V t (x) = x Px + g t (x) with P being the solution of the Riccatti equation in (4) and analyzing the Jacobian of g t (x) to bound the action difference π * t (x) − π (x) .
Let (x t : t ≥ 0) denote a trajectory of states generated by the adaptive λ-confident policy π t = λ t π + (1 − λ t )π (Algorithm 1). The following theorem states the stability result for π .
Theorem 6: Let γ := (ρ + C F C (1 + K )). Suppose the black-box policy π is ε-consistent with the consistency constant ε satisfying ε < 1/C F −C sys Then the adaptive λ-confident policy is an exponentially stabilizing policy such that x t ≤ γ t −μ t 1−μγ −1 (C F + μγ −1 ) x 0 for any t ≥ 0 where μ := C F (ε(C + B ) + C sys a C ). Proof Sketch of Theorem 6: We first introduce a new symbol x (τ ) t , which is the t-th state of a trajectory generated by the combined policy π t (x) = λ t π (x) + (1 − λ t )π (x) for the first τ steps and then switch to the model-based policy π for the remaining steps. Let (x t : t ≥ 0) be a trajectory of states generated by a model-based policy π . As illustrated in Fig. 1, for any Let (x t : t ≥ 0) and (x t : t ≥ 0) denote the trajectories of states at time t generated by the model-based policy π (x) = −Kx when the initial states are x 0 and x 0 , respectively. Using the Lipschitz continuity of ( f t : t ≥ 0) and Assumption 2, the same argument as in Theorem 4 gives that for any t ≥ 0, x t − x t ≤ C F (ρ + C F C) t x 0 − x 0 where C := C (1 + K ). Therefore, the telescoping sum (illustrated in Fig. 5 ) where x t , x t and x t are the states generated by running an optimal policy π * , a model-free policy π and a model-based policy π , respectively, for one step with the same initial state x t−1 . Bounding x t − x t and x t − x (t−1) t respectively completes the proof.

F. PROOF OF THEOREM 2
To show the stability results, noting that the policy is switched to the model-based policy after t ≥ t 0 . Since the Lipschitz constant C satisfies C < 1−ρ C F (1+ K ) where F t ≤ C F ρ t for any t ≥ 0, then the model-based policy π is exponentially stable as shown in Theorem 4. Let μ := C F (ε(C + B ) + C sys a C ) and γ := ρ + C F C (1 + K ) < 1. Applying Theorem 6 and Theorem 4, for any t ≥ 0, x t ≤ μ t 0 μγ −1 −1 (C F + μγ −1 )(ρ + C F C (1 + K )) t−t 0 x 0 . Since λ t+1 < λ t − α for all t ≥ 0 with some α > 0, t 0 is finite and the adaptive λ-confident policy is an exponentially stabilizing policy.