Improving the Robustness of Reinforcement Learning Policies with L 1 Adaptive Control

—A reinforcement learning (RL) control policy could fail in a new/perturbed environment that is different from the training environment, due to the presence of dynamic variations. For controlling systems with continuous state and action spaces, we propose an add-on approach to robustifying a pre-trained RL policy by augmenting it with an L 1 adaptive controller ( L 1 AC). Leveraging the capability of an L 1 AC for fast estimation and active compensation of dynamic variations, the proposed approach can improve the robustness of an RL policy which is trained either in a simulator or in the real world without consideration of a broad class of dynamic variations. Numerical and real-world experiments empirically demonstrate the efﬁcacy of the proposed approach in robustifying RL policies trained using both model-free and model-based methods.


I. INTRODUCTION
Reinforcement learning (RL) is a promising way to solve sequential decision-making problems [1].In the recent years, RL has shown impressive or superhuman performance in control of complex robotic systems [2], [3].An RL policy is often trained in a simulator and deployed in the real world.However, the discrepancy between the simulated and the real environment, known as the sim-to-real (S2R) gap, often causes the RL policy to fail in the real world.An RL policy may also be directly trained in a real-world environment; however, the environment perturbation resulting from parameter variations, actuator failures and external disturbances can still cause the well-trained policy to fail.Take a delivery drone for example (Fig. 1).We could train an RL policy to control the drone in a nominal environment (e.g., nominal load, mild wind disturbances, healthy propellers, etc.); however, this policy could fail and lead to a crash when the drone operates in a new environment (e.g., heavier loads, stronger wind disturbances, loss of propeller efficiency, etc.).To a certain extent, the S2R gap issue can be considered as a special case of environment perturbation by treating the simulated and real environments as the old/nominal and new/perturbed environments, respectively.Fig. 1: Proposed approach to policy robustness improvement based on L 1 adaptive augmentation

A. Related work
Robust/adversarial training: Domain/dynamics randomization was proposed to close the sim-to-real (S2R) gap [4]- [6] when transferring a policy from a simulator to the real world.Robust adversarial training addresses the S2R gap and environment perturbations by formulating a two-player zerosum game between the agent and the disturbance [7].A similar idea was explored in [8], where Wasserstein distance was used to characterize the set of dynamics for which a robust policy was searched via solving a min-max problem.
Though fairly general and applicable to a broad class of systems, these methods often involve tedious modifications to the training environment or the dynamics, which can only happen in a simulator.More importantly, the resulting fixed policies could overfit to the worst-case scenarios, and thus lead to conservative or degraded performance in other cases [9].
This issue is well studied in control community; more specifically, robust control [10] that aims to provide performance guarantee for the worst-case scenario, often leads to conservative nominal performance.Post-training augmentation: Kim et al. [11] proposed to use a disturbance observer (DOB) to improve the robustness of an RL policy, in which the mismatch between the simulated training environment and the testing environment is estimated as a disturbance and compensated for.A similar idea was pursued in [12], which used a model reference adaptive control (MRAC) scheme to estimate and compensate for parametric uncertainties.Our objectives are similar to the ones in [11] and [12], but our approach and end results are different, as we address a broader class of dynamic uncertainties (e.g., unknown input gain that cannot be handled by [11], and time-dependent disturbances that cannot be handled by [12]), and we leverage the L 1 adaptive control architecture that is capable of providing guaranteed transient (instead of just asymptotic) performance [13].Additionally, we validate our approach on real hardware, as opposed to merely in numerical simulations in [11], [12].We note that L 1 adaptive control has been combined with model predictive control (MPC) with application to quadrotors [14], and it has been used for safe learning and motion planning applicable to a broad class of nonlinear systems [15]- [17].
To put things into perspective, this paper is focused on applying the L 1 adaptive control architecture to robustify an RL policy.In terms of technical details, this paper considers more general scenarios, e.g., unmatched disturbances and unknown input gain, which were not considered in [16], [17].Learning to adapt: Meta-RL has recently been proposed to achieve fast adaptation of a pre-trained policy in the presence of dynamic variations [18]- [22].Despite impressive performance mainly in terms of fast adaptation demonstrated by these methods, the intermediate policies learned during the adaptation phase will most likely still fail.This is because a certain amount of information-rich data needs to be collected in order to learn a good model and/or policy.On the other hand, rooted in the theory of adaptive control and disturbance estimation, [13], [23], [24], our proposed method can quickly estimate the discrepancy between a nominal model and the actual dynamics, and actively compensate for it in a timely manner.We envision that our proposed method can be combined with these methods to achieve robust and fast adaptation.

B. Statement of contributions
For controlling systems with continuous state and action spaces, we propose an add-on approach to robustifying an RL policy, which can be trained in standard ways without consideration of a broad class of potential dynamic variations.The essence of the proposed approach lies in augmenting it with an L 1 adaptive control (L 1 AC) scheme [13] that quickly estimates and compensates for the uncertainties so that the dynamics of the system in the perturbed environment are close to that in the nominal environment, in which the RL policy is trained and thus expected to function well.The idea is illustrated in Fig. 1.
Different from most of existing robust RL methods using domain randomization or robust/adversarial training [4]- [8], the proposed approach can be used to robustify an RL policy, which is trained either in a simulator or in the real world, using both model-free and model-based methods, without consideration of a broad class of uncertainties in the training.We empirically validate the approach on both numerical examples and real hardware.

II. PROBLEM SETTING
We assume that we have access to the system dynamics in the nominal environment, either simulated or in the real world, and they are described by a nonlinear control-affine model: where x(t) ∈ X ⊂ R n and u(t) ∈ U ⊂ R m are the state and input vectors, respectively, X and U are compact sets, f : R n → R n and g : R m → R n×m are known and locally Lipschitz-continuous functions.Moreover, g(x) has full column rank for any x ∈ X .Remark 1. Control-affine models are commonly used for control design and can represent a broad class of mechanical and robotic systems.In addition, a control non-affine model can be converted into a control-affine model by introducing extra state variables (see e.g., [25]).Therefore, the controlaffine assumption is not very restrictive.
The nominal model ( 1) can be from physics-based modeling, data-driven modeling or a combination of both.Methods exist for maintaining the control affine structure in data-driven modeling (see e.g., [26]).
Assumption 1.We have access to a nominal control policy, π o (x), which is trained using the nominal dynamics (1) and thus functions well under such dynamics.Moreover, π 0 (x) is Lipschitz continuous in X with a Lipschitz constant l π .
The policy π o (x) can be trained either in a simulator or in the real world in the standard (i.e., non-robust) way, using either model-based and model-free methods.The Lipschitz continuity assumption is needed to derive an error bound for estimating the disturbances in Section III-D.The nominal policy π 0 could fail in the perturbed environment due to the dynamic variations.We, therefore, propose a method to improve the robustness of this nominal policy in the presence of such dynamic variations, by leveraging L 1 AC [13].To achieve this, we further assume that the dynamics of the agent in the perturbed environment can be represented by where Λ is an unknown input gain matrix, which satisfies Assumption 2, d(t, x) is an unknown function that can capture parameter perturbations, unmodeled dynamics and external disturbances.It is obvious that the perturbed dynamics (2) can be equivalently written as where Remark 2. Uncertain input gain is very common in realworld systems.For instance, actuator failures, and variations in mass or inertia for force-or torque-controlled robotic systems, normally induce such input gain uncertainty.For a singleinput system, Λ = 0.6 indicates a 40% loss of the control effectiveness.Our representation of such uncertainty in (2) is broad enough to capture a large class of scenarios, while still allowing for effective compensation of such input gain uncertainty using L 1 AC (detailed in Section III).
To provide a rigorous treatment, we make the following assumptions on the perturbed dynamics (2).Assumption 2. The matrix Λ in (2) is an unknown strictly row-diagonally dominant matrix with sgn(Λ ii ) known.Furthermore, there exists a compact convex set such that Λ ∈ .Remark 3. The first statement in Assumption 2 indicates that Λ is always non-singular with known sign for the diagonal elements, and is often needed in applying adaptive control methods to mitigate the effect of uncertain input gain (see [23,Sections 6 and 7]).Without loss of generality, we further assume that in Assumption 2 contains the m by m identity matrix, I. Assumption 3.There exist positive constants l d , l d , b d , l f , and l g such that for any x, y ∈ X and t, τ ≥ 0, the following inequalities hold: Remark 4. This assumption essentially indicates that the rate of variation of d(t, x) with respect to both t and x, and of f (x) and g(x) with respect to x, in X , are bounded.It is needed for deriving the theoretical error bounds (in Lemma 2) for estimating the lumped disturbance, σ(t, x, u).
The problem we are tackling can be stated as follows.Problem Statement: Given an RL policy π o (x) well trained in a nominal environment with the nominal dynamics (1), assuming the dynamics in the perturbed environment are represented by (2) satisfying Assumptions 2 and 3, provide a solution to improve the robustness of the policy π o (x) in the perturbed environment.

A. Overview of the proposed approach
The idea of our proposed approach is depicted in Fig. 1.With our approach, the training phase is standard: the nominal policy can be trained using almost any RL methods (both model-free and model-based) in a nominal environment.After getting a nominal policy that functions well in the nominal environment, for policy execution, an L 1 controller is designed to augment and work together with the nominal policy.The L 1 controller uses the dynamics of the nominal environment (1) as an internal nominal model, estimates the discrepancy between the nominal model and the actual dynamics and compensates for this discrepancy so that the actual dynamics with the L 1 controller (illustrated by the shaded area of Fig. 1) are close to the nominal dynamics.Since the RL policy is well trained using the nominal dynamics, it is expected to function well in the presence of the dynamic variations and the L 1 augmentation.

B. RL training for the nominal policy
As mentioned before, the policy can be trained in the standard way, using almost any RL method including both model-free and model-based one.The only requirement is that one has access to the nominal dynamics of the training environment in the form of (1).
As an illustration of the idea, for the experiments in Section IV, we choose PILCO [27], a model-based policy search method using Gaussian processes, soft actor-critic [28], a stateof-the-art model-free deep RL method, and a trajectory optimization method based on differential dynamic programming (DDP) [29] to obtain the nominal policy.

C. L 1 adaptive augmentation for policy robustification
In this section, we explain how an L 1 AC scheme can be designed to augment and robustify a nominal RL policy.An L 1 controller mainly consists of three components: a state predictor, an adaptive law, and a low-pass filtered control law.The state predictor is used to predict the system's state evolution, and the prediction error is subsequently used in the adaptive law to update the disturbance estimates.The control law aims to compensate for the estimated disturbance.For the perturbed system (2) with the nominal dynamics (1), the state predictor is given by where x(t) x(t)−x(t) is the prediction error, a is a positive constant, σ(t) is the estimation of the lumped disturbance, σ(t, x, u), at time t.Following the piecewise-constant (PWC) adaptive law (which connects with the CPU sampling time) [13, Section 3.3], the disturbance estimates are updated as for i = 0, 1, • • • , where T is the estimation sampling time.
With σ(t), we further compute where σm (t) and σum (t) are the matched and unmatched disturbance estimates, respectively, g ⊥ (x) ∈ R n−m satisfies g(x) g ⊥ (x) = 0, and rank g(x) g ⊥ (x) = n for any x ∈ X .From (3) and ( 9), we see that the total or lumped disturbance σ(t, x, u), is estimated by σ(t) g(x)σ m (t) + g ⊥ (x)σ u m(t).The control law is given by where u RL (t) = π 0 (x(t)) is the control command from the nominal RL policy, u L1 (s) is the Laplace transform of the L 1 control command u L1 (t), L[•] denotes the Laplace transform, and C(s) K(sI + K) −1 is an m by m transfer matrix consisting of low-pass filters with K ∈ R m×m .Remark 5.As it can be seen from ( 9), ( 10) and (12), in an L 1 AC scheme with a PWC adaptive law [13, section 3.3], all the dynamic uncertainties (such as parametric uncertainties, unmodeled dynamics and external disturbances) are lumped together and estimated as a total disturbance.This is different from most adaptive control schemes [23], which rely on a parameterization of the uncertainty to design adaptive laws for updating parameter estimates and usually consider only stationary uncertainties that do not directly depend on time.Details on deriving the estimation and control laws can be found in [30], [31].The working principle of the L 1 controller can be summarized as follows: the state predictor (9) and the adaptive law (10) can accurately estimate the lumped disturbances, σm (t) and σum (t).In fact, under certain conditions, a bound on the estimation error, σ(t) − d(t, x), can be derived and is included in Appendix A. Additionally, the control law (12) mitigates the effect of disturbances by cancelling those within the bandwidth of the low-pass filter.Note that unmatched disturbances (also known as mismatched disturbances in the disturbance-observer based control literature [24]) cannot be directly canceled by control signals and are more challenging to deal with.Remark 6.In designing the L 1 controller consisting of ( 9), ( 10) and ( 12), we assume that the states are measured without noise.In practice, as long as the estimation sampling time is not too small and the filter bandwidth is not too large, moderate measurement noise that always exists in real-world systems usually does not cause big issues, as demonstrated by the hardware experiments in Section IV-D.

D. Analysis of the L 1 adaptive augmentation
In this section, we provide an analysis of the L 1 augmentation presented in Section III-C and explain the cases under which its performance can be limited.The working principle of the L 1 controller can be summarized as follows: the state predictor (9) and the adaptive law (10) can accurately estimate the lumped disturbances, σm (t) and σum (t), while the control law (12) mitigates the effect of matched disturbance, σm (t), by cancelling it within the bandwidth of the low-pass filter.
1) Estimation error bound: Next, we will show that under certain assumptions, an error bound in estimating the lumped disturbance σ(t, x, u) (and thus the matched and unmatched components) can be derived.Furthermore, this bound can be arbitrarily reduced by decreasing the estimation sample time T , which indicates that the estimation after one estimation sampling interval can be arbitrarily accurate.
Lemma 1.Given the perturbed dynamics (2) subject to Assumptions 2 and 3, if x(t) ∈ X and u(t) ∈ U for any t in [0, τ ], we have that where Proof.See appendix A1.Let us define: where g + (x) is the pseudoinverse of g(x), and θ and φ are defined in ( 14) and ( 16), respectively.We next establish the estimation error bounds associated with the estimation scheme in ( 9) and (10).
Proof.See appendix A2.Remark 8. Lemma 2 essentially states that under Assumptions 2 and 3 and the assumed boundedness of x and u, the error for estimation of the lumped disturbance is always bounded.Furthermore, the error after one sampling interval can be arbitrarily reduced (by decreasing T ).In practice, the size of T is limited by the computational hardware and measurement noise.
2) Limitations of the proposed approach: As mentioned before, the control law (12) only tries to cancel the matched disturbance σm (t), while ignoring the unmatched disturbance σum (t).Dealing with unmatched disturbance in the nonlinear setting has been a long-standing challenging problem for adaptive or disturbance observer based control methods, and need other methods, e.g., those based on robust control [32].As a result, when the unmatched disturbance dominates the total disturbance, the performance of the proposed approach will be limited.This is demonstrated in Section IV, e.g., in the quadrotor example in the presence of wind disturbances.

E. Comparison with existing approaches
The comparison of our proposed approach with existing approaches is summarized in Table I.Our approach falls into the category of post-training augmentation (PTA), which does not require a special training process such as randomizing parameters and adding disturbances, and allows the training to be done in both simulated and real-world environments, as opposed to robust/adversarial training (RAT) methods.Additionally, RAT methods aim to find a fixed policy for all possible realizations of uncertainties, which could be infeasible when the range of uncertainties is large.Compared to existing PTA methods based on MRAC and DOB, our approach is able to deal with a broader class of uncertainties, and is validated on real hardware.
On the other hand, similar to other PTA methods, our approach needs the dynamics to be continuous and have a control-affine form, and can only effectively compensate for the matched disturbance.Dealing with the unmatched disturbances in the nonlinear setting has been a long-standing challenging problem for adaptive or DOB-based control methods, other methods, e.g., those based on robust control [33], must be considered.As a result, when the unmatched disturbance dominates the total disturbance, the performance of the proposed approach will be limited.This is demonstrated in Section IV, e.g., in the quadrotor example in the presence of wind disturbances.

IV. EXPERIMENTS
We now apply the proposed approach to three systems, namely a cart-pole, a Pendubot and a 3-D quadrotor.In particular, for the Pendubot, experiments on real hardware are also conducted.An overview of the systems and test settings is given in Table II.The dynamic models for these systems are included in appendix B. A. Cart-pole swing-up and balance in simulations The system states include cart position x c and velocity ẋc , and pole angle θ and angular velocity θ.The input is the force applied to the cart.The nominal value of the key parameters in the dynamics are M = 0.5 kg (cart mass), m = 0.5 kg (pole mass), l pole = 0.6 m (pole length).The pole is roughly hanging straight down (θ = 0) with small random perturbations at the beginning.The goal is to search for a policy that can swing up the pole and balance it at the straight up position (corresponding to x c = 0 and θ = 180 • ).
We used PILCO [27] to search for a policy in the nominal environment defined by the nominal values mentioned above.PILCO adopts Gaussian processes (GPs) to learn the systems dynamics, uses the learned dynamics together with uncertainty propagation (e.g., based on moment matching or linearization) to predict the cost, and then applies gradient descent to search for the optimal policy.PILCO achieved unprecedented records in terms of data-efficiency in RL.
We next perturb the environment to test the robustness of the nominal policy with and without L 1 augmentation.For L 1 augmentation design, we use the physics-based model with the nominal parameter values as the nominal model, instead of the GP model learned during policy training, for simplicity.Moreover, the parameters in ( 9), ( 10) and ( 12) were chosen to be a = 10, T = 0.002 second, and K = 200, and fixed across all the tests.Figure 2 shows the results in the presence of perturbations in the cart mass and pole length, while the perturbations in the latter induced unmatched disturbances.One can see that the L 1 augmentation significantly improves the robustness of the PILCO policy.For instance, PILCO plus L 1 augmentation was able to consistently achieve the goal even when the cart mass was perturbed to 3 kg (six times of its nominal value) or when the pole length was reduced to 0.2 m (one third of its nominal value).We further performed testing under ten scenarios, each of which involves random joint perturbations in the cart mass, pole mass and length parameters, in the range of M ∈ [0.1, 5] kg, m ∈ [0.1, 5] kg, l pole ∈ [0.3, 1] m.The sampled parameters and the success/failure results for each scenario are shown in Fig. 3. Once again, the L 1 augmentation significantly improved the policy robustness, as validated by the much higher success rate.Also, it is not a surprise that PILCO plus L 1 augmentation failed under Scenarios 9 and 10 as these two scenarios involve significant perturbations in pole mass (and additionally in pole length for Scenario 9), which induces unmatched disturbances that could not be compensated for.[1,4] kg, m 2 ∈ 0.11 [1,4] kg 6 Nm DR-SAC3

B. Pendubot swing-up and balance in simulations
As depicted in Fig. 8, the Pendubot is a mechatronic system consisting of two rigid links interconnected by revolute joints with the second joint unactuated.The states of the system include the angles and angular rates of the two and control input is the torque applied to Link 1.The task is to swing up the links from initial states [q 1 , q 2 ] = [π, π] to the right-up position [q 1 , q 2 ] = [0, 0] and balance them there, as illustrated in Fig. 8.The same reward function is used for training SAC and DR-SAC policies and defined by r = −3(|sin(q 1 )|+|cos(q 1 ) − 1|+|sin(q 2 )|+|cos(q 2 )−1|).The nominal RL policies were trained in simulation using soft actor-critic (SAC) [28] implemented in the MATLAB Reinforcement Learning Toolbox.For comparison, we also trained a few robust policies (termed as DR-SAC) with SAC and domain randomization [5], [6], in which three parameters, namely, the input gain (Λ), the mass of Link 1 (m 1 ), and the mass of Link 2 (m 2 ), were randomly sampled in a variety of ranges.Additionally, we tried imposing different control limits (through squashing).When training the SAC and DR-SAC polices, each agent includes an actor and two critics, all three of which share the same neural network structure that has two hidden fully-connected layers with 300 and 400 neurons, respectively.The same hyper-parameters were used for training all the DR-SAC and SAC policies.We did five trials for each setting.Table III lists three of many settings that we tested for training the DR-SAC policies and the setting for training the vanilla SAC policy.Figure 4 shows the average episode return (computed using a window of 10 episodes) during training.The solid curves correspond to the mean and the shaded region to the minimum and maximum average return over the five trials.As seen in Fig. 4, it was much easier and took much less episodes to find a good SAC policy, compared to training DR-SAC policies.We were able to find a good DR-SAC policy (i.e., DR-SAC3) under Setting IV, while further increasing the range of parameter perturbations associated with Setting IV led to degraded performance of the resulting DR-SAC policies even with a larger control limit, as illustrated by the training curves for DR-SAC1 and DR-SAC2.For subsequent tests, we chose the best DR-SAC3 from all five trials and compared it with other control policies.
We tested the performance of vanilla SAC, DR-SAC, SAC with L 1 augmentation (SAC+L 1 ) and DR-SAC with L 1 augmentation (DR-SAC+L 1 ) under a wide range of perturbations in m 1 , m 2 , and under three input gain settings: Λ = 1.0, 0.5 and 0.3, while the latter two indicate a loss of control effectiveness by 50% and 70%, respectively.For L 1 augmentation design, the parameters in ( 9), ( 10) and ( 12) were chosen to be a = 10, T = 0.005 second and K = 200, and fixed across all the tests.The results in terms of the normalized accumulative reward under each test scenario are shown in Fig. 5.Note that perturbation in m 2 induces unmatched uncertainties that cannot be compensated by the L 1 control law.As one can see, the performance of vanilla SAC drops dramatically when the perturbations in m 1 , m 2 and Λ increase.DR-SAC3 achieved acceptable performance under Λ = 0.5 in general, except when the perturbations in m 1 and m 2 are near the maximum, which are beyond the perturbations encountered during training of DR-SAC3.However, when the control effectiveness further decreases to 30% of its nominal value, DR-SAC3's performance degrades significantly, while only slight performance degradation is observed under SAC+L 1 and DR-SAC3+L 1 when the perturbations increase to the maximum.It is worth noting that SAC+L 1 and DR-SAC3+L 1 show comparable performance under the tested scenarios.We conjecture that in the case of larger unmatched uncertainties, DR-SAC3+L 1 will outperform SAC+L 1 .

C. 3-D quadrotor navigation in simulations
The states include quadrotor position (x, y, z) and linear velocities ( ẋ, ẏ, ż) in an inertia frame and the roll, pitch, and yaw angles (φ,θ,ψ) of the quadrotor body frame with respect to the inertial frame, as well as their derivatives.Motor mixing is also included in the dynamics.The inputs are the total thrusts f z and three moments along three axes (τ φ ,τ θ ,τ ψ ) generated by the four propellers.
The nominal value of the key parameters are set to be [I x , I y , I z ] = [0.082,0.0845, 0.1377] kgm 2 (moment of inertia), m = 4.34kg (quadrotor mass), and c pi = 1 (i = 1, 2, 3, 4) (propeller control coefficients).The mission is to control the quadrotor to fly from the origin to the target point (4, 4, 2).To obtain a policy for achieving the mission, we chose to use trajectory optimization, which, together with model learning, is commonly used for model-based RL [34], [35].We further chose to use differential dynamic programming (DDP) [29], a specific trajectory optimization method.Since our focus is not on the training but on robustifying a pretrained policy, we use the physics-based dynamic model with the nominal parameter values as the model "learned" in the nominal environment.This model is used for computing the DDP policy, and for designing the adaptive augmentation.For computing the DDP policy, we discretized the nominal dynamics and applied the method in [29] with the cost function J = x N P N xN + N −1 i=0 x i P xi + u i Qu i , where xi = x i − x target for i = 1, ..., N , N is the control horizon, and P = diag(2, 2, 2, 0.1, 0.1, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), P N = diag(10, 10, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5) and Q = diag (20,4,4,4).For L 1 augmentation design, the parameters in ( 9), ( 10) and ( 12) were chosen to be a = 10, T = 0.001 second and K = 200, and fixed across all the tests.We tested the performance of the DDP policy with and without L 1 augmentation under three types of dynamic perturbations.The first one is loss of propeller efficiency, which mimics the effect of propeller failures, and is simulated by adjusting the control coefficients c pi (i = 1, 2, 3, 4). Figure 6a shows the resulting trajectories under ten scenarios, in each of which the control coefficients of two propellers were randomly selected to be in [0.5, 1].One can see that L 1 augmentation significantly improved the robustness of the DDP policy,  Fig. 7: Results under joint perturbations in quadrotor mass, inertia and propeller efficiencies, and wind disturbances.In each of the ten scenarios, each type of perturbation was generated in the same way as was for the results in Figs.6a-6c.
leading to consistent trajectories that are close to the ideal trajectory obtained by applying the policy to the nominal dynamics.The second type of dynamic perturbations are the mass and inertia change, e.g., to mimic the effect of carrying different packages for a delivery drone.Fig. 6b shows the results under ten scenarios with randomly increased mass and inertia through a scale of [2,5].Once again, L 1 augmentation significantly improved the policy robustness, leading to closeto-ideal trajectories.The third type of dynamic variations is related to wind disturbances in the horizontal plane, which causes disturbance forces in the x and y directions.In each of the ten scenarios, the forces were simulated by stochastic variables with the mean values randomly sampled from [10,25].
The results are depicted in Fig. 6c.L 1 augmentation improved the robustness, but was not able to yield close-to-ideal performance.This is mainly because the wind disturbances induce unmatched disturbances (σ um (t) in ( 9) and ( 10)), which are not compensated for in the control law (12).Finally, Fig. 7 illustrates the simulation results under joint perturbations in quadrotor mass, inertia and propeller efficiency and wind disturbances.

D. Pendubot swing-up and balance on real hardware
We further tested the performance of those policies used in Section IV-B on the hardware setup depicted in Fig. 8.In addition to SAC and DR-SAC, we trained another policy using PILCO with the same reward function defined by (22).The ways to introduce dynamic variations include changing the input gain Λ, adding masses to Link 2, adding disturbance forces using a rubber band and different combinations of these three ways.For L 1 augmentation design, the parameters in ( 9), ( 10) and ( 12) were chosen to be a = 150, T = 0.005  IV, where ( ) indicates a success (failure) in achieving the mission.A video of the experiments is available at https://youtu.be/xZBcsNMYK3Y.
As one can see, in the nominal case (i.e., without intentionally introduced dynamic variations), all the policies with and without L 1 augmentation succeeded in achieving the mission.This, to a certain extent, indicates that the L 1 augmentation does not adversely affect the performance of RL policies in the presence of no or minimal dynamic variations.Additionally, L 1 augmentation significantly improves the robustness of PILCO and vanilla SAC, enabling them to succeed under all the tested scenarios except Scenario V for SAC, due to the extreme dynamic variations induced by the the largest perturbations in input gain and added masses.DR-SAC displayed much more robustness compared to vanilla SAC as expected, and only failed under Scenario V. It's worth noting that L 1 augmentation also further enhanced the robustness to DR-SAC and made it succeed under Scenario V.In Scenario VI, a rubber band was attached to the joint connecting the two links to exert a disturbance force.The disturbance force applied by the rubber band changed quite rapidly and peaked when Link 1 reached the upright position.This caused great challenges for the RL policies, as evidenced by the struggling of PILCO and SAC in the video, since, by training, these policies are not expected to produce large control inputs near the upright VI: Added disturbances with a rubber band position.Nevertheless, with the help of L 1 compensation, PILCO and SAC were able to deal with this challenging scenario.
V. CONCLUSION This paper presents an add-on scheme to improve the robustness of a reinforcement learning (RL) policy for controlling systems with continuous state and action spaces, by augmenting it with an L 1 adaptive controller (L 1 AC) that can quickly estimate and actively compensate for potential dynamic variations during execution of this policy.Our approach is easy to implement and allows for the policy to be trained or computed using almost any RL method (model-free or model-based), either in a simulator or in the real world, as long as a control-affine model to describe the dynamics of the nominal environment is available for the L 1 AC design.Experiments on different systems in both simulations and on real hardware demonstrate the general applicability of the proposed approach and its capability in improving the robustness of RL policies including those trained robustly, e.g., using domain/dynamics randomization (DR).Future work includes incorporating mechanisms, e.g., based on robust control [31], [33], to mitigate the effect of unmatched disturbance, and model learning to safely and robustly learn the unknown dynamics.
The proposed approach and existing robust RL methods e.g., based on DR, do not necessarily replace each other.Instead, they can complement each other, as demonstrated by the experimental results in Section IV-D.As mentioned before, existing robust RL methods aim to find a fixed policy for all possible realizations of uncertainties, which could be infeasible when the range of uncertainties is large.On the other hand, the proposed adaptive augmentation approach can deal with significant amount of matched uncertainties by using additional control effort to actively compensate for those, but cannot handle unmatched uncertainties in its current form.For systems subject to both matched and unmatched disturbances, a compelling solution will be to combine the strength of both by (1) (partially) ignoring matched disturbances in training a policy using existing robust RL methods to reduce conservativeness, and (2) augmenting the trained policy with the proposed L 1 scheme during execution of this policy to compensate for matched disturbances.

A. Proof of Lemmas
Hereafter, the notations Z i and Z n 1 denote the integer sets where the two inequalities hold due to ( 5) and ( 6) in Assumption 3. Additionally, 2) Proof of Lemma 2: From ( 3) and ( 9), the prediction error dynamics are obtained as Therefore, σ(t) = 0 for any t ∈ [0, T ) according to (10).Further considering the bounds on d(t, x) and g(x)(Λ − I)u in ( 13), we have We next derive the bound on σ(t) − σ(t, x, u) for t ∈ [T, τ ).
3) Quadrotor: The dynamics is taken from [38], which use Euler angles.The system is given by

Fig. 2 :
Fig. 2: Results in the presence of perturbations in cart mass and pole length.Ten trials were performed and average results with variances are shown for each perturbation case.Cumulative reward is normalized.

Fig. 3 :
Fig. 3: Results (bottom) under ten random perturbations in the cart mass, pole mass and length (percentage perturbation with respect to the nominal value shown at the top)

Fig. 5 :
Fig. 5: Performance of SAC, DR-SAC3, SAC+L 1 and DR-SAC3+L 1 for Pendubot under perturbations in m 1 , m 2 and Λ. Percentage change with respect to the nominal value is used to measure the perturbations in m 1 and m 2 .

Fig. 6 :
Fig. 6: Results under loss of propeller efficiency ((a)), perturbations in quadrotor mass and inertia ((b)), and wind disturbances ((c)).DDP (ideal) denotes the trajectory obtained by applying the policy to the nominal dynamics.

Fig. 8 :
Fig. 8: Left: a Pendubot configuration.Middle: stabilization at the upright position.Right: added masses and a rubber band used to induce dynamic variations

TABLE I :
Comparison with existing approaches to improving the robustness of RL policies

TABLE II :
An overview of testing systems and settings

TABLE III :
Selected training settings for Pendubot