Provably Safe Reinforcement Learning via Action Projection Using Reachability Analysis and Polynomial Zonotopes

While reinforcement learning produces very promising results for many applications, its main disadvantage is the lack of safety guarantees, which prevents its use in safety-critical systems. In this work, we address this issue by a safety shield for nonlinear continuous systems that solve reach-avoid tasks. Our safety shield prevents applying potentially unsafe actions from a reinforcement learning agent by projecting the proposed action to the closest safe action. This approach is called action projection and is implemented via mixed-integer optimization. The safety constraints for action projection are obtained by applying parameterized reachability analysis using polynomial zonotopes, which enables to accurately capture the nonlinear effects of the actions on the system. In contrast to other state-of-the-art approaches for action projection, our safety shield can efficiently handle input constraints and dynamic obstacles, eases incorporation of the spatial robot dimensions into the safety constraints, guarantees robust safety despite process noise and measurement errors, and is well suited for high-dimensional systems, as we demonstrate on several challenging benchmark systems.


I. INTRODUCTION
Reinforcement learning has been successfully applied to find solutions for many challenging applications, such as robotics [1], autonomous driving [2], and power systems [3].Many of these applications are safety-critical, so that the lack of safety guarantees for standard reinforcement learning controllers prevents their deployment in the real world.We aim to overcome this limitation with a novel safety shield for reinforcement learning agents that considers the very general case of disturbed nonlinear continuous systems with input constraints that have to avoid dynamic obstacles.Note that our safety shield can be applied to arbitrary unsafe controllers, while reinforcement learning is the main focus of this work.

A. State of the Art
We first provide a summary of the current state of the art in safety-related methods of reinforcement learning.The term safe reinforcement learning refers to approaches that aim to obtain safe agents, but do not provide hard safety guarantees.One example for this is constrained reinforcement learning [4], [5], where the objective of the training phase is to maximize the reward while satisfying safety constraints.While advantages of this technique are that no system model is required and that even complex temporal logic safety specifications [6], [7] can be considered, the obvious Steps for action projection using parameterized reachability analysis, where the reachable set is depicted in gray and the unsafe regions are shown in red: 1) Computation of the reachable set for all actions starting from the current state x 0 .2) Extraction of action constraints from the intersections between the reachable set and unsafe regions.3) Projection of the action u a outputted by the agent to the closest safe action.
disadvantage is that hard safety guarantees can be provided during neither training nor deployment.The same is true for probabilistic approaches [8], [9] that aim to identify the safety probability of an action.Overall, safe reinforcement learning techniques can be used for non-critical applications, where unsafe actions do not cause major damage; however, these methods are not suited for safety-critical systems.
In contrast to safe reinforcement learning, provably safe reinforcement learning approaches provide hard safety guarantees.They can be divided into the three main categories: action masking, action replacement, and action projection [10].In action masking [11], [12], a mask that only allows the agent to choose actions from the set of safe actions is applied.One disadvantage of this method is that it is often hard to explicitly compute the set of safe actions, especially for continuous action spaces, where the set of safe actions often has a very complex non-convex shape as shown in Fig. 1.In addition, it is non-trivial to correctly consider the masking during training so that the reinforcement learning algorithm is not perturbed [13].For action replacement [14]- [17], unsafe actions returned by the agent are replaced by safe actions.As replacement, one can either use a single safe action obtained from a failsafe planner [14] or via human feedback [15] or one can sample from the set of safe actions [16], [17].Also the well-known simplex architecture [18]- [20], where a safe controller is used as a backup for an unsafe controller, can be categorized as action replacement.One disadvantage of action replacement is that the difference between the original action and the replacement action can be very large, which might prevent the agent from completing its task.Action projection tries to avoid this issue by finding the safe action that is closest to the action suggested by the agent.
Since our approach applies action projection, we discuss this category in more detail.The most prominent methods for action projection are control barrier functions [21], [22], model predictive control [23], [24], and parameterized reachability analysis [25].A control barrier function is a level-set function that divides the state space into a safe and a potentially unsafe region.Here, action projection is formulated as an optimization problem, where the correction of the action is minimized, such that the system stays inside the safe region defined by the control barrier function.While an advantage is that control barrier functions can for static environments guarantee safety for infinite time, the method also has several disadvantages: 1) It is often not easy to find a suitable control barrier function, especially in the presence of dynamic obstacles.2) Control barrier functions are often quite conservative since they usually exclude many states that are safe.3) The approach is often limited to control affine systems because the optimization problems would otherwise become non-convex.4) It is challenging to consider input constraints as well as process noise and measurement errors.The second method is model predictive control, which also formulates the projection as an optimization problem, but uses the safety constraint that the system should not enter any unsafe regions for a certain finite prediction horizon, which avoids the requirement for a control barrier function.However, one downside is that it is often not possible to guarantee that the solution is robustly safe despite process noise and measurement errors since for nonlinear systems these uncertainties usually cannot be encoded directly into the optimization problem.Our safety shield is based on the parameterized reachability analysis approach, which is visualized in Fig. 1: The first step is to compute the reachable set for all available actions.Since this reachable set is parameterized by the actions, one can directly extract the safety constraints for action projection from the intersection between the reachable sets and the unsafe regions.Since process noise as well as measurement errors can conveniently be integrated into reachability analysis, this approach is very well suited for guaranteeing robust safety.
Due to its advantageous properties, several approaches apply reachability analysis to guarantee safety.One method [26] uses the Hamilton-Jacobi reachability framework [27] to compute the backward reachable set starting from the unsafe sets -a state is safe for all possible actions if it is outside of the backward reachable set.This has the disadvantage that for each unsafe set a different backward reachable set has to be computed.Moreover, the Hamilton-Jacobi framework requires gridding the state space so that the computational complexity of the approach grows exponentially with the system dimension.Another method [28] applies reachability analysis for black-box systems and uses a differentiable collision check that is based on constrained zonotopes [29] to efficiently push the reachable set for the proposed action away from unsafe sets.This, however, has the drawback that the reachable set has to be recomputed after each correction update of the action, which is computationally demanding.The method closest to our approach is a reachability-based trajectory safeguard [25], which computes the parameterized reachable set for a simplified trajectory-generating model and determines a safe action satisfying the constraints extracted from the reachable set via random sampling.While this approach can be computationally efficient for some systems, sampling methods often fail to find feasible solutions, especially in high-dimensional action spaces.

B. Contributions and Outline
We present a novel safety shield that is based on action projection using parameterized reachability analysis.This safety shield extends our previous work on dependency preserving reachability analysis [30]- [32] by a method for correcting unsafe actions, and we additionally also study the effect online verification has on the learning process.Unlike the related approach in [25], our safety shield directly operates on the original nonlinear system model rather than on a simplified trajectory-generating model.Moreover, in contrast to [25], we use conservative polynomialization [30] instead of conservative linearization [33] for reachability analysis, which enables us to efficiently capture the nonlinear effects the actions have on the system.Another advantage over [25] is that we use mixed-integer optimization instead of random sampling for projection, which always finds the action with the smallest correction.Finally, the various design choices provided by our safety shield enable the user to fine-tune its performance for the considered application.
The remainder of this paper is structured as follows: After introducing some preliminaries in Sec.II, we provide the problem definition in Sec.III.Our main contribution is the reachability-based safety shield for reinforcement learning presented in Sec.IV, for which we discuss several extensions in Sec.V. Finally, we demonstrate our approach on several numerical examples in Sec.VI and conclude with a discussion of its properties in Sec.VII.

II. PRELIMINARIES
We first introduce our notation and define the set representations that we use in this paper.

A. Notation
Sets are denoted by calligraphic letters, matrices by uppercase letters, and vectors by lowercase letters.Given a vector a ∈ R n , a (i) is the i-th entry and the p-norm is denoted by a p .Given a matrix A ∈ R n×m , A (i,•) represents the i-th matrix row, A (•,j) the j-th column, and A (i,j) the j-th entry of matrix row i.The concatenation of two matrices C and D is denoted by [C D], I n ∈ R n×n is the identity matrix, and the symbols 0 and 1 represent vectors of zeros and ones of proper dimension.We further introduce an n-dimensional interval as

B. Set Representations
Our approach relies on several different set representations, which we introduce here.Let us begin with polytopes, for which we use the halfspace representation: Definition 1 (Polytope): Given a constraint matrix A ∈ R s×n and a constraint offset b ∈ R s , the halfspace representation of a polytope P ⊆ R n is We use the shorthand P = A, b P .Zonotopes are a special type of polytopes that can be represented efficiently using generators: Definition 2 (Zonotope): Given a center vector c ∈ R n and a generator matrix G ∈ R n×p , a zonotope Z ⊂ R n is with so-called factors α i .We use the shorthand Z = c, G Z .An extension to zonotopes are polynomial zonotopes [30], which can represent non-convex sets.We use the sparse representation of polynomial zonotopes [32] 1 : Definition 3 (Polynomial Zonotope): Given a constant offset c ∈ R n , a generator matrix of dependent generators G ∈ R n×h , a generator matrix of independent generators G I ∈ R n×q , and an exponent matrix The scalars α k are called dependent factors and β j independent factors.We use the shorthand PZ = c, G, G I , E P Z .Polynomial zonotopes can equivalently represent intervals, zonotopes, polytopes, and Taylor models [32, Sec.II.B].Moreover, due to their polynomial nature, they are closely related to polynomial level sets: Definition 4 (Polynomial Level Set): Given a vector of coefficients a ∈ R h , an offset b ∈ R, and an exponent matrix We use the shorthand LS = a, b, E LS .

III. PROBLEM FORMULATION
We consider general nonlinear disturbed systems with input constraints defined by the ordinary differential equation where The process noise is bounded by a compact set w(t) ∈ W ⊂ R z and the system has to satisfy the input constraints defined by the convex set u(t) ∈ U ⊆ R m .The set W can for example be determined from measurements of the real physical system using conformance checking [34].Given a nonlinear system defined as in (1), the goal is to solve a reach-avoid problem, where the system state should be steered from the current state x 0 = x(0) to a goal set G ⊆ R n while avoiding collisions with potentially timevarying unsafe sets F i ⊂ R n , i = 1, . . ., o, where o denotes the number of unsafe sets.In case the measurements of the system state are subject to a measurement error v(t) ∈ V, the goal becomes to steer all states in the set x 0 ⊕ V to the goal set.We aim to solve reach-avoid problems with reinforcement learning, where we train an agent to return the control inputs u a (t) for a given state x(t) steering the system to the goal set while avoiding obstacles.However, we have no guarantee that the behavior learned by the agent is safe.Therefore, we add a safety shield that is based on reachability analysis to obtain formal guarantees: Definition 5 (Reachable Set): Let ξ(t, x 0 , u(•), w(•)) denote the solution of (1) at time t for an initial state x 0 = x(0), control input trajectory u(•) and process noise trajectory w(•).The reachable set at time t is where X 0 ⊂ R n is the initial set and W ⊂ R z is the set of process noise.
For our safety shield, we consider that U, W, and V are represented as zonotopes, and G and F i are represented as polytopes in halfspace representation.Moreover, we use polynomial zonotopes to represent reachable sets.In case other agents are present in the environment, we can apply set-based methods [35] to safely predict their future behavior and obtain the corresponding time-varying unsafe sets.

IV. SAFETY SHIELD
As visualized in Fig. 1, the high-level idea behind our safety shield is to compute the reachable set for a time horizon of t f and the set of all control inputs satisfying the input constraints ∀t ∈ [0, t f ] : u(t) ∈ U rather than a single control input trajectory u(•).The intersection of this reachable set with the unsafe sets then yields constraints that define safe control inputs, which we can use to formulate the projection of the control input u a provided by the reinforcement policy to the closest safe control input as an optimization problem.We first consider input trajectories that are constant over time for simplicity and discuss more advanced control strategies later in Sec.V-A.For constant control inputs, we can compute the reachable set using the extended system dynamics together with the initial set X 0 = x 0 × U, where we omit the set of measurement errors V for simplicity.For reachability analysis, we use the conservative polynomialization algorithm [30], which encloses the nonlinear dynamics in (1) by a differential inclusion ẋ ∈ p(x(t), u(t), w(t)) ⊕ E consisting of a polynomial approximation p(x(t), u(t), w(t)) and the abstraction error E. This reachability algorithm explicitly preserves dependencies between the initial states and the reachable states [31].Since with the extended system dynamics in (2), the control inputs become part of the system state, we can therefore directly determine from the reachable set which control inputs steer the system to unsafe regions.Let us demonstrate this dependency preservation by an example: Example 1: As a running example we consider the system and time horizon t f = 1 s.Moreover, is the unsafe set and is the set of control inputs.With the conservative polynomialization algorithm we obtain the final reachable set which is visualized in Fig. 2. Since the reachable set R(t f ) and the input set U are parameterized by the same factors α 1 and α 2 , we have a direct analytical relation between the control inputs and the corresponding reachable states.We now exploit the analytical relation between the control inputs and the reachable states to determine the set of safe control inputs.As demonstrated in the example above, the control input u(t) ∈ U = c u , G u Z is unambiguously defined by the factors α via the relation u = c u + G u α through the definition of a zonotope in Def. 2. Instead of determining the set of safe control inputs directly, we therefore determine the safe set for α instead, since this simplifies the computations as it becomes apparent later.The independent generators of the polynomial zonotope R(t f ) represent uncertainties that results from abstraction errors during reachability analysis as well as from the process noise.Consequently, a control input is safe only if the reachable set does not intersect the unsafe sets for any possible value of the independent factors β j .We formulate this in the following theorem, which extends our previous results for unsafe sets given as halfspaces [31,Sec. 4.1] to the more general case of polytopes: Theorem 1: Given is an unsafe set F = A, b P ⊂ R n consisting of s halfspace constraints and the reachable set R(t) = c, G, G I , E P Z ⊂ R n of the system in (2) computed with the conservative polynomialization algorithm [30] for the initial set The following constraints on the zonotope factors α that parameterize the control input ensure that there exists no trajectory that enters the unsafe set:

, s.
Proof: A single point x ∈ R n is located outside the unsafe set F if it is fully located outside of at least one halfspace: Moreover, due to dependency preservation of reachability analysis, it holds according to [31, Thm.1] that the disturbed trajectory ξ(t, x 0 , c u + G u α, w(•)) for a specific control input u = c u + G u α is contained inside the reachable subset obtained by restricting the factors α k ∈ [ 1,1] in the definition of polynomial zonotopes in Def. 3 to the corresponding concrete value for α = [α 1 . . .α p ] T : Finally, combining this with (3) under the consideration that the constraints should hold for all values of the independent factors β j yields which results in the statement of the theorem after bringing the constant offset and the independent generators to the other side of the inequality.Remark 1: A geometric interpretation of Thm. 1 is that we first bloat the obstacle F by the uncertainty given by the independent generators through pushing each polytope halfspace outward.Next, we obtain the constraints via intersecting with the part of the polynomial zonotope spanned by the dependent generators, where the intersection between each halfspace of the bloated polytope F corresponds to a polynomial level set constraint for the factors α.
Thm. 1 defines a feasible region α ∈ [ 1, 1] ∩ l LS l for the factors α that parameterize the control input such that the intersection between a reachable set at a specific point in time and a single unsafe set is empty.However, to guarantee safety we have to consider the reachable set for the whole time horizon t ∈ [0, t f ], which consists of a sequence of reachable sets R(τ 0 ), R(τ 1 ), . . ., R(τ f ) for consecutive time intervals τ 0 , τ 1 , . . ., τ f .Moreover, we might also have more than one unsafe set.So overall we obtain one feasible region α ∈ [ 1, 1] ∩ l LS l for each pair of reachable sets and unsafe sets resulting in an intersection.The feasible region for α to guarantee safety for all time intervals and all unsafe sets is given by the intersection of the feasible regions for single pairs: where the level sets LS rl are obtained from Thm. 1 and z is the number of intersecting pairs.To efficiently check if a reachable set represented by a polynomial zonotope intersects an obstacle represented by a polytope, the polynomial zonotope refinement algorithm [36] can be used.This algorithm recursively splits the polynomial zonotope along the longest generator until the intersection with the polytope can either be proven or disproven using zonotope enclosures of the split polynomial zonotopes.Overall, given a vector of factors α a ∈ R p that corresponds to the control input u a = c u + G u α a ∈ U = c u , G u Z provided by the reinforcement learning policy, we can formulate the projection to the closest safe control input as an optimization problem: This is a disjunctive programming problem, which can be formulated as a mixed-integer quadratic program with polynomial constraints using the convex hull relaxation [37]: for r = 1, . . ., z and l = 1, ..., s r .Here, the disjunction is realized using the binary variables λ rl ∈ {0, 1} which modify the corresponding polynomial constraints to be either active (λ rl = 1) or inactive (λ rl = 0).Let us demonstrate the optimization for our running example: Example 2: As shown in Fig. 2, for the nonlinear system in Example 1 only the final reachable set R(t f ) intersects the unsafe set F .We consequently obtain the feasible region for α by applying Thm. 1 to the sets R(t f ) and F , which yields α ∈ LS 1 ∨ LS 2 with The feasible regions for α 1 and α 2 are shown in Fig. 3.In the presence of measurement errors v(t) ∈ V we can apply the same overall approach but have to change the initial set to X 0 = (x 0 ⊕ V) × U, where the set V has to be represented by independent generators to ensure that safety is guaranteed for all possible values of the measurement errors.While we focused on the conservative polynomialization algorithm [30] for simplicity, our safety shield is also compatible with other reachability approaches as long as they preserve dependencies between initial states and reachable states.This is for example the case for algorithms that compute reachable sets using the Picard-Lindelöf iteration together with Taylor models [38].
The safety shield can be used during reinforcement learning or for a learned agent.For every decision step, the action suggested by the agent is corrected to the closest safe action by (4) only if it violates safety constraints.If the safety shield is used during learning, it can be beneficial to adapt the reward to inform the agent about corrections of actions [10].

V. EXTENSIONS
We now discuss several extensions for our safety shield.

A. Different Types of Control Laws
For the basic safety shield presented in Sec.IV, for simplicity we considered that the control input is kept constant for the whole planning horizon.Since this is very restrictive and would in practice often prevent us from finding a feasible solution, we now discuss how to realize more advanced control strategies.Note that the reinforcement learning agent has to match the control law used for the safety shield.
a) Piecewise Constant Control Law: One simple but very effective extension to constant control inputs are piecewise constant control inputs.Instead of determining a single control input from the input set U, we determine control inputs for all piecewise constant segments from the set U × • • • × U. We can still use the extended system in (2), but have to reset the initial set for reachability analysis to R(t i ) × U after each of the i = 1, . . ., N piecewise constant time segments [t i−1 , t i ] with t i = i • t f /N , where R(t i ) is the final reachable set from the previous segment.
  together with the initial set x 0 × C × 0. In the optimization problem (4) we then determine the values for the parameter vector c, where we add the constraint c (1) +c (2) t+c (3) t 2 ∈ U to ensure that the input constraints are satisfied.The initial set C ⊂ R 3 for the coefficient vector c can be determined by estimating the feasible values for c such that the constraint c (1) + c (2) t + c (3) t 2 ∈ U is satisfied for the whole time horizon.c) Feedback Control: We can also apply a feedback control law u(t) = u ref (t) + K(x(t) − x ref (t)) with a fixed feedback matrix K ∈ R m×n , where both piecewise constant or polynomial control inputs can be used for the reference control input u ref (t) corresponding to the reference trajectory x ref (t).For the safety shield, we then compute the reachable set for the extended system   using the initial set x 0 × U × x 0 .In the optimization problem (4) we then determine the optimal parameter for the reference control inputs u ref (t), where we add the constraint A comparison of the different control laws presented in this section is shown in Fig. 4 for the system in Example 1.The results demonstrate that even for a piecewise constant control law with only N = 2 segments we already obtain a larger reachable set than with a quadratic control law, which increases our chances to find a safe control input.While piecewise constant control laws therefore seem to be preferable, their rapidly changing values often negatively impact comfort or durability for many systems, which can be avoided with polynomial control laws.
For all control strategies we apply the following control scheme: We plan for a time horizon of t f , but execute the resulting control law for only a shorter time period t c < t f before planning a new trajectory.This increases the chances to avoid getting stuck in dead ends and ensures that we can react quickly to dynamic changes in the environment.

B. Spatial Dimensions of Mobile Robots
So far we considered the case where the safety constraints are specified directly by the system state.For collision avoidance, however, this setup is usually not sufficient since we additionally have to consider the shape and spatial dimension of mobile robots, e.g., cars, vessels, or drones, which we want to control safely.While for many other approaches this poses a huge problem, incorporating spatial dimensions of the robot into our safety shield is quite straightforward since we simply have to replace the reachable set with the occupancy set.Given the reachable set R(t) that typically describes all possible positions of a reference point on the robot as well as all possible robot orientations, the occupancy set is defined as where the function o : R n × R δ → R γ describes how the space occupied by the robot is computed from the system state and the set D specifying the spatial dimension of the robot.
Example 3: Let us consider a car where the states x (1) and x (2) describe the x-and y-position of its center, and state x (3) describes the orientation of the car.Then the function o(x, d) that defines the space occupied by the car is given as where the shape of the car is for simplicity enclosed by a rectangle, so that d  [40] to evaluate (5) and then convert the resulting set back to a polynomial zonotope.The resulting safety constraints that we obtain from the intersections between the occupancy set O(t) and obstacles have to hold for all values d ∈ D. To ensure this, we could represent the set D with independent generators before computing O(t), similarly as we did for the set of measurement errors in Sec.IV.However, since the set D is in general much larger than the set of measurement errors V, this would often yield very conservative results.A better approach is to project out all factors that correspond to the set D using Fourier-Motzkin elimination [41,Chapter 4.4].Let us demonstrate this by an example: Example 4: We consider the constraint from which we want to eliminate α 3 .The first step of Fourier-Motzkin elimination is to solve all constraints for α 3 , which yields Next, we have to form all combinations of the constraints in ( 8) that result in a non-empty solution, yielding the constraints which represent an equivalent formulation of (7).Since Fourier-Motzkin elimination requires that the constraints are solvable for the variable that is eliminated, all terms that violate this condition have to be removed first by applying a zonotope enclosure [32,Prop. 5].

C. Mixed-Integer Linear Program Formulation
For some systems, solving the nonlinear mixed-integer optimization problem (4) might be computationally too expensive, especially when we have to evaluate the safety shield in real-time for online application.Therefore, we now discuss how to obtain a feasible and close to optimal solution using mixed-integer linear programming, which is significantly faster.To achieve this, we enclose the polynomial zonotopes that represent the reachable set with zonotopes using [32,Prop. 5].Since zonotopes are linear in the factors α, the feasible region for α calculated using Thm. 1 is then given as a union of polytopes l A l , b l P instead of a union of polynomial level sets.Consequently, if we additionally minimize the L 1 -norm instead of the L 2 -norm, we can simplify the optimization problem (4) to A rl , b rl P , which can be formulated as a mixed-integer linear program using Balas' Theorem [42]: subject to for r = 1, . . ., z and l = 1, . . ., s r .The structure of this optimization problem is very similar to (4), except that we introduced the new variables α rl = α rl λ rl to avoid the bilinear terms and obtain a linear program.Due to the over-approximation of all nonlinear terms of the polynomial zonotope by the zonotope enclosure, it holds that every feasible solution for ( 9) is a feasible solution to the original problem ( 4), but some values that are feasible for (4) will not be feasible for (9).Note that if the system dynamics ( 1) is linear, we directly obtain a mixed-integer linear program in the form of ( 9).Moreover, we can always first check if the desired value α a satisfies the original nonlinear constraints and only perform the simplification to a mixed-integer linear program if it does not.A mixed-integer quadratic program can be obtained in a similar way as the mixed-integer linear program by enclosing all generators that belong to higher-order polynomials by a zonotope.Finally, since mixed-integer programming can be highly parallelized, the computation time for optimization can always be reduced by using a more powerful machine with more cores.

D. Constraint Grouping
Since the time step size for reachability analysis is usually relatively small, it often happens that many reachable sets for consecutive time intervals intersect the same obstacle, resulting in a lot of very similar constraints.We can reduce the computation time by grouping similar constraints together, as we demonstrate with the following example: Example 5: The two constraints ] can be grouped to the single constraint To eliminate the new variables ǫ 1 and ǫ 2 we represent their domains as a summation of the center with a zero-centered uncertainty as ǫ where a lower bound for the optimal value of the minimization problem can be computed using interval arithmetic [43].
In addition to the number of constraints, constraints grouping also decreases the number of integer variables for the optimization in (4), which reduces computation time.Since integer variables are required only if the safe region for the agent is non-convex, another strategy to accelerate the optimization is to replace non-convex safe regions by the largest convex subset [44].

E. Reachable Set Pre-Computation
In order to reduce the computation time for our safety shield, we can pre-compute the reachable set starting from an initial set X 0 offline, and then apply the reachable subset approach [31] to efficiently extract the reachable set for the current state x 0 ∈ X 0 during online execution.Since for nonlinear systems the accuracy of the reachable set enclosure depends on the size of the initial set, we cannot make X 0 too large but instead have to divide the relevant state space into sets of suitable size.The number of required sets for such a division grows exponentially with the system dimension, so that this approach is not suited for high-dimensional systems.However, for many systems the differential equation ẋ(t) = f (x(t), u(t), w(t)) describing the system dynamics is invariant with respect to transformations of certain states [45,Sec. 4.1].For example, the dynamics of a car are invariant with respect to translations of the car's position and with respect to rotations of the car's orientation.In this case only the state space for the states that are not invariant has to be divided since we can always apply a suitable state space transformation to set the invariant states to 0.

VI. EXPERIMENTAL EVALUATION
We now demonstrate the performance of our safety shield on several benchmark systems, where each benchmark highlights different properties of our approach.If not explicitly stated otherwise, all computations are carried out in Python on a 2.9GHz quad-core i7 processor with 32GB memory.We use the CORA toolbox [46] to pre-compute reachable sets, proximal policy optimization [47] for reinforcement learning, Gurobi to solve the mixed-integer linear and quadratic programs, and CasADi together with the BONMIN solver to solve mixed-integer nonlinear programs2 .Benchmark parameters as well as the applied extensions from Sec. V are listed in Tab.I. We published our implementation on CodeOcean3 and created a video showing our results 4 .

A. F1tenth Racecar
To demonstrate that our safety shield is fast and robust enough to be applied to a real system, we conduct experiments on an F1tenth racecar [48], whose dynamics are described by a kinematic single-track model.Moreover, the car contains a low-level PI controller with gains k P = 8 and k I = 1 that takes as input the desired velocity and realizes the required acceleration.Overall, this results in the model where the system state consists of the x-and y-position of the center s x , s y , the velocity v, the orientation ψ, and the integrated error of the PI controller e I .The control inputs are the desired velocity u 1 and the steering angle u 2 , which are bounded by the set U = [0, 0.5]m s −1 × [−0.3, 0.3]rad.To ensure that the model ( 10) encloses all possible behaviors of the real system, we performed conformance checking using the AROC toolbox [49] to determine the process noise as well as the measurement error from data traces we recorded from the real car, which results in the sets To incorporate the size of the car, we use the output function in ( 6) with length 0.51 m and width 0.31 m.
For control we use a piecewise constant control law with N = 2 segments and a planning horizon of t f = 2 s, and we replan as soon as the previous computation is finished.Moreover, we simplify the optimization problem for action projection to a mixed-integer quadratic program, which on average took 0.14 s to solve during our experiments.The car uses a 1.9GHz six-core ARMv8 processor with 7.6GB memory and is equipped with a LiDAR sensor.To obtain the unsafe sets F i , we enclose all points measured by the LiDAR by a union of polytopes.Moreover, while the velocity and the integrated error can be directly obtained from the car's internal sensors, we use a particle filter [50] to determine the position and orientation of the car in the environment from LiDAR measurements.For our experiments, we then applied reinforcement learning to train an agent on four environment maps that were similar to but slightly different from the map we used for the experiments on the real F1tenth car.In addition to the system state, we used the LiDAR measurements and the position of the goal set as observations for the agent, and we did not use the safety shield during training.
As shown in Fig. 5, without the safety shield, the trained agent is unsafe since the car crashes into the obstacle.With our safety shield, however, the car avoids the obstacle and safely reaches the goal set.This not only demonstrates that our safety shield successfully works on a real system, but also that the modifications to the control inputs suggested by the reinforcement learning policy are small enough for the agent to still fulfill its objective.

B. Autonomous Driving
In order to show that our safety shield can handle very complex reach-avoid problems that include dynamic obstacles, we consider the motion planning benchmarks for autonomous cars provided by the CommonRoad database [51].As system dynamics we use the kinematic single track model from [20, Sec.VII] with the same input set U and set of process noise W as in [20,Sec. VII].This model is very similar to the model in (10), with the only difference that PSfrag replacements without safety shield with safety shield Fig. 5: Trajectories driven by the F1tenth racecar with and without the safety shield, where the green area is the goal set and the orange area is the obstacle.the acceleration instead of the desired velocity is used as a control input.The car we consider is a BMW 320i, which has a length of 4.51 m and a width of 1.61 m.To guarantee safety even though the intentions of the other cars are unclear, we use the tool SPOT [52] to compute all possible occupancies of the other traffic participants that apply to traffic rules using set-based prediction.
To counteract the large process noise for this benchmark, we use a feedback controller u(t) = u ref +K(x(t)−x ref (t)) for the safety shield, where the reference input u ref is piecewise constant with N = 2 segments.The feedback matrix K ∈ R m×n is determined by applying an LQR control approach with state weighting matrix Q = I 4 and input weighting matrix R = I 2 to the linearized system.Moreover, we use a planning horizon of t f = 0.8 s and replan after t c = 0.4 s.We apply reinforcement learning to train an agent that aims to safely control the car, where we do not use the safety shield during training.The observations for the agent are selected from [53,Tab. II].In particular, we use the state of the ego vehicle, the distances of the ego vehicle to road/lane boundaries as well as to the goal set, and the states of surrounding vehicles.
The effect of the safety shield is highlighted by the results for 2000 traffic scenarios shown in Tab.II: While the original agent collides with other traffic participants in 10 scenarios, our safety shield successfully prevents all collisions.Moreover, applying the safety shield does not lead to a reduced goal-reaching percentage, but instead even increases the number of scenarios for which the goal set is reached.Tab.II also demonstrates the effect of constraint grouping (see Sec. V-D), which reduces the average computation time for solving the optimization problem, but slightly decreases the goal reaching percentage due to the increased conservatism.In Fig. 6 the results for one specific traffic scenario are visualized.There, the agent without the safety shield changes the lane too early and collides with g replacements ut safety shield th safety shield the adjacent truck, whereas the agent with the safety shield changes the lane just in time and finally reaches the goal set in the end.

C. Quadrotor 2D
Next, we compare our safety shield with a safe reinforcement learning approach that modifies the optimization criterion.In particular, we incorporate the safety specification as a violation penalty in the reward function.For this, we consider a benchmark problem from the safe-controlgym [54] featuring a trajectory tracking task for a twodimensional quadrotor.As shown in Fig. 7, the trajectory that should be tracked is partially located inside an unsafe region, so that there exists a conflict between tracking performance and safety constraint satisfaction.The dynamics of the quadrotor are according to [54, Eq. ( 3)] given as where m = 0.027 kg is the mass, g = 9.81 m s −2 is the gravitational acceleration, a = 0.0397 m is distance from each motor pair to the center of mass of the quadrotor, and I yy = 1.4 • 10 −5 kg m −2 is the moment of inertia.
The system state consists of the x-and z-positions s x , s z as well as the pitch angle ψ of the quadrotor together with the corresponding velocities.To decouple forward thrust and tilting torque, the input set for the control inputs u 1 and u 2 that represent the thrusts generated by the two rotors is restricted to The process noise w 1 , w 2 , w 3 is bounded by the set W = 0.01 For the safety shield we use a constant control input with a planning horizon of t f = 0.5 s, where we replan after t c = 0.02 s.To perform action projection, we solve the original nonlinear optimization problem, which takes 0.004 s on average during our experiments.The main reason for the fast computation time is that the safe region for the quadrotor is convex, which results in an optimization problem without any integer variables.We train three different agents: A baseline agent that should track the trajectory and gets no information about the constraints, a constraint-penalty agent where the reward is extended with a penalty for constraint violation, and a safe agent that is trained with the safety shield.As shown in Fig. 8, the safe and baseline agents converge after 400 000 training steps while the agent with constraint penalty needs 2 million training steps to converge.Moreover, only the safe agent never violates any constraints during training, and could therefore also be used for training directly on the real physical system.The results for deploying the different trained agents are shown in Fig. 7.As expected, the baseline agent without the safety shield violates the safety constraints since they were not considered during training.Also, the constraint-penalty agent violates the constraints, which demonstrates that it is not sufficient to incorporate the safety constraints into the training process.Only the two agents that apply our safety shield stay inside the safe region for all times, where the agent that uses the safety shield during training achieves a smoother trajectory compared to the baseline agent.

D. Quadrotor 3D
To compare our safety shield with reachability-based trajectory safeguard [25], we consider the three-dimensional quadrotor benchmark from [25, Sec.V.B].Reachabilitybased trajectory safeguard [25] applies the safety shield to a simplified trajectory-generating model and the resulting trajectory is then tracked by a low-level controller that uses the original nonlinear system model.For the quadrotor, the trajectory-generating model for each of the three spatial directions where x i is the quadrotor position, v i and a i are respectively the velocity and the acceleration at the beginning of the trajectory, and t is time.The input u i to the system is the peak velocity reached at time t = 1.5 s, which is bounded by the set To apply our safety shield, we tightly inner-approximate the set U with a zonotope using the method described in [55, Sec.IV].A similar trajectory-generating model is used to decelerate the quadrotor from the peak velocity back to velocity 0, so that the overall planning horizon is t f = 3 s.We consider the same control task as in [25, Sec.V.B], which is to safely navigate the quadrotor through a 100 m long tunnel with randomly generated box obstacles.For our experiments, we deployed the same trained reinforcement learning agent as used in [25] on 100 tunnels with different obstacles and compared the conservatism of the two safety shields in terms of the required control input correction u − u a 2 at each intervention of the safety shield.While both safety shields had to intervene for 5078 out of 5760 time steps, the average control input modification for our approach is with 1.13 m s −1 smaller than the average modification 1.22 m s −1 for the safety shield from [25], which increases the chances that the agent can successfully complete its task.

VII. DISCUSSION
Finally, let us discuss some properties of our safety shield.

A. Safety Guarantees for Infinite Time
Our basic safety shield approach can guarantee safety only for the finite time horizon t f .To obtain safety guarantees for an infinite time horizon, one can either combine our safety shield with a fail-safe planner [56] that takes over when the safety shield cannot determine a safe trajectory anymore, or one can modify the safety shield in such a way that the system always stops in a safe final state at the end of the planning horizon [25].

B. Computational Complexity
The two main steps required for our safety shield are computing the reachable set and solving the mixed-integer optimization problem (4) for action projection.The complexity of the conservative polynomialization algorithm for reachability analysis is O(n 5 ) with respect to the system dimension according to [57,Sec. 4.1.4].However, for many benchmarks one can apply the pre-computation discussed in Sec.V-E to avoid computing reachable sets online.Solving a mixed-integer optimization problem is in general NP-hard [58].But, as we demonstrated with the numerical experiments in Sec.VI, by applying the simplification to a mixedinteger linear program in Sec.V-C and/or constraint grouping in Sec.V-D we can solve this optimization efficiently.

C. Safe Computation Time Consideration
As demonstrated by the experiments in Sec.VI, even with all the speed-ups discussed in Sec.V, the calculations required for our safety shield still need a certain amount of computation time that, depending on the system, might be too long to simply be neglected.Therefore, in order to consider the required computation time in a formally correct manner, we can apply the following well-known procedure [59]: We allocate a certain computation time t comp for the calculations and use reachability analysis to predict the reachable states for the allocated computation time.By using this set as the initial set for our safety shield, we can guarantee safety even though the required calculations are not instantaneous.If the computation does not finish in the allocated computation time, we either stick to the safe solution from the previous time step or apply a failsafe maneuver.

D. Conservatism of the Safety Shield
Due to over-approximation errors, our safety shield might not be able to always find a feasible solution if one exists.In particular, there are four sources of conservatism: • Since the exact reachable set cannot be computed for general nonlinear systems, we compute a tight enclosure instead (e.g., we aim to minimize the Hausdorff distance between the enclosure and the exact set).• Due to dependency preservation, the abstraction error for reachability analysis is computed on the reachable set for the whole input set rather than the smaller reachable set for a specific control input, which results in additional conservatism.• For bloating the obstacles by the set of uncertainties defined by the independent generators, we use an overapproximative Minkowski sum in Thm. 1 that simply pushes the obstacle halfspaces outward.• Since we choose a certain type of control law in advance, we restrict the space of possible control inputs.However, all of these over-approximation errors can be made arbitrarily small: The over-approximation for reachability converges to zero if the time step size is reduced and/or the reachable set is split, which also eliminates the error introduced by dependency preservation.Moreover, the approximative Minkowski sum in Thm. 1 can be replaced by the exact one and every control law can be approximated arbitrary close by a piecewise constant control law with an infinite number of piecewise constant segments.

E. Parameter Tuning
Since the settings for reachability analysis can be tuned automatically [60], [61], the main design parameters for our safety shield in addition to the type of control law discussed in Sec.V-A are the planning horizon t f and the replanning time t c .A longer planning horizon t f often yields better control performance due to the larger lookahead, but also increases the computation time.Especially in the presence of dynamic obstacles, a small replanning time t c is desirable in order to be able to quickly react to a changing environment.However, a small t c requires the approach to be faster in order to run in real-time.Finally, the extensions discussed in Sec.V-C, V-D, and V-E all reduce the computation time at the cost of introducing more conservatism.

VIII. CONCLUSION
We presented a novel safety shield for nonlinear continuous systems with input constraints that can be added to reinforcement learning agents in order to prevent them from applying unsafe actions.Since our safety shield uses set-based computations in the form of reachability analysis to determine which actions are safe and which are unsafe, it can guarantee robust safety despite process noise and measurement errors.Moreover, because our approach applies highly parallelized mixed-integer programming to project the action from the agent to the closest safe action, it is possible to reduce the computation time by using a more powerful machine with more cores.Finally, we demonstrated with several numerical examples as well as experiments on a real system that our safety shield modifies the actions proposed by the reinforcement learning agent as little as necessary for robust safety.
Fig.1: Steps for action projection using parameterized reachability analysis, where the reachable set is depicted in gray and the unsafe regions are shown in red: 1) Computation of the reachable set for all actions starting from the current state x 0 .2) Extraction of action constraints from the intersections between the reachable set and unsafe regions.3) Projection of the action u a outputted by the agent to the closest safe action.

2 Fig. 2 :
Fig.2: Reachable set for the system from Example 1, where the initial state x 0 is shown as a black dot, the final reachable set R(t f ) is depicted in blue with a black border, and the unsafe set F is shown in red.

2 Fig. 3 :
Fig. 3: Domain (left) and objective function (right) for the optimization problem from Example 2. For the domain plot the set of infeasible values is shown in red, the desired value α a is visualized as a black dot, and the optimal solution to the optimization problem is depicted as a blue dot.

2 pFig. 4 :
Fig. 4: Final reachable set for the system in Example 1 for a quadratic control law and piecewise constant control laws with different numbers of segments N .b) Polynomial Control Law: Another possibility is to use control laws that are polynomial functions with respect to time.We consider the quadratic case for simplicity since the extension to general polynomials is straightforward.For a quadratic control law u(t) = c (1) + c (2) t + c (3) t 2 parameterized by the vector of coefficients c ∈ R 3 , we can use the extended system 

Fig. 6 :Fig. 7 :
Fig. 6: Results for the CommonRoad scenario DEU LocationALower-33 16 T-1 visualized at times 0 s, 1.2 s, 2.8 s, 4.4 s, and 9.2 s (from top to bottom), where the agent without safety shield is depicted in purple, the agent with safety shield is depicted in blue, the dynamic obstacles are depicted in red, and the goal set is depicted in green.

Fig. 8 :
Fig. 8: Episode rewards and constraint violations for the 2D quadrotor benchmark observed during training without safety shield, with safety shield, and with constraint penalty.

TABLE I :
States n, control inputs m, planning horizon t f , number of pre-computed reachable sets, and extensions applied for each benchmark.

TABLE II :
Results for the evaluation of our safety shield on 2000 CommonRoad traffic scenarios.