Guaranteeing Control Requirements via Reward Shaping in Reinforcement Learning

In addressing control problems such as regulation and tracking through reinforcement learning, it is often required to guarantee that the acquired policy meets essential performance and stability criteria such as a desired settling time and steady-state error prior to deployment. Motivated by this necessity, we present a set of results and a systematic reward shaping procedure that (i) ensures the optimal policy generates trajectories that align with specified control requirements and (ii) allows to assess whether any given policy satisfies them. We validate our approach through comprehensive numerical experiments conducted in two representative environments from OpenAI Gym: the Inverted Pendulum swing-up problem and the Lunar Lander. Utilizing both tabular and deep reinforcement learning methods, our experiments consistently affirm the efficacy of our proposed framework, highlighting its effectiveness in ensuring policy adherence to the prescribed control requirements.


I. INTRODUCTION
The paradigm of using reinforcement learning (RL) for control system design has gained substantial traction due to its ability to autonomously learn policies that effectively address complex control problems, relying solely on data and employing a reward maximization process.This approach finds diverse applications, spanning from attitude control [1] and wind farm management [2] to autonomous car-driving [3] and the regulation of plasma using high-fidelity simulators [4].However, a significant challenge in this domain revolves around ensuring that the learned control policy demonstrates This work was in part supported by the Research Project "SHARESPACE" funded by the European Union (EU HORIZON-CL4-2022-HUMAN-01-14.SHARESPACE.GA 101092889 -http://sharespace.eu)and by the Research Project PRIN 2022 "Machine-learning based control of complex multi-agent systems for search and rescue operations in natural disasters (MENTOR)" funded by the Italian Ministry of University andResearch (2023-2025).
F. De Lellis is with the Department of Electrical Engineering and Information Technology, University of Naples Federico II, Naples, Italy (e-mail: francesco.delellis@unina.it).
G. Russo is with the Department of Computer and Electrical Engineering & Applied Mathematics, University of Salerno, DIEM, Salerno, Italy (e-mail: giovarusso@unisa.it).
M. Musolesi is with the Department of Computer Science, University College London, U.K. and the Department of Informatics -Science and Engineering, University of Bologna, Bologna, Italy (e-mail: m.musolesi@ucl.ac.uk).
M. di Bernardo is with the Department of Electrical Engineering and Information Technology, University of Naples Federico II, Naples, Italy, and with the Scuola Superiore Meridionale, School for Advanced Studies, Naples, Italy (e-mail: mario.dibernardo@unina.it).
the desired closed-loop performance and steady-state error, posing a crucial open question in control system design.
It is often argued that accurate knowledge of system dynamics is necessary to provide analytical guarantees of stability and performance, which is crucial for industrial applications [5], [6].In fact, in this paper, we introduce a set of analytical results and a constructive procedure for shaping the reward function of approaches based on reinforcement learning (tabular and function approximation methods that rely on deep learning).The goal is to derive a learned policy which is obtained without the use of a mathematical model of the system dynamics, able to verifiably meet predetermined control requirements in terms of desired settling time and steady-state error.
In the Literature, reward shaping, consisting of modifying the reward function in order to improve learning or control performance, has mostly been used to increase sample efficiency [7]- [9], rather than provide guarantees on the learned policy.An early example was presented in [7], where an agent was trained to ride a bicycle exploiting a reward shaping mechanism.More recently, reduced sample complexity was demonstrated in [9] for a modified Upper Confidence Bound algorithm using shaped rewards.In [8], it was shown that adding a function of the state to the reward keeps the optimal policy unchanged if and only if the function is potential-based.A method to select potential-based functions is presented and validated analytically in [10], requiring knowledge of an appropriate Lyapunov function, to guarantee convergence to a state under the optimal policy.While this result can be used to solve regulation problems, it does not ensure a specific settling time.Moreover, finding a Lyapunov function is often cumbersome for many real-world problems.
When guarantees are given on reinforcement learning control [6], they are typically provided in terms of reachability of certain subsets of the state space [11], [12], or in terms of safety during learning and/or for the learned policy.Namely, in [11], RL is used to select a control law among a set of candidates, using Lyapunov functions to ensure a system enters a goal region with unitary probability, under certain conditions on the controllers.In [12], a partially known system model is used to improve a safe starting policy through RL, avoiding actions that bring the system out of the basin of attraction of a desired equilibrium.Both approaches [11], [12] do not provide guarantees on the time required to reach the desired regions.Safety for RL control has been extensively explored in the Literature using various frameworks, such as constrained Markov decision processes [13], "shields" [14], control barrier functions [15], and a combination of Model Predictive Control and RL [16].Although these techniques ensure avoidance of unsafe subsets of the state space, they generally do not provide guarantees on reaching a specific goal region or on control performance metrics, such as settling time.
The problem of synthesizing rewards for control tasks is also the subject of inverse optimal control (IOC) [17], focusing on estimating the rewards associated to given observations of states and control inputs, assuming closed-loop stability and/or policy optimality.Initially aimed at determining control functions producing observed outputs [18], IOC has since been connected to reinforcement learning [19], applied in nonlinear, stochastic environments [20], and its framework has been used to investigate the cost design problem [21].However, to the best of our knowledge, IOC has not been used specifically to design reward functions that, when optimized for, can guarantee specific control performance.
Given a regulation or tracking problem with predetermined stability and performance requirements on steady-state error and settling time, we advance the state of the art as follows.
1) We introduce a model-free sufficient condition on the discounted return associated to a trajectory to determine if it is acceptable (i.e., it satisfies the control requirements).2) We give a sufficient condition to assess whether a learned policy leads to an acceptable closed-loop trajectory.3) We provide a procedure to shape the reward function so that the above conditions can be applied on a system of interest and that the optimal policy is acceptable.4) We successfully validate the approach through two representative control problems from OpenAI Gym [22]: the stabilization of an inverted pendulum [23], and performing landing in the Lunar Lander environment [24].For reproducibility, the code is available on GitHub [25].The rest of the paper is organized as follows.In Section II, we formalize the problem of constructing a reward function for learning-based control.The main results of our approach are then presented in Section III and validated via numerical simulations on two representative application examples in Section IV.Concluding remarks are given in Section V.

II. PROBLEM STATEMENT A. Problem set-up
We consider a discrete time dynamical system of the form where k ∈ N ≥0 is discrete time, x k ∈ X is the state at time k, X is the state space, x0 ∈ X is an initial condition, u k ∈ U is the control input (or action) at time k, U is the set of feasible inputs, and f : X × U → X is the system dynamics.Furthermore, we let π : X → U be a control policy, and let with the Cartesian product being applied an infinite number of times.We denote by ϕ π (x 0 ) ∈ X ∞ the trajectory obtained by applying policy π to system (1) starting from x0 as initial state.
We are interested in finding a policy such that the trajectory generated by it (starting from a given x0 ) reaches a desired goal region G ⊂ X before some desired settling time Fig. 1. (a): A state-space sequence ξ, a trajectory ϕ π (x 0 ), and a goal region G (see Section II); while a state-space sequence is simply a sequence of points in the state space X , a trajectory is generated by applying a policy to the dynamics in (1).(b): Terms of the reward structure in Assumption III.1.and remains in this region for at least a desired permanence time k p ∈ N >0 (see Definition II.3 below for the rigorous statements).For example, G could be an arbitrarily small neighborhood of a reference state, with a radius equal to the admitted steady-state error.In our main results, we assume G, k s , k p are given; nonetheless, in Section III (see Remark III.8), we will observe that k p can be arbitrarily large, and in Proposition A.1 (in the Appendix), we give a criterion to assess the feasibility of the settling time constraint when limited knowledge about the system to control is available.

B. Acceptable state-space sequences
We will now introduce concepts that will be used for the formalization of the proposed approach.
Note that all trajectories are state-space sequences, but the converse is not true.As a matter of fact, given a state-space sequence ξ with x 0 = x0 ∈ X , there is no guarantee that there exists a policy π such that ϕ π (x 0 ) = ξ.A graphical representation of these concepts is reported in Figure 1a.
Next, we define the set of acceptable state-space sequences, trajectories, and policies, i.e., those that satisfy the performance and steady-state specifications.
Definition II.3 (Acceptable state-space sequences, trajectories and policies).Given the desired goal region G, the desired settling time k s , and the desired permanence time k p , a state-space sequence ξ = (x 0 , x 1 , x 2 , . . . ) or equivalently a trajectory ϕ π (x 0 ) = (x 0 , x 1 , x 2 , . . . ) are acceptable if 1) ∃ k ≤ k s : x k ∈ G (i.e., the state is in G not later than time k s ); 2) k exit (ξ) > k p (i.e., the state does not exit G before time k p , included).A policy π is acceptable from x0 if ϕ π (x 0 ) is acceptable.
It can be immediately verified that there exists at least one acceptable state-space sequence provided that G ̸ = ∅.Indeed, this state-space sequence is ξ = (x 0 , x 1 , . . . ) with x k ∈ G for all k, which can be verified to be acceptable by checking the two conditions in Definition II.3.

C. Using reinforcement learning to find acceptable control policies
Following [26], [27], we employ a reinforcement learning solution to automatically identify an acceptable policy for a given initial condition x0 , and to do so without the need of knowing the dynamics f .Namely, let r : X × X × U → R be a reward function, so that r(x ′ , x, u) is the reward obtained by the agent when taking action u in state x and arriving at the new state x ′ at the next time instant.Let also J π : X ∞ → R be the (discounted) return function defined as where ξ ∈ X ∞ is a state-space sequence, u k = π(x k ), and γ ∈ [0, 1] is a given discount factor. 1 To find an acceptable policy, we set the following optimization problem and solve it via reinforcement learning: Thus, the problem we aim to solve can be stated as follows.
Problem II.4.Shape the reward function r so that: (i) it is possible to determine that a trajectory ϕ π (x 0 ) is acceptable by assessing the value of J π (ϕ π (x 0 )); (ii) an acceptable policy from x0 (provided it exists) can be found by solving (3).

III. MAIN RESULTS
In Section III-A, we relate acceptable state-space sequences and their return (solving point (i) in Problem II.4), in Section III-C, we embed the theory in a constructive procedure to shape rewards, in Section III-D, we give analogous results for trajectories, and finally in Section III-E, we show that acceptable policies can be found using reinforcement learning algorithms.The assumptions we make and how they are related are schematically summarized in Figure 2.

A. Assessing acceptable state-space sequences
We start by defining the structure of the shaped reward.
Assumption III.1 (Reward structure).The reward function can be written as 1 According to this formulation, it is possible to evaluate J π on a statespace sequence that is not a trajectory (which is needed for the theoretical results presented in Section III); in this case, even though the value of the states are not generated following policy π, in general it is still necessary to specify π to obtain the values of the inputs u k used for the computation of the reward r.When J π is evaluated on a trajectory, e.g., J π 1 (ϕ π 2 ), we will only consider the case in which • r c : X × X → R is a correction term given by In practice, r c in will typically be a positive reward for being inside the goal region, while r c exit will normally be a negative reward for having left the goal region-please, refer to Figure 1b for a diagrammatic representation.
Remark III.2 (Generality of Assumption III.1).Assumption III.1 is not too restrictive.Indeed, if one wants to use a preexisting reward, it is only required it is bounded (see (5)).It can then be shaped by adding the correction term r c to it.
We also define the differences To assess properties of state-space sequences, trajectories, and policies from their associated return, we define the return threshold σ ∈ R and introduce the following definition.
Definition III.3 (High-return state-space sequences, trajectories and policies).A state-space sequence ξ is high-return if J π (ξ) > σ for any policy π.A trajectory Of the quantities introduced so far, those that we assume to be given (i.e., fixed) are G, k s , k p , γ, U in , U out , L in , L out ; conversely, the quantities to be designed are σ, r c in , r c exit .Next, we introduce an assumption on the correction terms in the reward.
and, given the desired settling time k s and the desired permanence time k p , assume that  In the following Proposition, we state a key result that solves point (i) in Problem II.4.
Proposition III.5.Let Assumptions III.1 and III.4 hold.Then, high-return state-space sequences are acceptable.
Proof.We will show that, for any policy π, if a statespace sequence ξ is not acceptable, then it is not high-return (consequently, if ξ is high-return, then it is acceptable).
ξ can be not acceptable if and only if one of the following three scenarios occurs (cf.Definition II.3): 1) ξ is never in the goal region G; 2) ξ is in G for the first time at a time later than k s ; 3) ξ exits from G at time k exit (ξ) ≤ k p .We now consider the three cases one by one and show that, for any π, if any of them occurs then it must hold that ξ is not high-return, i.e., J π (ξ) ≤ σ.
Case 1: In this case, the state-space sequence is never in the goal region, that is ∀ k ∈ [0, ∞), x k ̸ ∈ G. Therefore, only the third case in ( 6) is fulfilled, for all k, and we obtain r c (x k , x k−1 ) = 0 for all k.For any policy π, exploiting (2), ( 4), (5a), and (9), we obtain 2 Note that, in (12) and in the rest of the proof, the dependency of J π on the specific policy π is made irrelevant by using the bounds in (5). 2 Recall that, for |γ| < 1, the geometric series is Case 2: Defining k enter := (min k s.t.x k ∈ G), we have that k enter > k s .For the sake of simplicity and without loss of generality, assume that the state is always in the region G after k enter (i.e., x k ∈ G, ∀ k ≥ k enter ). 3 For any policy π, from ( 2), ( 4), and ( 6), we obtain Exploiting (7), and recalling that k enter > k s , from (13), we obtain Then, from (14) and exploiting (5), we obtain Exploiting (10), it is immediate to see that J π (ξ) ≤ σ.Case 3: From the definition of k exit (see Sec. II), we have x k exit(ξ) −1 ∈ G and x k exit(ξ) ̸ ∈ G. From (7), the largest J π (ξ) is obtained when the state-space sequence ξ is such that x k ∈ G, ∀k ∈ [1, k exit (ξ)), ξ then exits the region G at k exit (ξ) = k p , and enters again at time k p + 1.Thus, without loss of generality, we assume this is the case.Then, we have Exploiting (11), we immediately verify that J π (ξ) ≤ σ.
Notably, Proposition III.5 does not guarantee the existence of any high-return state-space sequence.The existence of the latter is instead guaranteed by Corollary III.7 below.
Remark III.8 (Selection of k p ). ( 11) captures the only assumption that depends on k p .Given this assumption, it is possible to observe that k p can be set to any arbitrarily large value, thus not limiting the variety of problems that can be addressed using the present theoretical framework.
To summarize, we demonstrated that it is possible to check if a state-space sequence is acceptable by verifying that it is high-return.Conversely, there may exist acceptable state-space sequences that are not high-return, e.g., those that exit (and re-enter) G before k s , or those that enter G before k s but not early enough to collect sufficient rewards to be high-return.In some cases though, it is possible to prove that acceptable state-space sequences are high-return, such as those that enter the goal region not later than a certain time instant (k z ) and never exit it, as formalized by the next Proposition.
Assumption III.9 (Dependent on the choice of k z ).Given some Proposition III.10.Let k z ∈ N ≥0 such that k z ≤ k s .If Assumption III.1 and III.9 hold, then state-space sequences that are in G for the first time at time k z or earlier and have k exit = ∞ are high-return.
Proof.According to the hypothesis, let ξ = (x 0 , x 1 , . . . ) be a state-space sequence such that x k / ∈ G for k < k z and x k ∈ G for k ≥ k z .For all policies π, from (2), ( 4) and ( 6), we obtain Exploiting (5d) and (5c) yields Given (17), it follows that J π (ξ) > σ.Moreover, say ξ ′ a state-space sequence that is in G for the first time at some time k ′ z < k z , i.e., with For all policies π, exploiting (7) and the fact that J π (ξ) > σ, we have From (17), we notice that the larger r c in is, the later statespace sequences are required to be in G in order to be highreturn.Moreover, it is again important to remark that while state-space sequences that enter G within k z time steps always exist, the same is not necessarily true for trajectories: this depends on the dynamics of the system being controlled.
Remark III.11 (Tracking).In the results in Section III-A, it was never assumed that G is a fixed region in the state space.Indeed, it is possible to carry out the same analysis by considering a time-dependent goal region G k and, for simplicity of computation, the quantities rather than U out , U in , and L out , L in should be used in (5), respectively.This reformulation can be used to address tracking control problems.
Reviewing the findings derived so far, a shaped reward r needs to satisfy Assumptions III.1, III.4,III.6 (to exploits Proposition III.5 and Corollary III.7) and optionally Assumption III.9 (with some chosen k z , to exploit Proposition III.10); see also Figure 2. Next, we characterize the relation between these assumptions and show that they can hold simultaneously.

B. Compatibility of the assumptions
First, we give a Lemma to aid the selection of r c in .Lemma III.12.Given some k z ≤ k s , if Assumption III.9 holds, Assumption III.6 also holds.
Proof.See the Appendix.
We say that two or more Assumptions are compatible if they can hold simultaneously.To guarantee that high-return state-space sequences are acceptable (Proposition III.5) and that such state-space sequences exist (Corollary III.7), we need Assumptions III.1, III.4,III.6, whose compatibility is ensured by the next Lemma.
To guarantee that a class of acceptable state-space sequences are high-return, we need Assumption III.9 (Proposition III.10), whose compatibility with previous ones is ensured by the next Lemma.
Algorithm 1: Reward shaping Input: A goal region G, a desired settling time k s , and a desired permanence time k p ; a bounded reward function r b , and discount factor γ. Output: A reward function r satisfying to Assumptions III.1, III.4,III.6.

C. A constructive procedure for reward shaping
In Algorithm 1, we propose a constructive procedure that can be applied to shape the reward functions used in Section III.To provide more flexibility, the procedure takes a preexisting reward r b as input, bounded according to (5).If no r b is available, it is possible to set r b = 0.As Lemma III.13 ensures set I in the algorithm is not empty, the latter always terminates successfully.Once Algorithm 1 has been used to obtain a shaped reward r (thus fixing r c in , r c in , σ, which remain constant), it is possible to run a reinforcement learning algorithm to learn a suitable control policy, as explained below in Section III-E.
It is to be noted that in some cases the values of r c in and r c exit resulting from Algorithm 1 might be significantly larger in absolute value when compared to those in r b .This can lead to a relatively sparse reward function r, i.e., one where relatively large values (in absolute value) are present but infrequent in the state-action space.Notoriously, this lack of frequent feedback information can make learning more difficult, especially when deep reinforcement learning algorithms are used; see, e.g., [28], [29] and references therein.To mitigate this issue, it is possible to select r c in and r c exit as small in absolute value as possible, while still complying with Assumptions III.1, III.4,III.6.Reward shaping methods that do not make the reward sparse will be the subject of future work.
Remark III.15 (Advanced reward shaping algorithm).For simplicity, in Algorithm 1, we did not include the requirement captured by (17) on r c in (used to ensure that a family of acceptable state-space sequences are high-return, according to Proposition III.10), as it depends on the time instant k z , which would be a further parameter to select.This constraint can be incorporated in Algorithm 1 by first selecting k z ≤ k s (possibly exploiting knowledge of the system to control), enforcing (19) at line 4 (Lemma III.14 ensures I is not empty), and using the right-hand side of (17) as lower bound of I at line 5.

D. Assessing acceptable trajectories
In Section III-A, we showed how the value of the return J π (ξ) can be used to assess whether ξ is an acceptable statespace sequence.The same theory applies to trajectories (which are state-space sequences, by definition).
It is important to remark that, while the existence of highreturn state-space sequences is ensured by Proposition III.7, it can be much more difficult to establish if there actually exist policies that generate high-return trajectories.This depends on the dynamics of the system at hand and the performance required, and is tightly related to the problem of reachability [30], with the addition of requirements on the settling time and the permanence time.

E. Assessing acceptable policies in value-based reinforcement learning
First, we provide a simple result stating that an acceptable policy (see Definition II.3) can be found by achieving the optimum in (3), thus solving point (ii) in Problem II.4.
Lemma III.16.Let Assumptions III.1 and III.4 hold.If there exists a high-return policy π ⋄ from x0 ∈ X , then the optimal policy π ⋆ solving the problem objective defined in (3) is acceptable from x0 .
Proposition III.5 allows to detect acceptable state-space sequences by evaluating their return J π .However, this is not normally known in a reinforcement learning setting, but it is instead approximated through a value function.In particular, let Q : X × U → R be the state-action value function associated to the greedy policy Q is normally updated iteratively with the Bellman operator so that it converges to the value of J πg , in the sense that In the next Theorem, we conclude the analysis by showing how the acceptability of a policy can be evaluated by assessing the value of Q.
Theorem III.17.Consider a state x k ∈ X at time k.Let Assumptions III.1 and III.4 hold and assume that Proof.Exploiting (21), max u∈U Q(x k , u) > σ implies that J πg (ϕ πg (x k )) > σ.Thus, it is immediate to apply Proposition III.5 (using Assumptions III.1 and III.4) to obtain that ϕ πg (x k ) is acceptable.
It is important to clarify that ϕ πg (x k ) being an acceptable trajectory means that, by following policy π g , (i) the statespace sequence will be in G before k s time instants have passed (i.e., ∃ k ′ ∈ [k, k + k s ] : x k ′ ∈ G), and (ii) the state will not exit from G before k p + 1 time instants have passed, (i.e., (22) Indeed, ( 22) implies ( 21) through Lemma A.2 in the Appendix. 4 (22) is fulfilled after a finite number of iterations if the algorithm used to update the value of Q is converging asymptotically to J πg , which has been proved formally for reinforcement learning algorithms like SARSA and Q-learning [31].In the latter, the greedy policy and the function Q are guaranteed to converge to the optimal policy and to its discounted return J, respectively; hence, if high-return policies exist, Lemma III.16 guarantees that the learned policy is acceptable.

IV. NUMERICAL RESULTS
We validate the theory presented in Section III by means of two representative case studies (and corresponding reinforcement learning environments, from OpenAI Gym [22]): Inverted Pendulum [23] and Lunar Lander [24].The former is a classic nonlinear benchmark problem in control theory, whereas the latter is a more sophisticated control problem with multiple input and outputs.In particular, we first validate Theorem III.17 using Q-learning to learn a policy that stabilizes an inverted pendulum within a predefined time; then, we show that the theory also holds when using a deep reinforcement learning algorithm, such as Double DQN, to learn a policy able to land a spacecraft fulfilling desired time constraints.
In each scenario, the learning phase and deployment phase are repeated in S ∈ N >0 independent sessions, which are composed of E ∈ N >0 episodes.Each episode is a simulation lasting N ∈ N >0 time steps.Moreover, we always use the ϵ-greedy policy [31] during learning.
For reproducibility, the code is available on GitHub [25].
A. Inverted Pendulum 1) Description of the Inverted Pendulum environment: In this environment, the objective is to stabilize a rigid pendulum affected by gravity to the upward position in a certain time, by exploiting a torque applied at the joint.In particular, the pendulum is a rigid rod, having length l = 1 m, mass m = 1 kg and moment of inertia I = ml 2 /3; the gravitational acceleration is taken equal to g = 10 m/s 2 .A graphical depiction of the scenario is given in Figure 3a.
T , where x k,1 and x k,2 are the angular position and angular velocity of the pendulum, respectively.In order to apply Qlearning, which is a tabular RL method, we discretize the state space (and the set of acceptable inputs).Namely, the position being the Euclidean norm), with θ = 0.42 amounting to 5% of the maximum distance from the origin, in the state space.We select the desired settling time as k s = 500 time steps and the desired permanence time as k p = 1000 time steps (cf.Sec.II).
d) Reward: To guarantee the required performance and steady-state specifications, the reward function is chosen as in (4), with r b being the standard Gym reward, given by (23), we compute that Then, given γ = 0.99 [cf.( 2)], we select σ = 10000 [cf.( 9)], and the correction terms in (6) as e) Parameters: We take S = 5 sessions, E = 1000 episodes, and N = 1000 time steps.We set the learning rate to 0.8.For the ϵ-greedy policy, we select ϵ = 0.05.
2) Results of Q-learning in the Inverted Pendulum environment: After training is completed for all sessions, we test the capability of the learned policies to swing up and stabilize the pendulum within the desired settling time.The results are portrayed in Figure 4, showing the distance of the trajectories from x ref , position, and velocity in time.We observe that the control problem is solved in all sessions, suggesting that the optimal policy (which would be an acceptable one, according to Lemma III.16) has been found.Interestingly, this might be difficult to detect by looking only at the returns obtained during training, plotted in Figure 5. Indeed, the discounted returns per episode appear to decrease as training progresses.This happens because, as the agent progressively learns to enter the goal region, and to do so earlier in later episodes, the chance of it incurring the penalty r c exit for existing the goal region increases as the result of random explorative actions, taken by the ϵ-greedy policy used during learning.Although this does not prevent learning from converging to the optimal policy, in practical implementations, this can be avoided by letting the exploration rate ϵ decay in later episodes.However, tuning the decay rate is highly problem-dependent and no general rule can therefore be given here.
In our experiments, learning ended after the planned episodes.An alternative heuristic method to determine when to terminate the learning stage is to pause training at regular intervals, and simulate using the greedy policy in (20).If the return obtained exceeds σ, the greedy policy is deemed acceptable (see Proposition III.5), ending learning; otherwise, training continues.

B. Lunar Lander
1) Description of the Lunar Lander environment: In a 2D space, a stylized spaceship must land with small speed in a specific area in a predetermined time, in the presence of gravity, and in the absence of friction.The spacecraft has three thrusters to guide its descent and two supporting legs at the bottom, as depicted in Figure 3b.
a) State: The state at time k is , where p k ∈ R 2 is the horizontal and vertical position of the center of mass (arbitrary units; a.u.), v k ∈ R 2 is its horizontal and vertical velocity (a.u./s), θ k ∈ [0, 2π) rad is the orientation of the lander (with 0 corresponding to the orientation of a correctly landed spacecraft), ω k is the rate of change of the orientation (rad/s), l l k ∈ {0, 1} (resp.l r k ) is 1 if the left (resp.right) leg is touching the ground.The initial conditions are given by p 0 = [0 1.4] T (consequently, l 1 0 = l 2 0 = 0), v 0 = [0 0] T , θ 0 = 0, and ω 0 = 0.The landing area is the region [−0.b) Control inputs: At each time step k, the lander can use at most one of its three thrusters.In particular, we let u m k ∈ {0, 1} be 1 if at time k the main engine on the bottom of the spacecraft is used at full power or 0 if it is off, and define u l k , u r k ∈ {0, 1} analogously for the left and right thrusters, respectively.Then, the control input at time k is the vector T , which has four possible values, depending on which thruster, if any, is used.
c) Control problem: The goal region G is the set of states where p k is in the landing pad, v k = 0, θ k = ω k = 0, and Additionally, we select the desired settling time as k s = 500 time steps and the desired permanence time as k p = 1000 time steps (cf.Section II).We also remark that the simulation stops if the lander touches the ground beyond the landing pad, or if it lands on the pad with a speed that is too high.During training only, the simulation is also halted if the spacecraft lands correctly.Further detail can be found in [24].
d) Reward: The reward function is in the form introduced in (4), with r b generated according to the standard environment definition [24].Namely, let r : X × X × U → R be a function given by where β 1 and β 1 are two mutually exclusive Boolean conditions: namely, β 1 is true if the spacecraft lands on the ground and stops, and β 2 becomes true if the lander touches any point of the map with a speed that is too high (i.e., it crashes), or goes beyond the operating area of the environment, i.e., [−1.5, 1.5]×[−1.5,1.5].Following Algorithm 1, from (25), we derive that U out = 100, L out = −100, U in = 100, L in = 100.Given γ = 0.99 (cf.( 2)), we select σ = 12000 (cf.( 9)), and the correction terms in (6) as e) Parameters: We take S = 5 sessions, E = 1000 episodes, and N = 1000 time steps.For the ϵ-greedy policy, we select ϵ = 0.1.To better stabilize the values of Q and help prevent overestimation, we use a standard Double DQN algorithm (a variation of DQN [32]; see [33] for a detailed description), implemented in TensorFlow 2. For the neural networks, we used an input layer with 8 nodes, 2 hidden layers each composed of 128 nodes with rectified linear unit (ReLU) activation functions, and an output layer with 3 nodes and linear activation functions.The networks were trained using the Adam optimizer [34], with a learning rate of 0.001.Samples collected during training are stored in a replay buffer and at each training update a batch of 128 samples is used.
2) Results of Double DQN in the Lunar Lander environment: In this environment, in all sessions, the policies learned with Double DQN are able to solve the given control problem, fulfilling the given control requirements, as showed in Figure 6.In Figure 7, we also report the returns obtained by the learning algorithm.It is possible to observe that we obtain (averaged) returns that are over the threshold value σ.In this case, the large negative returns, which were visible in Figure 5 for the Inverted Pendulum environment, are not present.The reason is that, in the Lunar Lander simulation environment, during training (but not on validation), once the lander stops, the simulation is halted; therefore, in this case, during training the lander never exits the goal region once it has entered it.

V. CONCLUSIONS
One of the most significant issues holding back the use of reinforcement learning for control applications is the lack of guarantees concerning the performance of the learned policies.In this work, we have presented analytical results that show how a specific shaping of the reward function can ensure that a control problem, such as a regulation problem, is solved with arbitrary precision, within a given settling time.We have validated the proposed theoretical approach on two representative experimental scenarios: the stabilization of an inverted pendulum and the landing of a simplified spacecraft.
One drawback of the present methodology is that the shaped reward might be relatively sparse (as discussed in Section III-C), which could possibly hamper learning when using deep reinforcement learning algorithms.Future work will focus on integrating existing techniques [35] (and developing new ones) for reward shaping, which are able to deal with the potential sparse reward problem, on extending the current results to the case of stochastic system dynamics, and on deriving conditions to ensure feasibility of a set of control requirements (G, k s , k p ) for a given system, which is highly problem-dependent.

Fig. 2 .
Fig. 2. Schematic representation of the main assumptions and results in Section III. Green blocks denote assumptions, blue blocks indicate analytical findings, yellow blocks denote algorithms, and purple blocks refer to the problems being studied.Dashed arrows denote optional steps in the control design."SSS" means "state-space sequence"; The symbols in the figure are defined in Section III.

Fig. 3 .
Fig. 3. Sketch representation of the environments used in the numerical validation in Section IV: (a) Inverted Pendulum and (b) Lunar Lander.Both the pendulum and the lander are depicted in their initial states.

Fig. 4 .Fig. 5 .
Fig. 4. Average (blue line) plus/minus standard deviation (shaded area) of x k − x ref (top panel), angular position x k,1 (middle panel), and angular velocity x k,2 (bottom panel), obtained by S policies trained with Q-learning in the pendulum environment.The green solid line (top panel) indicates the goal region G; the green dashed line (middle and bottom panel) indicate neighborhoods of width 2θ centered in x 1,k = x ref 1 = 0 (middle panel) and in x 2,k = x ref 2 = 0 (bottom panel).The red line indicates the time instant when the (averaged) trajectory enters the goal region.

Fig. 6 .Fig. 7 .
Fig. 6.Average (blue line) plus/minus standard deviation (shaded area) of the trajectory obtained by S policies trained with Double DQN in the Lunar Lander environment.From top to bottom: position on horizontal axis, position on vertical axis, velocity on horizontal axis, velocity on vertical axis.The green lines define the goal region.The red line indicates when the (averaged) trajectory enters the goal region.