Model-Based Policy Iterations for Nonlinear Systems via Controlled Hamiltonian Dynamics

The infinite-horizon optimal control problem for nonlinear systems is studied. In the context of model-based, iterative learning strategies we propose an alternative definition and construction of the temporal difference error arising in policy iteration strategies. In such architectures, the error is computed via the evolution of the Hamiltonian function (or, possibly, of its integral) along the trajectories of the closed-loop system. Herein the temporal difference error is instead obtained via two subsequent steps: first the dynamics of the underlying costate variable in the Hamiltonian system is steered by means of a (virtual) control input in such a way that the stable invariant manifold becomes externally attractive. Then, the distance-from-invariance of the manifold, induced by approximate solutions, yields a natural candidate measure for the policy evaluation step. The policy improvement phase is then performed by means of standard gradient descent methods that allows us to correctly update the weights of the underlying functional approximator. The above-mentioned architecture then yields an iterative (episodic) learning scheme based on a scalar, constant reward at each iteration, the value of which is insensitive to the length of the episode, as in the original spirit of reinforcement learning strategies for discrete-time systems. Finally, the theory is validated by means of a numerical simulation involving an automatic flight control problem.


I. INTRODUCTION
T HE optimal control problem consists in designing a control law that steers the state of the system from an arbitrary initial condition to a final equilibrium configuration while minimizing-along the trajectories of the resulting system-a prescribed cost functional [1], [2], [3], [4], [5]. Two alternative strategies to solve optimal control problems have been studied in the past decades: Dynamic Programming [6], [7] and Pontryagin's Minimum Principle [8]. In the setting of infinite-horizon problems, the two approaches share in fact a common bottleneck, namely the explicit knowledge of the positive definite solution of the Hamilton-Jacobi-Bellman (HJB) partial differential equation (pde) (see, e.g., [1], [9]): in the former method, the solution to the pde directly yields the optimal feedback, while in the latter, it provides the correct initial condition for the costate variable of the underlying Hamiltonian dynamics.
Methods based on a direct solution (or approximation, see, for instance, [9], [10], [11], [12]) of the HJB pde have been interpreted hitherto as offline strategies, namely requiring the knowledge (or an estimate) of the solution prior to the evolution of the plant or the simulation model. Recent intensive research efforts have been devoted to the objective of recasting classic optimal control design techniques into online methods. In particular, iterative learning control (ILC) and reinforcement learning (RL) [13], [14] aim at learning the optimal control law online by conducting a sufficiently large number of experiments on the actual plant, if feasible, or by relying on a simulation model of the former. This is achieved, for instance, by borrowing ideas similar to those employed in the context of adaptive control, leading to the so-called adaptive dynamic programming [15], [16], [17]. The methods can be further categorized by distinguishing model-based techniques [18], [19] and model-free approaches [20].
Policy iteration (PI) approaches [13]-developed both in the context of model-based and model-free methods-encompass several iterative techniques that aim at providing (or approximating) the solution to the underlying HJB pde by relying on a sequence of (stabilizing) control laws designed to converge in some sense to the optimal policy. In the case of linear-quadratic (LQ) optimal control problems, for instance, the well-known Kleinman's Lyapunov iterations [21] describe essentially a PI approach to the solution of the optimal control design. In fact, PI strategies are typically based on two main steps, policy evaluation and policy improvement, which may be intertwined in a discrete time [22] or synchronous [18] fashion. The aim of the former step consists in evaluating the value of the current control policy, whereas the latter stage should lead to an improvement of such a value by suitably modifying the control law for the subsequent iteration.
While the policy improvement phase is carried out by relying on fairly standard methods (such as, for instance, gradientdescent approaches), the design choice of the measure devoted to capturing the quality of the current control policy is crucial to envision effective approximation mechanisms. In most of the results in the literature, the above-mentioned measure is provided essentially by the Hamiltonian function evaluated along trajectories of the closed-loop system, typically referred to as temporal difference error in the framework of RL and related to the so-called "reinforcement" employed for learning. An intrinsically similar situation arises also in the integral reinforcement learning (IRL) framework, see, e.g., [23], where the role of the temporal difference error, namely the approximate Bellman error, is inherited by an integrated (over a moving window) version of the Hamiltonian function along trajectories of the system, with the aim of circumventing the need for the explicit knowledge of the drift vector field in the overall architecture. As a consequence, whenever the candidate value function is replaced by a linearly parameterized functional approximator, the computation of the temporal difference error yields a scalar (linear) equation in a certain number of unknowns. Therefore, since the latter relation does not contain sufficient information to characterize the correct values of the underlying coefficients, it is necessary to collect a certain number of samples of such reinforcement in a sequential implementation (see, e.g., [22]) or to monitor-over a receding window of prescribed length-such an error signal, which must then remain persistently exciting, within a synchronous learning framework (see, e.g., [18], [23]). The conflicting requirements of persistence of excitation and asymptotic stability lead to the need, on one hand, for the injection of additional (probing) control inputs to the closed-loop system. On the other hand, the sampling rate or the length of the moving window must be suitably selected to ensure that the collected data is sufficiently rich for the computation of the correct value of the approximating parameterization. Note that the above rewarding policy for the learning algorithm is a common feature at the core of the majority of the existing methods, regardless of their nature (i.e., model-based or model-free, synchronous or with discrete updates).
In the context of model-based, iterative learning strategies for continuous-time nonlinear systems, the main contribution of this article consists in suggesting an alternative definition and practical construction of the temporal difference error with respect to the one on which the majority of RL and ILC approaches are intrinsically based. The result is achieved in sequential steps that are interesting per se, as discussed in the following.
First, it is shown that-provided the solution to the HJB pde is known-the stable manifold of the Hamiltonian dynamics can be externally stabilized by suitably steering the costate dynamics via a virtual control input, while preserving the property of invariance of the latter manifold as well as the behavior of the Hamiltonian dynamics restricted therein. This preliminary result possesses interesting consequences on the implementation of open-loop optimal control laws, typically employed in practical applications. In fact, whenever only the initial condition of the plant is available, open-loop optimal control laws are computed by considering the forward propagation of the underlying Hamiltonian dynamics initialized on the stable invariant manifold. Since the latter invariant manifold is structurally externally unstable, the above strategy is known to be particularly fragile, even with respect to extremely small numerical errors in the computation of the correct initial condition, and almost impractical in applications. The implementation of such a stabilized costate dynamics permits the construction of numerically robust openloop control laws, which approximate the optimal feedback with arbitrary degree of accuracy.
Then, we propose an alternative definition of the temporal difference error with respect to the one at the basis of any reinforcement or iterative learning method, based on the following, intuitive, observation. Suppose that an approximate value function is employed in the construction of the aforementioned stabilizing control law. It can be easily shown that, on one hand, the property of asymptotic stability of the zero equilibrium of the controlled Hamiltonian dynamics is preserved also for (sufficiently close) estimates of the solution to the HJB pde, whereas, on the other hand, the (fragile) property of invariance is not retained. Therefore, a candidate measure to capture the accuracy of the approximation of the current estimate is naturally provided by the maximal distance from the induced (approximate) manifold along the trajectories of the closed-loop (stabilized) Hamiltonian dynamics. This measure provides an alternative to more "standard" temporal difference errors. To circumvent the daunting computational task of a direct solution and in the spirit of iterative and episodic learning strategies, the above information is exploited by sequentially updating the approximate value function toward the minimization of the considered temporal difference error.
The rest of this article is organized as follows. The problem statement and a few preliminaries are discussed in Section II. The aim of Section III is to suggest and design a (virtual) control architecture for the underlying Hamiltonian dynamics, which is then employed in Section IV to propose an alternative temporal different error for IL methods. These are then specialized to the case of LQ optimal control problems, before the theory related to nonlinear systems is validated by means of a physically motivated numerical simulation involving an automatic flight control problem in Section VI. Finally, Section VII concludes this article.
Notation: R 0 (resp. R >0 ) denotes the set of nonnegative (resp. positive) real numbers. Given f : R n → R, the mappings ∇f and ∂f ∂x denote the column and the row vectors, respectively, of the corresponding partial derivatives, whereas ∂ 2 f ∂x 2 defines the Hessian matrix of the second-order derivatives provided they exist. For a vector-valued function g : R n → R m , ∂g ∂x denotes the Jacobian matrix of first-order partial derivatives. The notation C κ , κ 0, defines the set of functions that admit continuous derivatives up to order κ. Given r ∈ R >0 andx ∈ R n , the notation B r (x) defines the open set {x ∈ R n : x −x < r}.

II. PRELIMINARIES AND PROBLEM DEFINITION
Consider continuous-time, time-invariant, nonlinear systems described by the equatioṅ with x : R → R n and u : R → R m denoting the state and the control input, respectively. It is assumed that the vector field f : R n → R n and the mapping g : R n → R n×m are sufficiently smooth and that, in the absence of control action, the origin of R n is an equilibrium point of (1), i.e. f (0) = 0. The trajectories of the system are evaluated and compared by means of the infinitehorizon cost functional defined by where q : R n → R 0 , sufficiently smooth, represents a running cost imposed on the state evolution and the second term, with R = R > 0, represents a penalty on the control effort. Let and B := g(0) describe the linearization around the origin of the dynamics (1), namelyδ x = Aδx + Bu, let Q := ∂ 2 q ∂x 2 (x)| x=0 and consider the following, standard structural assumption. • Under the given conditions it is known that (locally) the value function V associated with the optimal control problem is at least twice differentiable and there exists a continuous function u that minimises the cost function (2) subject to the dynamic constraint (1).
Let V : R n → R >0 , V ∈ C κ , κ 2, denote a positive definite solution of the HJB pde with V (0) = 0, for all x ∈ R n (or locally in a neighborhood of the origin). Then, V represents the value function associated with the problem and the optimal state feedback, which minimizes the cost functional (2) along the trajectories of the closed-loop system, is given by As such, the control design method based on (4) and (5) constitutes an offline technique, in which the explicit knowledge of the positive definite solution of (4) is a priori instrumental for the construction of the optimal feedback (5). To circumvent the computational burden of a direct solution of the HJB pde, several alternative strategies have been explored to recast the control design into an online strategy, regardless of any assumption on the full or partial knowledge of the underlying plant. In particular, the PI approach consists of two sequential steps, i.e., policy evaluation-in which an estimateV of the value (i.e., the cost in a minimization problem) of the current approximate feedback is provided and quantified by a temporal difference error in the context of RL-and policy improvement-in which the estimatê V , and hence, the approximate feedbackû k+1 = u f | V =V k , is updated (in continuous or discrete time) to reduce the approximation error, where k denotes the index of the iteration. In the vast majority of these methods, the temporal difference error, on which learning is based, is envisioned on the basis of the following consideration. Let H p : R n × R m × R n denote the pre-minimized Hamiltonian function, defined as It has been shown that the solutions of the HJB pde satisfy the condition for all t 0. Therefore, if the value function V is replaced by an estimateV (e.g., described in terms of a functional approximator), a measure of the accuracy of such an approximation may be suggested by monitoring the temporal difference error (or its integral over a moving window as in IRL [23]) along the trajectories of the closed-loop system. The objective of this manuscript is to propose an alternative definition of the temporal difference error. Toward this end, recall that the optimal feedback (5) can be equivalently designed and implemented as the output u (t) = −R −1 g(x(t)) λ(t) of the Hamiltonian dynamics with (x(0), λ(0)) = (x 0 , ∇V (x 0 )), where the (minimized) Hamiltonian function H : R n × R n → R is defined as The rationale of the proposed alternative definition of the temporal difference error is that the Hamiltonian function H, obtained by replacing the actual value function V with an approximation thereof, provides sufficient information to completely characterize the underlying parameters ofV at a single time instant. This is due to the dependence of the dynamics as well as of the initial conditions of the Hamiltonian dynamics on the approximating coefficients and the alternative definition is achieved, essentially, by determining the maximal error over a certain time window. Although this strategy has not been pursued hitherto, it is worth observing that a similar result may be obtained by summarizing the entire information of the classic temporal difference error in IRL with its maximal value over time. However, the ability of exploiting the above-mentioned logic may be hindered by the following consideration, which concerns the instability of a certain invariant manifold and which concludes this preliminary section. This provides in addition a motivation for the technical results in Section III.
The Hamiltonian system (7) represents the lifted system defined on the state/costate space. Assumption 1 implies that the Hamiltonian dynamics (7) possess a hyperbolic equilibrium point at (x, λ) = (0, 0) with n-dimensional stable N s and unstable N u submanifolds through the origin that are invariant for system (7) (see for instance the detailed discussion in [24] and [25] in the context of disturbance attenuation problems for nonlinear systems, in which a similar pde arises). Therefore, a straightforward consequence of the above discussion is that for any approximateV such that, necessarily,V (x(·)) is different from the corresponding time evolution of the optimal value function V (x(·)) one has that any temporal difference error based on the evaluation of the Hamiltonian function H(x, λ) over time would be an exponentially diverging function of time. This issue is circumvented in the following section, whereas the definition of the proposed temporal difference error is deferred to Section IV.

III. EXTERNAL STABILIZATION OF THE STABLE INVARIANT SUBMANIFOLD
As anticipated above, the main objective of this section is to discuss preliminary results concerning the Hamiltonian dynamics (7) such that the latter can be employed to define and construct an alternative temporal difference error, which quantifies the accuracy of an intermediate estimate of the optimal value function. For clarity of exposition it is convenient to explicitly describe the Hamiltonian dynamics (7), with respect to the Hamiltonian function H defined in (8), aṡ . Then, it can be shown that M is an invariant, externally unstable, manifold for (9), since N s = graph(∇V (x)) = M and N s is tangent at the origin to the stable subspace of the linearized Hamiltonian dynamics [24], [25].
To provide a concise statement of the following result, consider the controlled Hamiltonian dynamicṡ where v : R → R n is a (virtual) control input, to be designed, in charge of arbitrarily steering the costate variable η. Note that the change in the notation between the costate of (9) (i.e., λ) and of (10) (i.e., η) underlines the different evolution of the former with respect to the latter, since they satisfy distinct dynamics, with the latter being steered by the choice of v. The main result of this section, summarized in the statement below, provides a selection of the control input v such that the trajectories of the controlled Hamiltonian dynamics (10) approximate, arbitrarily close, those of the Hamiltonian dynamics (7) [or, equivalently, (9)]. Namely, the control input is such that η is arbitrarily close to λ, while ensuring that M is in fact an invariant, externally stable manifold for (10).
Then, for any t ∈ R >0 , ε ∈ R >0 and μ ∈ R >0 , there exists σ F ∈ R >0 such that 1 (i) M is an invariant and externally (locally) exponentially stable manifold for (10), (11); all t > t , the trajectories of (10), (11) are such that where λ denotes the solution to (9) initialized at (x 0 , ∇V (x 0 )), namely the optimal costate. • Proof: To begin with, since the HJB equation (4) holds for all x ∈ R n , it follows that which is obtained by differentiating the right-hand side of (4) with respect to x. Consider then the change of coordinates described by (z 1 , In the transformed coordinates, the dynamics of the first component of the state becomeṡ where the second equality is obtained by recalling that the function f 1 (x, ·) is affine for any x ∈ R n . Moreover, the dynamics of z 2 is derived as shown in (13) (overleaf), where the third equality is obtained by recalling that (12) holds and the last equality is derived by the definition of the vector field f 1 as in (9) and noting that Therefore, by replacing the control law v , defined in (11), into the dynamics (13) one has thatż 2 = −σ F z 2 , hence This, in turn, immediately implies item (i) of the statement by recalling the definition of the submanifold M for (10).
Let now x λ denote the solution of (7) with initial condition Moreover, since by (14) where x satisfies the equatioṅ Therefore, item (ii) of the claim follows by considering (15) and (17) and applying Lemma 5 in the Appendix, with By continuity of the mapping ∇V this in turn implies that for any ε V , for all t 0. Thus, letting ε V = ε/2, there exists σ F such that the first term of the second line of (16) satisfies ∇V (x(t)) − ∇V (x λ (t)) ε/2 for all t 0. Im- it follows that the second term in (16) satisfies the condition e −σ F t ξ 0 < ε/2, for all t t , from which item (ii) follows. Theorem 1 provides a robustified mechanism to compute the optimal costate. A further consequence of the constructions discussed in the statement-resulting essentially from the fact that the behavior on the manifold M is not modified by the feedback control input v and from Assumption 1-is summarized in the following statement.
• Proof: In the transformed coordinates the closed-loop system is described by the equationṡ By the structural requirements of Assumption 1 the manifold M is tangent, locally around the origin of the extended state/costate space, to the stable n-dimensional invariant subspace of the linearized Hamiltonian dynamics. Therefore, it follows that the linearization of the vector field f 1 (z 1 , ∇ z 1 V (z 1 )) in (18a) has all eigenvalues with negative real part since V solves the underlying HJB pde. The claim is then shown by noting that the dynamics in (18b) is linear, with −σ F I a Hurwitz matrix. Consequently, the overall linearized description of (18) is an upper triangular block matrix, which is Hurwitz. Example 1. Consider the nonlinear systemẋ = x 2 + u, with x(t) ∈ R and u(t) ∈ R, and let q(x) = x 2 in the cost functional (2). Then, the HJB pde is solved by the positive The graph of the function ∇V , x ∈ R-which coincides with the stable invariant submanifold N s of the underlying Hamiltonian dynamics-is depicted by the dotted gray line in Fig. 1. The dashed gray line displays the phase plot of the solution of the Hamiltonian dynamics (7) (correctly) initialized at (x 0 , ∇V (x 0 )). It can be appreciated that the trajectory, simulated in the time interval [0,30], does not converge to the origin, even setting the relative and absolute tolerance to 10 −12 in the MATLAB routine ode45. Namely, due to the manifold M being externally unstable, (unavoidable) numerical errors cause the trajectory to diverge. Instead, the solid black line displays the phase plot of the controlled Hamiltonian dynamics (10), with a randomly generated initial condition η(0), in closed loop with (11). It is worth observing that, even for a random initial condition η(0), the optimal costate is recovered arbitrarily fast via the selection of σ F = 100 and the trajectory robustly converges to the origin, since the invariant submanifold {(x, λ) : λ = ∇V (x)} is rendered externally exponentially stable.

IV. PI VIA CONTROLLED HAMILTONIAN DYNAMICS
The objective of this section consists in discussing how the results of Theorem 1, which in principle relies on the knowledge  of V , can be employed to construct an approximate optimal control policy. Inspired by the rationale behind PI architectures, the proposed approach consists of a policy evaluation phase, followed by a policy improvement (discrete time) step. The overall iterative strategy (illustrated in Fig. 2) relies on a finitedimensional parameterization of candidate value functions by means of value function approximators (VFAs), such as (singlelayer) neural networks with a polynomial basis, see, e.g., [19]. Therefore, the suggested approximator may be equivalently interpreted as a sum-of-squares (SOS) estimate of the underlying value function.
To this end, let c(x) := x {m} denote a vector containing a basis for the monomials of degree less than or equal to m with respect to the variable x ∈ R n . Note that c(x) ∈ R ν with ν = n + m m (19) where the notation on the right-hand side describes the binomial coefficient of n + m over m. In the spirit of (model-based) online PI schemes, define the functionV : with Θ ∈ R ν×ν a symmetric and positive definite matrix. In (20), the parameter ϑ ∈ R μ , μ = ν(ν + 1)/2, denotes a vector containing the entries of the matrix Θ by columns, namely ϑ = vech(Θ), where vech denotes the half-vectorization operator for symmetric matrices, whereasc i (x) describes the ith element of the vector-valued function (c(x) ⊗ S c(x)), with ⊗ S denoting the symmetric Kronecker product (see [26] for more detailed discussions and insights). Note thatV in (20) describes a single layer, linear neural network with polynomial activation functions c i , i = 1, . . ., μ. Furthermore-to avoid cumbersome notation in the derivations below that involve (mixed) second-order derivatives of the functionV with respect to ϑ and x-the following compact notation is introduced. Let h(ϑ, x) = ∇ xV (ϑ, x) and define the approximate (virtual) control input namely the feedback control input (11) obtained by replacing the gradient of the optimal value function with the parameterized function h. The following sections provide the definition and some properties of a finite-dimensional cost function R x 0 : R μ → R 0 that is shown to be instrumental for the construction, or the approximation, of optimal control laws. The rationale behind the formal derivations below can be intuitively anticipated as follows: the statement of Theorem 1 entails that the implementation of v to the controlled Hamiltonian dynamics (10) implies that the manifold M is rendered externally LES, while preserving its underlying invariance property. The implementation of a genericv instead-provided the feedbackv is sufficiently close to v -preserves the attractivity property of the manifold M (due to the intrinsic robustness of Lyapunov's asymptotic stability to uncertainties) while, however, compromising the (fragile) property of invariance. Therefore, the main idea below is to employ the distance of the ensuing trajectories from the manifold M as a reward value at each episode of the learning process. In particular, given a certain initial condition x 0 ∈ R n , an episode is defined as a sufficiently long time interval in which the trajectories of the plant-ensuing from x 0 ∈ R n and in closed loop with the current estimate of v -are monitored online. Such a scalar reward is subsequently employed to measure the difference betweenv and v and, consequently, to adjust the value of the parameter 2 ϑ to minimize such a distance. Note that, differently from alternative PI schemes, by the nature of the episode and of the corresponding reward proposed herein, the measured signal is not required to be sufficiently rich to obtain an informative measure, nor is the selection of the length of the episode crucial for the convergence of the algorithm. The intuition above is illustrated and motivated at this preliminary stage by means of the following numerical simulation.

A. Maximal Distance From Invariance as Policy Evaluation
In this section, we introduce and characterize the reward signal of the model-based RL architecture that is employed to measure the distance betweenv and v along trajectories of the closedloop state/costate dynamics. To this end, consider the finitedimensional cost function R : R μ → R defined as parameterized with respect to the initial condition x 0 ∈ R n . Note that the dependence of the function R x 0 on its argument ϑ optimal value function-the analysis is general in terms of the manipulation of the underlying parameters ϑ, hence, a similar strategy could be adapted and extended to several functional approximators and different activation functions.
is threefold, the first one being the, immediately evident, direct dependence of the function h on ϑ. Then, since (x(t), η(t)) appearing in R x 0 denote the trajectories of the controlled Hamiltonian dynamics (10) in closed loop withv(x, η, h(ϑ, x)) and initialized at (x(0), η(0)) = (x 0 , h(ϑ, x 0 )), one has that R x 0 depends on ϑ also via the fact that the underlying (closed-loop) vector field is parameterized with respect to ϑ as well as via the specific selection of the initial conditions. The results of this section characterize certain properties of the function ϑ → R x 0 (ϑ), which are derived by relying on the following standing assumption.
Assumption 2: Fix x 0 ∈ R n . Let the value of ν in (20) be given and defineθ The origin is a LES equilibrium point of system (10) in closed loop withv| ϑ=θ in (11), (21) and (x 0 , h(θ, x 0 )) ∈ R n × R n belongs to the basin of attraction of the origin. • Note that Assumption 2 is verified provided μ is selected sufficiently large, since it is known that the approximation errors ofV with respect to V as well as the corresponding partial derivatives are uniformly bounded in a compact set and converge uniformly to zero provided ν tends to infinity, see, e.g., [27]. Moreover, the assumption is trivially satisfied whenever the positive definite function V , solution of the HJB pde (4) is in fact a SOS with highest degree smaller than or equal to m in (20). The latter condition ensures the existence of ϑ with the property thatV (ϑ , x) = V (x) for all x ∈ R n and Assumption 2 is implied by the statement of Proposition 1 in the nominal case. Under Assumption 2, the reward value (22) represents the maximal distance between the trajectory of η and the manifold M. Intuitively, the optimal control law (or an approximation thereof) can be obtained by minimizing such a distance.
Lemma 1: Consider the function R x 0 in (22) and suppose that Assumption 2 holds. There exists a constant r ϑ ∈ R >0 and an open set U ⊂ R n , containing the origin, such that R x 0 is bounded and continuous for all ϑ ∈ B r ϑ (θ) and for all x 0 ∈ U.
• Proof: As a straightforward consequence of Assumption 2 and of the definition of the function R x 0 in (22), it follows that R x 0 (θ) is finite for any x 0 in a neighborhood U ⊂ R n containing the origin. By definition, R x 0 (ϑ) is nonnegative for all ϑ ∈ R μ . Let now zθ := (x, η) denote the solution of (10), (11) ensuing from zθ(0) = (x 0 , h(θ, x 0 )), while z ϑ describes, similarly, the solution of (10) in closed loop withv from z ϑ (0) = (x 0 , h(ϑ, x 0 )). Then, by continuity of the solution of (10), (11) with respect to the parameter ϑ and to the initial condition (see, e.g., [28]), for any ε > 0 there exist a nonempty open set of initial conditions x 0 with the property that zθ(t) − z ϑ (t) < ε for all ϑ sufficiently close toθ. Hence, boundedness of R x 0 follows by continuity, in fact linearity, of the mapping ϑ → h(ϑ, x) for fixed x ∈ R n . Continuity of the function R x 0 , instead, follows immediately by relying on arguments identical to those employed above-showing that the flow (x, η) is a continuous function of the parameter ϑ appearing in the vector field and in the initial conditions-and by recalling that the maximum of a continuous function is continuous.
As discussed in the proof of Lemma 1, the function R x 0 is such that R x 0 (θ) yields the minimal distance from invariance of the resulting attractive manifold, for any x 0 , provided that Assumption 2 holds. Moreover, in the limiting case for μ that tends to infinity one has that lim μ→∞ R x 0 (θ) = R x 0 (lim μ→∞θ ) = 0. The aim of the following remark consists in showing that, in the case of LQ optimal control problems, the reward function R x 0 is equal to zero if and only if ϑ is equal to ϑ = vech(P ), where P denotes the unique positive solution of the underlying ARE, as discussed below. The above argument in turn ensures that in the LQ case the cost function R x 0 admits a unique global minimizer with respect to ϑ.
Remark 1: In the setting of LQ optimal control problems, the controlled Hamiltonian dynamics (10) is described by where P = P > 0 denotes a generic positive definite matrix, which replaces the role of the matrix Θ in the parametrization (20), and F := −σ F I. Therefore, the closed-loop system becomes x η with Ξ = −Q − A P − F P + P SP . Provided that Assumption 1 holds, it can be shown that the reward function R x 0 is positive definite around ϑ = vech(P ), with P denoting the unique positive definite (maximal) solution of the algebraic Riccati equation (ARE) In fact, the function R x 0 can be zero for a certain ϑ if and only if ϑ = vech(P ) is such that the control lawv in (25) is stabilizing for (24) and, simultaneously, the identity η(t) = P x(t) holds for all t 0 along the trajectory of the closed-loop system (24), (25) ensuing from the initial condition (x(0), η(0)) = (x 0 , P x 0 ). More precisely, ϑ = vech(P ) must be such that σ(H + GK(P )) ⊂ C − and such that the subspace is Hurwitz is P = P . The properties of the function R x 0 characterized in Lemma 1 ensure that the valueθ is (locally) attractive for a class of gradient descent methods (as discussed below in more detail). Therefore, the first objective of the following section is to provide the gradient of the reward function R x 0 .

B. Gradient of the Reward Function
The objective of this section is to characterize and compute the gradient of the cost function R x 0 with respect to ϑ.
Remark 2: Since the definition of R x 0 involves the maximization of a function of time, it is expected that it possesses points of nondifferentiability. Moreover, the cost function (22) is in general a nonconvex function of ϑ, appearing in the initial conditions of the controlled Hamiltonian dynamics (10) and as a parameter that determines the underlying vector field. While nonsmooth, nonconvex optimization problems are known to be NP-hard, there are a range of methods available (including learning-based techniques) that render the (finite dimensional) problem of minimizing the cost (22) more readily solvable in practice (as demonstrated in Section VI in the context of optimal design for an automatic flight control system) than the original (infinite dimensional) optimal control problem. One such method, based on gradient information obtained on several points, is selected and discussed in the following section The solution of (22) is obtained via the controlled Hamiltonian dynamics 3 (10), with the crucial difference that while λ in (10) represents the optimal costate variable, η defines the approximate costate induced by the selection of the stabilizing control lawv in (10), initialized at (x(0), η(0)) = (x 0 , h(ϑ, x 0 )). Therefore, as mentioned in Section IV-A, the function R x 0 depends on the vector of parameters ϑ explicitly, as shown in (22), as well as via the initial condition and the dynamics itself. The computation of the gradient of R x 0 is detailed in the following formal statement. To provide a concise statement, let χ = (x, η) ∈ R 2n and define Suppose that the set arg max t∈R 0 η(t) − h(ϑ, x(t)) 2 =: T (ϑ) is a singleton and let T (ϑ) =:t. Then • Proof: To begin with, by considering the definition of the cost function R x 0 in (22) and the discussion in the following paragraph, it immediately follows that Therefore, only the sensitivity of the flow χ = (x, η) with respect to ϑ remains to be computed. Toward this end, by the standard fixed-point representation of the solution of an ordinary differential equation, one has that Thus, by considering the partial derivative with respect to ϑ ∂ ∂ϑ which is obtained by recalling that χ(0, ϑ) = (x 0 , h(ϑ, x 0 )). By the fundamental theorem of integral calculus, the time derivative of (33) then satisfies Therefore, by letting S x (t) := ∂x(t)/∂ϑ and S η (t) := ∂η(t)/∂ϑ, for all t ∈ R 0 the gradient of the cost function R x 0 with respect to ϑ is provided by (30). In fact, the matrices S x : R → R n×μ and S η : R → R n×μ obtained as solutions of (28b), together with the initial conditions S x (0) = 0 and S η (0) = h ϑ (ϑ, x 0 ), satisfy the dynamics (34), while ζ satisfying (28a) replicates the time evolution of χ in (34).

Remark 3:
In the statement of Lemma 2, it has been assumed, for simplicity of exposition that the set T is a singleton. Whenever the set T is multivalued for some ϑ, the gradient of the function R x 0 is in turn multivalued and it can be obtained by considering (31) evaluated at any t ∈ T (ϑ). The constructions in the following section-yielding a gradient-descent algorithm that extends the gradient-sampling strategy to the setting of matrix manifolds-are applicable without changes also in the multivalued case. Furthermore, in practice, it may be possible to entirely circumvent the issue of non-differentiability by considering a modified, truncated, definition of the set T for which the gradient is computed, namely T T (ϑ) := arg max t∈[0,t] η(t) − h(ϑ, x(t)) 2 , for anyt ∈ R >0 . In fact, since the measure induced by R x 0 is based on a forward invariance property, R x 0 (ϑ) is equal to zero if and only if its restriction to any finite interval [0,t] is equal to zero. The numerical values selected fort and σ F (which is related to the rate of external convergence) are, however, relevant in practice (see Section V-B for an illustrative example).

C. Manifold Gradient Descent Algorithms for PI
The objective of this section consists in combining the intuitions and constructions of Sections III, IV-A, and IV-B to provide a hybrid dynamical system that yields an adaptive optimal control law via PI. To provide a concise statement of the main result of this section-which provides the construction of such a hybrid adaptive control law-a few preliminary definitions and tools are briefly recalled.
Definition 1: Given a vector w = [w 1 , . . ., w p ] ∈ R p , with p = n(n + 1)/2, the inverse half vectorization operator, denoted vech −1 (w) maps w into the symmetric matrix 1 • It is worth observing that, by combining the constructions of Section IV-B with Definition 1, the matrix vech −1 (ζ 1 (t)) is a symmetric matrix that describes the gradient of the (scalar) cost function R x 0 with respect to the matrix Θ. Therefore, before presenting and discussing in detail the hybrid control architecture mentioned above, the gradient-descent algorithm is first revisited with respect to the manifold of symmetric and positive definite matrices, denoted by S + (n) = {X ∈ R n×n : X = X , X > 0}. This, in fact, is instrumental in updating the vector ϑ while ensuring that the matrix Θ does not leave S + (n). The tangent space at a point X ∈ S + (n) is defined as T X S + (n) = {W ∈ R n×n : W = W }, namely by the space of symmetric matrices. The manifold S + (n) becomes a Riemannian manifold by introducing the Riemannian metric as for W 1 ∈ T X S + (n) and W 2 ∈ T X S + (n), where tr(·) denotes the trace of a matrix. The geodesic curve at the point X ∈ S + (n) in the direction W ∈ T X S + (n) is defined as and it is entirely contained in S + (n) for any α ∈ [0, 1]. Finally, consider a few preliminary definitions, which are borrowed from [29]. Definition 2 (see [29,Def. 3.1)]: Given a vector ϑ such that Θ = vech −1 (ϑ) and a positive real number , ∂ R x 0 (ϑ) denotes the -subdifferential of R x 0 .
• While the interested reader is referred to [29], note that the above subdifferential is in fact computed by considering the convex hull of the gradients determined in (30) at several points close (according to the underlying Riemannian distance and the underlying exponential map) to Θ = vech −1 (ϑ). By relying on [29], Thm. 3.12], it can then be concluded that a descent direction can be determined by selecting the element of minimal norm in ∂ Y x 0 (ϑ), for some > 0, denoted as D(ϑ). More precisely, consider ∂ R x 0 (ϑ) and let w • := arg min{ w : w ∈ ∂ R x 0 (ϑ)} define the element of minimal norm in the set. Then, D(ϑ) := −w • / w • yields a descent direction with uniform decrease.
The above constructions essentially represent an extension of the sampling gradient method to the case of (matrix) manifolds. The following formal statement describes the convergence properties of the proposed episodic learning strategy, a practical implementation of which is then provided by Algorithm 1. In particular, in the following statement, the time histories induced by the episodic learning of Fig. 2 are interpreted in terms of trajectories of a hybrid system. Theorem 2: Consider the hybrid system defined by the flow dynamics defined by (28) together witḣ the jump dynamics defined by the reset condition Then, all trajectories of the hybrid system (28), (38), (29), (39) ensuing from the initial conditionsθ(0, 0) with the property thatv|θ (0,0) is stabilizing for (10) are such thatθ(t, k) converges to a stationary point of R x 0 (ϑ). •

Proof:
The claim is shown by noting that the sublevel sets of the function R x 0 are bounded and relying on the convergence properties of the algorithm discussed in [29], Th. 3.18].
A systematic description of the practical implementation of the results expressed in Theorem 2 is provided in Algorithm 1, which yields a strategy based on episodic learning to construct approximate optimal control laws.
Note that Step (1) of Algorithm 1 aims at approximating the computation of the -subdifferential D(ϑ), via the evaluation of the gradient at a finite collection of nearby points, in such a way that the gradient descent method can be practically implemented. Furthermore, by relying on the properties established in Section IV-B, (local) convergence of Algorithm 1 follows by somewhat standard arguments on gradient-descent strategies, see, e.g., [29]. The principles of the control design methodology are presented in the algorithm above by focusing on its essential features for clarity of exposition. However, a few, classical refinements of the implementation of the gradient descent approach may be straight-forwardly included. In addition to the use of a sampling gradient method mentioned above, one could, for instance, consider a line search subroutine that permits the computation of an optimized step α, in place of the constant one introduced in Algorithm 1.

V. ROBUST OPEN-LOOP LQ OPTIMAL CONTROL
The main objective of this section consists in specializing the previous results to the case of linear systems and with quadratic cost functionals. As discussed in Remark 1, in such a setting the Hamiltonian dynamics is described by a system of linear equations, the costate of which is controlled by means of the additional, virtual control input v, as shown in (24).

A. Comparison Between Optimal and Controlled Costate
Before discussing the main results of this section, the classic characterization of the optimal costate in this scenario is briefly revisited. Toward this end, consider a linear, time-invariant, system described by the equatioṅ with x : R → R n and u : R → R m denoting the state and the control input, respectively, together with the quadratic cost functional It is well known that the optimal control policy is given by where P denotes the symmetric, positive definite solution of the ARE (26). The optimal solution is equivalently obtained as where λ denotes the solution of the Hamiltonian system which depends on the solution of the ARE (26), and hence It is then straightforward to show that with A cl := A − SP . Therefore, in the transformed coordinateŝ By noting that, since z = Tẑ with T defined in (47), λ(t) = z 2 (t) + P ẑ 1 (t) it follows that The proof is then concluded by recalling that λ 0 = P x 0 and λ(t)| λ 0 =P x 0 = λ (t) = P e (A−SP )t x 0 . By specializing the constructions discussed in the nonlinear setting to the case of linear systems, consider the controlled Hamiltonian system with G defined as in (24), where v : R → R n in (11) reduces to the (linear) feedback x and g(x) = B do not depend on the state variable x. Lemma 4: Consider the controlled Hamiltonian system (51) in closed loop with (52). Then, the resulting controlled costate is given by The claim is shown by relying on arguments identical to those of the proof of Lemma 3 and by noting that T G = G and that v = (A cl − σ F I)ẑ 2 in the transformed coordinates.
It is now possible to compare the optimal costate λ and that computed by considering the stabilized Hamiltonian dynamics (51). To provide a concise statement, let c 1 ∈ R >0 and c 2 ∈ R >0 be such that e −A cl t < c 1 e c 2 t .
• Proof: By the definitions of the functions λ and η given in (45) and (53), respectively, it follows that e(t) = M s (t)ξ − e −σ F t ξ. Consider first the norm of the matrix-valued function  Therefore for all t > t , where the third inequality follows from the selection of σ F .

B. Stabilized Computation of the Optimal Costate
By building on ideas inspired by the constructions of Section IV, the main objective of this section consists in specializing the cost function (22) and the subsequent gradient descent algorithm to the case of LQ optimal control problems. Toward this end, note first that in the latter case, the candidate approximate value functionV in (20) can be a priori effectively limited to the class of quadratic functions, namely by letting c(x) = x ∈ R n . As a consequence, the matrix Θ ∈ R n×n , with n describing the dimension of the state of system (40), represents a generic matrix P ∈ R n×n in the quadratic form with ϑ = vech(P ), where P = P > 0, therefore, describes a candidate solution of the ARE (26). It is worth observing that this selection is such that Assumption 2 holds with P = P . Such a matrix P must be then determined by specializing the gradient-descent algorithm on matrix manifold discussed above to the cost function where (x, η) are obtained from (24), (25), and witht := arg max t∈R 0 η(t) − P x(t) 2 . As commented upon in the paragraph following the definition of R x 0 in (22), the dependence of L x 0 on P is threefold: direct as immediately suggested by (56) as well as via the parameterization of the vector-field of the underlying Hamiltonian dynamics, as shown in (24) and (25), and via the initial conditions (x(0), η(0)) = (x 0 , P x 0 ). Interestingly, since in the LQ setting the flow of the Hamiltonian dynamics can be easily obtained in closed-form as a matrix exponential, such a threefold dependence can be explicitly expressed as with K(P ) defined in (25).

Remark 4:
The construction in Algorithm 1, together with the discussion in Remark 1 for the LQ setting, ensures that the gradient-descent method introduced herein is such that the estimate P converges to P . Therefore, the strategy can be interpreted as an alternative to the solution of the underlying ARE (26), which does not involve the solution of any algebraic equation, the computation of eigenvalues of any square matrix or the inversion of any non-singular matrix.
Remark 5: Consider the first-order approximation of the matrix exponential function t → e (H+GK(P ))t around the origin, evaluated at t =t, namely e (H+GK(P ))t = I +t(H + GK(P )) + o(t 2 ) .
By replacing the above-mentioned approximation into the cost function (57), straightforward computations show that the cost function L x 0 can be approximated by where Y := Q + P A + A P − P SP coincides with the righthand side of the ARE (26).
To illustrate the properties mentioned in Remark 3, consider (linear) double integrator dynamics together with a cost functional as in (41) with Q = I and R = 1. The initial condition is x 0 = [5, 3] and let Algorithm 1 be initialized with a random P and fixed step size α = 0.001. Figs. 4 and 5 show the dependence of the convergence rate on the selection oft and σ F , respectively: from the computational perspective, it is desirable to employ a longer time interval and a smaller value for σ F , such that the trajectories are driven sufficiently away from the original subspace. The latter feature, in fact, yields a gradient matrix that is more reliable from the numerical point of view.

VI. AUTOMATIC FLIGHT CONTROL OF THE LONGITUDINAL MOTION OF AN AIRCRAFT
The results of the previous sections are validated and corroborated by means of numerical simulations, concerning the optimal design of an automatic flight control system for the longitudinal motion of an aircraft. The model description is borrowed from [30], in which a detailed derivation of the dynamics is provided. The longitudinal motion of an aircraft is therefore captured by dynamics as in (1), described by the vector fields f and g defined in (60) (at the bottom of the page), where x 1 (t) ∈ R denotes the angle of attack, in radians, while x 2 (t) ∈ R and x 3 (t) are the angular displacement with respect to the horizon and its rate of change, respectively. The control action u(t) ∈ R describes the prescribed tail deflection angle. In [30], two control strategies are suggested: a linear static feedback-optimally designed on the basis of the linearized model-is subsequently compared with a nonlinear feedback obtained by a power series approximate solution of the underlying HJB pde. The latter policy is computed at the price of computationally intensive derivations that require the solutions of nonlinear algebraic equations. The effectiveness of the two strategies above is therein measured by means of a quadratic cost functional similar to (2) with q(x) = (1/2)x Qx, where Q = (1/2)I and R = 2. More precisely, the control law solving the linearized problem is whereas the nonlinear feedback computed by approximating the HJB pde by including the cubic terms of its power series expansion is Note that the coefficients of the control law (61) are related to the entries of the symmetric and positive definite matrix P that solves the corresponding ARE (26), i.e., The constructions discussed in the previous sections are then carried out in the setting of the automatic flight control problem described by (60) by selecting m = 2 in the definition of c(x), yielding The gradient-descent method on the manifold of symmetric and positive definite matrix illustrated in Theorem 2 is then applied by initially letting the matrix Θ ∈ R 9×9 be defined as where is a positive parameter such that Θ 0 ∈ S + (9). In the following numerical simulations, the coefficient has been selected as = 0.001, while the parameter σ F in the construction of the adaptive control law (21) is equal to 1. It is assumed that the initial configuration of the aircraft is such that the angle of attack corresponds to 25 degrees, hence, x 0 = [ (25/180)π 0 0 ] . Fig. 6 depicts the costs yielded by the linearized solution u in (61), i.e., J(u ) = 0.1152, and by the nonlinear control law u ps proposed in [30] and reported in (62), i.e., J(u ps ) = 0.091, dash-dotted and dashed lines, respectively, together with the evolution over the iteration number k of the cost yielded by the iterative control law (21), defined by the approximate value functionV (ϑ k , x) = c(x) Θ k c(x). It can be observed that even a relatively small number of iterations of the algorithm allows, essentially without solving any partial differential or even algebraic equations, to achieve a cost lower than that obtained in [30] with u ps . Note in addition that the latter control law, i.e., u ps , has been constructed at the price of significant computational complexity and effort in solving algebraic equations, as commented upon in [30]. The graphs in Fig. 7 instead report the evolution over the iteration number (i.e., one curve for each value of k) of the phase-plot in R 3 of the corresponding costate variable induced by the controlled Hamiltonian dynamics (10) with v(x, η, h(ϑ k , x)). The solid black line describes the phase-plot associated to the last iteration of the gradient descent algorithm.
Finally, the value of the matrix Θ 60 , reported in (66) (at the bottom of the page), is then employed to compute the time histories of the state x and of the control actionv induced by the value functionV (ϑ 60 , x). These are indicated by the solid black lines in Figs. 8 and 9, respectively. The time histories are also compared with those resulting from the use of the control law u (dash-dotted lines) and of u ps (dashed line).

VII. CONCLUSION
An iterative learning strategy to construct optimal control laws in infinite-horizon dynamic optimization problems has been proposed and discussed. The strategy relies on a modified (controlled) version of the classic Hamiltonian dynamics arising in this context, in which the stable invariant submanifoldinstrumental for the construction of the optimal policy-is rendered externally asymptotically stable. This is then exploited, in the presence of a VFA for the candidate solution, to yield an alternative temporal difference error in the episodic learning architecture that measures the distance from invariance induced by the current approximating value function. The estimate is then updated by means of a gradient descent algorithm in the manifold of symmetric positive definite matrices. The theory has Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.