Finite-Horizon Optimal Control for Linear and Nonlinear Systems Relying on Constant Optimal Costate

A class of finite-horizon optimal control problems, the solution of which relies on a time-varying change of coordinates that incorporates the transition matrix of the system linearized along the current estimate of the optimal process, is studied. The transformed dynamics exhibit a constant optimal costate. Differently from existing methods that hinge upon similar tools, the proposed strategy does not require at each step the (numerical) solution of a two-point boundary value problem or of a time-varying Riccati equation, and only the solution of a linear initial value problem is needed. The method is firstly illustrated in the setting of linear dynamics and quadratic cost for which the construction permits the identification of a class of problems in which the solution to the underlying (quadratic) Differential Riccati Equation exhibit a separation between homogeneous and particular contributions.


Finite-Horizon Optimal Control for Linear and Nonlinear Systems Relying on Constant Optimal Costate
Lorenzo Tarantino , Mario Sassano , Senior Member, IEEE, Sergio Galeani , and Alessandro Astolfi , Fellow, IEEE Abstract-A class of finite-horizon optimal control problems, the solution of which relies on a time-varying change of coordinates that incorporates the transition matrix of the system linearized along the current estimate of the optimal process, is studied.The transformed dynamics exhibit a constant optimal costate.Differently from existing methods that hinge upon similar tools, the proposed strategy does not require at each step the (numerical) solution of a two-point boundary value problem or of a time-varying Riccati equation, and only the solution of a linear initial value problem is needed.The method is firstly illustrated in the setting of linear dynamics and quadratic cost for which the construction permits the identification of a class of problems in which the solution to the underlying (quadratic) Differential Riccati Equation exhibit a separation between homogeneous and particular contributions.
Index Terms-Iterative learning control, nonlinear systems, optimal control.

I. INTRODUCTION
T HE problem of controlling the state of a system in an optimal way from a given initial state to a certain terminal state, within a prescribed time interval, which is fixed a priori, has gained a central role in systems and control theory over the past decades [1], [2], [3], [4], [5], [6], [7], [8], [9], [10].In the second half of the last century, two approaches have been introduced for solving such a class of problems, i.e., Dynamic Programming (DP) [11], [12] and Pontryagin's Minimum Principle (PMP) [13].The former provides necessary and sufficient conditions of optimality, leading to an optimal feedback solution characterized via a nonlinear partial differential equation (PDE), i.e. the so-called Hamilton-Jacobi-Bellman (HJB) equation.Since the optimal solution, as well as the optimal cost, are characterized for any initial condition in the state space, methods inspired by DP are particularly appealing, whereas the main drawback of this approach is represented by the fact that closed-form solutions to the HJB equation can be seldom obtained.Methods relying on PMP, on the other hand, provide only necessary conditions of optimality, allowing to identify candidate optimal solutions, referred to as extremals, and this approach essentially leads to open-loop strategies, since the characterization of the optimal control law heavily relies upon the knowledge of the initial condition of the underlying system.Such necessary conditions are characterized in terms of ordinary differential equations, instead of PDEs, which must be satisfied by the optimal process, together with an auxiliary variable, the so-called costate.It is worth mentioning that, in the case of infinite-horizon problems (see, e.g., [14], [15]), this approach recovers the computational complexity of DP, since determining the correct boundary conditions requires the solution of the HJB PDE.On the contrary, in the finite-horizon setting, a solution to a two-point boundary value problem (TPBVP) is needed.As it has been recognized, the major challenge in solving optimal control problems lies in their intrinsic infinite-dimensional nature, combined with the presence of nonlinearities.To address this challenge, in the past decades a large number of algorithms have been envisioned: These can be essentially classified as direct or indirect (see [16] and [17] for further discussion on this distinction).A direct method converts the original problem into a sequence of nonlinear optimization problems, for example, by using Chebyshev polynomials (see, e.g., [18], [19]).As a matter of fact, a direct method transforms an infinite-dimensional problem into a sequence of finite-dimensional problems, which are in general easier to handle.Indirect methods instead rely essentially on necessary conditions such as those derived from the Calculus of Variations [6] and the problem is then handled in the infinite-dimensional context.On the other hand, since Newton method provides particularly efficient iterative strategies to tackle nonlinear optimization tasks, methods based on the linearization of the underlying plant along certain candidate trajectories have emerged with the objective of replicating Newton's arguments in an infinite-dimensional context.This can be accomplished in two different ways (see [20] and [21]).The first one consists in the linearization of the underlying dynamics, together with the expansion of the cost functional in a Taylor series up to second order terms, solving then a Linear Quadratic Regulator (LQR) problem at each iteration (see [22], [23], [24], [25], and [26], where the latter refers to the context of stochastic systems).The second one relies on imposing some necessary conditions, such as those derived from PMP, and then subsequently linearizing the resulting dynamics (see, e.g., [3] and [1]).Although this latter way has been extensively studied, since there exist efficient numerical techniques for solving TPBVP (see, e.g., [27], [28], [29], [30], [31], [32], [33], [34]), the major drawback of this approach is represented by the fact that the steps involved lead to losing sight on the behavior of the cost functional.
The strategy proposed here follows the first of the two approaches described above to replicate Newton's argument in an infinite-dimensional setting, which in principle relies on solving a time-varying LQR problem.This task could appear discouraging, since the solution of this problem relies on the solution of a time-varying Riccati equation, which is a quadratic differential equation.Nonetheless, by borrowing ideas from [1] and [35], in which a class of zero-sum differential games is solved by means of an auxiliary variable related to the underlying dynamics via a change of coordinates, it is possible to replace the Riccati equation with a linear, time-varying, differential equation with only initial boundary conditions.This makes the proposed approach computationally viable.Furthermore, this strategy also allows assessing the behavior of the cost functional after each iteration.Therefore, the steps involved lead to global convergence to (possibly local) extremal.Since the resulting iterative scheme is characterized by a gradient-descent nature, it follows that the choice of the optimal step size along the descent direction plays a crucial role in terms of convergence rate of the entire scheme.This issue can be addressed by relying on approximate as well as exact line search methods or by borrowing ideas used in the field of machine learning in the training process of neural networks, by relying on heuristic methods, such as the RMSProp scheme [36].
Preliminary results have been presented in [37].Differently from [37], herein we provide detailed proofs of all the technical results, together with further insights on the structure of the solution.Moreover, the main ideas are firstly presented in the case of the LQR problem, allowing to identify a class of such problems with interesting properties.An extensive comparison is provided between strategies for obtaining the optimal step size at each iteration of the descent algorithm.Finally, the optimal control problem for a mechanical system on a redundant task is solved using the proposed approach.
The rest of the article is organized as follows.The class of problems considered, together with some preliminaries and the discussion of the main ideas by revisiting the LQR problem, leading to a closed-form solution of the Differential Riccati Equation, are introduced in Section II.The introduction of the algorithm that allows solving the finite-horizon optimal control problem by exploiting the constant optimal costate is provided in Section III.Further discussion on the proposed algorithm, together with some implementation aspects are given in Section IV.A discussion on the steps involved by the proposed approach and those of the quasilinearization scheme is provided in Section V. Finally, an example of application on a mechanical system on a redundant task and some concluding remarks are given in Sections VI and VII, respectively.
Notation: Given two matrices A ∈ R m×n and B ∈ R p×q , the Kronecker product between A and B, denoted as A ⊗ B ∈ R pm×qn , is the block matrix given by A positive definite function is denoted as q : R n → R >0 , whereas a positive semidefinite function is denoted as q : R n → R ≥0 .

II. PROBLEM STATEMENT AND PRELIMINARIES
The main purpose of this article is to study a class of finite-horizon optimal control problems described by the cost functional subject to the dynamic constraints where x(t) ∈ R n denotes the state of the system, u(t) ∈ R m represents the control input, and t f > 0 is the fixed final time.Moreover, the functions q : R n → R ≥0 and q f : R n → R ≥0 are assumed to be sufficiently smooth, namely, whenever their (higher-order) derivatives are required, it is assumed that such derivatives are continuous.The matrix R ∈ R m×m , with R = R , is positive definite.Finally, suppose that the vector field f : R n → R n and the mapping g : R n → R n×m are sufficiently smooth.The problem under investigation is formally introduced in the following statement.Problem 1: Let x 0 ∈ R n be given.The finite-horizon optimal control problem consists in determining a control law u ∈ C 0 (t 0 , t f ), that minimizes (1) along the trajectories of (2).• Problem 1 is studied by first considering the case of linear dynamics and quadratic cost functional, which is briefly revisited in the following section.The genuinely nonlinear formulation of Problem 1 is subsequently addressed by first letting, for illustrative purposes (see Theorem 1 below), q f be the squared norm of the terminal state x f and q ≡ 0. These structural assumptions are then removed, thus yielding a solution to Problem 1 in the general setting given by ( 1) and (2) (see Corollaries 1 and 2 below).

A. Revisiting the Finite Horizon LQR Problem
The main intuition behind the arguments employed to tackle the problem defined by (1), ( 2) is here discussed in the specially structured case in which the dynamics (2) are linear and timeinvariant, namely where A ∈ R n×n , B ∈ R n×m , and the running and terminal costs in (1) are quadratic functions of the state, namely It is well known (see, for instance, [6]) that, in such setting, the optimal solution is described by a time-varying state feedback where P (t) ∈ R n×n , with P (t) = P (t) ≥ 0, solves the Differential Riccati Equation (DRE) Ṗ (t) = −A P (t) − P (t)A + P (t)BR −1 B P (t) − Q (6a) It is worth mentioning that (5) solves Problem 1 for all x 0 ∈ R n .Furthermore, the minimal cost attained from a generic time t and state x is yielded by the value function for all x ∈ R n and t ∈ [t 0 , t f ].Whenever the state of the system is penalized only at the final time in (1), a particularly insightful structure of the solution, which can be determined in closed form, emerges.The latter then constitutes the main ingredient toward the extension to the nonlinear setting discussed in Section III.To provide a concise statement of the following result, let B(t Proposition 1: Consider the system (3) and the cost functional (1) with q and q f as in (4).Suppose that Q = 0 and Q f > 0. Fix x 0 ∈ R n .Then, the solution to Problem 1 is provided by with K(t) defined as Proof: Consider the time-varying change of coordinates defined as In such transformed coordinates, the cost functional (1), ( 4) with Q = 0 becomes while the underlying dynamics are It is then straightforward to note that the Hamiltonian function associated with the problem in the transformed coordinates reduces to which is, in fact, independent of the state z.Therefore, by relying on standard results (see, e.g., [6]), necessary and sufficient conditions of optimality for the problem described by ( 9), (10) imply that the optimal process (u , z ), satisfies the Hamiltonian dynamics together with the boundary conditions The control law u in ( 12) is instead obtained via the (unconstrained) minimization of H, yielding By replacing the control law ( 14) into the dynamics (12a), and noting that, since the optimal costate is constant (see (12b)), From (15) it follows that: Substitution of ( 16) into the control law in ( 14) leads to thus showing the claim.By closely inspecting the structure of (7) in comparison with (5), one immediately realizes that the former implicitly suggests the structure of the matrix P solving the DRE in closed form in this setting.The above intuition is summarized in the following statement.
Proposition 2: Suppose that Q f > 0 and let Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Then P h in ( 18) is a closed-form solution of the DRE (6) with Proof: To begin with, observe that P h in (18) can be rewritten as P h (t) = e A (t f −t) K(t) −1 e A(t f −t) , by definition of K as in (7), namely Note now that Therefore, replacing the above identity into the time derivative of the candidate solution P h in (18), and recalling that S(τ Finally, note that the boundary condition is satisfied by construction, hence the proof is complete. A few observations about Proposition 2 are in order.First note that, although the DRE ( 6) is quadratic instead of linear, the statement of the Proposition 2 implicitly entails that P h (t) in (18) solves the homogeneous part of (6), obtained by letting the constant terms equal to zero in the vector field of (6).One may then wonder in which circumstances the DRE enjoys a separation into homogeneous and particular solutions, as for linear differential equations.This is precisely illustrated in the result below.To provide a concise statement, let B ⊥ ∈ R n×(n−m) denote a full column rank matrix with the property that B B ⊥ = 0.
Proposition 3: Consider the Differential Riccati Equation ( 6).Suppose that there exists a matrix Then, the solution to the DRE (6) can be decomposed as with Proof: Since P (t f ) = Ph (t f ) + P p = Q f , the boundary condition is satisfied.Evaluating the time derivative of P one has that It is worth noting that, since ( 21) can be equivalently rewritten as there exists a matrix X with the property that ( 21) is satisfied if and only if Therefore, if a given LQR problem is such that the condition ( 24) is met, then the closed-form solution to the Differential Riccati Equation ( 6) is provided by (22).Remark 1: The results given by Propositions 1-3 remain valid even in the case in which, instead of the positive definiteness of the matrices Q f and Q f − B ⊥ X(B ⊥ ) , which assures the existence of a unique matrix P solving the DRE ( 19) and ( 6), respectively (see, e.g., [8]), only invertibility of the previous matrices is assumed.However, the invertibility of the matrices (20) and t S(τ )dτ must be additionally assumed, provided that there exist unique solutions to the DRE ( 6) and (19).
Example 1: As a simple illustration of the previous discussion, suppose that the weighting matrices appearing in (4) are given by It is also assumed that R is the identity matrix in the cost functional (1), whereas the dynamic constraint (3) is represented by the matrices In this setting, by following the results discussed in this section, the matrices B ⊥ and X are given by B ⊥ = [1 1] and X = 1, respectively.Therefore, P (t) can be obtained by following the constructions given by Proposition 3, yielding Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. where

III. ITERATIVE OPTIMAL CONTROL OF NONLINEAR SYSTEMS
To provide a concise statement of the main result in the nonlinear case, a few preliminary definitions are introduced.Let a process (û(t), x(t)), defined for all t ∈ [t 0 , t f ], be given such that x(t 0 ) = x 0 and the time derivative of x satisfies (2) with u(•) ≡ û(•).Consider then the linearization of (2) along the pair (û, x), defined by together with the initial condition δx(t 0 ) = 0, where δx = x − x, δu = u − û describe the incremental variables and g j , j = 1, . . ., m, are the column vectors of the mapping g.Finally, the operator where Φ(t f , t 0 ) describes the transition matrix associated with A(t), and the controllability Gramian G(x 0 ) ∈ R n×n associated with the system ( 26) is given by Further, note that the operator β relies upon the knowledge of the state transition matrix of a linear time-varying system.The latter can be seldom obtained in closed form.To overcome this issue, recall (see, e.g., [38]) that for a given linear system It is possible to backward integrate the linear differential equation (29), obtaining the required expression for β(x(t)), i.e., β(x(t)) = Θ(t) g(x(t)), for any time.Finally, consider the cost functional where It is worth anticipating that the specially structured form of this cost functional is motivated by illustrative purposes of the underlying intuition and it is subsequently generalized, therefore allowing to solve problems involving cost functionals of the form given in (1).By relying on the definitions ( 26)- (28), it is now possible to state the following theorem, which provides an iterative strategy leading to candidate optimal solutions for (30), (2).Theorem 1: Consider the cost functional (30), subject to the constraints (2).Let u 0 ∈ C 0 (t 0 , t f ) be an arbitrary control law defined for all t ∈ [t 0 , t f ] such that the solution x(•) to (2) with u(•) = u 0 (•), exists in [t 0 , t f ].Consider the iterations described by for i = 0, 1, . . ., with where Θ i (t f ) = I, A i is defined as in (26) with respect to the process (u i , x i ), and β i : R → R n×m is defined as Then there exists a sequence {α i }, α i ∈ R >0 , such that the sequence {u i } converges pointwise to an extremal of ( 30), (2).Furthermore, if the problem described by ( 30), ( 2) admits an optimal solution u that is the unique extremal, then for all t ∈ [t 0 , t f ].
Proof: Let δx i = x i+1 − x i and δu i = u i+1 − u i , and consider the linearization introduced in (26) along the process (u i , x i ).By borrowing ideas similar to those in the constructions introduced in [35] in a different context, consider the auxiliary variable δz i defined as for t ∈ [t 0 , t f ].In the transformed coordinates, the problem, once linearized, becomes since, by definition of state transition matrix, δz i (t f ) = δx i (t f ), subject to the dynamic constraints together with the initial condition δz(t 0 ) = 0.The latter is a time-varying system such that the underlying vector field Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
is independent of the variable δz i .The Hamiltonian function associated with (36), (37) is given by By relying on standard results (see, e.g., [6]), necessary conditions of optimality for the problem ( 36), (37) imply that the optimal process (δu i , δx i ), if it exists, satisfies the Hamiltonian dynamics together with the boundary conditions whereas the control law δu i in ( 39) is obtained, as in Section II, via the unconstrained optimization of H i , yielding Moreover, since and because of linearity of the underlying dynamics, the conditions provided by ( 39), ( 42) are also sufficient conditions for optimality of the process (δu i , δx i ).It is worth observing that the dynamics in (39b) imply that the costate variable δλ i is constant in the interval [t 0 , t f ] and equal, by the boundary conditions in (40b), to . By replacing the costate into the control law in (41), and, in turn, the latter into (39a), one obtains which, once solved with respect to δz i (t f ), yields The matrix I + G(x 0 )Q f is nonsingular, since the matrix G(x 0 )Q f , although not necessarily symmetric, possesses nonnegative eigenvalues (see, e.g., [39]).Finally, by replacing the latter into (41), yields This yields the update rule which describes precisely the iterations given by (31).The first claim follows by noting that the control law δu i is the optimal solution of the linearized problem and it yields a descent direction (see, e.g., [40]) for the nonlinear cost J.
Finally, the second claim follows immediately from the stated assumptions and the gradient-like nature of the update rule (31).
The update rule (45) reveals that moving along the descent direction δu i is equivalent to considering the convex combination of the previous input u i and the function Γ i in (32).It is worth mentioning that, even if the proposed approach iteratively updates the control law in a gradient-like scheme, the properties related to the computational complexity of the proposed algorithm are different from those exhibited by standard gradient descent (see, e.g., [3]) and the conjugate gradient descent methods (see, e.g., [41]).More precisely, the integration of 2n nonlinear differential equations is required for applying the gradient scheme, whereas 4n nonlinear differential equations must be integrated to implement the conjugate gradient scheme.On the other hand, the approach proposed herein relies on the (backward-in-time) integration of a linear, time-varying, matrix equation, namely (29), hence revolving around the solution to n 2 linear time-varying differential equations.As such, the comparison between the number of equations for (conjugate) gradient methods suggest a disadvantage of the proposed method with respect to the former.On the other hand the performance of the proposed approach has been compared, by means of a case study, with those provided by the gradient and conjugate gradient schemes.The numerical simulations have been carried out in the setting described by the cost functional (30), subject to the dynamic constraints given by the following nonlinear system, borrowed from [42,Ch.4],namely The initial condition is x 0 = (1, −1, 2, −1), whereas the matrix Q f in (30) is the identity matrix.Finally, since in this example m = 1, R has been selected as R = 1.The three methods are implemented by using a prescribed step size, for each iteration, along the descent direction.More precisely, for each scheme, the step size has been numerically selected as the maximum value that ensures monotonic convergence.In particular, in the case of the gradient as well as the conjugate gradient schemes, the maximum step size with the property above is α = 5.25 × 10 −2 .On the other hand, the proposed approach ensures convergence with a unitary step size, i.e., for α i = 1.The behavior of the cost functional J along the iterations, for each of the simulated schemes, is reported in Fig. 1, which shows that, thanks to the possibility of selecting a unitary step size, convergence is achieved faster than the two other gradient schemes by the proposed algorithm.
Remark 2: The statement of Theorem 1 entails that the sequence {u i }, produced by the iterations described by (31), converges in general to an extremal of ( 30), (2).It is worth Fig. 1.First ten iterations of the cost functional obtained by using gradient descent scheme (diamond), the conjugate gradient descent scheme (squares) and the proposed approach (circles).
observing that this extremal cannot be a local maximizer of (30), (2).In fact, let {J i } be the sequence of costs produced by the algorithm, where J i is the cost associated with the u i obtained at the ith iteration, and consider the limiting process (x , u ).The latter is not a (local) maximizer of ( 30), (2), since there exists a sequence of admissible processes (namely (31) in reverse order) along which J monotonically increases.This property is a natural consequence of the gradient descent nature of (31) and it is significantly different from other techniques based on the approximation of the Hamiltonian dynamics (see Section IV).
The objective of the two following corollaries consists in generalizing the structure of the cost functional (30).This is achieved by modifying first the terminal cost (Corollary 1) and then the running cost (Corollary 2).
Corollary 1: Consider the cost functional subject to the constraints (2), and Then, letting and considering the iterations (31), there exists a sequence {α i }, α i ∈ R >0 , such that the sequence {u i } converges pointwise to an extremal of (46).Furthermore, if the problem described by ( 46), ( 2) admits an optimal solution u which is the unique extremal, then (34) holds.Proof: By repeating the steps of the proof of Theorem 1 and noting that, since δλ i for all t ∈ [t 0 , t f ], the terminal state (43) can be rewritten in the form given by (47), the claim follows by replacing in (48) the solution δz o i (t f ).Remark 3: By possibly requiring that q f ∈ C 2 (R n ) it is always possible to find the solution of (47), at least locally, by relying on the implicit function theorem.To this end note that, if the difference t f − t 0 is sufficiently small, the left-hand side of (47) can be interpreted as a function . Thus, local solvability of (47) is implied by nonsingularity of the matrix where δz i (t f ) has been replaced with zero, since by construction (and its role of describing an increment) it can be reasonably assumed to be sufficiently small.Corollary 2: Consider the cost functional subject to the constraints (2), with q : R n → R >0 .Let u 0 ∈ C 0 (t 0 , t f ) be an arbitrary control law defined for all t ∈ [t 0 , t f ] such that the solution x(•) to (2) with u(•) = u 0 (•), exists in [t 0 , t f ].Let β i : R → R (n+1)×m be given by (33), where β is given by (27) considering F (x) = [f (x) , q(x)] and G(x) = [g(x) , 0 m×1 ] , in place of f and g, for all x ∈ R n , respectively.Then Moreover, letting and considering the iterations described by (31), there exists a sequence {α i }, α i ∈ R >0 , such that the sequence {u i } converges pointwise to an extremal of (49).Furthermore, if the problem described by (49), (2) admits an optimal solution u which is the unique extremal, then (34) holds.
Proof: Note that, by introducing the auxiliary state-variable x v , the problem given by (49), (2) can be rewritten as (see, e.g., [6]) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. since Therefore, the claim follows by direct application of the previous Corollary 1 noting that, since in this case δλ i (t) = δλ i (t f ) = [0 1×n , 1] , the solution of (47) is given by (50).Remark 4: Although the structure exhibited by the term Γ i in (51) appears simpler once compared to those in the previous statements, it is worth observing that all the information needed to solve the finite-horizon optimal control problem, related in particular to the cost-to-go q, is essentially encoded, at each iteration, within the (extended) β i introduced in Corollary 2. The value of the terminal state in (50) appears unused in the construction of the update rule (51) due to the structural absence of a genuine terminal cost in (49).Nonetheless, it would be employed in the descent direction whenever the cost functional (49) contains, in addition to a running cost on the state, a terminal cost similar to those of ( 30) or (46).More precisely, by combining the results of Corollaries 1 and 2, it is then possible to solve, iteratively, the optimal control problem described by the cost functional (1), subject to the nonlinear dynamic constraints (2).
Remark 5: By inspecting the structure of the update law ( 31), it appears evident that the main computational complexity in employing expression (32) for the construction of Γ i is represented by the evaluation of the function β i , which in turn relies upon the knowledge of a state transition matrix of a linear time-varying system, which can be determined by backward integration of (29).This allows concluding that the proposed scheme solves a sequence of time-varying LQR problems but, instead of the solution of a time-varying Differential Riccati Equation, it requires a solution of a time-varying linear differential equation.

IV. SELECTION OF THE STEP ALONG THE DESCENT DIRECTION
The structure of the update rule (31) entails that the knowledge of the descent direction δu i in ( 44) is exploited at each step via a convex combination of the function Γ i with the control law employed at the previous iteration.Once the descent direction is computed, the implementation of the rule (31) has the property that the cost functional J(u i+1 ) becomes essentially a function of the step α i alone.While, on one hand, the formal statements above ensure the existence of a sufficiently small α i such that J(u i+1 ) < J(u i ), the practical selection of such a value remains a crucial design choice with significant consequences on the convergence rate of the updates.A few alternative strategies are discussed and compared in this section.It is worth pointing out that all the strategies discussed in the following section aim to construct the sequence {α i }, the existence of which is ensured by Theorem 1 and therefore, in practice, some preliminary parameter tuning may be necessary to actually achieve convergence of the iterations.

A. Approximate Line Search
The most intuitive approach would be to study the variation of J(u i+1 ) as a function of α i , denoted J(α i ), and select α i = arg min α i ∈(0,1) J(α i ).However, since the analytic expression of J cannot be obtained in general, the previous strategy may be pursued only in an approximate fashion.This can be achieved in, at least, two different ways, i.e., either approximating (discretizing) the admissible values for α i or by approximating directly the expression of J(α i ).While the latter strategy is discussed in detail in the following section, the former requires the definition of a set T = {α 0 , α 1 , . . ., α N } ⊂ (0, 1), of finite cardinality |T | = N , and subsequently obtaining Since J(α i ) should be computed numerically at the values in T , the approach is in general particularly time-consuming for large values of N .

B. Optimal Step for Second-Order Expansion
Provided that the descent direction δu i (t) is known, one can interpret x i+1 (•) as a function of the step α i in addition to its canonical dependence on time.Thus, by considering the firstorder Taylor series expansion of x i+1 (t, α i ) with respect to α i about a nominal value α 0 , it follows that for all t ∈ [t 0 , t f ], where r(t, α i ) contains higher-order terms of the expansion which decay to zero faster than (α i − α 0 ) 2 , whenever α i tends to α 0 .Considering the formal claims of Section III concerning a sufficiently small step, a reasonable choice of the nominal value α 0 consists in selecting α 0 = 0, corresponding to the case in which u i+1 ≡ u i , implying as a consequence that x i+1 (t, α 0 ) = x i+1 (t, 0) = x i (t), for all t ∈ [t 0 , t f ].Then, replacing (55), with α 0 = 0, in the first term of the cost functional (30), one obtains where Jq (α i ) represents the second-order approximation of J(α i ) about the nominal step value α 0 = 0.The dependence of Jq only on α i highlights again the property that, once the direction δu i (t) is given, the optimal control problem reduces to the problem of finding the step α i that, at least locally, yields Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the minimum value of J(α i ).The latter is a finite-dimensional optimization task with respect to a scalar variable.Proposition 4: The optimal step that minimizes Jq , subject to the dynamic constraints is given by from the initial condition S i+1 (t 0 ) = 0. Proof: Since finding the optimal step consists in solving a scalar, unconstrained optimization problem, with respect to α i , the solution can be found by equating the total derivative of Jq (α i ), given by ( 56), with respect to α i to zero, namely The proof is then concluded by solving the previous equation with respect to α i , provided that one is able to evaluate . This can be accomplished by relying on the theory of sensitivity functions (see, e.g., [43]), obtaining the sensitivity equation related to the system (57), which is given by (59), namely S i+1 (t) := ∂ ∂α i x i+1 (t, α i )| α i =0 .Remark 6: Whenever the cost functional is in the form given in (46), the first-order Taylor expansion around α 0 = 0 of q f yields Therefore, (60) becomes Provided that one is able to solve (61), then α i is obtained.As a specially structured case, consider for instance the setting in which q f (x i (t f )) = x i (t f ), assuming that x i (t f ) is a scalar (as in the case of x v (t f ) introduced in Corollary 2).Then, the step size α i that leads to the maximum reduction of J q is given by It is worth observing that the construction of Proposition 4 revolves around the solution of a system of n linear differential equations (i.e., the sensitivity equations (59)), in order to provide, at least locally, the maximum descent of the cost functional, along the direction δu i .It is possible, however, to circumvent the need for such a set of equations, for instance, by borrowing the heuristics used in the field of machine learning to improve the speed of convergence of the gradient descent algorithm, used in the training process of neural networks, as discussed in the following section.

C. Heuristic-Based Evaluation of the Step Size
Even if the descent direction of the proposed algorithm (being infinite dimensional) differs, in general, from the one provided by the gradient descent, the idea is that the chosen heuristics are still viable, since the two schemes are essentially based on descent directions.From the vast amount of heuristics available in the literature (see, e.g., [36]), the RMSProp scheme yields an update rule given by u i+1 = u i + Δu i , where Δu i is given by where the symbol stands for the elementwise product, or Hadamard product, between the two vectors θ i (t) and δu i (t).The former is given by where the operation ((•) − 1 2 ) is again performed elementwise on the m-vector δ[1 m×1 ] + r i (t), α ∈ (0, 1) is a fixed step, and r i (t) is given by The coefficients δ ∈ R >0 and ρ ∈ R >0 are left as design parameters, and r 0 (t) = [0 m×1 ].Note that this scheme allows choosing the initial step α, which is then adjusted along the iterations according to the construction discussed above.

D. Strategy Comparison on a Continuous Stirred-Tank Chemical Reactor
To compare the recipes introduced in this section, consider the model of a continuous stirred-tank chemical reactor, borrowed from [3, ch.6], and described by the equations where x 1 represents the deviation of the steady-state temperature and x 2 describes the deviation of the steady-state concentration Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 2. Top graph: First ten iterations of the cost functional obtained by using a fixed step scheme (circle), the RMSProp scheme (square), the approximation of the optimal step (diamond), the grid search (star).Bottom graph: time histories of the control laws, evaluated on the last (10th) iteration, together with the initial guess u 0 of the reactor.The cost functional subject to the previous dynamical constraints is where t 0 = 0, t f = 0.6, and R = 0.1.The structure of the cost functional suggests that the objective is to maintain both the temperature as well as the concentration close to prescribed steady-state values, while minimizing, at the same time, the control effort.The following numerical simulations are carried out with x(0) = [0.5 − 0.5] , and the initial guess u 0 on the optimal control, required to initialize the algorithm, is chosen, similarly to [3], as u 0 According to Corollary 2, the use of the auxiliary dynamic constraint allows rewriting the cost functional (67) in the form required in (52), which in turn leads to the iterations described by (31), with Γ i provided by (51).Fig. 2 shows the behavior of the schemes introduced in this section for the selection of the step size α, namely the grid-search, the RMSProp, the one based on the second-order approximation of J(α i ), with α i provided by (62), and finally the scheme characterized by the choice of a fixed step.Both the RMSProp and the fixed step schemes are initialized with α = 0.01, whereas the grid search is performed discretizing the interval (0, 1) in samples uniformly spaced via an offset of 10 −4 , namely by defining T = {0.0001,0.0002, . . .}, with |T | = 9999.In particular, the top graph of Fig. 2 depicts the behavior of the cost functional after the first ten updates, with the lower values achieved by the grid search (starred line).It is worth noting, however, that each iteration of the grid search scheme is computationally much more expensive (time-consuming) with respect to the alternative schemes, requiring approximately 750 s for the simulation to terminate, whereas the other three proposed schemes required less than 5 s to terminate.Therefore, the scheme based on the second-order approximation leads to the best tradeoff between speed of convergence of J i to J(u ) and time required for the execution of the algorithm, with respect to the alternative strategies.The plot reported in the bottom graph of Fig. 2 provides a comparison between the resulting control laws after ten iterations and confirms the previous discussion.
In fact the control law resulting from the second-order approximation of the cost functional is closer than the others to the one obtained by the grid-search strategy.
V. PERSPECTIVES ON QUASILINEARIZATION Intuitively, the proposed scheme hinges upon two main ingredients, namely i) linearization and ii) the construction (and solution) of Hamiltonian dynamics.These are also the two main steps on which the strategy of quasilinearization (see [3, ch. 6]) is based, although performed in the opposite order: in fact linearization is carried out after the construction of the nonlinear Hamiltonian dynamics.The following statement shows that these two steps do not commute in general.To avoid burden of notation in the following result, consider the problem (30), (2), in which Q f = I for simplicity.
Fact 1: Consider the optimal control problem given by ( 30), (2) with Q f = I.The structures of the dynamics obtained by linearizing the nonlinear Hamiltonian dynamics and by constructing the Hamiltonian dynamics for the linearized system differ.
To show this, consider the Hamiltonian associated with (2), (30) given by and consider the following dynamics, obtained by applying the standard necessary conditions of optimality and by substituting in the resulting Hamiltonian dynamics the candidate optimal control, u = −R −1 g(x) λ, namely together with x(t 0 ) = x 0 and λ(t f ) = x(t f ), respectively.Considering then δx = x i+1 − x i , δλ = λ i+1 − λ i , and linearizing around the pair (x i (t), λ i (t)), one obtains Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where δλ f = δλ(t f ), δx f = δx(t f ), x f i = x i (t f ), and λ f i = λ i (t f ), and Consider now the linearization of (2), introduced in Section II and given by ( 26), along the process (u i , x i ), together with the cost given by (36), in which δz i is replaced by δx i , and consider the Hamiltonian function associated with this problem, given by By performing the same steps carried out at the beginning of the discussion and by considering the candidate optimal control δu = −(u i (t) + R −1 B(t) δλ), one obtains the dynamics together with δx(t 0 ) = 0, δλ f = δx f + x f i and the claim follows by noting that the two dynamics given by ( 70) and (73a) differ.
To visualize the implications of the previous proposition, consider the cost functional subject to the dynamical constraint with initial condition x 0 = 2.5.By applying, for example, the Pontryagin's minimum principle, one obtains the candidate optimal control law u(t) = −λ(t)x(t), which is then used to initialize the proposed scheme by substituting, respectively, λ(t) and x(t) with x 0 (t) and λ 0 (t), the latter one obtained by forward integrating the costate equation starting from an arbitrary initial condition λ(t 0 ).Note that the pair (x 0 (t), λ 0 (t)) is used also for the initialization of the quasilinearization scheme.shows how the control is updated by following, respectively, the quasilinearization and the scheme proposed herein, confirming that, although the starting and terminal control laws coincide in both cases, the paths taken by the two schemes, for the update of the optimal control law, differ.

VI. OPTIMAL CONTROL OF MECHANICAL SYSTEMS ON REDUNDANT TASKS
The effectiveness of the strategies envisioned above is validated by means of an application involving a mechanical system that must accomplish a prescribed redundant task.Toward this end, consider a mechanical system described by the canonical equations M (q)q + C(q, q) q+ ∂P (q) ∂q = u, q(t 0 ) = q 0 , q(t 0 ) = q0 (74) where q(t) ∈ R n is the vector of the generalized coordinates, q(t) ∈ R n and q(t) ∈ R n are the vectors of generalized velocities and accelerations, respectively, M (q) ∈ R n×n is the inertia matrix, which is symmetric and positive definite.The term C(q, q) q accounts for the centrifugal and Coriolis forces, whereas the term (∂P (q)/∂q) accounts for the gravity forces, with P (q) describing the potential energy due to gravity.Finally, u is the vector of the external joint generalized forces (i.e., the torques).This section is devoted to illustrate the application of the results presented in the previous sections to the case in which the mechanical system (74) is required to perform a desired task with respect to which it is redundant.In particular, task redundancy is characterized, for a certain mechanical system, by the presence of a number of controlled degrees of freedom greater than the number of variables needed to describe a certain given task (see, e.g., [44]).A popular solution to this control problem, as briefly recalled below, consists in rewriting the dynamics of the mechanical system with respect to the task; decouple the task itself from the remaining (zero) dynamics and finally by stabilizing the task alone via linear design methods.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
However, despite its simplicity, two relevant shortcomings of this strategy can be immediately identified.First, the hidden dynamics may be unstable for certain tasks.Second, somewhat related to the former, the above approach does not permit to fully exploit the intrinsic degrees of redundancy.On the other hand, a direct solution of this problem within the framework of optimal control is in general a discouraging task.In the following we envision a combination of the two approaches above, which permits to retain the advantages of both, in terms of simplicity and robustness.Since a task can be efficiently modeled as a function h : R n → R r , q → h(q), mapping the generalized coordinates space into the operational space, the objective is to steer h(q) to a desired function h d (t).Thus, redundancy arises whenever r < n.Introduce the auxiliary variable z = h(q) − h d , for all t ≥ 0. Define x 1 = z, x 2 = ż and substitute the preliminary control law u = (∂P (q)/∂q) +C(q, q) q + M (q)τ in (74), obtaining where ∇h(q) is assumed to be full row rank.Then, replace the first into the last equation and let in which the operator (•) indicates Moore-Penrose pseudoinversion, and K p and K d are such that the dynamic matrix of the linear system possesses all eigenvalues with negative real parts.Thus, one has that the control law given by u = C(q, q) q − M (q)(∇h(q)) ( q ∇ 2 h(q) q − ḧd (t)) steers z and its successive time-derivatives to zero asymptotically.Consider now instead the problem of applying the strategy of Section III to (74).The cost functional subject to the dynamic constraints (74) is given by where the matrices W 1 , W 2 , W 3 , and R are of appropriate dimensions.This particular choice of the weights indicates that it is desired that h(q(t)) is as close as possible to h d (t) for any time, together with the smallest possible angular velocity, thus settling for a compromise between the task accomplishment and the requirement of stability.Since the structure of the problem, given by the cost functional (78) and the constraints (74), belongs to the class of problems addressed by Corollary 2, the update on the control law is given by u i+1 = (1 − α i )u i + α i Γ i , with Γ i provided by (51).Therefore, the control law (77) can be used for the initialization of the algorithmic scheme proposed in the previous sections.Simulations are carried out in the case of a 2-DOF manipulator evolving on a horizontal plane, for which the task is defined as h(q) := x e (q) + y e (q) where (x e , y e ) are the coordinates of the origin of the endeffector reference frame.The desired task is then induced by selecting h d = 0.4, with W 1 = 10 3 , W 2 = I 2×2 , W 3 = 10 2 • I 2×2 , R = 10 −3 • I 2×2 , t 0 = 0 s, and t f = 20 s.The 2-DOF planar manipulator links lengths are given by l 1 = l 2 = 0.3 m, both having mass m 1 = m 2 = 1 kg.In the considered scenario, the gravitational effects may be neglected, since the robot essentially lays on the horizontal plane.The initial conditions of the manipulator have been chosen as q 0 = (π/4, π/4) and q0 = (0, 0).As it is can be appreciated from Figs. 4 and 5, the proposed algorithm is such that the robot reaches the desired objective with nearly zero angular velocity at time t f = 20 s.As a beneficial Fig. 6.Time histories of q 1 (red) and q 2 (blue) resulting from the application of the guess u 0 given by (77) (dashed), and from the application of the last update of the control law u (solid).consequence, the optimal solution that exploits the redundancy to a full extent has the effect of reducing the undershoot that arises from the application of the control law u 0 given by (77), see Fig. 4. The angular positions at the end of the procedure can be inferred from Fig. 6, which also depicts the same quantities at the beginning of the algorithm, obtained from the application of the control law (77).As graphically illustrated by the dashed lines of Fig. 6, the control law (77) is such that, while accomplishing the prescribed task, the position of the end-effector continues to slide along the task (i.e., the line described by the identity x e (q) + y e (q) = h d ), inverting the direction of motion each time it reaches the boundary of the feasible workspace.
The updates of the control law are reported in Fig. 7, whereas Fig. 8 shows the behavior of the auxiliary state-variable x v , which satisfies, according to Corollary 2, the equation together with the initial condition x v0 = 0.It is worth observing that this variable, due to the choice of the weights on the quantities that constitute the cost functional (78), represents the dominant part of the cost functional, since it is of six orders of magnitude greater than the term involving the control variable.Therefore, Fig. 8 depicts, at each iteration of the algorithm, the behavior of the integral of the running cost with respect to   time, which is related to the cost functional.The behavior of the aforementioned cost functional, with respect to the iterations, can be visualized in Fig. 9, showing that the proposed scheme improves significantly, with respect to the optimality criteria herein adopted, the performance of the control law (77), since after less than 2500 iterations the cost appears to be reduced to more than the 50% of its initial value.Finally, in Fig. 10, the cartesian trajectory of the end-effector of the planar manipulator is plotted.The gray dashed line corresponds to the equation x e (q) + y e (q) = h d , whereas the lines in gray represent the initial position of the manipulator.The trajectory individuated by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the red-circle points is given by applying the control law (77), with the orange lines indicating the final configuration of the manipulator in this case, while the blue-star points represent the trajectory obtained by applying the last update u of the control law, with the final configuration of the manipulator given by the blue lines.This figure shows, according to the previous figures, that the path traveled by the end-effector of the manipulator in the latter configuration is shorter than the former.

VII. CONCLUSION AND FURTHER WORK
An algorithm to solve a class of finite-horizon optimal control problems for nonlinear systems has been presented.The key feature of the algorithm is that it is based on an iterative strategy that relies, by means of a suitable time-varying change of coordinates, on the solution of a linear initial value problem, which solves the time-varying LQR problem derived from the linearization of the nonlinear system together with the construction of the corresponding time-varying Hamiltonian dynamics, removing then the necessity of solving a time-varying Riccati equation.It has been shown, for the considered class of problems, that this method yields different convergence properties with respect to those provided by methods that linearize directly the nonlinear Hamiltonian dynamics, such as the quasilinearization method.It has been also shown how using heuristics borrowed from the machine learning field, such as the RMSProp, could improve the speed of convergence of the proposed scheme.Further work will involve the extension of the proposed scheme to nonlinear differential games, validation on more complex problems and the solution of problems in which constraints on both the control and the state of the system are present.
a ij indicates the (i, j)th element of the matrix A. The vectorization of a matrix C ∈ R n×m , denoted with the operator vec(•) consists in stacking the columns of the matrix C, i.e., vec(C) = [c 11 , . . ., c n1 , . . ., c 1 m , . . ., c nm ] .

Fig. 3 .
Fig.3.Successive control update of quasilinearization (top graph) and constant costate (bottom graph) iterations.The dashed blue line indicates the initial guess u 0 (t) = −λ 0 (t)x 0 (t), whereas the solid red line indicates the candidate optimal control law.

Fig. 7 .
Fig. 7. Time histories of the components of the control law obtained from the proposed algorithm.

Fig. 8 .
Fig. 8. Time histories of the auxiliary state-variable x v , which, in this case, represents the dominant part of the cost functional.

Fig. 9 .
Fig. 9. Cost functional evaluated at each iteration of the proposed algorithm.

Fig. 10 .
Fig.10.Trajectories of the 2-DOF planar manipulator, obtained by applying the guess on the optimal control law given by (77) (circles), and from the application of the last updated control law u (stars).The starter links are reported in gray.