Linear-Like Policy Iteration Based Optimal Control for Continuous-Time Nonlinear Systems

We propose a novel strategy to construct optimal controllers for continuous-time nonlinear systems by means of linear-like techniques, provided that the optimal value function is differentiable and quadratic-like. This assumption covers a wide range of cases and holds locally around an equilibrium under mild assumptions. The proposed strategy does not require solving the Hamilton–Jacobi–Bellman equation, i.e., a nonlinear partial differential equation, which is known to be hard or impossible to solve. Instead, the Hamilton–Jacobi–Bellman equation is replaced with an easy-solvable state-dependent Lyapunov matrix equation. We exploit a linear-like factorization of the underlying nonlinear system and a policy-iteration algorithm to yield a linear-like policy-iteration for nonlinear systems. The proposed control strategy solves optimal nonlinear control problems in an asymptotically exact, yet still linear-like manner. We prove optimality of the resulting solution and illustrate the results via four examples.


I. INTRODUCTION
T HE solution of optimal control problems for nonlinear systems hinges upon the solution of the Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDE), which can be extremely difficult or impossible to solve.Many approximation methods for solving the HJB PDE have been developed, under a variety of assumptions, at the cost of some optimality Adnan Tahirović is with the Faculty of Electrical Engineering, University of Sarajevo, 71000 Sarajevo, Bosnia and Herzegovina (e-mail: atahirovic@etf.unsa.ba).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TAC.2022.3226671.
Digital Object Identifier 10.1109/TAC.2022.3226671 loss [1].An alternative way for solving optimal control problems for nonlinear systems is based on Pontryagin maximum principle, which provides necessary conditions of optimality.Direct discretization is another approach for solving optimal control problems; it is often used for problems over finite horizon and to handle constraints.The resulting problem can be efficiently solved due to the existence of fast and reliable nonlinear programming solvers, which make this the most widely used and popular approach.However, the HJB equation gives both necessary and sufficient conditions for an optimal feedback control solution and provides the optimal value function over the entire state space [1].This makes a solution based on the HJB approach unique.For this reason, in this article we study unconstrained optimal control problems and their solutions via the HJB equation.
A first class of techniques used to solve HJB equations is based on the theory of viscosity solutions [2].This solution is proved to be the value function of the underlying optimal control problem.It is required to be continuous, but not necessarily differentiable, as it is assumed for classical solutions.To obtain an approximate viscosity solution, finite-difference and finite-element methods have been used: Both require a discretization of the state space, hence the computational cost increases exponentially with the dimension of the state space.
A second class of techniques is based on the principle of model-based reinforcement learning with policy-iteration (PI) algorithm, which reduces a nonlinear HJB PDE to a linear PDE [3], [4].This is used to find the cost associated to an admissible control.The PI algorithm also provides an incremental improvement of the control policy and ensures convergence to the optimal control.In many cases, solving a linear PDE is still not easy.In [5], Galerkin approximations have been used to approximately solve optimal control problems by combining this approximation with the PI algorithm.Some other approaches developed to approximate the solution of the HJB PDE, up to a desired degree of accuracy, have been presented in [12], [13], and [14].
A third class of techniques is based on results obtained for linear systems and for a cost in quadratic form.For such systems the HJB PDE reduces to an algebraic Riccati equation (ARE), which is easy to solve.The methods based on Jacobian linearization of the nonlinear system, feedback linearization [6], [15], dynamic extensions [16], and state-dependent Riccati equations (SDRE) [17], [18], [19], [20], [21], provide techniques to approximate the optimal control by avoiding solving nonlinear PDEs.The linearization-based approach is feasible only in the vicinity of an equilibrium, while feedback-linearization may cancel "useful" nonlinearities and may not provide a near-tooptimal control law.The dynamic extension-based approach relies on a modified cost to avoid solving the HJB PDE, providing thus a suboptimal control law.It is worth noting that the dynamic extension-based control is capable to extract an upper bound of the modified cost to provide a measure of the suboptimality level of the solution.The SDRE-based control approach relies upon a linear-like factorization of the nonlinear system.Its main disadvantage is the lack of stability guarantee.
This article provides a thorough theoretical extension of our previous work [22].We propose a control strategy for inputaffine continuous-time nonlinear systems which is based on the PI paradigm combined with the linear-like factorization used in the SDRE approach.We use the PI algorithm to ensure convergence of the policy to the optimal control.Unlike other PI approaches, we use a linear-like factorization of the nonlinear system to avoid solving any PDE, thus replacing the PDE with a state-dependent Lyapunov matrix equation (SDLE).In this way the proposed control strategy solves the optimal nonlinear control problem in an asymptotically exact, but still linear-like, manner, provided the optimal cost has a quadratic-like form.If this is not the case, the obtained results suggest that the proposed approach has a potential to find an optimal solution in the vicinity of an equilibrium.
The rest of this article is organized as follows.In Section II we define the problem and recall a general form of the PI algorithm.In Section III we recall the SDRE approach with its associated factorization technique and redefine the optimal control problem.In Section IV, we define the linear-like PI which computes the optimal control with a modified cost.Section V introduces the modified linear-like PI to solve the considered nonlinear optimal control problem.Section VI provides an illustration of the results via four examples.Finally, Section VII concludes the article.

A. Problem Description
Consider a class of continuous-time nonlinear systems described by an equation of the form with state x(t) ∈ R n , input u(t) ∈ R m , and f and g Lipschitz continuous on a compact set Ω ⊂ R n that contains the origin.Suppose in addition that the system (1) has an equilibrium at the origin for u = 0, i.e., f (0) = 0. Finally, assume that the system is controllable in Ω, i.e., it is possible to find an input signal u which steers the state of the system to the origin from any initial condition x 0 in Ω in some time t ≥ 0. Consider now the cost function where the state penalty function l is a positive function on Ω, such that l(0) = 0. Assume that the system (1) with output y = l(x) is zero-state observable, and R ∈ R mxm is a symmetric positive definite matrix.Typically, l(x) is quadratic, i.e., l(x) = x T Qx, where Q = Q T is a positive semidefinite matrix.
A feedback control u = u(x) is called an admissible control, u ∈ A(Ω), with respect to l on Ω, if u is continuous on Ω, u(0) = 0, the zero equilibrium of the closed-loop system is locally asymptotically stable with basin of attraction Ω ⊆ Ω, and the cost (2) is finite for all x 0 ∈ Ω.The minimal value of the cost function V , obtained for an admissible control u * = u * (x) (the optimal control), is denoted as the optimal cost V * (x), ∀x ∈ Ω.This optimal cost V * , called the value function, is the solution of the HJB equation (3) provided it is differentiable.Eq. ( 3) is generally hard to solve even in those cases in which a (unique) solution is known to exist.The requirement to solve a PDE makes the optimal control problem virtually impossible to solve in closed form.If a solution exists, the optimal control is

B. Policy Iteration for Nonlinear Systems
To compute the value of the cost for a given initial condition x 0 and an admissible control û, one has to solve (1) with u = û, which is not always possible, and compute the integral (2) along the corresponding solution.Another way to deal with this problem is to differentiate (2) along the trajectories of the system yielding the linear PDE which represents an incremental expression of the cost of the admissible control û, and it does not depend on the trajectories of the system (1).If the optimal control (4) is used, i.e., û = u * , then (5) transforms into the nonlinear PDE (3), the solution of which directly provides the optimal cost V * and the optimal control law u * .For more detail, see, e.g., [4] and [5].
The optimal PI for continuous-time nonlinear systems has been proposed in [4].The main idea of this iterative algorithm is to choose an arbitrarily initial admissible control û = û(x) ∈ A(Ω) and solve the linear PDE (5) for V , which should be easier than solving the nonlinear PDE (3).In order to improve the performance of the arbitrarily selected control û, one then defines the policy-update Having a new and improved control û * (see e.g., [4] and [5]), one can again solve (5) to obtain the value function V .By iteratively updating the value function and the control law iterating ( 5) and ( 6), the optimal PI algorithm ensures, in principle, the desired convergence, i.e., lim k→∞ Vk (x) = V * (x) and lim k→∞ ûk (x) = u * (x), ∀x ∈ Ω, where k is the index of the iteration.
Choosing an arbitrarily initial admissible control in an analytical form as a first step of the policy iteration algorithm can be difficult for some nonlinear systems.However, different techniques for constructing such a control law for classes of nonlinear systems can be found, e.g., in [6], [7], [8], [9], [10], [11].
Although (5) should be easier to solve for V than solving (1) and (2), it is still a PDE.For this reason different approaches to approximately deal with (5) have been proposed, see, e.g., [4], [5].The goal of this article is to show how PI can be exploited to find the optimal control solution without the need to solve any PDE on the basis of a simple linear-like procedure.

C. Policy Iteration for Linear Systems
In this section we consider linear systems, i.e., the system (1) with f (x) = Ax, with A ∈ R nxn , g(x) = B, with B ∈ R nxm , and a quadratic cost, i.e., l( . Assume that the pair (A, B) is stabilizable and the pair Assuming that the optimal value function is of the form where P * = P * T is a positive definite matrix, the HJB (3) becomes the ARE which is easily solvable and has a unique positive definite solution P * .The optimal control action can then be computed from (4) yielding where Π * is the optimal control policy.
Although the solution to the optimal control problem for continuous-time linear systems can be given in the closed form (9), we recall the optimal PI algorithm to understand how to construct the optimal control in an iterative manner.
In the simplified version of the optimal PI algorithm for linear systems the cost-update (5) becomes the Lyapunov matrix equation which can be easily solved for a positive definite matrix P , provided an admissible control û = Πx is given.Additionally, the policy-update (6) for linear systems becomes The proof of convergence of the PI for the linear case is provided in [23], where it has also been shown that the PI is actually Kleiman-Newton's method, which ensures convergence to the solution of the ARE whenever the initial control is admissible.

III. POINTWISE FACTORIZATION OF THE OPTIMAL CONTROL PROBLEM
Under mild regularity assumptions the nonlinear system (1) can be rewritten in the form ẋ = A(x)x + g(x)u (12) where A : R n → R nxn is a smooth matrix valued function.The main idea behind the factorizations of the function f as f (x) = A(x)x is to represent the nonlinear system (1) as a pointwise linear system by assuming that A and g are constant matrices for each state x along the trajectories of the system, see e.g., [19].
In the spirit of the above factorization, similarly to the linear case, we assume a pointwise quadratic form for the optimal value function, namely where P * (x) = [P * (x)] T for all x ∈ Ω is a state-dependent matrix valued function and it is positive definite for all x ∈ Ω.
For clarity, we first define the solution to the SDRE [19], which represents the factorized version of the ARE (8).
Definition 1 [SDRE]: A positive definite matrix P is the pointwise solution to the SDRE for the state x if As in the case of the ARE, the SDRE is easily solvable for each fixed x ∈ Ω.By mimicking the linear-like procedure presented in Section II-C, the control action can be computed in the pointwise form Equations ( 14) and ( 15) form the SDRE-based control method: (14) is solved for each x along the trajectories of the system and the control law is computed as in (15).Note that the SDRE-based control does not provide the optimal solution to the optimal control problem for the nonlinear system, since (14) has not been derived from the HJB (3).Another issue pertains to the matrix P , for which we do not have a closed form solution, i.e., P = P (x), but only the pointwise value for each state x along the trajectories of the system.This prevents V (x) = x T P (x)x from being a Lyapunov function candidate, since its time derivative along the trajectories of the system, namely has the additional term Ṗ (x), which is impossible to obtain analytically and to be used for further analysis.To address this issue consider the following statement.Lemma 1: [Direct optimal control] Assume that the optimal value function for the optimal control problem for the nonlinear system (12) is given in the quadratic-like form (13), where P * (x) = [P * (x)] T is a positive definite matrix for all x ∈ Ω.Then, P * (x) is the solution of the HJB equation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
while the optimal control is given by u * = ū + u corr , where and p i,j indicates the (i, j)th element of the matrix P * (x).
Proof: Starting from ( 5), one obtains (dropping arguments) where the optimal control u * is obtained as the control that minimizes the left-hand side of ( 20), giving the two components ( 18) and (19).Note that the last term of ( 20) is the time derivative along the trajectories of the system once u * is used, i.e., x By replacing the control policy Π in accordance with (18), one gets (17) which completes the proof.
Although Lemma 1 provides the exact solution to the optimal control problem, the HJB (17), which is itself a PDE, is as hard to solve for P * as the initial HJB (3).However, (17) allows for a separation of the optimal control problem into two simpler problems, one aimed at finding the solution ū, which is the counterpart of (14), and the second one aimed at finding a correction term from the last two terms in (17), which are discussed in Sections IV and V, respectively.

A. State-Dependent Lyapunov Matrix Equation (SDLE)
The main idea behind the linear-like PI is to use the PI algorithm for nonlinear systems by avoiding using PDEs, i.e., by using only Lyapunov matrix equations as in the linear case discussed in Section II-C.To do so, we conduct the PI by omitting the last two terms in (17) to obtain a Lyapunov equation instead of the PDE at the cost of optimality loss.For clarity, we define the SDLE which is used as the approximate cost-update equation in the PI algorithm.
Definition 2 [Approximate cost-update]: We call (23) the approximate cost-update equation for the nonlinear system and write P (x) = CU SDLE ( Π(x)), where the index SDLE indicates that one has to solve the state-dependent Lyapunov matrix equation ( 23) to obtain P (x).
Note first that this equation is easy solvable as in the linear case (10).Moreover, unlike the idea behind the SDRE ( 14), where P is computed pointwise for each single x along the trajectories of the system, the SDLE provides an analytical form of P .Having P in closed form, it is then possible to compute the time derivative Ṗ along the trajectories of the system, thus circumventing one of the main limitations of the SDRE-based approach.
Note also that the SDLE can be derived from (5) as in ( 20)-( 22), by letting û = ū + u corr , where the terms equal to the last two terms in (17) are omitted for simplicity.This would mean that the SDLE can be considered as the cost-update equation when taking u corr (x) = 0, for all x, and by omitting the time derivative Ṗ .For this reason we call ( 23) the approximate cost-update equation, and we write P (x) = CU SDLE ( Π).

B. Control Based on the SDLE
Along with Definition 2, we introduce a new definition and two results to define the control based on the SDLE.
Definition 3 [Approximate policy-update]: Consider the differentiable function V (x) = x T P (x)x : Ω → R ( V (0) = 0), in which for each x, P (x) is a positive definite matrix obtained from (23).The control û * is said to update the control û (or the policy Π * updates the policy Π) in accordance with the approximate policy-update equations for nonlinear systems and we write Π * = P U SDLE ( P (x)).Note that (24) includes only the first term (18) of the optimal control given by ( 18)- (19).For this reason, we also call Π * = P U SDLE ( P (x)) the approximate policy-update equation.
Lemma 2: [Stabilizability of the approximate policy-update] Consider an admissible control ûk (x) = Πk x ∈ A(Ω) and the positive definite solution Pk obtained from (23) in accordance to Definition 2.Then, the updated control x is admissible on Ω as well.
Proof: To prove the statement we need to consider two different state space regions, R 1 ∈ R n and R 2 ∈ R n , in which x T Ṗk x ≥ 0 and x T Ṗk x < 0, respectively, and the time derivative Ṗ is obtained along the trajectories of the system (1) in closed-loop with ûk+1 .Let V k be a candidate control Lyapunov function which is positive definite, i.e., Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where p is an arbitrary positive constant.This function is continuous by definition and differentiable for all x ∈ Ω, including the states x along the switching hypersurface x T Ṗk x = 0.
Namely, lim V m k = lim Vk also holds in the limiting case when x T Ṗk x → 0 ± .
In the region R 1 , the time derivative of V k along the trajectories of the system (1) becomes (26) which can be rewritten in the form Since the pair ( Πk , Pk ) satisfies ( 23) and As a result, V m k is negative-definite in the region R 1 .In the region R 2 , the time derivative of V k along the trajectories of the system (1) is which can be written in the form This means that Vk is negative-definite in R 2 .This proves that V k is a Lyapunov function, the origin is asymptotically stable and the control law ûk is admissible.
Lemma 3: [Cost of the approximate policy-update] Consider an admissible control ûk = Πk x ∈ A(Ω) and its corresponding positive definite solution Pk obtained from (23) in accordance to Definition 2.Then, the cost of ûk is where Vk = x T Pk x.
Proof: By assumption, we use (23) to construct Pk as which can be modified into the form For all x ∈ R 2 , it is seen from ( 30) that Vk can be considered the cost for ûk which is related to the modified positive semidefinite state-cost matrix It follows from (31) that 32) which is the claim of Lemma 3 for all x ∈ R 2 .For all x ∈ R 1 , it is not possible to conduct the same analysis as for x ∈ R 2 , since Q m might be negative definite for some x ∈ R 1 .

However, for all
This modification gives (30) in the form of (5), i.e., 33) which is the incremental expression of the cost of the admissible control.In addition, since V m k is positive definite in R 1 , it represents the cost of ûk , which completes the proof of Lemma 3.
It should be noted that it is possible to conduct the same analysis for all x ∈ R 2 as for x ∈ R 1 since, due to (32), it is now possible to assume that V m k is a positive definite solution of (33) for every x ∈ R 2 , which is required to show, that it is the cost of the control ûk .

C. Approximate Linear-Like PI Based on the SDLE
The approximate linear-like PI based on the SDLE iteratively uses the approximate cost-update (Definition 2) and the approximate policy-update (Definition 3) in order to construct the final form of the approximate control.The following result states that such a procedure is convergent.
If the pair ( Pk , Πk+1 ) is constructed at the kth iteration step, then the approximate linear-like PI based on the SDLE procedure converges.
Proof: In accordance to the approximate cost-update (23), Pk is the unique and positive definite solution of If Pk (û k ) related to the control ûk is considered as an update of Pk−1 (û k−1 ), then it can be replaced in (34) by where Δ Pk (û k ) is by definition the variation of the matrix Pk−1 (û k−1 ) once the new control ûk is used.This further means that the form of the variation of the cost (Lemma 3) for the control ûk is given as where Combining ( 34) and (35) leads to In case of a linear system, the first term would represent the Hamiltonian function.However, for a nonlinear system we call this term a linear-like Hamiltonian function for the pair ( Pk−1 , Πk ), i.e., H k−1 .The second term in (38) represents the time derivative of the cost variation (36) when the control ûk is used, i.e., Starting again from (34), we write x T [(A + g Πk+1 ) T Pk + Pk (A + g Πk+1 ) Using now g T Pk = −R Πk+1 , which follows from the approximate policy-update Πk+1 = −Rg T Pk , we obtain which by back-tracking becomes Combining now (41) with (39), yields which can be integrated over [0, ∞) to get Due to Lemma 2 the controls ûk and ûk−1 are admissible, hence the state of the system is zero as t → ∞ for both controls.This means that there is no variation in the cost at the origin between As a result, the sequence {V m k } n k=1 is monotonically decreasing, while being bounded from below by zero due to Lemma 3, i.e., V m k ≥ 0, hence it is convergent when n → ∞.
We call the solution based on this approach the PI-SDLE control.One of the main advantages of the proposed PI-SDLE control is that the linear-like PI can also be computed pointwise using (23), instead of finding a closed form solution.In such a case, one needs to conduct the whole PI algorithm for every single x along the trajectories of the system.Such a procedure is similar to the pointwise computation of the ARE solution when the SDRE-based control is used.Unlike the SDRE-based control, the PI-SDLE based control is proven to be stabilizing in Ω provided the initial control is admissible.

V. OPTIMAL CONTROL BASED ON LINEAR-LIKE POLICY ITERATION
We now show how to use the linear-like PI proposed in Theorem 1 to obtain the optimal solution of optimal control problems for continuous-time nonlinear systems.
Definition 4 [The plain-PI]: Consider the linear-like PI We call (45)-( 46) the linear-like plain-PI, the pair (ū 1 (x), P 1 (x)) its limit solution, and P 1 and P 2 the regions in which x T Ṗ 1 (x)x ≥ 0 and x T Ṗ 1 (x)x < 0, respectively.The structure for the plain-PI is given in Algorithm 1.
Definition 5 [The P 1 -PI]: Consider for all x ∈ P 1 the linear-like PI where the index i indicates the outer iteration (Lines 1-11 in Algorithm 2) and one completed linear-like PI (47)-(48) over the index k (Lines 5-8 in Algorithm 2), with k indicating the inner iteration and one PI step for a fixed index i.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The control construction based on the linear-like P 1 -PI from Definition 5 can be interpreted for each fixed value i as a plain-PI from Definition 4 with a modified state-cost, , see (47) and Lines 4 and 6 in Algorithm 2. Whenever Q m i is a positive definite matrix, it is then possible to find a unique and positive definite solution P 2 k,i from (47) at each iteration step over the index k.Lemma 4 shows that this is the case for all x ∈ P 1 and every k and j.
This means that the linear-like P 1 -PI aims to find the solution through the state-cost modification.The underlying rationale is to see whether convergence can be obtained by replacing i−1 from the preceding (i − 1)th PI iteration.In the latter case, we still solve the SDLE instead of the HJB equation.
Once convergence is achieved over k (Lines 5-8 in Algorithm 2), i.e., the new pair (ū 2 i , P 2 i ) is obtained (Lines 9 and 10 in Algorithm 2), we repeat the procedure until convergence of the outer PI over the index i is achieved as well (Lines 1-11 in Algorithm 2).The proof of convergence is given in Lemma 6.
Definition 6 [The P 2 -PI]: Consider, for all x ∈ P 2 , the linear-like PI x + u j,corr (50) where u j,corr is the correction component obtained as solution to the quadratic matrix equation Algorithm 3: P 2 -PI(A(x), g(x), Q, R, ū3 0 = ū1 , P 3 0 = P 1 ).1: for j ← 1 to M 3 do outer loop with M 3 steps 2: Policy term from ū3 j−1 4: u j,corr ← QUADRATICEQ(R, Ṗ 3 j−1 ) (51) 5: for k ← 0 to N 3 do inner loop with N 3 + 1 steps 6: The index j indicates the outer iteration (Lines 1-11 in Algorithm 3) and one completed linear-like PI (49)-( 50) over the index k (Lines 5-8 in Algorithm 3), with k indicating the inner iteration and one PI step for a fixed index j.
The control construction based on the linear-like P 2 -PI from Definition 6 can be interpreted for each fixed value j as a plain-PI from Definition 4 with a modified control, see ( 50) and (51), as well as Lines 4 and 7 in Algorithm 3. We show in Lemma 5 that it is possible to find a real-valued solution u corr from (51) for all x ∈ P 2 and every j.
The algorithm for the P 2 -PI (49) and ( 50) is as follows.For a fixed Ṗ 3 j−1 and u 3 j−1 (Lines 2 and 3 in Algorithm 3), we construct the correction term u j,corr in accordance to (51) (Line 4 in Algorithm 3).Then, (49) is solved (Line 6 in Algorithm 3) to construct the modified control û * 3 k+1,j (50) for every k of the inner iteration (Line 7 in Algorithm 3).The underlying rationale is to see whether convergence can be obtained by modification of the control with respect to the plain-PI.The HJB (17) indicates that u corr can be used to cancel out the last term of the left-hand side of (17) whenever it is possible to find a real-valued solution u corr from u T corr Ru corr + x T Ṗ * x = 0.In order to simplify the problem, we use (51) based on the pair (ū 3 j−1 , P 3 i−1 ) obtained from the preceding (i − 1)th PI iteration.Since the next k steps of the inner iterations (Lines 5-8) requires the policy Π3 k+1,j of the modified control û * 3 k+1,j obtained in Line (7), one can easily derive it from (50) and (51).
Once convergence is achieved over k (Lines 5-8 in Algorithm 3), i.e., the new pair (ū 3 j , P 3 j ) is obtained (Lines 9 and 10 in Algorithm 3), we repeat the procedure until convergence of the outer PI over the index j (Lines 1-11 in Algorithm 3) is achieved as well.The proof of convergence is given in Lemma 7. Lemma 4: [Domain of the P 1 -PI] Consider the P 1 -PI from Definition 5.Then, Proof: In accordance with Definition 5, the initial admissible control ū2 0 and the matrix Ṗ 2 0 , required by the P 1 -PI ( 47) and (48), are the solutions of the plain-PI from Definition 4. The limit form of ( 45) and ( 46) can be written as from which we select ū2 0 = ū1 and Ṗ 2 0 = Ṗ 1 to form the P 1 -PI for i = 1.By comparing (53) with the form of the incremental expression of the cost ( 5) for ū1 = Π1 x, i.e., x T ((A + g Π1 ) T P 1 + P 1 (A + g Π1 ) one can see that the optimal cost and the optimal control are already achieved for all x along the hyperplane x T Ṗ 1 x = 0.
For the ith-outer iteration of the P 1 -PI ( 47) and ( 48), the inner iteration over the index k can also be considered as a plain-PI, so the convergence can be achieved provided the initial control is admissible.Starting from the first P 1 -PI (i = 1) and the first inner iteration (k = 1), we take Π1 for the initial control policy, i.e., Π2 k=1,i=1 = Π1 , which is already optimal in the set x T Ṗ 1 x = x T Ṗ 2 0 x = 0. Furthermore, once we complete the iteration over the index k for i = 1, the optimality in this set is preserved due to the convergence of the plain-PI.
More generally, once convergence of the i th iteration is achieved over the index k, the form (55) follows from (47) and ( 48) and It is therefore possible to conclude from (55) that the optimal cost and the optimal control are already achieved for all x in the set x T Ṗ 2 i−1 x = 0. On the other hand, any optimal pair ( Π2 i , P 2 i ) related to the original state-cost x T Qx must satisfy which holds only for x along x T Ṗ 2 i x = 0. Assuming now that the convergence is not yet achieved along the index i, this further means that the optimality is not achieved for all other x which are outside the set x T Ṗ 2 i x = 0.However, since the optimality is already achieved for x T Ṗ 2 i−1 x = 0, we conclude that the sets x T Ṗ 2 i−1 x = 0 and x T Ṗ 2 i x = 0 represent the same state-space set.This means that the initial hypersurface x T Ṗ 1 x = 0, along which the optimal control and the optimal cost are achieved by the plain-PI, is preserved for all i of the P 1 -PI.This discussion proves that Additionally, one can interpret from ( 53) and (54) that, in order to achieve the optimal cost and the optimal control by a form of linear-like PI, one needs to properly modify the state-cost x T Qx.It can be seen that, for all x for which x T Ṗ 1 (x)x > 0, the modified state-cost has to be larger than x T Qx.Assume now that ∃x : x T Ṗ 2 i x < 0 for x T Ṗ 2 i−1 x > 0. This would mean that for such x the modified state-cost has to be smaller than x T Qx, which is in contradiction with the indication of the preceding iterations.This conclusion holds for the initial iteration as well, where x T Ṗ 2 0 x ≥ 0, i.e., x T Ṗ 1 x ≥ 0, which completes the proof.
Lemma 5: [Domain of the P 2 -PI] Consider the P 2 -PI from Definition 6.Then, (57) Proof: Starting from (50), we can write the preceding control of the inner iteration in the form (58) If we now plug the term Π3 k,j x + u j,corr into (49) in a similar manner as in ( 20)-( 22), we obtain Using (51), we can replace the last term to obtain This means that (49)-(51) can be replaced with (60) together with a nonmodified control law and (51) in order to construct the linear-like P 2 -PI.However, it is worth noting that the last term of (60) exists only in case u j,corr = 0 due to (51).Observing now (60) and (61), the proof of Lemma 5 can be completed in a similar manner as in Lemma 4.
In the following, we show the convergence of both the P 1 and P 2 policy-iterations and introduce the proposed optimal control.
Lemma 6: [Convergence of the P 1 -PI] The linear-like P 1 -PI is convergent for all x ∈ P 1 .
Proof: The ith-outer iteration of the P 1 -PI can be considered a plain-PI with a modified state-cost matrix This means that it is convergent for every inner iteration over the index k provided the initial control is admissible (Theorem 1).Since the initial control policy Π2 k=1,i for the ith-outer iteration is taken as the convergent solution of the preceding (i − 1)thouter iteration, i.e., Π2 k=1,i = Π2 i−1 , we conclude the convergence of each inner-iteration cycle over the index k.Moreover, by recalling Lemma 3 and taking into account that the P 1 -PI uses a modified state-cost Q, the real cost V m k,i ( Q) of the control ū2 i Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
converges to when k → ∞ during the ith -outer iteration.This leads to the real cost of the control ū2 i which is related to the original state-cost (63) To complete the proof, we show that the cost V m,2 i (Q) does at least not increase during the transition from the ith to the (i + 1)th-outer iteration.Once convergence of the ith-outer iteration is achieved over the index k, (47) becomes (55), where the first expression for k = 1 of the (i + 1)th-outer iteration is given as The difference between (64) and ( 55) can be written in the form where ΔP 2 i+1 = P 2 k=1,i+1 − P 2 i .Observing now the real cost of the control ū2 i for k = 1 during the ith-outer iteration, which is it can be seen that ( 65) can be written in compact form as where the first time derivative is taken along the trajectories of the system when the control policy Π2 i is used, and Once the overall convergence is achieved, where lim i→∞ ΔV m,2 i+1 (Q) = 0 and lim i→∞ ΔP 2 i+1 = 0, the modified cost (63) and the control (48) become Lemma 7: [Convergence of the P 2 -PI] The linear-like P 2 -PI is convergent for all x ∈ P 2 .
Proof: In Lemma 5, one can see that (49)-(51), which form the proposed linear-like P 2 , can be replaced with (51), (60), and (61).Starting from the latter, one can complete the proof of Lemma 7 in a similar manner as for the proof of Lemma 6.
It is worth noting that here, at the end of the jth − outer iteration, the cost of the ū3 j which is related to the original state-cost matrix Q converges to Once the overall convergence is achieved, the correction term u j,corr vanishes, i.e., lim j→∞ u j,corr = 0.If we now recall (51), one can write Accordingly, the last two terms from (70) vanish as well, so the modified cost (70) and the control (50) become Theorem 2: [Optimal control] Assume that the optimal value function is in the form (13).Then, the optimal control u * (x) ∈ A(Ω) is given as where ū2 and ū3 are the final solutions of the linear-like P 1 and P 2 policy iterations, respectively.Proof: In accordance with Lemma 6, (63) represents the cost of the control ū2 i for x ∈ P 1 , which means that the pair (ū 2 i , V m,2 i (Q)) satisfies the incremental expression of the cost (5) which can be written as (55).On the other hand, it can be seen that the control ū2 i minimizes (55), which means that it is optimal with respect to the cost V m,2 i (Q).Since in the limiting case when i → ∞, (55) converges to which is the original incremental expression of the cost of the control ū2 while the cost is given in the quadratic form (68), this leads to the conclusion that the limit pair (ū 2 , V 2 ) is optimal for x ∈ P 1 .
In accordance with Lemma 7, (70) represents the cost of the control ū3 j for x ∈ P 2 , which means that the pair(ū 3 j , V m,3 j (Q)) satisfies the incremental expression of the cost (5) which can be written as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
In the limiting case when j → ∞, (76) converges to and the last two terms of (70) vanish as well.It can be seen that the control ū3 minimizes (77), while (77) can be rewritten in the form of the original incremental expression of the cost of the control ū3 with the quadratic function (72), i.e., (A + g Π3 ) T P 3 + P 3 (A + g Π3 ) + Π3 T R Π3 x+g(x)ū 3 x = 0 holds in the limiting case.This leads to the conclusion that the limit pair (ū 3 , V 3 ) is optimal for x ∈ P 2 as well.

VI. ILLUSTRATIVE EXAMPLES
We provide simulation results by considering four nonlinear systems.For the first and the second system the optimal control and the optimal value function are known, so it is possible to assess the proposed approach against the optimal solution.However, the nonlinear system used in the second example is obtained through nonlinear transformation of a linear system providing a testing case with a nonconstant matrix Q = Q(x).
In the third example we cover a nonlinear system for which the optimal control and the optimal value function are also known, however, the value function is not quadratic-like.In the fourth example we compare our approach against the control based on the Galerkin approximation (GAC) by considering a nonlinear system with an unknown optimal control policy and an unknown optimal value function.These last two examples illustrate an additional capability of the proposed approach to even solve such nonlinear control problems.
In all examples, after the plain-PI (45)-( 46) is used, we complete only one outer iteration of the P 1 -PI (47)-(48) and the P 2 -PI (49)-(50), i.e., i = 1 and j = 1, and then three inner iterations of the P 1 -PI and the P 2 -PI.The number of the inner iterations used for the plain-PI will be indicated in each example.

A. Optimal Control of the Van Der Pol Oscillator
Consider the Van Der Pol oscillator with μ = 0.5, the state-cost l(x) = x 2 2 , and assume R = 1.The optimal control is u * = −x 1 x 2 , with optimal value function V * = x 1 2 + x 2 2 .The system can be easily factorized with The initial admissible control for the linear-like PI can be selected to be the one that cancels out the nonlinearities and stabilizes the system, i.e., u = − 1 2 x 1 x 2 (e.g., Π = [0 − 1  2 x 1 ]).In such a case, the initial control used for the PIs is similar to the optimal control, so the convergence is expected to be achieved  quickly.Fig. 1 shows the optimal value function (left) and the value function obtained by the proposed approach in three inner iterations of the plain-PI and three inner iterations of each P 1 -PI and the P 2 -PI.
However, in order to illustrate how convergence improves based on three iterations of the plain-PI without using the P 1 -PI and the P 2 -PI, we start from a different initial control u = − 1 2 x 1 x 2 − 0.1x 3 1 x 2 (e.g., Π = [−0.5x 2 − 0.1x 3  1 ]) which also ensures the asymptotic stability of the feedback system (79).Fig. 2 illustrates how the associated cumulative cost (left) converges depending on the number of iterations used in the plain-PI (45)-(46).In some examples, the plain-PI can achieve an optimal solution without modifications based on the P 1 -PI and the P 2 -PI.In any case, the proposed approach based on the plain-PI with three iterations of the P 1 -PI and the P 2 -PI achieves the optimal cost value (right).
Fig. 3 provides a comparison between the proposed approach and the optimal control for the system for x 0 = [−1; 1] and three inner iterations of the plain-PI with three inner iterations of the P 1 -PI and the P 2 -PI.From the control signals, cumulative costs and phase portrait, we conclude optimality of the proposed solution.In Fig. 3 one can also see the switching function x T Ṗ 1 x along the trajectories of the system, which is used in (74).This function indicates the time intervals in which the two different forms of the optimal control (74) have been used.Another interesting observation is that this function becomes zero before the state reaches the origin.This phenomenon has not been investigated in this work, and it can be a promising direction for further understanding of the proposed framework.This means that the system trajectory has approached the hypersurface x T Ṗ 1 x = 0 (in this example, a curve) and then has Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.moved along this surface towards the equilibrium.Somewhat surprisingly, once the system state is on this hyper-surface, along which the PIs (45)-( 46), (47)-(48), and (49)-( 50) are equivalent, one only needs the linear-like PI (45)-( 46) to obtain the remaining part of the optimal control.Fig. 4 show different shapes of the switching function obtained for different initial conditions.

B. Nonlinear System With a Nonquadratic Cost
Consider the nonlinear system with the state-cost l(x) = (x 1 − 1 2 x 2 ) 2 + x 4 2 and assume R = 1.This system can be obtain by transforming the linear system with Q being the identity matrix and R = 1, using the nonlinear transformation x 2 = x2 and The optimal control of the system (81) can be computed through the same nonlinear transformation starting from the optimal control of the linear system (82).The nonlinear system (81) can be easily factorized with where the state-cost can be rewritten in the state-dependent quadratic form l( In order to illustrate the proposed PI algorithm, the initial admissible control is obtained through the same nonlinear transformation starting from the control of the linear system (82) which allocates the system poles to (−3, j0) and (−1, j0), i.e., Fig. 5 shows the optimal value function (left) and the value function obtained by the proposed approach (right) using only two inner iterations of the plain-PI indicating the convergence is locally achieved.

C. Nonlinear System Without a Quadratic-Like Form of the Optimal Value Function
Example 1: Consider the nonlinear system with the state-cost l(x) = x 2 2 and assume R = 1 [24].The optimal control of this system (88) The nonlinear system (86) can be easily factorized with for which the initial admissible control for the linear-like PI is selected to be the one that cancels out the nonlinearities and stabilizes the system, i.e., u = −x 2 − x 2 sinh(x 2 1 + x 2 2 ) (e.g., Π = [0 − 1 − sinh(x 2 1 + x 2 2 )]).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Fig. 6.Optimal value function (left) and the value function obtained by the proposed approach using only two inner iterations of the plain PI (right) for the system without quadratic-like form of optimal value function.
Fig. 6 shows the optimal value function (left) and the value function obtained by the proposed approach (right) using only two inner iterations of the plain-PI indicating the convergence is locally achieved.The obtained results also suggests that the proposed PI algorithm has a potential to construct optimal control even in case an optimal value function is not quadratic-like.The initial admissible control for the linear-like PI is selected to be the control based on feedback linearization (FL) which is obtained in the form [5] u(x) = 3x 5 1 + 3x (93) The solution based on GAC has been obtained for different orders of the approximation and those can be found in [5].In this example, we use two such controls obtained for N = {8, 15}.

D. Nonlinear System With Unknown Optimal Control
We illustrate the comparison of the GAC, FL, and the proposed approach in terms of their associated costs as in [5].Fig. 7 shows the costs that have been obtained for different initial conditions in x 1 , while x 2 is constant, i.e., x 2 = 0 (a) and x 2 = 0.2 (b).One can observe that the proposed linear-like policy-iteration generates the minimal cost, although the GAC with N = 15 is similar.However, we stress that the GAC requires a number of preconditions for a valid implementation [5].
Fig. 8 shows the optimal value function (left) and the value function obtained by the proposed approach (right) using only one inner iteration of the plain-PI indicating the convergence is locally achieved.The obtained results also suggests that  the proposed PI algorithm has a potential to construct optimal control even in case an optimal value function is not known.

VII. CONCLUSION
We have developed a method to determine optimal control strategies for continuous-time nonlinear systems.In particular, we have defined the approximate linear-like PI based on the SDLE to compute an approximate control law.Stabilizability of such an approximate policy-update is proved with Lemma 2, its cost is derived in Lemma 3, while convergence of this control law is provided in Theorem 1. Section V includes the main result in which Definitions 4-6 introduce three slightly different linear-like PIs.Lemmas 4-7 includes the proofs of their convergence properties, while Theorem 2 provides the description of the optimal control law algorithm based on these PIs.
The algorithm has been tested using four different nonlinear systems, including the Van Der Pol oscillator, a nonlinear system with a nonquadratic cost, a nonlinear system without a quadratic-like optimal value function, and a nonlinear system with unknown optimal control.From the results obtained one can observe the optimality of the proposed approach and the fast local convergence.The results also suggest that the proposed approach has the potential to be used in cases in which the optimal value function is not quadratic-like.

Manuscript received 13
November 2021; revised 1 August 2022; accepted 20 November 2022.Date of publication 5 December 2022; date of current version 27 September 2023.This work was supported in part by the European Union's Horizon 2020 Research and Innovation Programme under Grant 739551 (KIOS CoE), in part by the Italian Ministry for Research in the framework of the 2017 and of the 2020 Program for Research Projects of National Interest (PRIN) under Grant 2017YKXYXJ and Grant 2020RTWES4, and in part by the EPSRC grant "Model Reduction from Data" under Grant EP/W005557.Recommended by Senior Editor Tetsuya Iwasaki.(Corresponding author: Adnan Tahirović.)

Fig. 1 .
Fig. 1.Van Der Pol oscillator: The optimal value function (left) and the value function obtained by the proposed approach using only three inner iterations (right) of the plain-PI.

Fig. 2 .
Fig.2.Cumulative costs obtained for the initial condition x = [−1; 1] using only the plain PI with a different number of inner iterations (left); and the cumulative cost of the proposed approach based on three iterations of the P 1 -PI and the P 2 -PI (right).

Fig. 3 .
Fig. 3. Comparison between the optimal and the proposed controls in terms of (a) control signals, (b) cumulative costs, and (c) phase portrait, obtained along the trajectory from the initial condition x = [−1; 1] and three iterations of the plain-PI (k = 3).(d) shows the values of the boundary function used in (74), i.e., x T Ṗ 1 x.

Fig. 5 .
Fig. 5. Optimal value function (left) and the value function obtained by the linear-like PI approach using only two inner iterations of the plain-PI (right) for the nonlinear system with a state-dependent cost.
state-cost l(x) = x 2 1 + x 22 and assume R = 1.The system can be easily factorized with A(x)

Fig. 8 .
Fig.8.Optimal value function (left) and the value function obtained by the linear-like PI approach using only one inner iteration of the plain-PI (right) for the nonlinear system with an unknown optimal value function.