Regulation With Guaranteed Convergence Rate for Continuous-Time Systems With Completely Unknown Dynamics in the Presence of Disturbance

This paper presents the design of a novel <inline-formula> <tex-math notation="LaTeX">$H_{\infty }$ </tex-math></inline-formula>-based control framework for state regulation of continuous-time linear systems with completely unknown dynamics. The proposed method solves the regulation problem with the desired convergence rate and simultaneously seeks to attenuate the adverse effect of disturbance on the system. The <inline-formula> <tex-math notation="LaTeX">$H_{\infty }$ </tex-math></inline-formula> regulation problem assumes a cost function that considers regulation with a guaranteed rate of convergence as well as disturbance attenuation. The problem is then turned into a two-player zero-sum game optimization problem that can be solved by solving the associated algebraic Riccati equation (ARE), which provides a model-based solution. To solve this problem in a model-free way, a novel integral reinforcement learning (IRL) algorithm is designed to learn the solution online without requiring any prior knowledge of the system dynamics. It is shown that the model-free method (i.e., IRL-based method) provides the same solution as the model-based method (i.e., ARE). The effectiveness of the proposed method is ascertained through simulation examples; it is shown that the proposed method effectively addresses the problem for both stable and unstable systems.


I. INTRODUCTION
Optimal regulation involves developing a controller that ensures the system states optimally converge to zero, balancing the system costs and the control efforts. Solving optimal control problems requires solving the Hamilton-Jacobi-Bellman (HJB) equation. For the linear systems, this can be achieved through solving the linear quadratic regulation (LQR) problem using the algebraic Riccati equation (ARE), as the simplified version of the HJB equation [1], [2], [3], [4]; however, for the nonlinear systems, due to lack of analytical solution, the HJB equation should be solved The associate editor coordinating the review of this manuscript and approving it for publication was Rajeeb Dey .
numerically. The disadvantage of this approach is the need for an explicit knowledge of the system model; therefore, model-free methods working based on the concept of reinforcement learning (RL) have been proposed in the literature. This method has been effectively used to solve the regulation problem for the nonlinear systems and learn the optimal control solution in real-time while cutting the need for the complete knowledge of the system dynamics [5], [6], [7], [8], [9], [10], [11], [12].
Reinforcement learning has been widely employed as a machine learning technique for solving complex optimization problems [13]. As it is commonly known, RL has been used as a tool to solve optimal control problems by solving the HJB equations iteratively using either the policy iteration (PI) [5], [14], [15] or value iteration (VI) [2], [16], [17] techniques. Hence, RL can provide a real-time solution to the HJB equation, by optimizing the control cost RL iteratively, and under the assumption of unknown system dynamic [18], [19], [20]. In both PI and VI algorithms the policy evaluation and policy improvement steps are carried out iteratively until an optimal solution is achieved. Another merit of leveraging RL for feedback control problems is its ability to address the ''curse of dimensionality'' in such problems [21].

A. LITERATURE REVIEW
The off-policy integral reinforcement learning (IRL) algorithm is first introduced in [22] and [23] to avoid the time derivatives in continuous-time (CT) systems and to design an optimal state-feedback controller. The authors of [24] present an online IRL algorithm to find the solution to the tracking-constrained HJB equation for partially-unknown systems with a bounded control input. This algorithm produces observations using a policy that differs from the evaluated one, thus the term off-policy is coined to this method. In [25], the authors apply the IRL algorithm, which provides the solution to the HJB equation, to learn the CT optimal control solution for nonlinear systems with an infinite horizon cost and incomplete knowledge of the system dynamics. IRL is also used in [26] to solve the linear quadratic tracking (LQT) problem. A PI-based IRL algorithm is presented in [27] to solve the Nash equilibrium for a two-player zerosum differential game. The authors of [28] focused on output regulation with the help of the IRL algorithm without the need for a discounting factor to design a model-free controller for the linear systems. Integral RL algorithm is also employed in [29] for adaptive control of high-order multivariable nonlinear systems with unknown control coefficients.
A discussion on Q-learning and IRL algorithms for discrete-time (DT) and CT systems has been provided in [30]. The authors of [31] developed a model-free off-policy RL algorithm to solve ARE for robustly stabilizing a DT linear system affected by bounded and mismatched uncertainties. In [32], the optimal tracking control of DT nonlinear systems is studied under the condition of unknown system drift dynamics; the tracking problem is turned into a regulation problem by augmentation, and the associated HJB equation is solved by applying a new RL-based scheme using an actor-critic neural network structure. Discrete-time LQR problem is also studied in [33], in which the focus is on achieving the robustness utilising a new off-policy model-free RL algorithm, called optimistic least-squares policy iteration, for a system with additive stochastic noise.
Reinforcement learning and adaptive dynamic programming (ADP) techniques are studied in [34] to achieve an optimal output regulation controller for linear systems with unmeasurable disturbances and unknown dynamics. An ADP-based dynamic output feedback scheme is also developed in [35] for the linear quadratic regulation of CT systems; the solution is model-free and uses measurable input-output data to find the optimal control parameters.
In [36], the authors developed an off-policy model-free IRL algorithm to learn the optimal output-feedback (OPFB) solution for CT linear systems without the knowledge of system dynamics; the algorithm is applied to both regulation and tracking problems based on a discounted performance function and a discounted ARE. In addition, the authors of [37] presented an IRL-based online algorithm for learning a suboptimal OPFB H ∞ control law to address CT linear tracking problems under disturbances; the IRL solves the game ARE online, which gives a Nash equilibrium solution related to the optimization problem. The same scheme is also developed in [38] but for OPFB control of CT linear systems with input delay. Furthermore, the authors of [39] developed an H ∞ tracking controller for CT nonlinear systems by developing a tracking Hamilton-Jacobi-Isaac (HJI) equation and solving it through an online off-policy RL algorithm without the need for the system dynamic; they show that the algorithm does not require a specified disturbance input. In [40], the authors employed an RL algorithm to solve the H ∞ tracking problem in real-time for a nonlinear system without the need for the system dynamics. This is also the case in [41], in which the authors used an event-triggered RL algorithm to solve the tracking HJI equation. Moreover, IRL-based eventtriggered ADP is introduced in [42] to cut the need for drift dynamics and to control a CT nonlinear system with saturated input. In [43], the authors considered a CT neural network parameter update law based on variable gain gradient descent augmented with robust terms for model-free IRL-based H ∞ optimal tracking control problem of a CT nonlinear system with unknown dynamics for disturbance rejection.
The convergence rate or speed of regulation is an important consideration, and it is hoped that the convergence rate is as fast as possible. In [44], an RL-based method for solving the regulation problem with a guaranteed convergence rate for CT linear systems is developed. Also, an off-policy model-free RL-based solution for solving the LQR problem is developed in [45] for a DT linear system with a guaranteed convergence rate of the state variables. However, the adverse effect of disturbance is ignored in both works.

B. MAIN CONTRIBUTIONS
Reviewing the literature shows that the majority of previous studies have focused on either guaranteeing the convergence rate or attenuating the disturbance, but not both concurrently. Hence, the contributions of this paper are as follows: • This paper proposes an H ∞ -based control method for solving the regulation problem with a guaranteed convergence rate for CT linear systems through leveraging a novel model-free IRL algorithm; therefore, no knowledge of system dynamics is required. The poof of regulation with the desired convergence rate is also provided.
• This paper considers the system to be adversely affected by an unknown time-varying disturbance and shows that the proposed method not only attenuates the disturbance but also ensures regulation with a desired convergence VOLUME 10, 2022 rate. Simulation results will also demonstrate the effectiveness of the proposed method for various unknown harsh disturbances.

C. MANUSCRIPT LAYOUT
The remainder of this manuscript is organized as follows. Section II formulates the regulation problem with guaranteed convergence rate under the effect of disturbance. Section III presents the novel model-based solution to this problem with proofs. The proposed model-free method of this paper is proposed in Section IV and Section V provides some examples to support the proposed method. Finally, the manuscript is concluded in Section VI, and some related future research directions are provided.

II. PROBLEM FORMULATION
This section of the manuscript presents the basics of the state-feedback regulation control strategy. The CT state-evolution equation of the system can be written aṡ where The objective of the regulation problem is to design a control input u(t) such that the trajectories of system dynamics tend to zero, i.e., lim t→∞ x(t) = 0. The control input is represented by in which K u ∈ R m×n is the gain matrix of the feedback loop. Definition 1: The system is called normal if it is not directly affected by disturbance, i.e., ω(t) = 0.
Considering (2) as the control input and letting (1) be normal, (1) can be rewritten aṡ which is called the closed-loop equation of the system. It should be noted that the feedback gain K u guarantees the regulation of the system dynamics at a convergence rate faster than e −αt if all eigenvalues of the closed-loop system (3) are located on the left-hand side of the line s = −α in the splane, i.e., max{Re[λ(A + BK u )]} < −α, thus guaranteeing the convergence rate of the regulation problem.
Given α and γ as predetermined values, the problem is to devise a control strategy (2) for the defined CT linear system (1) such that: 1) Regulation is achieved at least with the rate of e −αt as t approaches to ∞, i.e., lim t→∞ e αt x(t) = 0 when ω(t) = 0. 2) The following bounded L 2 -gain condition should be satisfied when where is the performance output, Q = Q T ≥ 0 and R = R T > 0 are weight matrices, α is the minimum degree of stability, β is the rate of disturbance attenuation, and γ is the level of disturbance attenuation.

III. H ∞ REGULATION CONTROL WITH GUARANTEED CONVERGENCE RATE
The above-mentioned H ∞ regulation problem can be regarded as a two-player zero-sum game. Therefore, with regard to the second condition of the problem, the cost function for this problem can be defined as Since the satisfaction of the second condition of H ∞ regulation problem is equivalent to minimizing the cost function (5), the optimization problem turns into the following: By choosing β = 2α, the cost function (5) can be written in the following form: In order to proceed with the solution,x(t),ū(t),ω(t), and A are introduced as and substitute them in (7); consequently, the cost function can be written as Now, by taking the derivative ofx(t), the optimization problem turns into the following: min J x(t),ū(t),ω(t) s.t. (10).
As mentioned previously, the H ∞ regulation problem can be regarded as a two-player zero-sum game in the sense that the minimizing player is the control inputū(t) and the maximizing player is the disturbanceω(t). In other words, ω(t) attempts to maximize the cost function (9) whileū(t) aims to minimize it. Besides, due to the linearity of the system, the value function is quadratic with the following form: Consequently, the solution to the optimization problem (11) is equivalent to the solution to the min-max optimization problem where V * x(t) is the optimal value of V x(t) . In order for this optimization problem to have a unique solution, a game-theoretic saddle point must exist, i.e., the following condition must hold: Now, taking (10), (11), and (12) into account, the Hamiltonian function can be written as (15) and continuing by differentiating from (12), the following Bellman equation is achieved: The minimizing (optimal) control input and the maximizing (worst-case) disturbance can be achieved by applying the stationary conditions ∂H (·)/∂ū(t) = 0 and ∂H (·)/∂ω(t) = 0, which results in: The substitution of (17) and (18) into (16) produces the following equalitȳ which can be then simplified to an ARE as shown below: Theorem 1: Consider the CT linear system (1). By means of the control input signal (2), the H ∞ regulation problem is solved if where P = P T > 0 is the solution to the ARE (19). The proof is required to show that the optimal control input signal (17) satisfies both conditions of the problem; therefore, the proof is separated into two parts.
Proof of Theorem 1 (part 1): The first condition of the problem disregards the disturbance, i.e., considers the system normal. Since if ω(t) = 0 thenω(t) = 0, (10) can be written asẋ and by substituting (17) in (21), the dynamic representation of the system changes intȱ where A c =Ā−BR −1 B T P. Now, considering the quadratic value function (12) as a Lyapunov candidate, its derivative yields tȯ Since the system is considered normal, the ARE (19) reduces toĀ and the substitution of (24) in (23) produceṡ for whichV x(t) < 0 due to Q ≥ 0, P > 0, and R > 0; thus lim t→∞x (t) = 0. Now, sincex(t) = e αt x(t), it can be concluded that lim t→∞ e αt x(t) = 0. Hence, the first part of the proof is completed. Proof of Theorem 1 (part 2): Considering the optimal valuesū * (t) andω * (t), the Hamiltonian function (15) can be represented as follows: Regarding (16), (17), and (18), it is known that H x(t),ū * (t),ω * (t) = 0. Consequently, in order to complete the proof, it is sufficient to show that (17) is the solution to the H ∞ regulation problem. To continue the proof, (15) can be rewritten as Considering (26) and (27), the following equality holds: Consideringū Taking into account (8) and (12), the first term of (29) can be written as and by considering (8), (29), and (30), the following inequality can be introduced: Applying integration to both sides of the inequality (31) results in which leads to Since e 2αt f V x(t f ) ≥ 0 for every t f > 0 and ω(t) ∈ L 2 [0, ∞), one can conclude that in which β = 2α, t f = ∞, and V x(0) can regarded zero without the loss of generality; hence, the second part of the proof is also completed. Remark 1: Up until now, the validity of Theorem 1 has been proved, and it has been demonstrated that the ARE (19) solves the H ∞ regulation problem. Therefore, all eigenvalues of the closed-loop system will be located on the left-hand side of the line s = −α in the s-plane, i.e., max{Re[λ(A + BK )]} < −α, thus guaranteeing the convergence rate of the regulation problem. However, (19) depends on A and B, i.e., the system dynamics. In this regard, a method will be proposed in the next section that is independent of the system dynamics.

IV. ONLINE OFF-POLICY MODEL-FREE INTEGRAL REINFORCEMENT LEARNING
This section focuses on developing an online off-policy model-free integral reinforcement learning (IRL) algorithm. This algorithm enables the ARE to be solved using measured data in real time without needing prior knowledge of the system dynamics.
In order to develop the IRL algorithms, (10) can be rewritten as whereū(t) is the behavior policy applied to the system for data generation,ω(t) is the actual disturbance affecting the system,ū k (t) = K u,kx (t) is the control policy,ω k (t) = K ω,kx (t) is the disturbance policy, and, lastly, k indicates the iteration of the algorithm as it is a recursive algorithm. Data generated by the behavior policy is used to evaluate and update control and disturbance policies. In addition, the value function can be rewritten as and its differentiation can be written as which can be further represented aṡ The Bellman equation (16) can be rewritten as Step 1. Regarding (10), apply an initial stabilizing control input u 0 (t) for collecting the required data (system states) in N sample times.
Step 3. If F k < η, stop; otherwise, set k = k + 1 and return to Step 2.
Step 4. Set u * (t) = u k (t) and ω * (t) = ω k (t) on convergence. and with regard to (17) and (18),ū k+1 (t) andω k+1 (t) can be determined asū Therefore, (39) can be rewritten aṡ In order to determine the Bellman equation for the IRL algorithm, the integral of (43) over the interval [t, t + ] is calculated as follows: Now, with the consideration of (8), (12), (34), (40), (41), and (42), one can rewrite (44) as Now that the preliminaries of the online off-policy modelfree IRL algorithm are completed, the step-by-step procedure of the algorithm can be demonstrated in Algorithm 1. A graphic demonstration of the Algorithm 1 is also depicted in Fig. 1 as a flowchart diagram. Algorithm 1 employs the IRL Bellman equation (45) to iteratively solve the Bellman equation (16). The online implementation of the Algorithm 1 employs least squares and is similar to the practice described in [28], thus omitted.
Theorem 2: Algorithm 1 ensures the convergence of in which P * is the unique solution to ARE (19).
Proof of Theorem 2: The proof follows the practice described in [39] and is therefore omitted.

V. NUMERICAL EVALUATIONS
In this section, the effectiveness of the proposed approach is illustrated through two examples: a four-state stable system VOLUME 10, 2022 and a two-state unstable system. Algorithm 1, which is a model-free algorithm, is used to solve the H ∞ regulation problem, and the results are compared with the model-based solution obtained by (19) and (20).
This paper considers two different types of disturbance, (i) sawtooth waveform, (ii) sinusoidal waveform (both are plotted in Fig. 2). The sawtooth disturbance has a bounded amplitude range and constant frequency, and it can be presented by the following equation The sinusoidal waveform, on the other hand, has increasing amplitude and frequency, and is demonstrated through the following equation:

A. AN EXAMPLE OF STABLE SYSTEMS
The first example is a stable system with its state-space matrices given as follows:  The eigenvalues of this system are −14.8474, −0.5260 ± 3.2531i, and −1.4967. Also, the weight matrices for this example are Q = I 4×4 and R = I 1×1 with γ = 10. In order to show the efficacy of the proposed IRL algorithm, a comparison between the model-based method and the model-free method is given by comparing the solutions that both methods provide to the H ∞ regulation problem. Besides, two values for α are considered for this system in order to compare the effect of α on the speed of regulation and disturbance attenuation. Fig. 3 shows the convergence of the model-free solutions to the model-based solutions for the two values of α. More details are provided in Table 1. According to this table, the model-free method achieves the same solution as the modelbased (the table shows the solution achieved in the 10th iteration). Fig. 4 and Fig. 5 depict the response of the states  for different values of α under both disturbances. Regarding Table 1, Fig. 4, and Fig. 5, it can be seen that the eigenvalues are always on the left half of the line s = −α, and by increasing α, the speed of regulation increases and also disturbance is attenuated in a better way.

B. AN EXAMPLE OF UNSTABLE SYSTEMS
The second example is an unstable system with its state-space matrices given as follows:  The eigenvalues of this system are −1 and +1. Also, the weight matrices for this example are Q = I 2×2 and R = I 1×1 with γ = 10. Similar to the previous example, the same comparison between the model-based and model-free methods is provided here with two different values for α. Fig. 6 shows the convergence of the model-free solutions to the model-based solutions for the two values of α. Modelbased and model-free solutions are also compared in Table 2 (the table shows the solution achieved in the 10th iteration). Fig. 7 and Fig. 8 depict the response of the states for different values of α under both disturbances. Furthermore, it can be observed that the eigenvalues always lie within the left half of the line s = −α, and by increasing α, the speed of regulation increases and the disturbance is attenuated in a better way.

VI. CONCLUSION AND FUTURE WORK
In this paper the problem of designing a model-free H ∞ controller for the state regulation of continuous-time linear systems with a guaranteed convergence rate in the presence of disturbance is studied. Both regulation and disturbance attenuation problems are addressed in a single unified cost function by formulating it as a two-player zero-sum game optimization problem. Then the optimization problem is solved using the associated algebraic Riccati equation, which provides a model-based solution. A novel model-free integral reinforcement learning algorithm was developed to learn the solution in real-time using no prior knowledge of the system dynamics. The results show that the algorithm substantially attenuates the adverse effect of the disturbance on the system performance and also guarantees a predefined rate of regulation. The efficacy of the proposed method for both a stable and unstable systems is verified numerically. The approach proposed in this paper may serve as an effective tool to study the optimal control design problem with a guaranteed convergence rate for a wide range of applications such as robotics, industrial manufacturing systems, process control, and so forth. In future, this development could be extended for systems with input delay. In addition, developing a similar method for systems with bounded inputs or constrained states could be an important area of research. BEHZAD MOSHIRI (Senior Member, IEEE) received the M.Sc. and Ph.D. degrees in control systems engineering from the University of Manchester Institute of Science and Technology (UMIST), in 1987 and 1991, respectively. He is currently a Full Professor in control systems engineering at the School of ECE, University of Tehran. He has also been an Adjunct Professor with the Department of ECE, University of Waterloo, Canada, since 2014. He has been working as the Chairperson of the IEEE Control System Chapter in the IEEE Iran Section, since December 2018. He has been a member of the International Society of Information Fusion (ISIF), since 2002, and has also been a member of the Waterloo AI Institute, since 2019. He is the author/coauthor of more than 360 articles, including more than 130 journal articles and more than 20 book chapters. His research interests include advanced industrial control, advanced instrumentation systems, data fusion theory, and its applications in areas, such as robotics, process control, mechatronics, information technology (IT), intelligent transportation systems (ITS), bioinformatics, and financial engineering. His research interests include the areas on control theory and digital signal processing. Particularly, he is interested in adaptive signal processing and control, robust control, linear and nonlinear system identification, estimation theory, and evolutionary computing and optimization with applications in active noise and vibration control systems, robotics. He was a recipient of the European Research Consortium on Informatics and Mathematic (ERCIM) and Humboldt research awards, in 2010 and 2011, respectively. He is also a fellow of Higher Education Academy. His research is funded by different councils and industries in U.K., such as Engineering and Physical Research Council, Sellafield Ltd., National Nuclear Laboratory, and Nuclear Decommissioning Authority. He is currently working on IFAC Technical Committees ''Adaptive and Learning Systems'' and ''Modeling, Identification, and Signal Processing''.