Online Adaptive Critic Learning Control of Unknown Dynamics With Application to Deep Submergence Rescue Vehicle

As a powerful tool for nonlinear systems robust controller design, robust adaptive dynamic programming (RADP) methods require initial admissible control and prior knowledge of disturbance to be effective. As the most effective approach to provide robustness to uncertainties, active disturbance attenuation (ADA), was rarely considered in RADP literatures. To combine ADA with RADP, a neural-network identifier was developed initially to approximate the plant dynamics and the imposed external disturbance. System states was extended with the approximated disturbance to establish ADA actor-critic learning. To relax the initial admissible control constraint, a novel auxiliary system were created based on the identifier dynamics. Theoretical analysis and simulations on unstable nonlinear system show that the approximated control law with respect to the auxiliary system and a newly proposed cost function itself could guarantee asymptotic stability of the original system. Simulations and comparison with other model-free control techniques demonstrated the excellent performance and robustness of the proposed method. Applicability of the proposed method was validated by applying it to trajectory tracking control of a deep submergence rescue vehicle.


I. INTRODUCTION
In the past decade, adaptive dynamic programming (ADP), wherein adaptive parameter identification is combined with conventional dynamic programming to solve nonlinear system optimal control problem forward-in-time, has received increasing attention in the research of adaptive and intelligent control studies [1], [2]. For complex nonlinear systems, it is difficult to derive an analytical solution to the Hamilton-Jacobi-Bellman (HJB) equation. Neural networks (NNs) and fuzzy systems are generally incorporated as intelligent components for the value approximation [3], [4]. ADP is a promising approach that provides optimal control solutions for complex tasks and has been applied effectively in robotic manipulations [4]- [6], multi-agent systems [7], [8] and power systems [9], [10]. However, the identifiers respond slowly to the parameter variations of the plant.
The associate editor coordinating the review of this manuscript and approving it for publication was Jin-Liang Wang. Vrabie et al. proposed integral reinforcement learning (IRL) algorithm to determine the solution of the HJB equation of linear [11] and nonlinear [12] systems without requiring knowledge about the state transition dynamics. Modares et al. applied experience replay technique to speed up the convergence of IRL [13]. Li et al. applied the IRL method to solve H ∞ control problem for systems with unknown dynamics [14]. Palanisamy et al. proposed continuous-time Q learning to solve the optimal control problem of systems with completely unknown dynamics [15]. Modares et al. extended the ADP method to optimal tracking control by augmenting the control plant with reference trajectory dynamics [16]. Wen et al. proposed a novel actor-critic RL method for nonlinear system tracking control [17]. Vamvoudakis [18] and Sahoo et al. [19] applied ADP technology for event-triggered control, which significantly improves the efficiency of in-system communications. Yang et al. provided solutions to event-trigging robust control of continuous-time nonlinear systems [20]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The abovementioned ADP methods, however, rarely considered dynamical uncertainties. Real systems are generally subject to parameter variation and external disturbance. To deal with these adverse events, ADP and RL methods can be applied to solve the H ∞ control problem [20], [21], which is formulated as a two-player zero-sum game. The optimal controller minimizes the cost function, whereas a worst-case disturbance generator is used to maximize it. The H ∞ optimal control method aims to find the Nash equilibrium point by solving the Hamilton-Jacobi-Isaacs (HJI) equation. Modares et al. proposed an on-policy IRL algorithm for solving the HJI equation without requiring the drift dynamics of the control plant [16]. Using a NN-based actor-critic structure, many weights update laws have been proposed to minimize the Bellman error. The on-policy IRL method requires the imposed disturbance to be adjustable online, which is very difficult to implement in practice. Luo et al. proposed an off-policy RL method to solve the H ∞ control problem of nonlinear systems [21]; however, the policy iteration cannot be implemented online. Zhang et al. presented an online algorithm for obtaining the HJI solution for discrete-time nonlinear system control [22]. Wang et al. proposed a modelfree method by introducing a system dynamics identifier, in which the control policy is updated online with guaranteed system stability. However, even for off-policy methods, the H ∞ solution of robust control requires that the imposed disturbance to be known, which means that an alternative disturbance observer is necessary for algorithm implementation. Robust adaptive dynamic programming (RADP) methods achieve intelligent robust control using a different perspective. In RADP, the utility function and the optimal control policy for the nominal plant are designed based on nonlinear control theory, such as robust-redesign, back-stepping, and the small-gain theorem [23]- [26]. Problem transfer methods are usually used. Dipak et al. designed a controller to optimize a cost function, which includes penalties on constraint control efforts and a maximum bound on uncertainties [25]. Ding et al. found that weighting the upper bound of uncertainty with a scalar larger than the maximal eigenvalue of a control effort weighting matrix guarantees the uniformly ultimately bounded (UUB) stability of the uncertainty system [26]. Jiang et al. used a model-free method for cost function estimation and added a small gain to the control policy, which has proven robustly optimal [23]. The RADP method has also been applied to the decentralized optimal control of large-scale systems [27] and the output feedback control of interconnected systems [28]. RADP methods aim to guarantee the stability of the uncertain plant by approximating a value function for the nominal plant. In this robust optimal design, only the upper bound of the uncertainty is required. To approximate the robust optimal policy, ADP methods require that the system meets two necessary conditions: the initial admissible control condition and the persistence of excitation condition. These two conditions are generally difficult to satisfy and narrow the extent of ADP application. To solve this problem, Dierks et al. designed a single estimator based control scheme for solving the HJI equation without requiring initial admissible control [29]. Chowdhary et al. introduced a concurrent learning technique to relax the persistence of excitation condition [30]. Yang et al. proposed a robust control strategy for nonlinear systems subjected to unmatched uncertainties [31]; they developed a new critic learning algorithm that relaxes both conditions. To review, the RADP method requires a nominal model and the knowledge about the structure of the uncertainties. Both requirements are difficult to attain in many real applications. Some ADP-based H ∞ methods are completely model free and are even capable of being applied online. However, an initial admissible control law and disturbance law are necessary to guarantee convergence of the system states in the learning process. It is very difficult to satisfy this requirement when the plant dynamics are completely unknown. Therefore, an online model-free intelligent critic control design method is needed that would remove the requirements of initial admissible control and the persistence of excitation conditions.
In this study, an ADP based model-free robust optimal control scheme was developed for a class of continuous-time non-affine nonlinear systems with unmatched uncertainty. A NN system identifier was developed initially to reconstruct the system dynamical model along with uncertainty effects. Based on the identifier dynamics, an auxiliary system was constructed, and a critic network was employed to approximate a newly designed value function using currently observed system states and the approximated disturbance. A concurrent learning technique was integrated with the critic update law to relax the requirement of the persistence of excitation condition. Here, we show that for a given set of critic weights, the approximated optimal controller of the auxiliary system guarantees the asymptotic stability of the uncertain system. We also show that when applied to an original unknown plant, the optimal control for the auxiliary system achieves optimality for a specified value function. Using Lyapunov's method, all signals in the closed-loop system are stable in the sense of UUB. The contributions of this paper are summarized as follows: 1) A novel NN-based non-affine dynamics identifier is introduced. In addition to approximating the system dynamics, the external disturbance is also approximated for active disturbance attenuation. The error dynamics of the proposed identifier prove to be asymptotically stable; no prior knowledge about the external disturbance is required. 2) A critic learning based robust control scheme is presented. The designed control system is robust not only to external disturbances, but also to the identifier approximating error and the critic network value prediction error. Thus, the proposed method guarantees system stability during the training process, without requiring an initial admissible controller. The proposed method is completely model-free and can be implemented online for critic learning control.
3) Inspired by the research on ADP-based linear quadratic tracking problems [32], [33], we extended the system states with an approximated external disturbance and introduced an intermedia auxiliary system to generate data for robust critic learning. To the best of our knowledge, this is the first research article that combines robust design and active disturbance attenuation in an ADP-based control scheme. The remainder of this paper is organized as follows. In Section II, the robust control problem is formulated, and the aim of critic design is introduced. In Section III, the system identifier is presented with stability analysis. In Section IV, we describe the auxiliary system and discuss the associated critic learning rule. Section V illustrates the main results of the new robust critic design with proof of stability and optimality. Simulation results are given in Section VI to illustrate the effectiveness and applicability of the proposed method. Finally, conclusions are drawn in Section VII.

II. PROBLEM FORMULATION
Consider the following nonlinear non-affine system subjected to an unmatched disturbancė where x ∈ ⊂ R n is the state vector, and u ∈ R m is control input vector. f (x, u) ∈ R n is an unknown continuous differentiable function with respect to x and u. w ∈ R n is an uncertain function representing the disturbance effects.
x 0 = x(0) denotes the initial system state.
The goals of this research are summarized below: (i) Design a dynamics identifier to approximate the unknown system dynamics along with the external disturbance. (ii) Concerning the existence of identification error and external disturbance, design an adaptive critic control law such that system (1) is asymptotically stable for any intermedia control policy applied during the on-policy training process. (iii) Find a value function such that the controller obtained in (ii) is a solution to the optimal control problem defined as Based on the work of [36], the robust control problem (2) can be solved by solving the optimal control problem of an auxiliary system. Since the identifier is a certain dynamical system, we construct an auxiliary system based on the identifier dynamics.

III. IDENTIFIER DESIGN WITH STABILITY ANALYSIS
In this section, a three-layer NN is applied to reconstruct the non-affine system dynamics based on the system input-output data. Using the extended state observer theory [38], we designed an observer to estimate the disturbance and uncertainty using only the measurable signals. Let the number of neurons in the hidden layer be denoted by l m . The dynamics of system (1) can be rewritten aṡ where w ∈ R n is the external disturbance added to the system. ω m ∈ R l m ×n is the ideal weight between the hidden and output layer, σ (·) is a continuously differentiable and monotonically increasing activation functionz = v T m z ∈ R l m is the input signal of the hidden layer, where v m ∈ R (n+m)×l m is the ideal weight between the input and the hidden layer.
∈ R n+m is the input vector, and ε m ∈ R n is the reconstruct error. The differentiable activation function is Lipschitz continuous and for any ξ a , ξ b ∈ R n , there exists a constant λ 0 > 0 such that the following inequality holds Note that σ (·) applies the same operation to every element of its input. For simplicity, the input-hidden weight matrix v m is maintained constant and only the hidden-output weight matrix is tunable. We randomly initialize v m and keep it static during the training process. The NN identifier dynamics is repre- where,ω m is the current estimation of the ideal hidden-output weight matrix ω m , andẑ = v T m [x, u] T denotes the input signal of the hidden layer. Letx = x −x andw = w −ŵ represents the states and disturbance approximate error respectively. Then x and w can be approximated by the following observeṙ where η 1 > 0 and η 2 ≤ 1 are real scalar design parameters. As described in (6), the external disturbance is approximated by integrating the states approximating error and multiplying the designed parameter η 2 . The current approximated system states and system inputs are concatenated as the input vector of the NN. Then, the NN output, the approximated disturbance and the approximating error feedback term η 1x are summed together as an approximation of the time derivative of system states. We obtain the approximated system states, disturbance and NN weights by integrating (6) and (10). Scheme of the NN identifier is displayed in Fig. 2.
Using (3) and (6), the error dynamics of the proposed observer are given bẏ Note that when the number of nodes in the hidden layer l m is large enough, the NN identification error can be arbitrarily small. Moreover, (7a) indicates that ε m is closely linked VOLUME 8, 2020 withx, (7b) indicates thatẇ is closely linked withw. It is reasonable to assume that ε m is bounded by a function of x;ẇ is bounded by a function ofw, andw is bounded by a function ofx. We give two assumptions often used in the ADP literature [22], [34], [35], which are helpful for analyzing the stability of the proposed identifier. Assumption 2: According to (7), the system dynamics identification error ε m , the time derivative of the external disturbanceẇ, and the disturbance approximating errorw have the following relationships where λ ε m > 0, λw > 0 and λẇ > 0 are constant scalars.
Selecting suitable values for η 1 and η 2 such that µ < 0. Then, with the NN weights be tuned bẏ the system states approximating error dynamics is asymptotic stable, which meansx would converge to zero. Proof: Choose a Lyapunov candidate of the form Taking the derivative of L 11 along the trajectory of the error dynamics (7), we obtaiṅ Considering Assumptions 1 and 2 and using Young's inequal- According to (4) we haveσ Tσ Tx . Then, combining (12) and (13), one getṡ Using the adaptive criterion (10) and trace operation property tr(AB) = tr(BA) = BA, considering that the ideal weights are constants, we have: Along (14) and the inequalities in Assumption 2 one getṡ By selecting sutible η 1 and η 2 such that µ < 0, one getṡ L 1 ≤ 0. Based on the standard Lyapunov extension theorem, the identifier error dynamics is asymptotically stable. According to (10) and using the fact thatω m = −ω m , we knewω m is a function ofx and there exists a constant λω m > 0 such that According to (9) and (17), with the system states approximating error converging to zero, the identifier weights and the disturbance approximating error would also converge to zero. This completes the proof. Remark 1: The proposed system dynamics identifier is asymptotically stable with the existence of a disturbance. It is also a high-performance disturbance observer that requires no prior knowledge of the external disturbance. With these advantages, the identifier is very applicable and the proposed control method achieves novel external disturbance attenuation performance using the approximated disturbance.

IV. ADAPTIVE CRITIC CONTROL OF THE AUXILIARY SYSTEM
In this section, based on the identifier dynamics, we present an auxiliary system and the associated HJB equation. A critic NN is used to approximate a solution for the HJB equation.

A. HJB EQUATION FOR THE AUXILIARY SYSTEM
Since the measured system state is a part of the observer's inputs, we define the auxiliary system aṡ where the auxiliary control inputs v 1 and v 2 are used to handle the model uncertainty and external disturbance. The augmented system states and the augmented control input are defined as s = [x T ,ŵ T ] T and v = [v 1 , v 2 ] T respectively, Thus, (18) can be rewritten aṡ Associated with the auxiliary system (19), the value function is described by where ẇ M (x) 2 , β 1 , β 2 and α are positive design parameters.ẇ M (x) is the approximated disturbance derivative with respect to time, and Q(x) is a symmetric positive-definite function with respect to the system states. The optimal value function is defined as [37] V * x,ŵ = min where A( ) is the set of admissible controls defined on . According to Abu-Khalaf and Lewis [39], the optimal value function satisfies the following Lyapunov function Define the Hamiltonian with respect to V * (x), u and v as Then the optimal value function can be obtained by solving with V * (0) = 0. The corresponding optimal control laws that minimize the Hamiltonian in (22) are given by T m σ (ẑ) is the approximated system input dynamics. Combining the augmented optimal control laws in (26) with the Hamiltonian in (24), the HJB equation can be rewritten as In the following sections, the subscripts ofẑ andz denote the source of the control input signal.
Remark 2: This newly proposed auxiliary system has two main advantages: 1) The auxiliary system constructs the disturbance dynamics model, which removes the requirement of prior knowledge about the disturbance. 2) The auxiliary system has a complete known input dynamics. Thus, the actor network is unnecessary and the inner loop of the policy iteration can be removed to improve the critic learning efficiency.

B. APPROXIMATE SOLUTION TO THE HJB EQUATION
It is difficult to compute the analytical solution of the HJB equation (27). In many studies, policy iteration procedures are employed to solve the HJB equation offline. An initial admissible control is generally required to guarantee a bounded value function. In this section, we develop a learning law that updates the critic NN in an online manner. Convergence and stability analysis of the learning rule will be discussed in subsequent sections.
According to the NN universal approximation property, the optimal value function can be represented by a NN as: where ω c ∈ R n c is the unknown ideal network weights, σ c (s) represents the input feature vector obtained by performing a specific type of math operation to the input state vector. n c is the number of neurons in the hidden layer, and ε c (x) ∈ R n is the NN reconstructing error. Differentiating V * (x) with respect to the augmented state, it follows that The corresponding optimal control laws are Since the ideal weights are unknown, we introduce a critic network formulated as follows to approximate the value function:V whereω c is the approximated NN weights. Meanwhile, the approximated optimal control policy can be formulated asû For a control policy u and v, by applying the NN expression to the Hamiltonian we have: According to (33) and the Lyapunov function (23), we have:

VOLUME 8, 2020
Replacing ω c in (34) withω c , we derive the approximated Hamiltonian as: Define the critic approximate error as e c =Ĥ (s,V , u, v).
The relationship between the ground truth and approximate Hamiltonian is given by The objective of critic learning is to find the critic weights that minimize the square error index E c = 0.5e 2 c . To learn the optimal control policy online, we propose an on-policy learning algorithm. In the training stage, the approximated control policies are used. Let the critic weights gradient be φ = ∇σ c (s) ( s+Bv+f (x, u)) ∈ R l c . Then the critic weights vector can be adjusted by: where α c is the learning rate, m is the length of the memory reserved for concurrent learning, and (1 + φ T φ) 2 is the gradient normalization term. The approximating error of the critic network is defined asω Noting thatω c = −ω c , based on the critic weights update criteria (37), the error dynamics of the critic weights is given bẏ A signal flowchart of the proposed robust control scheme is depicted in Fig. 1. As shown in the figure, the identifier approximates the system states, the external disturbance and the system dynamics using system input-output data. The Hamiltonian (35) is computed based on the auxiliary system dynamics (6), which is driven by the approximated system states and disturbance. The utility function (21) and the approximated value function gradient (29) are computed online as necessary components for constructing the Hamiltonian. Then, the critic network is updated by (37) using the concurrent learning technique to minimize the value of the obtained Hamiltonian. Finally, with the converged critic weights and the identifier NN weights, the optimal control law can be derived by (32a).

V. MAIN RESULTS
In this section, we demonstrate that system (1) can be guaranteed, in the sense of asymptotic stable by properly choosing the parameter β 1 , β 2 and α. We also prove the optimality of the derived critic control law when using a specific value function.
A. ROBUST OPTIMAL CONTROL SCHEME Before continuing further, we give the following assumptions about the critic network.
Assumption 5: For a given compact set ,ĝ(x) and the model identification NN weights approximating error are upper bounded such that ĝ(x) ≤ λ g , and ω m ≤ λω m , where λ g and λω m are positive constant scalars. The following lemma was well demonstrated in [31]; it is given here for the subsequent stability proof.
Lemma 1: Let x = 0 be the equilibrium point of a system with its dynamics given byẋ = F(x). ⊂ R n is the domain containing x = 0. If there exists a function V (x) ∈ C 1 ( ) such that the inequalities hold for all x ∈ , where W k (x)(k = 1, 2, 3) are positivedefinite functions, then the system is asymptotically stable and convergent with respect to x = 0. Theorem 2: Consider the auxiliary system (16) and the value function (17). Let Assumptions 1-5 hold. Choose α ≤ 1 3 and x,x = β 1x Then, if the optimal auxiliary control given in (26b) satisfies for any critic weights in the set {ω m ω m −ω m ≤ λω m }, the control law given by (32a) guarantees the asymptotic stability of the system (1). Proof: The optimal value function given in (22) is a positive-definite function defined on . For a positivedefinite function, there exist two class κ functions γ 1 (·) and γ 2 (·) such that for each x ∈ , the class κ functions are positive definite.
Choose the optimal value function as the Lyapunov function and let W 1 (x) = γ 1 ( x ), and W 2 (x) = γ 2 ( x ). Differentiating V * (x) along the trajectory of the original system, we obtaiṅ (26) and (27) we have Substituting (44) into (43) and lettingσ denote σ (ẑ u * ), we havė Tẇ (47) According to Assumptions 4 and 5 Substituting (48) into (47) and along with (40), Combining (49) with (45), we obtaiṅ (50) with 0 < α ≤ 1 3 and condition (41) holds, we have Since ρQ(x) ≥ 0, let W 3 (x) = ρQ(x). Based on (42) and Lemma 1 we can conclude that by applying the approximate control policy, system (1) is asymptotically stable. This completes the proof. Remark 3: Unlike [20] and [40], where identifier and controller are designed independently, the proposed critic control design aims to provide robustness to the identifier approximating error and the critic approximating error. The control system is guaranteed to be asymptotically stable even before the learning processes have converged. Hence the requirement of initial admissible control is relaxed. VOLUME 8, 2020 Remark 4: Condition (41) cannot be verified directly in [31]. Instead, this condition is verified via numerical simulations. However, observing (32b), v * can be adjusted to satisfy condition (41) by selecting η 1 and η 2 of matrix B in (19). This is a significant advantage of the proposed method.
In what follows, we will show that control law (30a) can minimize a specific value function given by Lemma 2: If α ≤ 0.5, (x) is a positive definite function.
Proof: Using Young's inequality, we have With α ≤ 0.5, we have and With (41) and (49), (x) ≥ 0. This completes the proof. Theorem 3: If α ≤ 0.5, the controller given in (30a) is the solution to the optimal control problem defined by system (1) and value function (52).
Proof: The Hamiltonian with respect to the newly defined value function J (x) is formulated as Let J (x) = V * (x), and substitute the first two formulas of (44) into (56), it follows that Therefore, J (x) = V * (x) and u = u * make up the solution to the HJB equation (56); furthermore the control law given in (23a) is the solution of the optimal control problems (1) and (52). This completes the proof.

B. STABILITY ANALYSIS
We present the stability analysis results for the proposed onpolicy critic learning process via Lyapunov's method. Theorem 4: Consider the auxiliary system (5) and the HJB equation. Let Assumption 1-4 hold. The approximate optimal control lawsû andv are given by (32a) and (32b), respectively. Let the critic network weightsω c be tuned by (37). The system states x, state approximate errorx, and the critic weight errorω c are UUB, respectively, by Proof: Choose a Lyapunov function candidates composed of three terms, According to the theorems proposed in the Main Results section, the time derivatives of the terms L 21 and L 22 are respectivelyL Consider (39). The time derivative of L 23 is given bẏ Given that Q (x) is a symmetric positive-definite function, there exists a positive constant q such that qx Tx < Q(x); therefore,L 21 (t) ≤ −ρq x 2 . Meanwhile, by using Young's inequality and noticing that 1 + φ T φ > 1, we have: Finally, we obtaiṅ According to (62),L 2 (t) < 0 if one of the following inequalities holds: By applying the Lyapunov extension theorem, we can conclude that the estimated state vectorx, the model identification error and the critic weights approximate error are UUB, respectively by B x , Bx and Bω c . This completes the proof. Remark 5: As shown in Theorem 2, the system states are asymptotically stable with any approximate control policy applied; it is not necessary to enlarge parameter q to restrict the system states within a small bound. The bound ofx and ω c can be adjusted to be arbitrarily small, if we enlarge the design parameters such as α c , η 1 , and η 2 .

VI. SIMULATION VERIFICATION
In this section, we first demonstrate the capability of the proposed method in maintaining system stability during the critic learning process. The simulation is established based on an unstable nonlinear system. Then we compared the proposed method with other model-free control schemes to demonstrate control performance of the proposed method. Finally, we applied the proposed method to trajectory tracking control of an autonomous underwater vehicle to verify the applicability of the developed theoretical results.
To study the robustness of the proposed control scheme, we measured the L 2 -gain defined as: which is an index typically used for measuring system robustness, it represents the sensitivity of observable performance output with respect to disturbance input.

A. NONLINEAR SYSTEM STABILIZING
Consider the following continuous-time nonlinear non-affine system that has been modified based on the system applied in [31] for conducting simulation resultṡ where x = [x 1 , x 2 ] T ∈ R 2 is the system states vector, and u ∈ R is the control input. The initial system state is Here, we used the same disturbance signal w(x) as in [31], i.e., w(x) = θ 1 x 1 sin(θ 2 x 2 ), where θ i (i = 1, 2) are randomly selected within the interval [−1, 1]. The key in obtaining robust optimal control of the system (63) is to solve the HJB equation associate with auxiliary system (18) and utility function (64). The designed parameters were selected as η 1 = 6, η 2 = 1, and −1 = 0.067. The number of neurons in the hidden layer was n c = 10. The concurrent learning memory pool size was chosen as m = 16. The weighting matrix ω m was initialized a zero matrix; the elements in matrix v m were initialized as random numbers in the interval [0, 1]. According to (40), we set the utility function parameters as β 2 = 180, α = 0.2. β 1x Tx and Q(x) were combined such that the termx Tx is weighted by 80. The value function to be optimized is given by The activation function for the critic network was selected as quadratic polynomial components of the augmented system states With n c = 10, the critic weight vector is a 10-dimensional vector written asω c = [ω c1 ,ω c2 , · · ·ω c10 ]. It is worth mentioning that the dimension of the activation function was determined by computational simulation and depends more on the implementer's designing experience than on theoretical analysis. The elements of the initial critic weight vector were all set to zero, which means the initial control policy was u = 0. As verified in [31], this initial control law cannot stabilize the system; here, it was applied to demonstrate that the proposed method relaxes the initial admissible control condition.
In the experiment, the critic network was updated by computing the Hamiltonian (27) online. It is worth noting that the auxiliary control v is only involved in the Hamiltonian computation. Actually, it was not applied as the system input. Since the objective of training the critic network is to minimize the Hamiltonian for arbitraryx ∈ , it is not necessary to have the state samples being conditioned on a specific trajectory. We used system states approximated by the observer as the input of the Hamiltonian. Then, the critic weights were updated and the approximate control policy was applied directly to the plant. We ran the simulation for 20 seconds. The system states trajectory and the state trajectory approximated by the observer are depicted in Fig. 3; notably, the states approximating errors converged. The control output during the learning process is depicted in subplot (a) of Fig. 4. Q(x) and ||v(x)|| 2 are compared in subplot (b) as the verification condition (41). The converged identifier weight matrices ω m and v m are respectively as shown at the bottom of this page. Fig. 6 displays the convergence of the critic weights. The converged critic network weights vector was ω T c = [0.9848, 0.6003, 0.5402, 0.3299, −0.0794, 0.6677, −0.2765, −0.0555, 0.4303, 0.0272]. It is worth noting that the critic weights were not updated in the first 16 steps; instead, this period was used to prepare the data for concurrent learning. With these parameters, the optimal control input u * can be derived online by (26).
We then examined the effects of the parameter uncertainties. To this end, f (x) in system plant (63) was modified as follows: where is a random number sampled from normal distribution. With the optimal controller obtained from critic learning applied to the system plant, we compared the system state evolution process with and without parameter uncertainties. Simulation results are displayed in Fig. 5. From Fig. 5, parameter uncertainties caused minor effects on the states evolution trajectory. The obtained optimal controller is very robust to parameter uncertainties.

B. NONLINEAR SYSTEM DISTURBANCE ATTENUATION PERFORMANCE STUDY
In this section, we will respectively apply the newly designed control scheme and two other promising model-free control schemes recognized as the most effective to a nonlinear plant to compare their performance. The L 2 -gain, which is commonly used as an index of control system disturbance attenuation performance [35], [40], was measured and compared to demonstrate their robustness to disturbance effects. In addition to L 2 -gain, indexes revealing control performance and learning efficiency are measured and compared to verify the effectiveness of the proposed method. Consider the following system used in [40] for numerical simulations: where x 1 and x 2 are system states, and u, v ∈ R denote the control input and perturbation input signal respectively. We selected the initial system state as x 0 = [1, −0.5] T . The critic network and NN identifier share the same structure in the experiment as that described in Section A. The same external perturbation signal as in [40] was imposed (i.e., v (t) = 3e −t cos t, t > 0). The value function parameters were chosen as β 2 = 180, α = 0.2, and the learning rate of the critic network was α c = 3.5. Simulation results are displayed in Fig. 7-11. The approximated / actual system states and the approximated disturbance during the training process are depicted in Fig. 7. System state trajectories with the obtained approximate optimal control are compared in Fig. 9. A comparison of the critic weights convergence process is depicted in Fig. 10. As shown in Fig. 10, the critic weights vector converged to ω c = [2.6658, 0.5470, 0.0395, 0.4896, 0.4194, −0.6941, 0.6320, 0.1468, 0.1737, 0.2040] T . The obtained optimal controller feedback gain was ω a = [0.9390, 2.1959, 0.5902] T . Although the critic network structure was more complicated, our method required less than 80 iterations to converge, whereas the method proposed in [40] required five times more iteration steps. The corresponding control output and ρ(t) are depicted in Fig. 11. As shown in Fig. 11, the control FIGURE 7. Evolution of system states and its approximation (a, b) and the external disturbance approximated by the identifier (c). system designed by using our method had an L 2 gain of 0.585, lower than that obtained in [40], i.e. 1.015. Since the auxiliary system was augmented with the estimated external disturbance, the learned control policy contained a feedforward loop for disturbance attenuation. Thus significantly improved the robustness to disturbance. With the states approximating error and the disturbance approximation terms involved in the HJB equation, the system dynamics become more deterministic and a more accurate critic gradient is obtained. Moreover, the concurrent learning technique reuses the historical data, further improving the learning efficiency.   We then compared our method with the online modelfree control method known as model-free adaptive control (MFAC). The MFAC controller was designed based on the full-form dynamic linearization (FFDL) data model. The desired states trajectory was set to constant zeros. A comparison of the states evolution process is displayed in Fig. 8. The parameter setups of the three applied controller are shown in Table 1. We further measured and compared control performance of the three scheme in four aspects: L 2 gain; states quadratic cost (SQC) which is the time integration of x T (t)Qx(t); The learning steps required for the adaptive algorithm to converge (CS) and the states convergent   Table 2. From Table 2, robustness and efficiency of our method outperforms the widely applied MFAC scheme and the novel ADP-based H ∞ scheme, which is optimal in terms of zerossum game.

C. SIMULATION STUDY ON DSRV MOTION CONTROL
In this section, we discussed an experiment in which the proposed ADP method was applied to the depth and pitch control of a deep submergence rescue vehicle (DSRV) [41]. The simulation was carried out based on the Marine Systems Simulator (MSS) [42].
Consider a 5-meter-long DSRV cruising at a constant speed of 8 knots (4.11 m/s). The altitude of the vehicle is adjusted by a horizontal deployed rudder, so the vehicle can move up and down by altering its pitch angle. Fig. 12 displays the coordination system setup and the DSRV motions. The inertia frame (denoted by I ) was fixed to the ground, with its x-axis and y-axis pointing to the north and east, respectively, and the z-axis points downward. DSVR motion state is described by a five dimensional vector x = [w, q, x, z, θ ] T . w and q represent vehicle heave velocity and pitch angular velocity respectively; x and z represent the advancing distance and diving depth of the vehicle, i.e., the vehicle coordination in the inertial frame; and θ is the vehicle pitch angle with respect to the ground. The dynamics of the DSRV are given by where is the inertia matrix comprised of mass, inertia and added mass. The projecting matrix J (θ) = cos θ sin θ − sin θ cos θ projects the speed measured in the body frame to the speed in the inertia frame. The heave force Z and pitch torque τ are given, respectively, by: All inertia and hydrodynamics parameters are listed in Table 3. To simplify the implementation, we extracted the control system states as where, z d and θ d are the expected depth and pitch angle, respectively. The objective of this experiment was to control the DSRV to track the desired trajectory in the x-z plane, which is generated by For simplification, we set the desired trajectory of the depth and pitch angle with respect to time as The vehicle states are initially set to zero with the cruise speed U = 8knots. First, we applied the proposed method to learn the optimal control policy while tracking the desired trajectory (70). Fig. 13 displays the actual and approximated depth and pitch angle errors. The DSRV motion trajectory is shown in Fig. 14 (a). The convergence process of the critic weights is depicted in Fig. 15    less than 60 learning iterations. Then, we applied the obtained optimal controller to the DSRV. Fig. 14 depicts the desired and actual vehicle motion trajectory in the x-z plane of the inertia frame. As shown in Fig. 13 and 14, the proposed method guarantees that the trajectory tracking error is asymptotically stable, ultimately reaching and maintaining a cruise depth of 5m. The corresponding rudder command is depicted in Fig. 16. It is worth noting that in real applications,   the rudder angle cannot exceed ±30 degree (0.524 rads). The obtained optimal controller output was limited by this nonlinear saturation constraint. Fig 17. displays the SQC of the propose method in this study. The obtained optimal controller promoted control performance in terms of SQC by 100 times. The abovementioned numerical simulations were established using Matlab. We applied Runge-Kutta algorithm (RK) to solve the proposed differential equations. The control scheme was implemented in a callback function which provides time derivative of system states, NN weights and identifier approximated states for RK. The simulation program was executed on a PC with the processor Intel-i7 8500. The average elapsed time for a 20-second simulating process is 3.4s. Flow chart of our implementation is depicted in Fig. 18. System states and adaptive parameters are reshaped and concatenated as a vector s to be transmitted among computational models.

VII. CONCLUSION
Previous work has documented the effectiveness of RADP and H ∞ in solving nonlinear system robust optimal control problems. Existing methods were constructed based on the policy iteration framework, and intelligent components were applied for value function approximation. The optimal control problem can be solved forward in time without requiring any knowledge of the system dynamics. However, it is very difficult to guarantee system stability to collect data safely for critic learning. In this study, a novel NN-based identifier was developed to reconstruct an auxiliary system through critic learning based on the auxiliary system and a designed utility function. The obtained control policy is robust to the critic NN and identifier NN weight parameters. The control system is guaranteed to be asymptotically stable before the critic learning converges. This work effectively extended RADP methods by introducing dynamics identifiers and providing robustness to the variation in critic and identifier NN weights. In general, the proposed method achieves online model-free critic learning for nonlinear non-affine systems, without requiring an initial admissible controller, overcoming most of the difficulties encountered in real applications of ADP. The proposed technique is expected to allow robots and autonomous systems to improve their behaviors much as humans do, by learning through safe interactions with the environment.
XIUFEN YE (Senior Member, IEEE) was born in 1966. She received the B.S., M.S., and Ph.D. degrees in control theory and control engineering from Harbin Shipbuilding Engineering University (Harbin Engineering University), Harbin, China, in 1987China, in , 1990, and 2003, respectively. She has been a Professor with the College of Automation, Harbin Engineering University, since September 2003. She is the author of more than 180 articles. She holds more than 20 patents. Her current research interests include underwater vehicle intelligent control systems, digital image processing, and object detection and tracking. She was a recipient of the two provincial and ministerial science and technology progress awards. She has served as the Program Committee Chairs for the IEEE ICIA 2010 and the IEEE/ICME CME 2011.
WENZHI LIU was born in 1968. He has been a Professorate Senior Engineer with the College of Information and Communication Engineering, Harbin Engineering University, since April 2019. His current research interests include underwater robotics and other new unmanned system of oceans.