SODO Based Reinforcement Learning Anti-Disturbance Fault Toler-Ant Control for a Class of Nonlinear Uncertain Systems With Matched and Mismatched Disturbances

This paper proposes a reinforcement learning anti-disturbance fault tolerant control structure for a class of nonlinear uncertain systems with time varying matched and mismatched disturbances. To deal with the time varying matched and mismatched disturbances, two second order disturbance observers (SODOs) are designed for the inner and outer loop dynamic equations. For the purpose of enhancing the robustness and adaptivity with respect to the system uncertainties, two long short-term memory (LSTM) networks those possesses perfect fitting ability, have been introduced as the critic and actor networks. Moreover, to overcome the difficulty caused by the unknown perturbations of the control effectiveness, several fault tolerant adaptive laws have been designed. Consequently, a novel reinforcement learning anti-disturbance fault tolerant control structure has been established for the concerned disturbed nonlinear uncertain system. Two numerical examples are provided finally, demonstrating the satisfactory performance of the proposed control structure.


I. INTRODUCTION
Nonlinear control system is always one of the focuses and difficulties in the control field [1]- [3]. In order to solve the control problem of nonlinear systems, the researchers have proposed a series of control methods and strategies [4], [5]. Isidori and his colleges first proposed a feedback linearization method based on differential geometry to solve nonlinear system problems [6]. A discontinuous nonlinear synovial membrane control method for nonlinear systems was proposed in [7]. In [8], to deal with the nonlinear control system with disturbances and input uncertainties, a novel disturbance observer has been designed and a stable control law has been proposed. In [9], the authors proposed an optimal H ∞ tracking control method for nonlinear multivariable dynamic The associate editor coordinating the review of this manuscript and approving it for publication was Haiquan Zhao . systems. In [10], a gap measurement method is introduced to design a multi-model stable controller for a class of nonlinear systems. In [11], taking the matched and mismatched disturbances into consideration, a trajectory linearization control method has been designed for the disturbed nonlinear systems. In [12], for a class of nonlinear systems with timevarying de-lay and state constraints, a novel quantitative adaptive control strategy has been established. In [13], for the miniature high-precision nonlinear system, a proportional integral-differential control method based on the dynamic hysteresis nonlinear model and inverse model has been proposed. However, in the above-mentioned results, the intelligent methods such as the neural net-works, the deep learning methods, the reinforcement learning approaches, have never been utilized to construct the control laws, and the suppression performance for the unknown nonlinearities or system uncertainties may have to be enhanced. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Among the plenty of the intelligent algorithms, the reinforcement learning strategies and methods possess the advantages of autonomous learning ability and ability of handling the complex dynamics [14]. Reinforcement learning control is a deep combination of the control technique and reinforcement learning methods, possessing the excellent ability of handling the complex or uncertain dynamics existing in the control systems, and so as to effectively realize the stabilization or tracking control. Recently, many of the reinforcement learning control methods have been investigated or reported.
In [15], a novel adaptive fault-tolerant attitude control approach has been designed based on the long short-term memory (LSTM) network for the fixed-wing UAV subject to the high dynamic disturbances and actuator faults. In [16], a reinforcement learning state feedback control method has been designed. In [17], a reinforcement learning control structure using the hidden reward function has been constructed. In [18]control method has been synthesized by using the reinforcement learning and behavior-critic strategy. In [19], a deep reinforcement learning control method has been provided, and a novel model-free reinforcement learning fault tolerant control structure has been established. In [20], an improved adaptive reinforcement learning control method has been proposed for the deformation control of the aerospace unmanned systems.
Moreover, because of the excellent ability of handling the complex or uncertain dynamics, the reinforcement learning control methods have been applied to a plenty of the practical engineering systems. In [21] and [22], two reinforcement learning trajectory tracking control methods has been investigated for the underactuated ships and the soft robots. In [23], a reinforcement learning control has been proposed for vehicle robot with variable gravitational center. In [24], by using reinforcement learning algorithm, a novel precise control strategy has been pro-posed for the nonlinear fast hot machining control system. In [25], a strategy-based reinforcement learning control method has been reported, minimizing switching time and overshoot of the nonlinear floating piston system. In [26], a real-time reinforcement learning control method has been proposed for experiential playback. In [27], a model-free reinforcement learning con-troller has been designed for the electrically driven cold heat storage system. Besides, the reinforcement learning control methods have also been applied to the air injection systems [28], the humanoid robots [29], the HVAC(Heating Ventilation and Air Conditioning) systems [30].
However, the reinforcement learning control methods has never been designed for the nonlinear system with both matched and mismatched disturbances. Moreover, for the matched and mismatched disturbances with time-varying features, the reinforcement learning anti-disturbance control law has been rarely reported. Furthermore, for the nonlinear system suffer-ing from the actuator faults and the multiple disturbances simultaneously, the reinforcement learning controllers is lacking. Therefore, this paper carries out the research of SODO based reinforcement learning anti-disturbance fault tolerant control for a class of nonlinear uncertain systems with matched and mismatched disturbances. The main contributions of this paper can be summarized as follows: • To the best of the authors knowledge, the reinforcement learning fault tolerant anti-disturbance controller has been firstly proposed for the nonlinear uncertain systems with matched and mismatched disturbances.
• By using the LSTM networks as the critic and actor networks, the robustness and adaptivity with respect to the system uncertainties can be enhanced.
• Benefitting from the estimation ability of the SODO, both of the matched and mismatched time-varying disturbances can be handled.

II. PROBLEM FORMULATION A. THE UNCERTAIN NONLINEAR SYSTEMS WITH MATCHED AND MISMATCHED UNCERTAINTIES
Consider the following nonlinear uncertain system with matched and mismatched uncertain-ties: where x 1 ∈ R n and x 2 ∈ R n denote the system states, u ∈ R m is the control input.
and f (x 1 (t) , x 2 (t)) are the known and unknown nonlinearities existing in the considered nonlinear uncertain system.
= T ∈ R m×m is a known matric, representing the control effectiveness.
= [ ] T ∈ R m×m denotes the unknown perturbations of the control effectiveness. d 0 (t) denotes the mismatched time-varying disturbance, while d 1 (t) and d 2 (t) are the matched disturbances.
The objective of this paper is to design a reinforcement learning anti-disturbance control to realize stably tracking for desired signal y d (t), in the presence of the unknown nonlinearities, the unknown perturbations of the control effectiveness, and the mismatched and matched time-varying disturbances.
To achieve the design objective, the following assumptions are required: Assumption 1: The matched and mismatched disturbancesare all bounded, i.e., there exists a constantd 0 , Assumption 3: The desired signal y d (t) is assumed to be smooth and twice differential.

B. LONG SHORT-TERM MEMORY NETWORK
To achieve the reinforcement learning anti-disturbance control for the concerned nonlinear uncertain system with matched and mismatched disturbances, the long short-term memory net-works are introduced as action network and critic network.
The output of the LSTM can be formulated by where y ∈ R n and z ∈ R represent the input and output signals of the LSTM. The LSTM in-cludes the forget gates, input gates, memory states, update gate and output gates, those can be described as follows: The final output state is

Lemma 1 ([38], [39]):
For any unknown smooth function, the LSTM network can achieve approximation with bounded errors. In details, for any given smooth function f : R n → R, the follow-ing equation holds: where W is the weight matric of LSTM, ε is the error of LSTM approximation.

A. THE CONTROL STRUCTURE OF THE REINFORCEMENT LEARNING ANTI-DISTURBANCE CON-TROLLER
The control structure of the proposed reinforcement learning anti-disturbance control law is shown in Fig 1. Two SODOs are utilized to handle the matched and mismatched time-varying disturbances. The critic network is utilized to evaluate the anti-disturbance control performance of the closed-loop system, and the actor network is introduced as a component in the anti-disturbance fault tolerant control law.

B. REINFORCEMENT LEARNING ANTI-DISTURBANCE CONTROL LAW
Define e 1 (t) = x 1 (t)−y d (t). Based on the dynamic equation of the nonlinear uncertain system (1), the cost function is selected as follows.
To approximate the cost function, a LSTM network is selected as the critic network, which is where W c ∈ R p c is the desired weight of the critic network. ε c is a bounded error of the critic network. p c is nodes number of the NN, and c (x) ∈ R p c is a vector of the primary functions. DefineĴ andŴ c as the estimated value of J and W c , respectively. Hence, we can get that Construct the residual mean square error function of the critic network as where The updating objective of the weight of the critic net-work is to minimize E c . Therefore, according to the gradient descent method, the update law for the weight of the critic network can be designed aṡ In this paper, the actor network is fused into the adaptive fault tolerant controller. Consider-ing that f (x 1 (t) , x 2 (t)) is unknown, a LSTM network is introduced as the actor network, which is where W a ∈ R p a ×n , a (x) ∈ R p a represents the desired weight and the primary function vector of the actor network. ε f is the bounded error of the actor network, satisfying that ε f ≤ε f . The estimated value of W a is defined asŴ a . Taking the derivatives of both sides of e 1 (t) yields thaṫ To force the inner loop of the nonlinear uncertain system to be stable, the virtual control signal is designed as where K 1 > 0 is the control gain.d 0 (t) is the estimated value of the mismatched time varying disturbance d 0 (t), obtained from the following second order disturbance observer: Define e 2 (t) = x 2 (t) − x 2c (t). We can get thaṫ By combining the actor network, the baseline control signal is designed as where K 2 > 0 denotes the control gain matric of the outer loop.d 1 (t) is an adaptive parameter, generated from the following equation:ḋ is the estimated value of mismatched time varying disturbance d 2 (t), generated from the following second order disturbance observer: The update law ofŴ a is designed as follows: The final control signal is designed as where u a is utilized to deal with the unknown perturbations of the control effectiveness, de-signed by: whereM is the estimated value of M . The adaptive law ofM is designed asṀ C. STABILITY ANALYSIS Theorem 1: Consider the nonlinear uncertain system (1) with matched and mismatched uncertainties. Suppose Assumption 1, 2 and 3 are satisfied. If the critic network and the actor network are selected as (5) and (9) respectively, the reinforcement learning anti-disturbance control law is designed as (18), (15) (20), the update law for the network weights are designed as (8) and (17), then all the signals of the closed-loop control system will be bounded and the tracking error can be forced to converge into a compact neighborhood of zero. Proof: Defined 0 =d 0 − d 0 ,d 0 =d 0 −ḋ 0 ,d 2 =d 2 − d 2 ,d 2 =d 2 −ḋ 2 . By using equation (12) and (16), we and take the derivative to get: 1 , the following closed loop equations can be obtained: The Lyapunov function L(t) is selected as follows: Take the derivative of (23) as follows By using equation (17) Similarly, the following inequality can be get: Accordingly, it can be known that J * x Tẋ 1 satisfies: Substituting (21) into (24) yields: It follows that By using the Young's inequities, we can get that: where¯ a is the upper bound on the demonstration number || a ||. By combining equations (27), (28) and (29), we know that:  (24), (32) and (33), we know thaṫ By using equation (17), (20) and (25) (27), the following equation can be obtained: Define: , L 4 − 1 4ρ 11 , Then according to (35)and (36),we can get: and Therefore,according to (38), it can be known that the system state,the disturbance estimation errord 0 ,d 0 ,d 2 ,d 2 of SODO and the adaptive estimation errord 1 ,W a ,W c ,M are all bounded.In addition,the boundness ofẋ 1 ,ẋ 2 ,d 1 ,d 0 ,d 0 , d 2 ,d 2 ,M ,Ŵ a ,Ŵ c can be verified. Moreover, it is obvious that the tracking error can be forced to converge into a compact neighborhood of zero,which completes the proof.

IV. SIMULATION STUDY
In order to evaluate the effectiveness and performance of the proposed reinforcement learning-anti-disturbance faulttolerant control law, a numerical example is provided in this section.
In the simulation, we select f = 2 sin(x 1 +5)+cos(6x 2 − 4), B = 1, N = 1. The desired signal y d is generated bÿ The initial value of the system is set as: The initial weight parameters of the actor network and the critic network are set as:   For the proposed control method, the control gain is K 1 = 3, K 2 = 10 and the adaptive parameters are c W a = 5, c W c = 5, c d = 0.5, c M = 0.5 and σ a = 3, σ c = 0.003, σ d = 0.1, σ M = 2.
In Case 1, the matched and mismatched disturbances are set as trapezoidal disturbances those varying with time. The simulation results are shown in Fig 2 -Fig 5. The simulation results show that the proposed control method can achieve satisfactory results under the condition of actuator failure and constant or changing external disturbance. All signals in the   closed-loop control system are bounded during the whole control process.
In Case 2, the matched and mismatched disturbances are set as square waves those varying with time. The simulation results are shown in Fig 6 -Fig 9. It is obvious that the proposed reinforcement learning-anti-disturbance fault-tolerant control method can still guarantee stable tracking, while the control methods without SODO or reinforcement learning may produce un-desired tracking errors and time delay. From the simulation results of the two cases, the effectiveness and the advantages of the proposed reinforcement learning-antidisturbance fault-tolerant control method can be verified.

V. CONCLUSION
This paper addressed the reinforcement learning control problem for the nonlinear uncertain systems with matched and mismatched time-varying disturbances, as well as the unknown perturbations of the control effectiveness. Two SODOs have been designed for the concerned non-linear uncertain system, estimating and compensating the time varying matched and mismatched disturbances. Two LSTM networks those possesses perfect fitting ability have been utilized as the critic and actor networks, improving the adaptivity with respect to the system un-certainties. Then by designing several fault tolerant adaptive laws, the reinforcement learning anti-disturbance fault tolerant control structure which can handle the matched and mismatched time-varying disturbances, has been established. Two cases of simulation have been performed in this paper, and the advantages of the proposed control structure can be known. SHIYI HUANG received the bachelor's degree in major of software engineering from Nanchang Hangkong University, Nanchang, China, in 2019. He is currently pursuing the master's degree with East China Jiaotong University. His current research interests include intelligent control of unmanned aerial vehicle systems, reinforcement learning control and design, and research of complex neural networks.