Data-Driven Nonlinear Near-Optimal Regulation Based on Multi-Dimensional Taylor Network Dynamic Programming

Using the data-driven control formulation, an iterative dynamic programming approach which is based on a multi-dimensional Taylor network is established to design the near optimal regulation of discrete-time nonlinear systems. For discrete-time general nonlinear systems, the iterative adaptive dynamic programming algorithm is developed and proved to guarantee the property of convergence and optimality. Three networks are constructed, namely, the identification network, critic network and action network. Moreover, a globalized dual heuristic programming technique with detailed implementation is developed. The cost function and its derivative can be approximated by this novel architecture. Besides, without the consideration of the system dynamics, this technique can learn the near-optimal control law simultaneously and adaptively. In addition, this technique greatly improves the existing results of the iterative adaptive dynamic programming algorithm in terms of reducing the requirement of the control matrix. Furthermore, because of the approach that is based on the multi-dimensional Taylor network, the amount of calculation needed is also greatly reduced. The simulation experiment is described to illustrate the effectiveness of the data-driven optimal regulation method proposed in this paper.


I. INTRODUCTION
A wide range of applications involve optimal control in engineering technology. To optimize the performance index of the controlled system, the controller design is the basis of the optimal control research [1]. Therefore, optimal control has become one of the main topics of modern control theory [2]. Unlike the optimal control problem of linear systems, the optimal control problems of nonlinear systems usually require solving nonlinear Hamilton-Jacobi-Bellman (HJB) equations [3]. However, it is very difficult to solve nonlinear partial differential equations, even though some equations cannot be solved under certain conditions. The associate editor coordinating the review of this manuscript and approving it for publication was Sun Junwei . Therefore, the emergence of dynamic programming provides a new method for optimal control [4], [5]. A novel iterative two stage dual heuristic programming is proposed to solve the optimal control problems for a class of discrete time that is switched nonlinear systems subject to actuators saturation [6]. In the past few years, adaptive-based methods have been well developed [7]- [9], such as heuristic dynamic programming (HDP) [10], dual heuristic dynamic programming (DHP) [11], and globalized dual heuristic dynamic programming (GDHP) [12]. In literature [13], it is discussed that a robust adaptive control scheme based on a cascaded structure with a full state feedback controller with integrator terms as inner control loop and computed torque as an outer control loop for flexible joint robots. Based on the adaptive laws and finite-time stability theory, a nonsingular terminal sliding mode control is designed in literature [14]. Literature [15] proposes a new output-constrained robust adaptive controller for a class of uncertain multi-input multi-output (MIMO) nonlinear systems. These methods which are simple to implement, do not require the controlled object model. Because of these advantages, the adaptive dynamic programming method is well developed [16]. It is concerned with a novel generalized policy iteration algorithm for solving optimal control problems for discrete-time nonlinear systems [17]. In literature [18], a simple, effective method is given for designing the autonomous memristor chaotic systems.
In the current technological development of large data, with the research on data-driven thinking and the study of learning algorithms, the adaptive dynamic programming (ADP) algorithm has become an effective means of optimizing design and intelligent control [19]- [21]. For the introduction of this algorithm, a large amount of researches on the ADP algorithm have emerged [22]- [25]. A novel datadriven robust approximate optimal tracking control scheme is proposed for unknown general nonlinear systems by using the ADP method [26]. A new iterative ADP method is proposed to solve a class of continuous-time nonlinear twoplayer zero-sum differential games [27]. In literature [28], it is showed how to implement ADP methods using only measured input output data from the system. An online adaptive policy learning algorithm (APLA) based on ADP is proposed in literature [29]. The first proposed HDP algorithm based on greedy iteration was reported in reference [30]. It mainly studies the infinite time optimal control design. In the basic ADP algorithm, it is generally necessary to construct two networks, namely, a critic network and an action network. The critic network is used to approximate the cost function, and the action network is used to approximate the control function [31]. Therefore, the training of the action network relies on the system dynamics in the existing iterative ADP algorithms. However, the structure of HDP cannot directly output the derivative function information of the cost function. Moreover, in general, network construction is mainly based on neural networks (NN) [32]. Memristor-Based Neural Network Circuit is discussed in literature [33]. As is known, a NN with a larger hidden node number tends to make the control more complex and increases the computation burden for the control system. Moreover, the NN neuron has exponential functions, which contribute to the complexity of the calculation and cause the NN control to fail to meet the real-time requirements.
Recently, the multi-dimensional Taylor network has been proposed. The multi-dimensional Taylor network (MTN) is a new structure [34]- [40]. The MTN that is a simple function of the state, input and is easy to analyze and solve for its polynomials. The MTN is good at approximating nonlinear dynamical systems, even unstable ones, as polynomials approach infinity well and can also accurately express the polynomial dynamical systems. In addition, the MTN only involves multiplication and addition; thus, its simple computation makes desirable real-time control possible. Due to the unique structural characteristic of MTN, its output is a linear combination of finite difference and product of its input, which is more suitable for computer implementation. Besides, the discrete MTN eliminates the dependence on the system model and reduces its design complexity. The adaptive controller based on MTN was proposed in [36]. However, the parameters of that controller are fixed. Then, an MTN tracking control scheme is proposed for a class of stochastic nonlinear systems with unknown input dead-zone [37]. And it is investigated the problem of adaptive MTN control for SISO uncertain stochastic non-linear systems [38].
To address the above issues, a data-driven approximate optimal control method for discrete-time nonlinear systems based on multidimensional Taylor network dynamic programming is proposed. The main contributions of the proposed control schemes are as followings: 1 Three networks, namely, the identification network, critic network and action network, were constructed based on the MTN. The parameter selection algorithm and the detailed control process are given.
2 The convergence of the algorithm is proved. The adaptive algorithm proposed in this paper can be guaranteed to be convergent under infinite time conditions.
3 The simulation experiment proves the effectiveness of the optimal control method proposed in this paper.

II. PROBLEM DESCRIPTION
Consider the following discrete nonlinear systems: is the control vector of the system; and y(k) = [y 1 (k), y 2 (k), · · ·, y n (k)] T ∈ y ⊂ R n is the output vector of the system.
In addition, when k = 0, Here, the following assumptions are made: (1) is controllable, that is, there is a set of control laws to stabilize the system. Assumption 2: The nonlinear mapping F(·) is Lipschitz continuous within the set x , and F(0, 0) = 0. This outcome shows that x(0) = 0 is a balanced state of the system under the control law u(0) = 0.
Control target: In the infinite time domain, designing the output feedback control law u(y) to stabilize the system from the initial state to the equilibrium state and minimize the cost function at the same time.
The cost function provides a standard for us to evaluate the effect of learning. For a controllable system, through a control scheme from allowable solutions, the cost function of the system is continuously reduced during the process of movement, and that is the control target.
Set the optimal cost function Further available Thus, J * (x(k)) satisfies the discrete time HJB equation The corresponding optimal control u * (k) is This finding shows that the next time state vector x(k + 1) is required to solve the optimal control u * (k) at the current moment. However, this requirement is impossible at the current moment. Therefore, an iterative algorithm based on the multidimensional Taylor network is proposed in this paper to obtain an approximate solution.

III. MULTI-DIMENSIONAL TAYLOR NETWORK
The MTN can approximate any nonlinear functions with a finite point of discontinuity. Neat structure is the merit of MTN, whose parameters are easy to adjust.
The detailed application of the MTN can be found in [18]- [20]. Let The basic structure of MTN is shown in Fig 1. In other words, there exists a set of parameter vectors W j (k) = [w j1 (k), w j2 (k), · · · , w jN (n z ,t) (k)] such that the output of MTN O ut jn (k) can be expressed as where N (n z , t) is the total number of the expansion, w ji (k) is the weight of the ith product term, λ(s, i) is the power of z s (k) in the ith product term, and n s=1 λ s,i ≤ t.

IV. ITERATIVE ALGORITHM CONVERGENCE ANALYSIS
To prove the convergence of iterative algorithms, two basic sequences {V i (x(k))} and {v i (x(k))} are constructed here.
{V i (x(k))} denotes the cost function sequence, and {v i (x(k))} denotes the approximate optimal control laws. v is a vector, and the number of elements is the same as the number of elements in the control vector, i = 0, 1, · · ·.
Here, the following two lemmas are used. Lemma 1 (Monotonicity): The cost function sequence {V i (x(k))} is as shown in equation (10), where V 0 (·) = 0. With the control law sequence {v i (x(k))} given by equation (9), {V i (x(k))} is a monotone non-decreasing sequence.
In other words, ∀i, 0 ≤ V i (x(k)) ≤ V i+1 (x(k)). Lemma 2 (Boundedness): If the system is controllable and the cost function sequence {V i (x(k))} is given by equation (10), there is an upper bound D such that ∀i, 0 ≤ V i (x(k)) ≤ D.

V. ITERATIVE ALGORITHM AND ITS IMPLEMENTATION
For the general HJB equation of the nonlinear system is difficult to solve, the optimal control law and the optimal iterative function can be obtained via an iterative algorithm in principle.
However, because the controlled system is unknown, the construction of the system dynamics {V i (x(k))} and {v i (x(k))} is required. A dynamic iterative implementation based on the multidimensional Taylor network is proposed in this section. The model mainly includes the construction of three networks, namely, the identification network, critic network and action network.

A. IDENTIFICATION NETWORK
Before implementing the iterative control process, an identification network must be built to ensure that the control process does not require dynamic information of the system. The weight vectors of the identification network are ω m ∈ R N m , where N m is a number of the multi-dimensional Taylor network identification network expansion items.
The output of the identification network: where η m (·) is the expansion items of the multi-dimensional Taylor network identification network. The error function of the identification network is The training objective function is The weight vectors of the identification network are updated via the gradient method: where α m > 0 is the model learning rate, and j is the training weight iteration index. When the identification network is fully trained and the weights are no longer changed, the training of both the critic network and the action network is started.

B. CRITIC NETWORK
The role of the critic network is to approximate the cost function V i (x(k)) and the partial derivative of the cost function ∂V i (x(k)) ∂x(k) . Let λ i (x(k)) be a co-function. Thus: According to Theorem 1, when i → ∞, we have Thus, because when i → ∞, the co-function is also convergent. Therefore, Set N c to be the number of multi-dimensional Taylor network critic network expansion items. Thus, the weight vector of the critic network is ω c ∈ R N c . At the ith iteration, the output of the critic network is where ω ci = ω V ci , ω λ ci . Expanding the equation, we can obtainV where η c (·) is the expansion items of the multi-dimensional Taylor network critic network. Although the introduction of co-functions increases the amount of computation to a certain extent, the cost function can be output directly. Furthermore, the control effect is also improved to some extent.
The structure is shown in Fig. 2.
The training objective of the critic network consists of two parts, namely, the cost function and the co-function, i.e., The training errors include two parts: The weight of the critic network is updated by the gradient descent method to obtain where α c > 0 is the learning rate of the critic network, j is the training weight iteration index, and 0 < β < 1 is a constant used to reflect the weight of E V ci (k) and E λ ci (k).

C. ACTION NETWORK
The role of the multidimensional Taylor network action network is to approximate the optimal control law. Set N a to be the number of multi-dimensional Taylor network action network expansion items. Thus, the weight vector of the action network is ω a ∈ R N a , and the output of the critic network isv where η a (·) is the expansion items of the multi-dimensional Taylor network action network. Set the error function where S(k) = 0 is the target value ofV i−1 (x(k + 1)). Thus, the Objective function is To minimize the objective function, the weight of the action network is adjusted using the gradient method: where α a > 0 is the learning rate of the action network, and j is the training weight iteration index. The traditional control methods too much rely on the dynamic information of the controlled object. In the process of training the action network, it is necessary to use the direct information of the control matrix or rely on the neural network to express it. However, using traditional control strategies such as the basic framework, the control method proposed in this paper can guarantee the convergence of iterative algorithms. Moreover, the proposed method can relax the dynamic requirements of the system, making it easier to achieve the control effects.
The basic control structure is shown in Fig. 3.

D. CONTROL PROCESS
Suppose x(k) is in any control state, and J * (x(k)) is the optimal cost function. According to theorem 1, when the iteration index i → ∞, V i (x(k)) → J * (x(k)). However, it is impossible to perform an iterative algorithm infinitely. As a result, the error ε is introduced so that to guarantee the cost function can converge after a finite number of iterations. This approximation in the practical sense can meet the needs of the general engineering design. However, in the general situation, the optimal cost function J * (x(k)) is unknown in advance. It is difficult to use the error as a stopping criterion. Therefore, the following stopping criterion is used here: Theorem 2: For nonlinear systems (1) and cost functions (2), in the iterative process, the convergence criteria (40) and (41) are equivalent.
Proof: If |J * (x(k)) − V i (x(k))| ≤ ε is established, then we have the following: According to Theorem 1, the following inequality is established: In other words, Therefore, equation (41) is established. Moreover, according to theorem 1, In other words, V i (x(k)) → J * (x(k)). Thus, for any small ε, When i → ∞, The Proof is completed. A design criterion is provided by theorem 2. Therefore, in practical applications, applying the control strategy proposed in this paper can obtain more reasonable control effects. Theorem 2 validates the equivalence between formula 40 and formula 41. And the important role of the Theorem 2 lies in that it provides practical design criteria of approximate optimal regulation for discrete-time nonlinear systems using iterative MTN dynamic programming method.

VI. SIMULATION EXAMPLE
To validate the controller proposed in this paper, consider the following nonlinear system: Setting x(k) = [x 1 (k), x 2 (k)] T , the utility function is as follows: The initial parameters of the system are When the excitation function is u r (k) = 0.8 · sin(k 17) + 0.7 · cos(k 80) + 0.4 · sin(k 27) the curve of the identification model by the identification network proposed in this paper is shown in Fig. 4.
For the method proposed in this paper, when the dimension n of the controller is equal to 2, the unit step response curve is as shown in Fig. 5.
Alternatively, the BP neural network self-adaption reconstitution algorithm gives the unit step response curve shown in Fig. 5. Fig. 5 shows that the data-driven approximate optimal control method based on the multi-dimensional Taylor network has a faster response.
To verify the follow-up response performance of the controller, when k = 10, with the input curve overlaying a sinusoidal signal, Fig. 10 shows the response curves. Fig. 6 reveals that the data-driven approximate optimal control method tracks the desired signal more quickly. And VOLUME 8, 2020   convergence proof of the proposed algorithm has been described in detail in literature [39] and literature [40].
The algorithm proposed in this paper and the neural network algorithm are all differential adjustment algorithm. That is, the error occurs first in the system, and then the controller acts to reduce the error and eventually to zero. Under this control mechanism, the faster response speed can ensure that the system fluctuates within a relatively small error range. To illustrate this problem, Fig. 7 shows the absolute error curves of the two algorithms.
It can be seen from the figure that the absolute error of the algorithm proposed in this paper is always smaller than that of the contrast algorithm. Thank you again for your valuable comments.

VII. CONCLUSION
For discrete time nonlinear systems, which are based on a data-driven approach, an approximate optimal iterative dynamic programming method based on MTN was proposed in this paper. Moreover, the convergence of the iterative algorithm was proved. Based on MTN, three networks were constructed: the identification network, critic network and action network. As the construction and execution of the control network do not require dynamic information of the controlled system here, the dependence on the structure of the controlled object model will be greatly reduced. The effectiveness of the method proposed in this paper was verified by the simulation results. Compared with the traditional NN control method, the iterative algorithm proposed here based on MTN has faster response speed.
The current research focuses on theoretical analysis in this paper. The convergence of the algorithm under time-infinite conditions was proved. Directions of further research include determination of how to promote the results to a limited time and how to prove the convergence of iterative algorithm in finite time. In addition, an approach to combine theoretical methods with practice requires further study.