Online Dual-Network-Based Adaptive Dynamic Programming for Solving Partially Unknown Multi-Player Non-Zero-Sum Games With Control Constraints

In this article, a novel online method for multi-player non-zero-sum (NZS) differential games of nonlinear partially unknown continuous time (CT) systems with control constraints is developed based on neural networks (NN). The issue of multi-player NZS games with saturated actuator is elaborately analyzed and the unknown dynamics model is learned by applying identifier NN. Different from using the standard identifier-actor-critic framework of adaptive dynamic programming (ADP), the proposed method uses only identifier networks and critic networks for all the players to solve the coupled Hamilton-Jacobi (HJ) equations for multi-player NZS games, which could effectively simplify the algorithm and save computing resources. Moreover, a tuning law which utilizes the gradient descent method is designed for each critic network. Meanwhile, to remove the requirement for the initial stabilizing control, a novel stability term is designed to ensure the system stability during the training phase of the critic NN. By the means of Lyapunov approach, it is proven that the system states, the critic network weight estimation errors and the obtained control are all uniformly ultimately bounded (UUB). Finally, two numerical examples are simulated to illustrate the validity of the developed method for multi-player NZS games with control constraints.


I. INTRODUCTION
The theories with respect to differential games have received more and more attentions since it was firstly studied in [1]. With the efforts of worldwide scholars, differential games theories, which are closely linked to our daily life nowadays, have been widely used in economics, sociology and many other domains [2]- [7]. Three indispensable elements, i.e., players, control policies and performance functions, jointly build the footstone of differential games theory. In multi-player NZS games, the key is to obtain a cluster of optimal control policies called Nash equilibrium for each player to pursue the minimization of their own performance function. In the meantime, the system stability should be The associate editor coordinating the review of this manuscript and approving it for publication was Jinquan Xu . ensured [8]. For the linear NZS games, solving the issue is equivalent to figuring out coupled algebraic Riccati equations. While for the nonlinear NZS games, it's hard to obtain the analytic solutions of the coupled HJ equations. Based on this situation, numerous intelligent methods have been designed by scholars to approximately attain the solutions.
During recent years, ADP methods have been widely utilized to solve nonlinear optimal control issues. The related algorithms could expediently get the approximate optimal control schemes with the help of NN and get rid of the ''curse of dimensionality'' which is induced by conventional dynamic programming approaches [9]- [12]. Due to the merit of ADP, many algorithms have been proposed to address the optimal control issues. To figure out the optimal tracking issue, a novel data−driven ADP method was presented [13]. Owing to the use of recurrent NN, the method only required VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the getatable input−output data. In [14]- [17], on the basis of an actor−critic framework, different ADP algorithms were proposed to seek the optimal control schemes. Instead of the actor−critic architecture employed by the above algorithms, a few algorithms utilized only critic NN to cope with the optimal control issues [18]- [20]. Generally, this sort of framework could simplify the algorithm structure and effectively save computing resources. On account of the trouble of coping with the coupled HJ equations, the issues of NZS games have become a challenging research area for scholars. Moreover, in many cases, it is often tough to acquire the actual model of the system dynamics [21]. Therefore, many scholars have employed various methods to learn the unknown system dynamics [22]. For instance, T−S fuzzy models can be employed to learn the dynamics for that it is powerful in approximating nonlinear systems [23]. In [24], fuzzy logic models, which are similar to T−S fuzzy models, were combined with ADP to solve the NZS games issues. For the unknown multi−input system, a three−layer NN identifier, reinforcement learning scheme and NZS game theory were utilized together to solve the optimal tracking control issue [25]. In [26], a model network was designed to identify the system whose dynamic is not known. Critic networks and actor networks were then employed to approximately learn the value functions and control policies for every players. The algorithm presented in [27] used identifier-critic framework to address the issue of NZS games when the dynamics were unknowable. A novel tuning law was adopted to increase the convergence speed of the controlled system. The integral reinforcement learning (IRL) technology, which is a crucial method of implementing ADP, was used to deal with the unknown dynamics for the NZS games [28], [29]. Recently, event-triggered mechanism has been widely employed to save transmission bandwidths and computing resources [30], [31]. This technology was also combined with ADP to obtain control schemes for every players in NZS games [32].
In actual physical systems, it's often inevitable that the saturation nonlinearity phenomena of actuators severely hurt system performance. Hence the control issues with constraints have attracted intensive attentions. The presented method in [33] used identifier-critic architecture to figure out the Hamilton-Jacobi-Isaacs (HJI) equations for unknown systems with constrained-input. A term was additionally designed to remove the need for the initial stabilizing control. In [34], by the means of IRL, a data-based method was proposed to seek the solutions when the dynamics of the system with constrained-input were completely unknown. An algorithm based on event-triggered mechanism was proposed to solve CT systems with constrained-input [35]. The method in [36] also utilized event-triggered mechanism to solve nonlinear H ∞ control issues with constraints. In [37], an IRL method was used to figure out the optimal control policies for players in NZS games with saturated actuator. Nevertheless, this method employed both actor NN and critic NN, which perplexed the algorithm and aggravated the computing burdens. Besides, initial stabilizing control policies were required.
In general, for the existence of the coupled terms, it's hard to attain the control schemes for every players. To the best of our knowledge, there exist few algorithms dealing with multi-player NZS games with control constraints. Inspired by the existing methods, in this article, an online dual-networkbased ADP method is developed to solve the multi-player NZS games with control constraints. The contributions of this work can be summarized as follows. In the proposed online method which could online control the system in real time without using the historical data, identifier-critic framework is designed. The identifier NN are used to estimate the system dynamics and the critic NN are utilized to approximate the solutions of the coupled HJ equations. With this architecture the algorithm can be simplified and the computing burdens are lightened. By introducing an additional term to the update laws of the critic NN, the requirement of initial stabilizing control policies is relaxed. Furthermore, the stability of the multi-player NZS system is demonstrated by Lyapunov theory. Finally, simulation results verify this method.
The remainder of this article is structured as follows. Section II formulates the issue of multi-player NZS games with control constraints. Identifier NN and critic NN are constructed to learn the unknown internal dynamics and solve the coupled HJ equations in Section III. Besides, for critic NN, a novel stabilizing term is added to the update law to get rid of the need of an initial stabilizing control. Section IV demonstrates the UUB of the states, the critic weight estimation errors and the obtained controllers by Lyapunov approach. Two numerical examples are shown to verify the effectiveness of the presented method in Section V. Finally, Section VI provides the conclusion.
Notations: R represents the set which includes all real numbers, R n is the n-dimensional Euclidean space and R n×m the space of all real matrices. denotes a compact set which contains the origin and C q ( ) denotes the class of functions that have continuous q-th derivative. N denotes the set N = {1, . . . , N }, where N ≥ 2 is a positive integer. I n×n denotes the unit matrix whose dimensionality is n. And the · represents the Euclidean norm of a vector or a matrix. ∇(·) ∂(·)/∂x is taken for denoting the gradient operator.

II. PROBLEM FORMULATION
Consider the system of the nonlinear system in an affine form as follows:ẋ where x ∈ ⊂ R n is the state, u j ∈ ⊂ R m is the control input of the j−th player, = {u j ∈ R m : |u jp | α j , p = 1, . . . , m}, and α j > 0 denotes the constraint bound. f (x) ∈ R n , g j (x) ∈ R n×m and f (0) = 0.
Assumption 1: ∀j ∈ N, the system function f (x) and g j (x) are all locally Lipschitz on , and system (1) is controllable and observable.
Assumption 2: ∀j ∈ N, g j (x) is bounded and there exists a positive constant g jM such that g j ≤ g jM .
Define the performance index for player i as where the utility function i (x, u 1 , . . . , Here the function C j (u j ) is set as where ϕ(·) ∈ R m , ϕ − represents (ϕ −1 ) , and is a monotonic bounded odd function which belongs to C q (q ≥ 1) and L 2 ( ). Meanwhile, Q i and R ij above-mentioned are positive symmetric matrices. In this work, we choose ϕ(·) = tanh(·), then we have The key point to solve NZS games is to seek the Nash equilibrium point solution, the definition of which is provided as [28], [38]: Definition 1: For the NZS game of system (1), a N-tuple policy pair u * = {u * 1 , . . . , u * i , . . . , u * N } is said to constitute a Nash equilibrium point solution if for any stabilizing policy Definition 1 indicates that without the reasonable adjustments of the policies of all the other players, the player i can hardly obtain its optimal performance.
For any stabilizing control policy pair u = {u 1 , . . . , u N }, the value function can be defined Assume that the function (5) is continuous and differentiable, then the differential equivalent for the i-th player can be described as Then we have the Hamiltonian function Moreover, the optimal value function and optimal control for the i-th player are given as Then we obtain the coupled HJ equations of system (1) as where Then, the coupled HJ equation (10) is denoted as Note that it's intractable to tackle the equation (12) for the existence of the coupled terms and partial derivatives. Therefore, in what follows, identifier-critic NN framework is developed to approximately solve the coupled HJ equations.

III. MULTI-PLAYER LEARNING FOR NZS GAMES WITH CONTROL CONSTRAINTS USING IDENTIFIER NN AND CRITIC NN
In this section, for the purpose of solving the multi-player NZS games with control constraints, an online adaptive method based on NN is proposed. Identifier-critic framework is employed to approximately solve the HJ equations. Besides, for critic NN, by designing an additional stabilizing term based on Lyapunov theory, the requirement for the initial stabilizing control is relaxed.

A. IDENTIFIER DESIGN
By constructing NN-based ideal identifier, (1) could be rewritten asẋ where W f ∈ R N ii is the ideal weight vector, γ f (x) ∈ R N ii the activation function set, S ∈ R n×n the Hurwitz matrix and ε f ∈ R denotes reconstruction error. N ii represents the number of neurons for identifier NN.
In practice, when employing identifier NN to learn the unknown internal dynamics, we havė wherex is the acquired state vector andŴ f the acquired weight.
Let the state estimate errorx = x −x and the weight estimate errorW f = W f −Ŵ f , then we can derive thaṫ Before proceeding, two mild assumptions are necessary [39].
Assumption 3: The ideal identifier weight W f and reconstruction error ε f in the reconstructed dynamic (13) are both bounded, and it holds that where L γ f is a constant. Theorem 1: Consider the system (1) and employ the identifier (14). When the identifier weight is tuned witḣ where H ∈ R N ii ×N ii is a positive definite symmetric matrix, the state estimation errorx is UUB.
Proof: The Lyapunov function candidate is constructed as Due to (15) and the above two assumptions, we havė where M s = −S − W f L γ f . The positive definite matrix M s can be obtained by appropriately choosing the matrix S. Due to thatẆ f = −Ẇ f and the tuning law (17), we havė From (19) and (20), it yields thaṫ where P Ms = λ min (M s ). When x > ε fM P Ms , it holds thatL d < 0. Based on the analysis above,x is UUB when the tuning law (17) is utilized.
Here, the identifier dynamics Sx +Ŵ f γ f (x) is obtained and will be utilized below.

B. ADAPTIVE CONTROLLER DESIGN
Owing to the universal approximation nature of NN, the value function for player i can be represented as where W i ∈ R N ci and γ i ∈ R N ci are the ideal weight vector and the activation function set, respectively. N ci denotes the number of neurons for player i and ε i ∈ R is reconstruction error. According to (22), the derivative of V * i (x) with respect to x is where ∇γ i = ∂γ i (x)/∂x ∈ R N ci ×n and ∇ε i denotes the partial derivative of ε i . For that the ideal weights are difficult to obtain, we utilize estimated critic networks to approximate V * Accordingly, the partial derivative ofV i is From (9) and (23), u * i is derived as where Similarly, the approximate control for the i-th player is presented asû Combining (10), (12), (24) and (27), we derive the approximate HJ error as For that the dynamics f (x) is unknown, the identifier dynamics Sx + W f γ f (x) is used here to substitute f (x) to construct the approximated Hamiltonian function to obtain the terms e i and e i .
For the purpose of minimizing e i , it's requested to find i . By employing the gradient descent method [14], [40], the updating laws of the critic NN can be described aṡ where Note that by utilizing the tuning law (29) for player i, an initial stabilizing control is generally needed. If the initial control is not stable, the stability of the system might not be ensured with the tuning law (29). In addition, a noise signal is usually requested to be added to the system to satisfy the persistence of excitation (PE) condition during the training process for the critic NN, which could influence the stability of the close-looped system. Therefore, based on (29), a novel tuning law is developed by Lyapunov theory to remove the requirement for the initial stabilizing control. Before proceeding, a reasonable assumption is given.
Assumption 5: ∀i ∈ N, there exists a Lyapunov candidate L i (x) which is continuously differentiable and radially unbounded. When the control policy is optimal, theṅ where the matrix ϒ i (x) ∈ R n×n is symmetric and positive definite. Consider a Lyapunov candidate L i (x) for player i, then we have its derivative aṡ Aiming at ensuring the system stability during the NN learning phase, by using the gradient descent method a novel term is developed as where Based on aforementioned assumptions and analysis, by introducing the developed term T s , a novel tuning law for the critic NN i is developed aṡ where the indicator function (x,û 1 , . . . ,û N ) is defined as Remark 1: Consider the tuning law (33) and the indicator function (34). Both the first and the second term in (33) are developed on the basis of the gradient descent method. The former term is employed to minimize the error function E i , and the latter one is to stabilize the closed-loop system during the training phase of the critic NN. It can be seen that when VOLUME 8, 2020 the system is unstable, then (x,û 1 , . . . ,û N ) = 1. The latter term in (33) works to ensure the system stability. By this means, the requirement for an initial stabilizing control is removed.
Remark 2: Note that the purpose of the NZS games is to obtain control policies for all the players to stabilize the system and minimize their own performance index functions. Each player has his individual goal to follow. Hence (x,û 1 , . . . ,û N ) = 0 holds only when the condition ∇L i f (x) + N j=1 g jûj < 0 is satisfied for all the players. Note thatẆ i = −Ẇ i . For the convenience of stability analysis later, from (29) and (33), we derivė

IV. STABILITY ANALYSIS
In this section, the stability demonstration of the system utilizing the proposed method is presented. Before proceeding, according to [14], [20], [41]- [43], some assumptions are needed. Assumption 6: ∀i ∈ N, bring the signalκ i which is persistently excited on the time interval [t, t + T ], i.e., there exist positive constants ω i > 0 and T > 0 such that Assumption 7: ∀i ∈ N, the ideal critic NN weight W i is bounded such that W i ≤ W iM , where W iM is a positive constant. Besides, there exist positive constants ε iM , e iM , ε * iM and γ iM such that ∇ε i ≤ ε iM , e i ≤ e iM , ε u * i ≤ ε * iM and ∇γ i ≤ γ iM .
To expediently present the main theorem, according to [9], [24], a requisite lemma is stated in advance.
Lemma 1: Consider the NZS system (1). The critic tuning law and control policy for each player are given by (29) and (27), respectively. Let Assumptions 6-7 hold, thenW i is UUB for all i ∈ N.
Proof: Construct the Lyapunov function candidate L 0 = Based on Young's inequality, we can derive that Hence we havė Then, (39) implies thatL 0 < 0 when Theorem 2: Consider the constrained-input NZS system (1). The novel critic tuning law and control policy for each player are designed by (33) and (27), respectively. Suppose all the aforementioned assumptions hold. Then, the system states and weight estimation errors of the critic NN are UUB through the condition where N is the number of players, and ω m means the minimum exploratory noise signal.
Proof: The Lyapunov function candidate is constructed as where i . L i is given in Assumption 5. Taking the derivative of L a and L b of (41), we respectively obtaiṅ According to Lemma 1, we havė The following two cases are considered for the convenience of the proof.
Case i: (x,û 1 , . . . ,û N ) = 0, which means that ∇L i f (x) + N j=1 g jûj < 0, ∀i ∈ N. Due to the PE condition, x > 0 is guaranteed. By choosing a positive constant δ that satisfies 0 < δ < ẋ , we can derive ∇L iẋ < − ∇L i δ < 0. Then, (45) is presented aṡ ∀i ∈ N, when one of the following two conditions holds: it yields thatL < 0. Case ii: (x,û 1 , . . . ,û N ) = 1, which means that ∇L i f (x) + N j=1 g jûj ≥ 0 holds for at least one player. The system (1) might be unstable and the additional term in (33) works. Then, (45) becomeṡ By utilizing Taylor series expansion and the nature of hyperbolic function tanh(·), from (50), we have From (49) and (51), we obtaiṅ where χ = (2 √ m N j=1 α j g jM + N j=1 g jM ε * jM ). Observe (52), if the condition N 2 γ 2 M η 2 RM g 4 M < λ m ω m is satisfied, then, ∀i ∈ N, when at least one condition as follows is satisfied : it holds thatL < 0. Based on the analysis of the above two cases and according to the Lyapunov extension theorem, we can draw a conclusion that ∀i ∈ N, the function ∇L i and the weight estimation errorW i are both UUB. The corresponding bounds are respectively denoted by . Moreover, the system state x is UUB due to the property of the function ∇L i .
Remark 3: Different from the method proposed in [37] that each player needs both critic NN and actor NN, the presented method in this article requires only critic NN for each player to tackle the coupled HJ equations. By abnegating the actor NN, identifier-critic framework is employed, which can effectively simplify the algorithm complexity and save the computing resources.
Remark 4: For the introduction of the indicator function (x,û 1 , . . . ,û N ), the above analysis considers two different cases to demonstrate the UUB of the system states and the critic NN weight estimation errors. When (x,û 1 , . . . ,û N ) = 1, it means that the system might not be stable. Then the stabilizing term is added to stabilize the system. By this means the requirement for the initial stabilizing control is removed. It's noted that when the initial control is a stabilizing one and the added noise signal doesn't effect the system stability, (x,û 1 , . . . ,û N ) = 0 holds all the time during the NN training phase and the designed tuning laws for critic NN are equivalent to the standard tuning laws without the designed term. In addition, based on the analysis of the above two cases, the bound of the weight estimation errorW i is obtained, which will be utilized in the following theorem 3.
Theorem 3: Consider the NZS game with control constraints of system (1). The critic NN weight tuning laws and the control schemes are given by (33) and (27). On the basis of the aforementioned assumptions and the condition (40), then the obtained controlû i converges to u * i with a small bound.
Proof: When all the assumptions and the condition (40) hold, according to (26), (27) and the conclusion of Theorem 2, we derive Remark 5: By observing (55) we can draw that the acquired controlû i converges to u * i with the small bound ε uib . Hence, the acquired control signal is UUB with the tuning law (33).

V. SIMULATIONS
In this section, two examples are simulated to illustrate the validity of the developed method.
We show the simulation results in Figs. 1-5. The states estimation error of identifier is given in Fig. 1 which shows the validity of the identifier. Fig. 2 shows the trajectories of the system states during the training phase. The amplitudes of the states gradually decrease over time and rapidly go down to zero after t = 150s when the probing signal is removed. The convergence of the critic NN weights for both of the players is indicated by Fig. 3 and Fig. 4, from which we can observe that the weights converge to constants after t = 120s. Then, Fig. 5 presents the control policies for the players during the training process of the critic NN. It can be evidently observed that the amplitudes of the control signals are limited by threshold 0.4.
The simulation results are presented in Figs. 6-11. The states estimation error curve of identifier is presented   in Fig. 6 to show the validity of the designed identifier. From  Fig. 7, by observing the trajectories of the states, we can find that the amplitudes go down to zero fleetly when the noises are removed. Figs. 8-10 present the convergence of the weights for all of the critic NN and the weights for all networks come to constants after t = 110s. Fig. 11 shows the control policies during the learning phase of the critic NN. From Figs. 6-11 and the main theorems, it's not difficult to conclude that all the signals (i.e., the system states, the critic weights and the control) in the three-player nonlinear system (57) with the developed control policies are UUB.

VI. CONCLUSION
In this article, a novel online ADP method was developed to solve multi-player NZS control issue with control constraints. To tackle the constrained inputs, a number of non-quadratic terms were first utilized to constitute the performance index functions. Then, for simplifying the algorithm and easing the computing burdens, only identifier NN and critic NN were employed to approximately solve the coupled HJ equations of the partially unknown NZS games. A tuning law with a novel stability term was developed for each critic NN such that the stability of the closed-loop system was guaranteed during NN training phase and the need for the initial stabilizing control was removed. All the signals of the system were demonstrated to be UUB by utilizing Lyapunov theory. Finally, a twoplayer linear system and a three-player nonlinear system were simulated to verify the validity of the novel ADP method.
Our future work is to employ event-triggered mechanism in the present algorithm framework to further save communication and computing resources. Meanwhile, in order to improve the applicability and practicability of the algorithm, we will focus on investigating the present method to apply to the stochastic systems, the switched systems and other types of nonlinear systems with completely unknown dynamics.
PENGDA LIU received the B.S. degree in automation control and the M.S. degree in control engineering from Northeastern University, Shenyang, China, in 2012 and 2014, respectively, where he is currently pursuing the Ph.D. degree in control theory and control engineering with the College of Information Science and Engineering.
His research interests include adaptive dynamic programming, reinforcement learning, and optimal control. VOLUME 8, 2020 HUAGUANG ZHANG (Fellow, IEEE) received the B.S. and M.S. degrees in control engineering from Northeast Dianli University, Jilin City, China, in 1982 and 1985, respectively, and the Ph.D. degree in thermal power engineering and automation from Southeast University, Nanjing, China, in 1991.
He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and the Head of the School of Information Science and Engineering, Institute of Electric Automation, Northeastern University. His main research interests include fuzzy control, stochastic system control, neural networks-based control, and nonlinear control and their applications. He has authored or coauthored over 280 journal articles and conference papers, six monographs, and co-invented 90 patents.
Dr His research interests include reinforcement learning, optimal control, fuzzy control, and adaptive dynamic programming and their applications.