Data-Driven Integral Reinforcement Learning for Continuous-Time Non-Zero-Sum Games

This paper develops an integral value iteration (VI) method to efficiently find online the Nash equilibrium solution of two-player non-zero-sum (NZS) differential games for linear systems with partially unknown dynamics. To guarantee the closed-loop stability about the Nash equilibrium, the explicit upper bound for the discounted factor is given. To show the efficacy of the presented online model-free solution, the integral VI method is compared with the model-based off-line policy iteration method. Moreover, the theoretical analysis of the integral VI algorithm in terms of three aspects, i.e., positive definiteness properties of the updated cost functions, the stability of the closed-loop systems, and the conditions that guarantee the monotone convergence, is provided in detail. Finally, the simulation results demonstrate the efficacy of the presented algorithms.


I. INTRODUCTION
Game theory is a powerful and natural framework to represent the interactions among multiple players, where each player seeks to maximize its own interest. Game theory has been widely and successfully used in variety of engineering sectors, including, power systems [1], transportation [2], and control systems [3]. In zero-sum (ZS) games, which are strictly competitive games, each player's gain or loss is exactly balanced by others. In contrast, non-zero-sum (NZS) games can take into account both individual self-interests, as well as global group interest, such as mixed H 2 /H ∞ control [4], etc. In this paper, NZS games with two players for continuous-time linear systems are investigated.
Differential games, for which the states of agents evolve based on differential dynamic equations, have been originally introduced in [5]. In ZS differential game theory, the Nash equilibrium seeking results in solving coupled The associate editor coordinating the review of this manuscript and approving it for publication was Xiaoli Luan.
Hamilton-Jacobi equations (HJEs) [6]- [8]. For the linear systems, the HJEs reduce to coupled algebraic Riccati equations (CAREs) [9], [10]. For NZS differential games, on the other hand, the Nash equilibrium solution is found by solving coupled HJEs for nonlinear systems and CAREs for linear systems [11], [12]. It is difficult or even impossible to obtain an analytical solution to coupled HJEs or CAREs. Many approaches are presented to approximate the solution to the CAREs, such as Newton's method, Lyapunov iteration [13], Riccati iteration [14], parallel synchronous algorithm [15] etc. However, these numerical methods are essentially off-line and require the complete knowledge of systems dynamics. In reality, however, this knowledge might not be available. Therefore, it is desired to develop an online method and obviate the complete model requirement.
Adaptive dynamic programming (ADP)/ reinforcement learning (RL) is a bio-inspired learning method trying to find the optimal policy that optimizes the cumulative reward [16]. RL has been widely used in the dynamic optimization VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ applications, such as ZS games with two players [17], NZS games with multiple players [18], optimal regulation/tracking problem with only single-agent systems [19], [20], consensus problem of multi-agent systems [21]- [24], etc. Value iteration (VI) and policy iteration (PI) algorithms are typical ADP/RL methods to approximate the optimal value policy [25], [26]. In PI algorithm, the initial policy has to be admissible in order to guarantee the closed-loop stability in the iterative learning process [27]- [29]. In contrast, VI algorithm does not require an admissible initial policy [30]. However, the closed-loop stability of the VI algorithm in each iteration can not be guaranteed. In this paper, a novel integral VI method is developed to obviate the requirement of initial admissible policy while guaranteeing the closed-loop stability during the learning process. Typical model-free ADP/RL methods are Q-learning [31]- [33] and off-policy RL [34]- [36]. In the Q-learning algorithm, the action-dependent value function representation is used to evaluate the action given the state. However, only convergence is considered for Q-learning algorithm in [32]. On the other hand, the off-policy RL method is equivalent to the model-based PI algorithm, which still requires the initial policy to be admissible. Therefore, it is desired to obviate the admissibility of initial policy while guaranteeing the closed-loop stability for model-free ADP/RL methods. The main contributions of this paper are summarized as follows: 1) A novel data-driven value iteration algorithm is developed for solving the NZS games for linear dynamical systems. 2) An explicit upper-bound for the discounted factor is given to ensure the asymptotic stability of closed-loop system with the Nash equilibrium. Moreover, it is shown that the undiscounted NZS games can be viewed as a special case of the discounted NZS games. 3) For the presented data-driven value iteration algorithm, theoretical analysis is discussed in terms of the positive-definiteness of the iterative value function, the closed-loop stability and the convergence to the optimal case. The rest of this paper is organized as follows. In Section II, the problem formulations with preliminaries are presented. It is shown that the coupled AREs are sufficient and necessary to the Nash equilibrium. In Section III, an integral VI algorithm and its equivalent form are considered. In Section IV, the positive definiteness of the updated cost functions, the stability discussions concerning the closed-loop systems and the conditions that guarantee the monotone convergence are proven. In Section V, examples are given to demonstrate the effectiveness of the proposed algorithm. Finally, the conclusion is made in Section VI.

II. PROBLEM FORMULATION
We consider the continuous-time linear dynamical systemṡ where x ∈ R n is the system state with initial state x 0 , u 1 ∈ R m 1 is the player one and u 2 ∈ R m 2 is the player two.
For each player, the NZS differential games on an infinite time horizon aim to minimize the following discounted cost function defined as are penalty functions for players one and two, and α 1 > 0, α 2 > 0 are discount factors for players one and two, respectively. As shown later, the non-zero discounted factor is given to ensure the asymptotic stability of closed-loop system with the Nash equilibrium.
The following definitions are required for subsequent discussions.
Definition 1 (Admissible Control): Feedback control policy pair µ = {u 1 , u 2 } is said to be admissible with respect to the performance (2) on a compact set ∈ R n , denoted as µ ∈ ψ ( ), if µ = {u 1 , u 2 } is continuous on , µ stabilizes (1) on , u i (0) = 0 for i = 1, 2, and the performance functions V i (x 0 ) in (2) take finite values for ∀x 0 ∈ . Definition 2 (Nash Equilibrium Strategies): A two-tuple of strategies µ * = u * 1 , u * 2 with µ * ∈ ψ ( ) , i = 1, 2 is said to constitute a Nash equilibrium solution for a two-player finite games in extensive form, if the following two inequalities are satisfied for i = 1, 2: The two-tuple of quantities u * 1 , u * 2 is known as a Nash equilibrium outcome of the two-player games.
In this paper, the problem of interest can be formulated as follows.
Problem 1 (Discounted Two-Player NZS Games): For the two players in system (1), find the Nash equilibrium strategies (u * 1 , u * 2 ) with respect to the cost functions defined in (2).

A. COUPLED ALGEBRAIC RICCATI EQUATIONS FOR DISCOUNTED NZS GAMES
In this subsection, the sufficient and necessary conditions of the Nash equilibrium of Problem 1, named the coupled algebraic Riccati equations, are introduced. Lemma 1 [6]: Under Assumption 1, consider the system (1) with the performance functions defined by (2). Then, (K * 1 , K * 2 ), defined as is a feedback Nash equilibrium if and only if (P * 1 , P * 2 ) is a symmetric stabilizing solution of the CAREs (4) and (5), as shown at the bottom of this page, with

Definition 3 (Riccati Operator):
For each player, Riccati operator Ric α i (X 1 , X 2 ) is defined as in (6) and (7), as shown at the bottom of this page.
Remark 1: The Riccati operator Ric has an important role in evaluating policy for each player with respect to the performance defined by (2). 1) If Ric α i P * 1 , P * 2 = 0, then the performance indices (2) are minimized and both players in system (1) has reached the Nash equilibrium.
holds, then the performance of step k +1 is closer to the optimal solution than step k.

B. OFFLINE POLICY ITERATION ALGORITHM
For iterative ADP algorithm, the optimal feedback gain is obtained by successive approximation. In k-th iteration, we denote the admissible policy for the player i as u Then, the corresponding discounted Bellman equation can be written as [37]: where Then, the Bellman equations (8) and (9) can be equivalently written as the following Lyapunov equations, The PI algorithm has been successfully used to solve the HJE and ARE in optimal control theory [20], [37]. Here, the PI algorithm is extended to approximate the solution to the CAREs iteratively. The closed-loop dynamics with K Then, the following offline PI algorithm can be presented to find the solution to the CAREs (4) and (5).

Algorithm 1 Offline Policy Iteration Algorithm
2 , such that the system (1) is a stable closed-loop system. 2: Policy Evaluation: solve (11) and (12) for P

.
3: Policy Improvement: update the control policy gain as, 4: Stop at convergence, otherwise set k = k + 1 and go to step 2 Remark 2: As shown in [8], the convergence of Algorithm 1 to the solution of the CAREs (4) and (5), and the closed-loop stability of the iterative control policy for each player can be guaranteed.
Note that in Algorithm 1, the solution to the CAREs (4) and (5) is obtained offline, and it requires complete knowledge of the system dynamics (1). In the subsequent sections, an online integral VI algorithm is developed to solve the CAREs (4) and (5) with only partial knowledge of the system dynamics. In addition, the initial policy has to be admissible in order to guarantee the closed-loop stability in each iteration. In the following, this requirement can also be relaxed.

III. INTEGRAL VI ALGORITHM
In this section, a novel integral VI algorithm is developed to solve Problem 1.

A. INTEGRAL VI ALGORITHM
Consider the system (1) with the performance functions (2), a novel equivalent representation with a stabilizing policy u i VOLUME 7, 2019 can be described as From (15), the integral temporal difference error for a given policy u i can be defined as whereQ To design the TD(0) algorithm, the value function update can be represented with the learning rates η 1 and η 2 as The learning rates η 1 and η 2 should be properly designed to guarantee the closed-loop stability and the convergence for the learning process, as discussed later in Section IV.
The next control policy gain is designed by Note that the value function V wherex t ∈ R n(n+1)/2 represents a column vector Then, update rule (17) can be expressed as: From the definition of δ t in (16), the term d k i in (19) contains an integral term. Therefore, to solve d k i , the following additional dynamics (20) is introduced: during the simulation, note that initial state for (20) is reset to zero at each interval (t, t + T ). Then the integral term in (16) can be calculated as Inserting (21) into (16), one has an equivalent form, Then, the term d k i in (19) can be equivalently expressed as . Therefore, the update rule (19) can be rewritten as: Note that in k-th iteration, only the termP (k+1) i is unknown. Therefore, the least squares method is employed to solvē P (k+1) i by collecting N ≥ n (n + 1) 2 sample points is to ensure the number of equations is greater than the number of unknowns for (23).
Remark 3: It is worthwhile to highlight the role of the discounted factor, α i , in the discounted NZS games. In Algorithm 1, it is required that the matrix is Hurwitz. In contrast, in the integral VI algorithm, the initial policy does not need to be admissible. As given in Step 1 in Algorithm 2, it is only required that the matrix is Hurwitz. A comparison between (24) and (25) indicates that the requirement of admissibility about the initial policy is no longer needed for the integral VI algorithm. Remark 4: Existing results on the NZS games are usually without discounted factors, such as the cases in [7]- [10]. In this paper, the discounted factor is allowed to be zero, i.e., the NZS games without discounted factors can be viewed as special cases in our formulation.
In the optimal control theory, the discount factor in the performance has effects on the closed-loop stability, which is required to be within some certain range, as discussed in [19]. To guarantee the closed-loop stability, the bound of the discount factor α i for the NZS games is discussed in the next theorem.
Theorem 1 (Upper Bound for the Discount Factor α i ): Consider the system (1), then the origin of system (1) is asymptotically stable if (26) or (27) holds.
Then, the CAREs (4) and (5) can be rewritten as: Assume that λ is an eigenvalue of the closed-loop systemĀ, one hasĀx = λx. First, for the player one, multiplying both sides of (28) by nonzero vector x T and x with x ∈ R n , one can obtain [19] 2 Using the inequality a 2 + b 2 ≥ 2ab and since P * 1 > 0, (30) becomes To guarantee the stability of the closed-loop system, it is required that Re(λ) < 0, i.e., (32) holds if the following inequality is satisfied, Using the fact that A B ≥ AB , one can obtain the sufficient condition to (33) as given in (26). From the above analysis, it is shown that the condition (26) guarantees the asymptotic stability of the closed-loop system. Similarly, for the discounted factor α 2 , one can obtain (27). This completes the proof.

B. EQUIVALENT INTEGRAL VI WITH DISCOUNT FACTOR
In this section, we give an equivalent formulation with a compact form of the integral VI algorithm developed in the previous subsection.
Consider the system (1) and feedback control u  (34) where

Algorithm 2 Online integral VI Algorithm With Discount
Otherwise, set k = k + 1 and go to step 2. 3: Stop until the following criterion is satisfied for a specified threshold ε:

Algorithm 3 Equivalent Integral VI Algorithm With
Otherwise, set k = k + 1 and go to step 2.

Remark 5: Algorithms 2 and 3 are equivalent to each other. However, Algorithms 2 and 3 are different for implementation purpose. As shown (34), A (k)
α i , which contains the model knowledge A, is required to calculate P (k+1) i . Therefore, Algorithm 3 is a model-based algorithm. In contrast, as shown in (23), the value function parameter P (k+1) i is determined by collecting the online data instead of model knowledge. Therefore, Algorithm 2 is a data-driven algorithm.
Remark 6: As shown in Figure 1-(a) and Figure 1-(b), in classical VI and PI, the iterative algorithm is implemented between the value function and the control policy. The value function update in PI or VI depends on the policy in the previous iteration and includes two steps in each iteration. However, in algorithm 3, one can observe that P (k+1) i can be determined directly based on P (k) i using equation (34), as shown in Figure 1-(c). That is, the value function update only depends on the value function itself. Therefore, the integral VI algorithm can be viewed as a simple one-step iteration.

IV. MAIN RESULTS
In this section, we give the theoretical analysis of the integral VI algorithm in terms of three aspects, i.e., positive definiteness properties of the updated cost functions, the stability discussions concerning the closed-loop systems, and the conditions that guarantee the monotone convergence.

A. POSITIVE DEFINITENESS OF THE INTEGRAL VI ALGORITHM
In this subsection, the positiveness of the iterative value function in the integral VI algorithm is analyzed.
Theorem 2: Suppose that η i ∈ (0, 1] for i = 1, 2, and V  (17), one can obtain Since V (k) i (x t ) is a positive definite function, then P (k) i is a positive definite matrix. Therefore, the first and last terms in (35) are both positive definite functions. In addition, the second term in (35) x t is also positive definite. This completes the proof.

B. STABILITY DISCUSSION
In this subsection, the stability analysis of the closed loop system (1) will be given.
Before moving on, the following lemma is required. Lemma 2: For a symmetric matrix G ∈ M n×n , and any nonzero matrices E 1 ∈ C n×n , E 2 ∈ C n×n , F 1 ∈ C n×n , F 2 ∈ C n×n , it follows that if there exists constant ε > 0 such that Based on the Young's inequality, one has for ∀ε > 0. Then, the following two inequalities holds for any ε > 0. Inserting (39) and (40) into (37), one can obtain (36). This completes the proof. The next theorem discusses the stability of the closed-loop system when applying the integral VI algorithm.
If η 1 and η 2 satisfy Then,Ā (k) α i is Hurwitz for all k ∈ N. Proof: We will prove this theorem by deduction. First, from the assumption,Ā α i is Hurwitz. In k-th iteration, there exists a positive definite matrix denoted by Y k i ∈ R n×n p such that (41) is satisfied. Next, we need to show the Hurwitzness of the matrix A (k+1) α i . As the following discussions, this will be done by finding the sufficient condition Based on (45) and Lemma 2, The above inequality can be guaranteed by Note that (46) is a matrix inequality quadratic in the variable ε i . To transform this into a scalar inequality, multiplying both sides of (46) by nonzero vector x T and x with x ∈ R n , one can obtain Since H 1 and H 2 are positive definiteness matrices, Y (k) i is also positive definiteness matrix and x = 0, then > 0. Therefore, (47) is a scalar inequality quadratic in ε i . In this case, the existence condition for ε i ∈ R + can be determined as That is, which ensures the existence of ε i > 0 in (47). Finally, the requirement of η 1 , η 2 to guarantee thatĀ is Hurwitz can be summarized as in (48). This completes the proof.

C. CONVERGENCE ANALYSIS
In this subsection, the effect of the parameters, the learning rate η i , on the convergence of the integral VI algorithm is discussed.
Theorem 4: Define the following parameters Then, the following propositions hold. a) If we consider (6) and (34), then the following matrix recursive equation holds for P (k+1) 1 and P (k+1) 2 : b) For player one, when the learning rate η 1 satisfies η 1 ∈ (0, 1] , if (48) and the following conditions hold: -Condition 1: Then, holds for every k ∈ N. For player two, when the learning rate η 2 satisfies η 2 ∈ (0, 1], and if (48) and the following conditions hold, -Condition 2: Then, holds for every k ∈ N. c) If learning rate η 1 and η 2 does not vanish at k = ∞, the pair (P Then, applying (56) to the Riccati operator representation (6) yields, The terms Ā (k) (57) can be written as Finally, inserting (58) into (57) yields (49). Similarly, for player two, one can obtain (50). b) When the learning rate η 1 satisfies condition 1 and (48), combining the fact (49) in proposition a) and the properties of the matrix norm yields where In order to satisfy for each k ∈ N, a sufficient condition can be selected as ψ 11 ≤ 1 by (59), i.e., To guarantee the existence of the solution to the above quadratic inequality, the equation  ≥ 0 for ∀k ∈ N, i.e., it is lower-bound by zero. In addition, from proposition b, the sequence Ric α i P (k) is monotonically decreasing. Therefore, P as k → ∞. In addition, from (34), one has Since the exponential function can not be zero, then That is, (P * 1 , P * 2 ) converges to the solution to the CAREs. This completes the proof.
Remark 7: As the learning rate η 1 and η 2 increase, the convergence speed of the integral VI algorithm will be faster, when the learning rate of the integral VI algorithm is sufficiently large, the integral VI algorithm outperforms the PI algorithm. To guarantee the positive definiteness in Theorem 2, the max value of η 1 and η 2 can not exceed 1.
Remark 8: For player one, the learning rate η 1 need to satisfy (51), (52) and (53) in condition 1, which contain ρ is affected by η 2 . Then, the learning rate η 1 is not independent of η 2 . Similarly, for player two, the learning rate η 2 also depends on η 1 .

V. SIMULATION STUDY
Here we present simulations of NZS differential games for linear systems, the games can be solved by the integral VI method and another method, Lyapunov iteration method that is used as a reference to verify the effectiveness of the proposed method for the NZS differential games.
Consider the following two-player continuous-time linear systems with [38]   The integral VI algorithm is implemented using T = 0.5. The threshold of the stop criterion is selected as = 10 −8 . The initial matrices P  (2) and (4) presents the evolution of the parameters of the value function for players one two during the learning process when the learning rate is selected as η 1 = η 2 = 0.7 and η 1 = η 2 = 1.0, respectively. It can be shown that the learning algorithm converges within 5 steps. Moreover, to investigate the convergence of the integral VI algorithm to the solution of the CAREs (4) and (5), The difference between (P 2 ) is shown in Figures 3 and 5, respectively. One can observe that the value functions for both players converge to (P ( * ) 1 , P ( * ) 2 ). A comparison between the integral VI algorithm with different learning rate and the PI algorithm is shown in Figure 6. First, the larger learning rate results in faster convergence   speed, i.e., the case of η i = 1.0 converges to the optimal case faster than the case of η i = 0.7. Second, when the learning rate of the integral VI algorithm is sufficiently large, the integral VI algorithm outperforms the PI algorithm, as shown in Figure 6.

VI. CONCLUSIONS
In this paper, an integral VI algorithm is proposed to find the Nash equilibrium of the NZS games. The presented integral VI algorithm is implemented using the online data to obviate the requirement of the drift dynamics. First, the reward function in the NZS games contains a discounted factor, which is required to be within a given range to guarantee the closed-loop stability. Compared with existing RL algorithms with fixed convergence speed, the convergence of the integral VI method can be tuned by the learning rate. Moreover, as discussed in Section IV, additional conditions on the learning rate are imposed to guarantee the positive definiteness of the iterative value function, closed-loop stability during learning and the convergence of the integral VI algorithm to the solutions of CAREs. Simulation examples demonstrates the effectiveness of the presented integral VI algorithm.