An Enhanced Model-Free Reinforcement Learning Algorithm to Solve Nash Equilibrium for Multi-Agent Cooperative Game Systems

Solving the Nash equilibrium is important for multi-agent game systems, and the speed of reaching Nash equilibrium is critical for the agent to quickly make real-time decisions. A typical scheme is the model-free reinforcement learning algorithm based on policy iteration, which is slow because each iteration will be calculated from the start state to the end state. In this paper, we propose a faster scheme based on value iteration, using Q-function in an online manner to solve the Nash equilibrium of the system. Since the calculation is based on the value from the last iteration, the convergence speed of the proposed scheme is much faster than the policy iteration. The rationality and convergence of this scheme are analyzed and proved theoretically. An actor-critic network structure is used to implement this scheme through simulation. The simulation results show that the convergence speed of our proposed scheme is about 10 times faster than that of the policy iteration algorithm.


I. INTRODUCTION
MULTI-AGENT consensus research involves the knowledge, goals, skills, and planning of how to enable the agents to take coordinated actions to solve problems. Multi-agent consensus mainly studies how to design algorithms to make all agents in a specified state to reach an optimal control policy by independently optimizing each agent's performance index based on the Hamilton-Jacobi-Bellman equation, whose result is the Nash equilibrium solution [1]- [3]. So that game theory can provide a framework for multi-agent consensus research [4], [5]. References [6], [7] proposed the concept of cooperative Nash equilibrium, with which, the dynamics and value function of each agent depend only on the actions of the agent and its neighbors. The graphical game can provide Nash equilibrium solutions among neighbors.
The research on the consensus problem began in the 80's. Traditional methods such as natural animal group models, Boid model, and Vicsek model are inspired by the movement rules of nature [8]- [10]. In recent years, reinforcement learning (RL) is one of the areas that have attracted the most research and development interest. RL maps the relationship The associate editor coordinating the review of this manuscript and approving it for publication was Ton Duc Do . between the learning state and behavior of agents, involving how for agents to choose their behavior in a dynamic environment to optimize the sum of cumulative rewards [11]- [13]. Many algorithmic ideas of RL can be applied to the consistency research of multi-agents [7], [14], [15]. Multiagent RL algorithms can be model-based or model-free [16], [17], where models are used to represent environments. Normally, models can be very helpful for the agent to deal with various situations and choose the best action [18]. However, the model building process requires a lot of information and time [19]. If the environment is unknown, possible subsequent states of the current state cannot be known. In this case, it is impossible to determine the best action to be taken in the next state. Therefore, the model-free algorithm is an important research direction for multi-agent systems used in unknown environments [20], [21]. Such kind of algorithms include those proposed in references [16], [22], which solve the related Nash equilibrium based on policy iteration, working as follows. It starts with an initial policy, and then evaluates and improves the policy, then further evaluates & improves the improved policy. After continuous iteration and updating, the policy will be optimized. Since the evaluation of the policy in each iteration is calculated from the start state to the end state, it needs a lot of time to get the best policy and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ optimal value when the state space is large [11]. The policy iteration algorithm proposed in reference [23] has relatively strong learning ability by adding a sub-iteration for iterative performance index functions.
In this paper, we propose a value iteration algorithm to solve the Nash equilibrium for multi-agent game systems by designing a cooperative agent's RL algorithm jointly using Qfunction in an online manner. For value iteration, its working steps are basically the same as those of policy iteration. However, with the iteration of the state value, the policy is adjusted in time to greatly reduce the number of iteration steps and improve the convergence speed. Value iteration guarantees that the value rather than the policy is optimal. In addition, policy iteration needs to formulate an initial policy to ensure convergence but not for value iteration [11]. The rationality and convergence of this algorithm are analyzed and proved theoretically. An actor-critic network structure is used to implement this algorithm. An actor-critic network includes an actor network and a critic network. The actor network uses the policy function to generate actions and interact with the environment. The critic network uses the value function to evaluate the performance of the actor network, and then guides the actor's action in the next stage. The simulation results with MATLAB show that the convergence speed of the proposed algorithm is about 10 times faster than that of the policy iteration algorithm.
In comparison with the model-based algorithms proposed in [1], [6], the complexity of our model-free algorithm is relatively low, with no requirement of the knowledge of system dynamics. The simulation results in Section VI show the convergence speed of these algorithms.
Section I is Introduction. The remainder of this paper is organized as follows. Section II describes the synchronization control problem of the multi-agent system on the graph. Section III introduces some results about optimal control used to solve the dynamic graphical games proposed in this paper. Section IV discusses the existing Nash solution and the best response for the graphical games. Section V proposes an enhanced RL algorithm, which is implemented by a simulation study in Section VI. The paper is concluded in Section VII.

II. MULTI-AGENT GRAPHS AND SYNCHRONIZATION
This section introduces multi-agent graphs and synchronization problem to be addressed in this paper. Some available results used here and henceforth are summarized in Section III. The definitions of the main notations used in the discussion are listed in table 1:

A. MULTI-AGENT GRAPHS
The multi-agent graph Gr = (V , ω) here is a directed graph with a set of agents or nodes V = {v 1 , . . . , v N }, and a set of edges ω, ω ⊆ V × V . The connectivity matrix E is defined by the edge between two agents. If v j , v i ∈ ω, then E = e ij , e ij > 0, or e ij = 0. A node's neighbors are defined as

B. STATE SYNCHRONIZATION AND ERRORS
A communication diagram is composed of N agents. The local dynamic system of agent i is defined as follows: Consider a leader node v 0 that has command generator dynamics x 0 (k) ∈ R n , which is given by The consensus of multi-agent systems is to synchronize the states of the follower and leader agents, i.e., by designing related inputs u i (k) and combining with the information from neighbor agents.
To study the synchronization problem on graphs, we define the local neighborhood tracking error ε i (k) ∈ R n for the follower agent i as follows: where, g i ≥ 0 is the gain between the leader agent and follower agent i. If node i has a connection to the leader node, then g i > 0 [6].
The overall tracking error vector is written as where, T is the global node state vec- T is the global tracking error vector. So (4) can be written as ε (k) = − ((L + G)⊗I n ) η (k).
The synchronization error vector is written as where, x 0 = I x 0 , I = 1 − ⊗ I n , and 1 − is the N-vector of ones.
If the graph contains a spanning tree and g i = 0 for the root node, then (L + G) is nonsingular, i.e., 223744 VOLUME 8, 2020 For easy identification, when the index time k is clear, x i (k) can be written as x ik . The next result shows that the synchronize error vector can be made arbitrarily small by making local neighbor tracking errors, whose dynamics for node i are given by Therefore, the work of solving the Nash equilibrium of multiagent systems is to minimize (7) [1], [6]. Game theory and graph theory are used to explain the responses of agent i to other agents in multi-agent systems. To define a dynamic graphical game, the control inputs of the neighbors of agent i is defined as Then local performance index for agent i can be written as where, U i (ε ik , u ik , u −ik ) is the utility function for agent i, and written as with the initial conditions V π i (0) = 0. Remark 1: Given the policies, (9) catches the local information given by (7) for agent i. The solution will be given in terms of the local neighbor tracking error (7).

III. THE RELATED WORKS USED IN THE PROPOSAL
This section introduces some results of optimal control used to solve the dynamic graphical game proposed in this paper [6].

A. BELLMAN FUNCTION
The Bellman function is defined as The difference of the value function V π i (ε ik ) is defined as follows: and its gradient is The goal for the multi-agent graphical games is to find the optimal value for agent i as the minimum of the value function Given stationary admissible polices for the neighbors of agent i, applying the Bellman optimality principle yields where, and and

B. HAMILTONIAN FUNCTION FOR DYNAMIC GRAPHICAL GAMES
According to the error dynamics (7) and the performance index (9), the Hamiltonian function of agent i can be defined as (15) where, λ ik is the Lagrangian multiplier vector. According to [24], the Lagrangian multiplier vector for optimal control can be written as So the optimal policy based on the Hamiltonian function is given by applying the stationarity condition ∂H i /∂u ik = 0 such that Then we can obtain VOLUME 8, 2020

C. BELLMAN FUNCTION BASED ON Q-FUNCTION
The Q-function of agent i is defined as follows: where, For the multi-agent graphical game optimization problem, the objective is to find the optimal value where u * −ik denotes the optimal policies in the neighboring policies. Since policies π jk , (j ∈ N i ) are stationary admissible, there exist the best response Bellman equation. Note that

So the best response Bellman function is defined as
with the initial condition Q π i (0) = 0, ∀i. The difference of Q-function is defined as follows: and its gradient is According to the Bellman function, when its gradient we will have the optimal control policy of agent i defined as follows: which is same as the best policy mentioned in (14), i.e., . So the best response Bellman function based on the Qfunction is defined as which can be written as

D. COUPLED HAMILTON-JACOBI-BELLMAN EQUATIONS
According to [25], the Hamilton-Jacobi theory is used to relate the Hamiltonian function and the Bellman equation.
The Discrete-Time Hamilton-Jacobi (DTHJ) equation is Combining with the Q-function, we obtain the DTHJ equation as follows: According to the Hamiltonian function, the Lagrangian multiplier vector for Q-function can be written as The following definition relates the Hamiltonian (15) along the optimal trajectories and (26).
with initial condition Q * i (0) = 0, and the best policy is The proof is given by Theorem 1 in [25].

IV. NASH SOLUTION FOR THE DYNAMIC GRAPHICAL GAME
When an agent's policy achieves the optimal value of its expected return, all other agents also follow this policy [6], [25]. The objective of solving the Nash equilibrium of the dynamic graphical game is to solve (13), leading to (26).

A. NASH EQUILIBRIUM FOR THE GRAPHICAL GAMES
A Nash equilibrium solution for the game is given with respect to the actions of the other players u¯i = u j |j ∈ N , j = i in the graph. Definition 2: If all agents ∀i ∈ N satisfy then an N-player game with N-tuple of optimal control policies u * 1 , u * 2 , . . . , u * N is said to have a Nash equilibrium solution. So the Nash equilibrium of Q-function can be written as where, Q i is the local performance index of Q-function [6], [25].

B. STABILITY AND NASH SOLUTION FOR THE GRAPHICAL GAMES
A stable Nash solution for the dynamic graphical game is shown to be equivalent to the solution of (26). Definition 3: (Stability and Nash equilibrium solution) Let 0 < Q π * i (ε ik , u * ik ) ∈ C 2 , ∀i, satisfy the Bellman optimality equation and all agents use the policies given by (26).
a) The error dynamics (7) is asymptotically stable and all agents synchronize to the leader's dynamics (2).
b) The optimal performance index for agent i is J , [25]. The proof is given by Theorem 2 in [25].

C. BEST RESPONSE SOLUTION OF DYNAMIC GRAPHICAL GAMES
Consider that neighbor policies u _i = {u j : j ∈ N i } are fixed. The best response Bellman equation based on Q-function for agent i can be defined as with the initial condition Q 0 ik is given by (17).
The best response Hamilton-Jacobi (HJ) equation is defined as with the initial condition Q π * i (0, 0) = 0, and u ik = u * ik is given by (29).
The following lemma shows the relation between (32) and (33). lemma 1: ) with the initial condition given by Q π * i (0, 0) = 0, and the optimal control policy is given by (29).
The proof of Lemma 1 is given in Appendix.
b) The optimal performance index for agent i is J . c) All agents are in Nash equilibrium. The proof of Theorem 1 is given in Appendix.

V. THE PROPOSED VALUE ITERATION ALGORITHM
This section proposes a value iteration algorithm to solve discrete-time dynamic graphic games in real time by using a cooperative agent RL algorithm. The single agent RL algorithm is extended to solve multi-agent dynamic graphical games, with a simplified writing of the value function based on Q-function, which consists of the control input and the state of the agent as well as the state of its neighbors, and is defined as follows: and where, for agent i,H i is the solution matrix.H i u ik , ε yk is a sub-block of matrixH i . The diagonal element ofH i is the agent's state vector.H i (u ik , u ik ),H i ε yk , ε ik and H i u ik , ε yk are the weighting matrices of the agent's policy, VOLUME 8, 2020 the neighbor agents and the policies of the neighbor agents, respectively.
The following gives the optimal control policy, which solves out the minimum by letting the partial derivative be equal to 0, i.e, We obtain Algorithm 1 is proposed to solve (66) for the multi-agent system.

Algorithm 1 The Proposed Value Iteration Algorithm
1: Begin from a random initial policy u 0 ik and valuẽ Q 0 i ε ik , u 0 ik . 2: Start action with the policy u 0 ik and valueQ 0 i ε ik , u 0 ik . 3: Obtain local error state vectorε i(k+1) at the next moment with (7), and solve forQ l+1 i by using 4: Update policy u l+1 ik by using

Remark 2: The algorithm does not require the knowledge of any of the agents' dynamics in the system (7).
Remark : 3 With (34) and (35). (36) enables Algorithm 1 to solve dynamic graphical games for both the directed and undirected graph topologies [25].
Remark 4: The main differences between the algorithm proposed here and that in [6] are the value function and the way to update policies. The value function of the algorithm in [6] is defined by with an updating policy is defined by converges to (28).

c) Since Algorithm 1 is a value iteration algorithm, its convergence speed is faster than that of the policy iteration.
Proof : a) When the neighbor policies are fixed, the difference of Q-function (37) yields for ∀i, l,

Rearranging (40) yields
Since the policies are stable, (42) yields For all policies u l ik , they have a monotonically increasing sequence of value functionsQ l ik , ∀i, such that From (19), we can obtain thatQ l ik , ∀i, is bounded byQ l+1 i , which means thatQ l ik , ∀i, has an upper bound. So Algorithm 1 converges to (32) as follows: So (46) satisfies Theorem 1 in Section IV. b) For the second case, we havẽ ). (47) After manipulation, it can be rewritten as We can also obtainQ l+1 (k+c) ). Then, for all policies u l ik , it has a monotonically increasing sequence of value functionsQ l ik , ∀i, such that From (19), we can obtain thatQ l ik , ∀i, is bounded byQ l+1 i with an upper bound that Algorithm 1 converges to the best response Bellman function (25), with So (49) satisfies Theorem 1 in Section IV. c) With the policy iteration algorithm in [25], each iteration will improve the policy and calculate the value. However, each iteration will be calculated from the start state to the end state rather than the previous value as follows: With our proposed algorithm, the calculation of the value is based on the value from the last iteration. So it converges faster than the policy iteration.

VI. GRAPHICAL GAME SOLUTION BASED ON Q-FUNCTION BY ACTOR-CRITIC LEARNING
This section develops an actor-critic network to implement the proposed Algorithm 1 for solving the dynamic graphical games in real-time. Each agent has its own critic network to perform the value update, and the actor network to perform the policy update. The actor-critic network structures depend on only the local information of agents.

A. ACTOR-CRTIC NETWORKS
The value function for agent i, Q i (ε ik , u ik ), is approximated by the critic networkQ i W ic . The control policyû ik W ia is approximated by an actor network such that where,W ic ∈ R (n)×(nN i,j +m i) , ∀i, are the weighting matrices of the approximated structuresQ i W ic ,W ia ∈ R nN i,j ×m i are actor weights. L ik = ε T ik . . . ε T −ik T consists of state ε ik and the state of its neighbours ε −ik . Set ξQ i(εik ,û ik ) u ik to be the approximation error of the actor network so that The update of control policies with the critic structure is given by (36), and where,W ic(û ik ε ik ) represents the block matrix defined by the positions of control approximation and the state of agent i.
The square approximation error is The update rules for the actor weights are given bỹ where, 0 <μ ia < 1 is the learning rate of the actor network. VOLUME 8, 2020 Set JQ i(εik ,û ik ) ε ik to be the target value of the neural critic network structure at step l such that The critic network approximation error at step l is given by The square approximation error for the critic network is defined as The update rules for the critic network weights are given byW

B. ACTOR-CRITIC WEIGHTS ONLINE TUNING
Algorithm 2 is proposed for real-time tuning of the actorcritic networks using data for the system trajectories. Begin from initial stateε i0 on the system trajectory. 4: Use (51) to calculateû ik in Algorithm 1.

C. SIMULATION STUDIES
In this section, we use simulation to verify the effectiveness of the algorithm proposed in this paper. The graph of the multiagent is shown in Fig.1, where, Agent 0 is the leader and the others are followers. The state and input of the agent system, which follows (1), are given by Since the proposed scheme is model-free, A, B 1 , B 2 and B 3 are just used to calculate the tracking error dynamics. Pinning gains are as follows: The graph edges are set to     2 shows the critic of agent 2 under the initial data, which has 2 neighbors with agent 2. The state vector of each agent here has two state components. Because the state vector matrix of each agent is two-dimensional, the state vector of each agent has two rows as state components. So the critic network for agent 2 is a 2 × 4 matrix, and the changes of 8 curves in Fig. 2 refers to the changes of 8 numbers in the matrix. Every 4 curves refer to an agent's state, and we can see that some curves reach to straight state at the same time, such as wc2 1 , wc2 2 , wc2 5 and wc2 6 . Fig. 2(b) shows the actor weights of agent 2 under the initial data. Since the policy is a number rather than a matrix, and the state vector of each agent has two state components, the matrix of agent 2 is a 1 × 4 matrix. Similarly, every two curves refer to an agent's policy state. Fig. 3 shows the tracking error dynamic transformation of all the follower agents. By comparing the number of iterations that reach convergence, it can be concluded that the number of iterations required by the proposed algorithm is almost onetenth of that of Fig. 3(b), so the convergence speed in Fig. 3(a) is much faster than that in Fig. 3(b) with about 10 times.
According to Figs. 2(a), 2(b), and 3(b), when the neighborhood tracking error comes to 0, the critic and actor weights reach to a steady state, and the Algorithm 1 reaches to convergence. Fig. 4 shows the dynamics of all the follower agents. We can see that they finally synchronize to the leader while reaching their optimality. It is obvious that the number of iterations for the agent with the policy iteration to reach the state synchronization is more than that with the value iteration. Fig. 5 compares the tracking error dynamic transformation results given by the proposed algorithm and the model-based algorithm proposed in [6]. Fig. 6 compares the tracking error dynamic transformation results given by the proposed algorithm and model-based policy iteration algorithm proposed in [1].
Therefore, according to Fig. 5 and Fig. 6, although the complexity of our model-free algorithm is relatively low, with no requirement of the knowledge of system dynamics, our algorithm can also make the tracking error dynamics quickly comes to 0.

VII. CONCLUSION
In this paper, we improve the convergence speed of solving the Nash equilibrium of multi-agent systems by replacing the policy iteration algorithm proposed in [25] with the proposed value iteration. The rationality and convergence of this algorithm are proved theoretically. An actor-critic network structure is used to implement this algorithm. The simulation results show that the convergence speed of the proposed value iteration algorithm is much faster than that of the policy iteration algorithm proposed in [25].
More research works are still necessary to further improve the proposal. First, there are many confrontations and cooperation between multiple agents, which are not taken into account by the proposal. Second, in unknown environments, deep learning and reinforcement learning algorithms need to be combined to improve the suitability of the proposal. Third, it is interesting to investigate how to apply the proposed algorithm in dynamic formation systems.