Formal Reachability Analysis for Multi-Agent Reinforcement Learning Systems

Reachability analysis is one of the most basic and challenging problems in verification. We investigate this problem in multi-agent reinforcement learning (MARL) system by transforming the reachability analysis to decision-making problem tackled by mixed integer linear programming (MILP) solver Gurobi. We define syntax and semantics for multi-role strategy logic (mrSL) which is used to describe the reachability specification. The logic mrSL is a variant of SL to express properties, which provide a holistic perspective to describe reachability properties instead of specifying a specific agent reaches a certain state in the MARL system. And the algorithms to translate reachability property into MILP constraints are provided. A tool called MAReachAnalysis is introduced, which uses Gurobi to solve the corresponding reachability problem and evaluate it on the predator-prey task of multi-agent deep deterministic policy gradient (MADDPG) example, discussion of the experimental results obtained on a range of test cases.


I. INTRODUCTION
As AI methods are increasingly used in safety-critical applications, concerns have been raised about the suitability of multi-agent reinforcement learning (MARL) for deployment in these applications. To ease this concern and gain users' trust, MARL system needs to be guaranteed that certain user-expected properties are satisfied. MARL systems are not programmed directly, instead, neural networks are first appropriately trained against data and then employed to conduct a particular task. So formal verification methods widely used in multi-agent system (MAS) verification cannot address the validation of AI applications due to the inherently different components [1], [2].
Generally speaking, a formal verification consists of two parts: (i) a neural network, (ii) a property to be checked, and its result is either a formal guarantee that the network satisfies the property, or a concrete input for which the property is violated. In the last few years, formal verification of neural network has grown up to be a very active field of research in computer science, with numerous success stories.
The associate editor coordinating the review of this manuscript and approving it for publication was Utku Kose.
Reference [3] provide a method of formally verifying desirable properties of neural networks, i.e., obtaining provable guarantees that neural networks satisfy specifications relating their inputs and outputs. Reference [4] submits a new direction for formally checking security properties of DNNs using interval arithmetic to compute rigorous bounds on the DNN outputs. Reference [5] introduce a general framework to certify robustness of neural networks with general activation functions. Reference [6] develop a method to verify automatically that no unwanted states are reached for the one agent reinforcement learning (RL) system. Reference [7] present a technique for the automatic verification of MAS populated by arbitrarily many agents adhering to different roles. Despite the emergence of various verification methods, reachability analysis is one of the most basic problems in verification, because it can be instantiated into several key decision problems, including safety verification [8]- [10], output range analysis [7], [11], and robustness comparison [12], [13]. But to our knowledge, there is little work in reachability analysis for MARL system.
To specify the verification property, alternating-time logic (ATL) and strategy logic (SL) are the well-known logical formalism. ATL is used to express cooperation and competition among agents in order to achieve certain temporal goals, but it suffers from the strong limitation that strategies are treated only implicitly through modalities that refer to games between competing coalitions. To overcome this problem, SL is proposed in which strategies are explicitly referred to by using first-order quantifiers and bindings to agents. However, the policy specified by SL is assigned to each agent, if there are multiple agents in MAS, the property formula will be quite complex. So specifications concerning strategic interplay based on roles naturally are required in verification. For this reason we put forward multi-role strategy logic(mrSL), which is a variant of the SL and provides a holistic perspective to describe some properties of the system, regardless of which agent is responsible for the concrete task.
Due to the complex structure, manual reasoning for MARL system is impossible. Inspired by some MILP-based verification [14], [15], we transform the reachability analysis to decision-making problem tackled by MILP-based solver, which can answer queries about reachability properties by transforming these queries into constraint satisfaction problems. This article (i) provides a formal logic-mrSL to describe the reachability properties in MARL system; (ii) develops effective methods transforming the mrSL property into constraints;(iii) integrates the methods into a larger implementation and makes the experiment on MADDPG task predator-prey.
The rest of the paper is organized as following. We provide preliminaries related problem in Section II, along with a motivating example called predator-prey task in MADDPG. Section III introduces the syntax and semantics of mrSL. In Section IV, MILP based reachability analysis methods and the algorithms to translate reachability property into MILP constraints are introduced respectively. The example is given in Section V to illustrate our approach, related work and conclude are presented in Section VI and VII.

II. PRELIMINARIES
This section summaries and fixes the notation on some of the key notions used later in the paper.
A MARL system has several RL agents, each RL agent is a neural network(RLNN). Actually, RLNN usually is a fully connected feed forward neural network. Each RLNN is made up of a number of layers, each layer consists of multiple nodes, which are single computation units that combine input from the values of nodes in the previous layer, and then using these values to iteratively compute the assignments of nodes in each succeeding layer, finally producing an output.
Definition 1 (RLNN): a m-layer RLNN is a tuple RN =(L, L 0 , x, W , b, σ ), where L i denotes the ith layer of N, L 0 is the input layer, L m is the output layer, and all the other layers are so-called hidden layers. each layer L i has a weight matrix W i , a bias vector b i , and σ i denotes an activation function in layer L i . Here we use x i to denote the variables of layer L i . Our purpose is to use MILP to analyze the reachability of MARL system, so translating the reachability problem to a MILP model can boil down to expressing the constraints between the input and the output.
Definition 5 (MILP): A MILP is an optimization technique that uses to maximize or minimize a linear objective function subject to a set of linear constraints on the input values. The variables in MILP may be integer, real or binary type.
Definition 6(Linear RLNN): an m-layer linear RLNN But in MARL system, activation function introduces nonlinearity into the output of a neuron. The most famous activation function is ReLU which is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. We called it a ReLU-RLNN if the RLNN uses ReLU as its activation function. The method using binary variable to eliminate the non-linearities is discussed in [6], which can be done with the use of ''big-M'' encoding. In this article, the linear constraints encoding layer i is defined as follows: Definition 7(Linear Encoding for ReLU-RLNN): Let N be a m-layer RLNN with computed function f . Suppose x (i−1) and x (i) are vectors of real variables representing the input and output of layer i, and δ (i) is a vector of binary variable. Then the set of linear constraints encoding layer i is VOLUME 9, 2021 defined as: . So the MILP model can be obtained via the composition of constraints of the above form with appropriate linear constraints. We can apply this linear encoding to linearize all ReLU nodes and obtain a set of linear constraints.
In MARL system, the environment is stateful and updates its state in response to the action of the agent. Usually agents interact with non-linear environments. In this article, the training environments is modified to be linear so that we can use MILP to solve reachability problem.
Definition 8(Linear Environment): LE = (S, t E ),where: • S is the set of states of the environment • t E : S × Act → S is a transition function which given the current state of the environment and an action performed by the agent returns the next state of the environment. • LRN s is the set of linear RLNN.
• LE is the linear environment.
• Init is the initial information, which may include initial state s 0 ∈ S, initial observation o 0 ∈ O, initial action a 0 ∈ A.
MADDPG is one of the most advanced MARL methods that performs well in multi-agent environments [16], but after it has been trained, you don't know if the training result is satisfying the need. We propose to study the reachability problem for the predator-prey task and report the results.
Example 1: MADDPG can be viewed as a multi-agent version of DDPG (deep deterministic policy gradient). Each agent has two networks: an actor network π and a critic network Q. The actor network calculates the action to be executed based on the state acquired by the agent, while the critic network evaluates the action calculated by the actor network to improve the performance of the actor network. In the training phase, the actor network only obtains observation information from itself, while the critic network acquires information such as the actions and observations of other agents. In the execution phase, the critic network does not participate, and each agent only needs an actor network.
The predator-prey task in MADDPG involves prey agents and a group of predator agents in figure 1. The prey is 30% faster than the predators, so aim of the predators is to learn how to team up in order to catch the prey, and the aim of the prey is to learn how to escape from predators. The task contains not only collaborative strategy, but also competitive strategy. Because only actor network is used to calculate the action for each agent and the actor network is RELU-RLNN in MADDPG, the structures of these RELU-RLNNs are as follows if there are 3 predators (ag 1 , ag 2 and ag 3 ) and 1 pray (ag 4 ) in predator-prey task.
If the network is for predator, then the input layer makes up of 16 nodes; if the network is for prey, then the input layer consists of 14 nodes. There are 3 hidden layers, each layer consists of 64 hidden nodes and uses the ReLU activation function. The output layer consists of 5 nodes denoting the action.
In MADDPG example, the agent is not linear, because it uses Euclidean distance to compute if the collision happens between the two agents. In this article, Manhattan distance is used to calculate the distance between two agents. When the collision happens, |x 1 − x 2 | and |y 1 − y 2 | should be less than some fixed value v. If the collision happens between the prey and the predator, the value is less than 0.125; if the collision between the two predators, the value is less than 0.1; and if the collision between the two preys, the value is less than 0.15.

III. MULTI-ROLE STRATEGY LOGIC
SL is an extension of the classic linear-time temporal logic LTL along with the concepts of strategy quantifications and agent binding [17]. SL provides a decidable language for talking in a natural and uniform way about all kinds of properties on game graphs, including zero-sum, as well as nonzero-sum objectives.
However, SL assigns policies to each agent, but if there are multiple agents in MAS, the formula will be more complex. For example, supposing there are 4 agents in the predatorprey task, Ag 1 , Ag 2 and Ag 3 are predators, while Ag 4 is prey. Now we want to know if the agent Ag 4 can be caught by the agents of predators. Actually we don't know whether the agent Ag 3 catches the agent Ag 4 , or the other agents catch it. The formula using SL is as follows: The more the number of agents is, the more complex for the formula. So we propose mrSL, in which roles are added. And it is a variant of the SL and provides a holistic perspective to describe some properties of the system, regardless of which agent is responsible for the concrete task. We also set the boundary with natural numbers denoting the temporal depth up to which the formula is evaluated.
Definition 10(mrSL Syntax): Given a set of environment states S ⊆ R m , the mrSL specification language over linear inequalities is defined by the following BNF: In definition 10, two new symbols are introduced, which are quantification prefix ℘ and binding prefixe b. A quantification prefix over a set V ⊆ Var of variables is a finite . The existential and the universal [x], can be read as ''there exists a strategy x'', and ''for all strategies x'' respectively. A binding prefix over a set of variables V ⊆ Var is a finite word b ∈ { (r, x) :r ∈ role∧x ∈ V } |role| of length |role| such that each role r ∈ role occurs just once in b, i.e., there is exactly The role binding (r, x) can be read as ''bind role r to the strategy associated with the variable x''. Atomic propositions α are linear constraints on the components of a state, in which p i represents the linear constraints of the component of r i .
Actually there are some other operation in other logic, such as F(Eventually), G(Always), R(Release) in LTL, CTL. We don't import these operators because G and F, R and U(until) are dual operators, and they can switch to each other. For example, F ≤k ϕ ⇐⇒ ¬G ≤k ¬ϕ, ϕ 1 U ≤k ϕ 2 ⇐⇒ ¬ϕ 1 R ≤k ¬ϕ 2 , F ≤k ϕ ⇐⇒X ≤k ϕ. So the goals in the example above can be expressed in mrSL as follows: It means for all the strategies used by role r 2 , there is a strategy used by role r 1 that the distance between the agents in r 1 and r 2 is less than 0.5 within 10 steps.
From the mrSL definition, it includes two strategy formulas [x] (r, x) ϕ and (r, x) ϕ, where the binding operation (r, x) expressing the agent in role r has a strategy to enforce the formula ϕ. Intuitively, mrSL formulae quantify over the concrete agents. For instance, [x] (r, x) ϕ expresses that every concrete agent from role r can enforce ϕ independently of the action performed by other agents whether they belong to role r or not.
We can encode predator-prey tasks as a MILP model H composed of some linearized RLNN models, some of which are preys, the others are predators. The input of these RLNN network is the global states of each agent, and the output is the action for each agent. The reachability property for preys may be that they can escape from predators, but for predators is that they can catch the preys.

Definition 11 (Path):
A path in a linear MARL system is a finite sequence of states ρ ∈ St * , such that, for all j ∈ [0, |ρ|− 1], there exists a decision dc ∈ Dc such that ρ(j + 1) i r = fs(ρ(j) i r , dc), where Dc = Ac × Ag i r which means the choices of an action for each agent i of role r. We use ρ i r (s) to represent the path starting at a state s for agent Ag i r , and ρ(i) to denote the i-th state in ρ.
Definition 12 (Strategy): A strategy for a concrete agent Ag i r is a function : ρ → r∈roles i∈Ag r Ac i that contains all the possible choices of actions depending upon the path.
(s, Ag i r ) denote all the strategies from state s of the agent i of role r.
A valuation is a function ν : Var → that maps variables to strategies. A binding is a function β : Ag i r → Var that maps agents to variables.
Definition 13 (Assignment): An assignment in a H is a partial function χ : Var ∪Ag i r → mapping variables, agents to a strategy.
Actually we can use ν(x) and the composition function ν(β(Ag i r )) to represent the assignment from vars and agents to strategies respectively. These two expressions can be abbreviated as χ(l), where l ∈ Var ∪ Ag i r . Given an assignment χ ∈ Asg, an agent, variable l ∈ Var∪ Ag i r , and a strategy f ∈ , we define a notation to represent a new assignment as following: the return is f if on domain l, otherwise the value is equal to χ on the remaining part of its domain. We now give the semantics of the satisfaction relation for the logic.
Definition 14 (mrSL Semantics): Given a linear MARL system H, for all mrSL formulas ϕ, states s ∈ S, and assignments χ , the modeling relation H, s,χ | ϕ is inductively defined as follows.
H, s,χ | c 1 l 1 + c 2 l 2 + · · · +c m l m opc opc iff ( m i=1 c i ×s.l i ) opc and l i is the local state for role r

IV. REACHABILITY ANALYSIS
In this section, we outline the ideas and details of our reachability analysis method. Firstly, the reachability problem in linear MARL system is formally introduced, then construction of MILP models for MADDPG is described. VOLUME 9, 2021 Finally the multi-agents multi-Step state reachability (MMSR) algorithm is provided.

A. PROBLEM FORMULATION
The general reachability problem of MARL system is to determine whether some states on the output holds for all or some input in a bounded input domain. Section 2 shows that a trained RL neural network can be represented with MILP model regardless of its layer depth and number of neurons. So the reachability problem in MARL system can be translated into using MILP to analyze the reachability problem.
Definition 15 (Reachability of Linear MARL System): From the definition above, we can transform the reachability objective to constraints, that is to say, for the agents in MARL system their objective can be described as z = 0 and C reach = C. Hence we can reduce the reachability analysis to solving a linear program defined on the constraints C reach , which is a MILP problem.
We have introduced how to transform the ReLU function to be a linear constraint set C i in definition 7. Let C R = m i=2 C i denote the linear constraint set for all the layer of each RLNN.
Definition 16(Linear Model for MARL System): Let C I and C O to denote the linear constraints on input and output for each RLNN, the linear encoding for MARL system is defined by the set of constraints and variables And the multi-agent disjunctive constraints can be expressed as follows: We consider the following specifications: S1: Is it ever the case that each prey can be caught by the predator within k steps. It can be described by mrSL as follows: ∀x (prey, x) ∃ y (pred, y)F ≤k d (prey, pred) < 0.125 S2: Is it ever the case that the preys hasn't been caught from the initial states within k steps.
∀x (prey, x) ∀y (pred, y) F ≤k d (prey, pred) > 0.125 S3: Is it ever the case that there exists one prey has been caught from the initial states within k steps.
∃x (prey, x) ∃y (pred, y) F ≤k d (prey, pred) < 0.125 To analyze reachability, the above specifications should be converted into constraints in the MILP model. The logic mrSL contains the strategy formula ∀x (a, x) and ∃x (a, x), so there are four types of combinations, The semantics of formula (1) is that all the strategy x using by any agent in role a and all the strategy y using by any agent in role b satisfy ϕ. The semantics of formula 4 is that there is a strategy x using by an agent in role a and strategy y using by an agent in role b satisfy ϕ. we think the semantics of formula (2) and formula (3) are the same, it means for all the strategy x using by any agent in role a and there exists a strategy y using by an agent in role b satisfy ϕ. So we need to translate formula (1), (4) and (2) to the MILP constraints, and the algorithms are described as following.
In algorithm 1, the input includes roles ar1, ar2, and the distance value to be compared d. The formula (1) is transformed into the set of conjunctive constraints that must be satisfied.
In algorithm 2, the formula (2) is transformed into the disjunctive and conjunctive set of all constraints that must be In algorithm 3, the formula (4) is transformed into the disjunctive set of all constraints that must be satisfied.

C. MULTI-AGENTS MULTI-STEP STATE REACHABILITY
In fact, the three specifications above can be reduced to the multi-agent multi-step state reachability (MMSR) decision problem, that is to say, if the agents can arrive at the specified states within multi-steps.
Definition 17(MMSR decision problem): Let n ∈ N and S" ∈ S be the target states, where S i ∈ S" represent the target state of agent ag i . The MMSR decision problem is to determine whether there is a case for each agent ag i within k step, that ∃X i 0 ∈ I , such that S i = F i k (X i 0 ) for each S i . So the MMSR problem can be solved by k-step iteration.
The MMSR analysis procedure is given by Algorithm 4. In it the method spec2constraints will call the algorithm 1, 2 and 3 according to the formula type and returns constraint set.

V. EXPERIMENTS
We implemented the methods described in the previous section in a toolkit, called MAReachAnalysis, that solves the Algorithm 4 The MMSR analysis procedure Procedure MMSR(models,ϕ) 1: Input: MILP models, formula ϕ 2: Output: feasible 3: feasible=MILP_Solver(C lp ) 6:return feasible reachability decision problem in MARL system. MAReach-Analysis takes as input several linear RLNNs and a mrSL specification ϕ, the toolkit returns whether or not the specification ϕ holds. The set Init of initial states is given in the form of a hyper-rectangle [l 1 , u 1 ] × · · · × [l n , u n ]. We experimented with ReLU-RLNNs using tensorflow, but other choices are possible. To evaluate the correctness and the performance of MAReachAnalysis, we analyzed the predator-prey task described in Example 1 with different agent number. The neural networks used in this article are trained 10000 iterations. The experiments were run on an Intel Core i7-9750 CPU (2.600GHz, 8 cores) running Ubuntu 18.04 upon which we invoked Gurobi version 9.0.

A. SCENARIO 1 WITH 3 PREDATORS AND 1 PREY
In this scenario, we describe the global states of the predatorprey task using the set of position of each agent S = (obs 1 , obs 2 , obs 3 , obs 4 ), where obs 1 , obs 2 and obs 3 are for predators, and obs 4 for prey. Actually the structure of obs 1 to obs 4 is also the set of tuples. obs i = (p_vel, p_pos, entity_pos, other_pos, other_vel) • p_vel is the velocity of the agent, it's a two-dimensional vector • p_pos is the position of the agent, it's a two-dimensional vector • entity_pos is the relative position between the agent and the landmarks, it's a two-dimensional vector multiplying the number of the landmarks • other_pos is the relative position between the agent and the other agents, it's a two-dimensional vector multiplying the number of the agent but minus one • other_vel is used for communication, it's not used in this example.
The length of obs 1 , obs 2 and obs 3 is 16, but for obs 4 is 14.
A joint action Act is a tuple of local actions for all the agents in the system. The set Act = (Act 1 , Act 2 , Act 3 , Act 4 , Act E ), where Act i is a 5-dimension Act i = (up, down, left, right, reserved). The first 4 items specify the desired movement in Cartesian coordinates and the last item is reserved. In MADDPG example, the environment has one dummy action ε because it does nothing to the agent, so the actions for environment Act E is equal to ε.
Similarly, a transition is a tuple of local transition for all the agents in the system T = (t 1 , t 2 , t 3 , t 4 ), where each VOLUME 9, 2021 local transition function is defined as follows. t i (s (i) , a i ) = p i _pos + p i _vel * t, where p i _vel is also obtained from obs i .
We attempt to perform reachability analysis on predator_prey task and report the analysis results. The MAReachAnalysis toolkit builds the set of linear constraints necessary based on Definition 7 and Definition 15. Each MILP model consists of three different groups of constraints: (1) constraints on the input to ensure that its value stays within bounds, that's the C I constraints; (2) constraints on the output of each layer ensuring that it is in the safe level of floating precision, that is the C R constraints; (3) constraints on the final output result, that is the C O constraints. We define the range for p_pos is [−1, 1], for output value is [0,1], and we add total perturbation tolerance for the RLNN, which is the sum of perturbation in each layer belonging to C R .
The constraints are then passed to Gurobi, which is used to determine whether the generated linear program admits a solution. Following the output from Gurobi, a result is then presented to the user as either a feasible result, indicating that an output state is reachable from the input sates or an infeasible result representing unreachability of the output states. S1: Is it ever the case that each prey can be caught by the predator within k steps.
In predator-prey task, the condition of the predators catching the preys is the distance between them less than 0.125 (The maximum collision distance is the sum of the sizes of predator and prey. The size of predator is 0.075, prey is 0.05, landmark is 0.2). Using algorithm 2, S1 can be transformed into the following disjunctive set: Distance Agt 0 pred , Agt 0 prey ≤ 0.125 Distance Agt 1 pred , Agt 0 prey ≤ 0.125 Distance Agt 2 pred , Agt 0 prey ≤ 0.125 In this article, we use Manhattan distance to calculate whether there is a collision between the two agents. The linear constraints are added to the model as following: gmodel.addConstr(C 1 ) gmodel.addConstr(C 2 ) gmodel.addConstr(C 3 ) where The variables delta03_x and delta03_y are the X-axis and Y-axis distance between predator Agt 0 pred and prey Agt 0 prey which are the linear constraint. Similarly, delta13_x and delta13_y are distance between Agt 1 pred and prey Agt 0 prey . We denote by (ȯbs 1 ,ȯbs 2 ,ȯbs 3 ,ȯbs 4 ) the initial state of the environment, and (ōbs 1 ,ōbs 2 ,ōbs 3 ,ōbs 4 ) the final state of the environment after n time steps. The first time we fix a set of initial states witḣ obs 1  . This corresponds to states where the predators are at the edge of the land, and the prey is in the middle. We checked whether that the prey can be caught by the predator within k steps, where k is equal to 5, 10, 15 individually.
The second column vars number includes continuous(c) and binary(b) vars. The solver reported the solution found for the problem if the result is ''Y'' in the table. This shows that the specification can be satisfied.
In fact there are many strategies for predators to catch the prey. Here we give one strategy that the prey is caught from initial state within 5 steps. The initial state is S2: Is it ever the case that the preys hasn't been caught from the initial states within k steps.
The formal formula for S2 specification is as follows: ∀x (prey, x) ∀y (pred, y) F ≤k d (prey, pred) ≥ 0.125 First we use algorithm 1 to transform the mrSL formula to constraints, and it will return the conjunctive form result.   Because the number of prey is 1, we get the same formula with the specification S1.
First we check the reachability for S1 specification, which can be transformed into the following disjunctive set: We denote by (ȯbs 1 ,ȯbs 2 ,ȯbs 3 ,ȯbs 4 , ,ȯbs 5 ,ȯbs 6 ) the initial state of the environment, and (ōbs 1 ,ōbs 2 ,ōbs 3 ,ōbs 4 , obs 5 ,ōbs 6 ) the final state of the environment after n time steps. We also set up a similar environment with scenario 1 for S1 specification, where the four predators are at the edge of the land, and the two preys are in the middle. The initial states areȯbs 1 table 4 show that there always exists a strategy that the two preys are caught by the predators within k steps, where k is equal to 5, 10, 15 individually. For example, when k is equal 5, the relative position of X axis between predator 2 and prey 1 is 0.06328125, and the relative position of X axis between predator 4 and prey 2 is 0.06328125, which are less than 0.125, indicating that S1 specification is satisfied.
We get the same results for S2 specification in scenario 2 as scenario 1. There is no strategy for two preys that they can escape from the predators. We believe this is right because the aim of this task is to catch the preys for predators. It shows the network works well after trained 10000 iterations.
We can't provide comparisons with other tools because we are not aware of other tools supporting systems of multiple neural agents and role based strategic properties as we do here.

VI. VI.RELATED WORKS
The purpose of reachability analysis is to compute the output reachable set to verify the problem through layer-by-layer analysis. We are not aware of any work addressing the reachability analysis/verification for MARL system. However, the reachability analysis for MAS or single RL system has received much attention over the years. These approaches can be divided into 3 main groups: (1) MILP-based techniques that formulate the verification problem as a mixed integer linear program. For example, [19] presents the verification problem of agent-environment systems against bounded CTL and prove that verification is coNExpTime. Reference [20] presents an ideal mixed-integer programming (MIP)   formulation for a ReLU appearing in a trained neural network, and this method is much more computationally efficient and scalable than the extended formulation. Reference [21] present an efficient implementation of MILP verifier for properties of piecewise-linear feed-forward neural networks, and this method improves performance by several orders of magnitude when compared to a naïve MILP implementation. Reference [6] is also a MILP-based technique that verify automatically for the single agent RL system.
(2) SMT-based techniques that encode the verification problem as the satisfiability modulo theory problem. Reference [22] partitions the input domain into small grid cells, and loosely approximates the reachable set for each grid cell considering the maximum sensitivity of the network at each grid cell. It works for any activation function and its runtime scales well with the number of layers but poorly with the input dimension. Reference [23] present Marabou, a framework for verifying deep neural networks. Marabou is an SMT-based tool that can answer queries about a network's properties by transforming these queries into constraint satisfaction problems. In [4], an SMT solver named Reluplex is proposed for a special class of neural networks with ReLU activation functions. (3) other techniques get a definite answer. Reference [24] computes the exact reachable set for networks with only ReLU activations. As there is no over-approximation, this method is sound and complete. However, because the number of polytopes grows exponentially with each layer, this method does not scale. Reference [25] investigate model checking algorithms for variants of SL over pushdown multi-agent systems, this algorithms are automata-theoretic. Reference [26] describe various properties of activation functions using Quadratic Constraints (QCs) and develop a novel framework based on semidefinite programming (SDP) for safety verification and robustness analysis of neural networks against norm-bounded perturbations in their input.

VII. CONCLUSION AND FUTURE WORK
With the deployment of systems based on MARL there has been a growing interest in their verification. While the benefits of formal methods have long been recognized, and they have found large adoption in safety-critical systems as well as in industrial scale software, but little research has been done to verify the MARL system.
In this article we put forward a formal methodology to solve the reachability problem for MARL system. First the reachability property is described by a different logic mrSL, then the property is translated into a union of a number of constraints, finally reachable set computation algorithm MMSR is used in Gurobi, and the result feasible or in feasible is returned.
In future work we plan to explore other methodologies, including introducing parameters for mrSL to make verification more flexible, job partition to improve the computing speed. Furthermore, we believe the technique here put forward can be used as an ideal stepping stone to verify MARL system. Since 2011, he has been a Lecturer with the College of Computer Science and Technology, Jilin University. He is the author of one book, more than ten articles, and five inventions. His research interests include computer architecture, software engineering, software design and develop patterns, model checking and validation, machine learning, and deep learning.
SHUQIU LI was born in Heilongjiang, China, in 1966. He received the B.S. degree in computer science and technology from Zhejiang University, in 1989, and the M.S. degree in optical instrument major from the Changchun Institute of Optics, Fine Mechanics and Physics (CIOMP), in 1994.
Since 2001, he has been an Associate Professor with the College of Computer Science and Technology, Jilin University. He is the author of one book and more than 15 articles. His research interests include software engineering, software design and develop patterns, model checking and validation, machine learning, and deep learning.
BING LI received the B.Sc., M.Sc., and Ph.D. degrees from Jilin University, in 1999, 2005, and 2011, respectively. He is currently a Lecturer in computer science with Jilin University. His main research interests include network security, software engineering, and machine learning. VOLUME 9, 2021