Hidden Markov Random Field for Multi-Agent Optimal Decision in Top-Coal Caving

Applying model-based learning for the optimal decision of the multi-agent system is not trivial due to the expensive price or even the impossibility of obtaining the ground truth for training the model of the complex environment. Such as learning the optimal action of hydraulic supports in the top-coal caving, the optimal action could not accessible as the ground truth of the corresponding state in the intricate processes. Regarding the latent ground truth as the hidden variable is an effective method in the hidden Markov model. This paper extends the hidden variable of ground truth to the multi-agent system and proposes the hidden Markov random field (HMRF) model with reinforcement learning for optimizing the action decision of the multi-agent. In the HMRF model, the input states and the output actions of the multi-agent are considered as an observable random field and a latent Markov random field, respectively. Based on the HMRF model, the optimal decision is inferred by the maximum posterior probability with the prior probability obtained by Q-learning. Meanwhile, the parameters of the HMRF model are estimated by the expectation maximum algorithm. In the experiment, the top-coal caving demonstrates the effectiveness of the proposed method that the recall of top-coal is improved prominently with a very small price of increasing the rock-rate. Furthermore, the proposed method is employed to deal with the predator-preys problem in the gym. The experiment result shows that the communication between agents by the HMRF increases the reward of the preys.


I. INTRODUCTION
Top-coal caving is the most efficient mining method in the underground thick coal seam at present [1]. As shown in Fig. 1, the hydraulic supports (HSs) is the key equipment of the fully mechanized caving workspace (FMCW) for security and top-coal exploiting. Each hydraulic support is equipped with a window (e.g, the component C in Fig. 1b) for top-coal exploiting. In the FMCW, there are hundreds of HSs to jack up the coal-roof, and the coal-cutter cuts coal in front of HSs. As the HSs move forward, the top-coal falls under gravity and be captured by the drag conveyor if the windows at the back of the HSs are open.
The associate editor coordinating the review of this manuscript and approving it for publication was Chenping Hou.
Controlling the top-caving windows of HSs are the key to maximize the output of top-coal exploit as in general practice. Ideally, the windows should keep open as long as possible so that more coal can fall into the drag conveyor. However, the rocks from the immediate-roof will inevitably fall during the top-coal exploiting. If the windows could not close in time, the rocks in the immediate-roof or even in the main-proof will fall into the drag conveyor too, which increases the cost of the mining due to the need to filter out the rocks subsequently. Therefore, the optimal decision of windows' action is one of the most important issues in top-coal caving to maximize the top-coal capturing meanwhile minimize the number of rocks falling into drag conveyor.
At present, there are two methods to control the windows open or close, by hand or by the preset timer based on the experience. However, the hundreds of HSs need many immediate-roof and main-roof. The window opens for capturing the top-coal after the coal in front of HSs has been exploited by coal-cutter and the HSs have moved forward. While the windows will close (shown in the sub-graph of Fig.(b) within the red ellipse) to prevent the rocks in the immediate-roof and main-roof from falling into the drag conveyor after the top-coal has been excavated completely. experienced workers which increases the cost significantly and may result in health and safety issues. Moreover, in the FMCW, the structures of top-coal, immediate-roof and main-roof layers are very complicated, and it is hard to decide the boundaries between them [1], [3], [4]. Hence, presetting a timer to control each window individually is impossible to get the maximum top-coal mining, especially during the coal cutter moving forward, the dynamic processes of top-coal, immediate-roof and main-roof would deteriorate the intricate boundary and results in the rock funnel areas.
In order to maximize the output of top coal with lower rocks by automating the windows' action, it is important to develop models that can control the windows collaboratively. This is due to the fact that the boundary straightness between coal and rocks could be improved to avoid the rock funnel areas and increase the top-coal exploiting by controlling the windows collaboratively. Essentially, it is an optimizing problem of the multi-agent system, in which each window can be regarded as an agent. Also, the window action decision could be considered as a random process since the action of each window is independent of the past environment states and has the characters of the Markov process. Hence, each decision maker of the window is an agent which can learn to predict what action to take from the experience.
In this multi-agent system of the FMCW, the dimension of state space and the computation complexity will expand exponentially with the increasing number of the agents [5].
That makes it difficult to optimize the action of windows since the hundreds of HSs in the FMCW will bring a large state space. In order to address these issues, this paper proposes a probabilistic graphical model (PGM) [6] for the multi-agent system to obtain the optimal decision with the following motivation: Because in the PGM of a multi-agent, the agents and their relationships can be depicted uniformly by the nodes and edges, hence the model architecture is explicit to describe the relationship between the current agent and its neighbors, and the optimal decision could be inferred by the PGM directly. However, training the PGM needs the samples with ground-truth, and unfortunately, it is impossible for the topcoal caving. Inspired by the hidden Markov model (HMM) training the action of each agent as a hidden variable, this paper extends hidden variables of the multi-agent to the random field and proposes the optimal decision methodology incorporated with the hidden Markov random field (HMRF) [7] for the multi-agent.
The proposed methodology mainly contains two parts: the HMRF architecture and the Q-learning algorithm. In the first part, the optimal decision model is established by HMRF, in which the input and output are the state random field and action Markov random field of the multi-agent, respectively. Meanwhile, the correlation of the agents is depicted by the nodes and edges of the HMRF. As a supervised learning method, it is impossible for the HMRF model to carry out training without a large number of samples. However the complex dynamic environment of FMCW makes the training sample acquisition hard. Hence the action of the agent is regarded as the hidden variable, which means the agent cannot know the optimal action during the inference process, then the expectation maximization (EM) algorithm [8] is employed to train the parameters of HMRF model. In the Q-learning part, the trained Q-value function produces the prior probability for HMRF and the HMRF model learns the parameters without labels, essentially, the proposed methodology is an unsupervised machine learning. The main contribution of this work includes: (I) The actions of HSs window for top-coal caving are transformed into the multi-agent decision methodology to approach optimal performance.
(II) The HMRF for the multi-agent decision is proposed to establish the unified model of both current-agent and its neighbors, in which the action of each agent is considered as a hidden random variable and the neighbors' actions are embedded into the HMRF to achieve the optimal decisions collaboratively. Hence, the optimal decision is transformed into a probabilistic inference based on the HMRF model.
(III) Based on the HMRF, the optimal decisions of multiagent are formulated as a maximum a posterior (MAP) of the actions with the joint probabilistic distribution, and the Q-learning algorithm is employed to obtain the prior probability for the MAP. In consequence, training the HMRF model without labels can acclimatize the complex environment of top-coal caving. VOLUME 8, 2020 The rest of the paper is organized below: Section II presents the related works on the optimal decision by PGM and reinforcement learning (RL). Section III introduces the hidden Markov random field for the multi-agent optimal decision. Section IV, present the Q-learning to obtain the MAP for HMRF. Section V demonstrates the automating top-caving and the predator-preys to prove the effectiveness of the algorithm. Section VI gives the conclusion.

II. RELATED WORKS
For the multi-agent system, the signal-agent could be abstracted to the node of the PGM [9], meanwhile, the communication between the agents could be tread as the edges of the PGM. Hence, the temporal-spatial characteristic of the agents is described by the node and edges.
RL [10] is an effective method to get the optimal decision [11]- [13]. The optimal policy is obtained by iterating the learning mechanism based on the reward from the environment or an established neural network to approximate the environment model by the critic-actor architecture [14], [15]. Hence in the multi-agent systems, the RL is widely applied [16], such as in traffic control [17], [18], mobile robots [19]- [21], resource management [22], [23].
RL achieves the optimal policy by revising the possible state transition from the end [24]. Hence, the multi-agent system encounters the dimension curse dramatically. Furthermore, the reward of each agent is dependent, and the performance index function of the system should be optimized overall the rewards [25], [26], sometime it should equilateral between the entirety and the individuals [27]. Moreover, the agent executes its policy locally while effecting the other agents, and the different policy combination produces different environment change, hence the environment is nonstationary during the agents executing its policy [25], [26]. That means it is difficult to get train data who can cover all the states' combinations, especially there is a large number of agents.
Hence, combining the advantages of PGM and RL is naturally chosen for multi-agent optimal decision, because the communication topology of multi-agent is explicitly formulated by PGM, and the temporal-spatial characteristic of the multi-agent actions is depicted clearly by the output of the PGM. Hence the combination could alleviate the difficulty confronted in the RL system [9].
Bayesian network and Markov random field (MRF) are the two major kinds of PGM. A series of methods based on the Bayesian method has been proposed for RL with the signal agent system [28]- [30], and the multi-agent system [31]- [33]. Conditional random field (CRF) is one of the classical MRF, in which the input and output random variables form their own MRF. Meanwhile, the output CRF could be obtained by probabilistic inference with joint conditional probability. The CRF used for the multi-agent can be found in [34], [35], in which the incidence relation of each agent is formulated as the potential functions whose parameters are estimated by maximum likelihood estimation based on the data pair of input and output.
However, one of the most important challenges for optimal decisions via PGM is the parameter estimating of the approximation model since it is hard to obtain the ground truth for training data, especially the process and environment is complex and non-stationary. Regarding the ground truth of the training data as a hidden variable is an effective method to estimate the parameters of PGM [36]. Hence, the Hidden Markov model(HMM) is proposed to deal with the hidden parameters system, now it has been well applied in many control and decision making system, such as robot control [37]- [40], autonomous manufacturing [41]- [43], fault diagnosis [44], [45].
In the RL system, the HMM is often used to deal with the partially observable Markov decision problems [46], [47]. In [48], [49] the HMM has been employed to describe the hidden variable in the RL framework, and the parameters are estimated effectively. In [50], the Restricted Boltzmann machine with the hidden variables has been adopted to approximate the action and state of the RL, and the hidden variables can be estimated by state-action-reward-state-action (SARSA) algorithm. In [51], the HMM is used with RL to predict the zone in the 5G system. Similar work can be found in [52]. However, the methods mentioned above are just applied to the signal-agent in the RL framework.
Extending the HMM to the multi-agent optimal decision is the most direct method. Naturally, the single random variable for denoting the input or output of a single agent should be promoted to the random variables for the multi-agent by the available way, named Hidden Markov random field (HMRF) [7].
The HMRF has been proposed to deal with the image segmentation [7], [53], [54], in which the input image and the segmentation output are respectively considered as an observable random field and a hidden Markov random field. Currently, HMRF is extending to the cone penetration tests [55], light-field stereo matching problems [56] and electronic medical records analysis [57], etc. For image segmentation, the best result could be considered as the optimal decision of choosing the label for the pixels by unsupervised learning. If the segmentation is understood from another perspective that the pixels make the decision regulated by HMRF, the pixels could be regarded as the multi-agent and the label set could be transformed into the action space of the multi-agent. Naturally, the HMRF could be used for inferring the optimal decision of the multi-agent.
Since the PGM could depict the relationship between the elements of the systems, recently, the graph property is combined with other machine learning methods to address the issue of combinatorial generalization which is considered as the top priority for artificial intelligence [58]. The graph neural network [59], [60] is the most famous branch and rising rapidly which has been expanded into RL [61] for robot controlling [62], solving boolean satisfiability [63], and crowd navigation [64], etc. Although the graph neural network is in the different categories of the method proposed in this paper, the direction of how to describe the relationship between the elements in the system is the same.
Motivated by the serial references mentioned above and aiming the optimal action of the windows of HSs for top-coal caving, this paper proposes the HMRF methodology based on probability inference and RL to optimize the multi-agent decision.

III. HIDDEN MARKOV RANDOM FIELD FOR MULTI-AGENT OPTIMAL DECISION
In a multi-agent system, the states of all agents can be denoted by a random field, meanwhile, the action decision process of all the agents meets Markov random process and all the actions can be considered as a Markov random field [65]. The states and actions are the input and output of the multi-agent, and the two random fields are the foundation of employing the HMRF [7] to get the optimal decision.

A. PRELIMINARIES
For a multi-agent system, let A = {a 1 , a 2 , . . . , a L } be the action space of agent, where L ∈ R is the number of the actions for a single agent, and a i is the ith action value. The state space is S = {s 1 , s 2 , . . . , s D } where D ∈ R is the number of states for a signal agent, and s i is the ith state value.
As a random field, the states of a multi-agent system is denoted by X = {X 1 , X 2 , . . . , X N }, N ∈ R is the number of the agents, X i ∈ S is a discrete random variable. The configuration of X is x = {x 1 , x 2 , . . . , x N }, and the set of all configurations is defined as The Markov random field of the actions is denoted by The configuration is y = {y 1 , y 2 , . . . , y N } and the set γ = {y = For the multi-agent action decision, given an action Markov random field Y that every random variable of states X i follows a known conditional probability distribution where θ y i is the parameters associated with the action y i . Note that, X i and Y i are discrete random variable, hence θ y i should meet the following condition Suppose each state and action in x and y is conditionally independent, that means for any random configuration y ∈ γ , the conditional probability distribution of x corresponded to y is independent for every x i . Therefore, the following formulation is yielded

B. HIDDEN MARKOV RANDOM FIELD FOR MULTI-AGENT SYSTEM
In a Markov random field, the neighborhood set of agent i is denoted by N i , that N i ⊂ A and i ∈ N i . According to the Markov local characteristic [7], [65], the joint conditional probability distribution of state and action of each agent could be yielded as below The marginal probability distribution of x i under the condition of the neighborhood of N i could be described as follows By Eq. (4), Since every probabilistic distribution p x i y i has the same function described as shown Eq. (1), and by Eq. (6) that Eq. (5) can be rewritten as the hidden Markov random field model [7] where θ y i is the parameter of the distribution function. If the random variable X i follows a Gaussian distribution, i.e., f x i ; θ y i , is formulated as where And HMRF can be rewritten as Remark 1: In this paper, the state variable X and the action variable Y of the multi-agent are discrete, hence the probability distribution function f x i ; θ y i should meet the condition of Eq. (2), that means 0 f x i ; θ y i 1.

Remark 2:
The form of f x i ; θ y i showed Eq. (8) is different from the probability distribution function for the continuous random variable, in which f (·) > 1 will be held if the standard deviation σ is set as a very small constant. Hence if the probability distribution function for the discrete random variable as Eq. (8), σ y i > ρ must be held to meet 0 f x i ; θ y i 1. Actually, the simulation experiment VOLUME 8, 2020 verifies this condition should be held, if there is no condition σ y i > ρ, the result is in confusion; while if the condition is obeyed, the result is reasonable.

C. MAXIMUM A POSTERIOR WITH HMRF FOR OPTIMAL DECISION
Based on the HMRF, the multi-agent system optimal decision could be transformed into a MAP problem. Given the observable random field of state X = {X i }, X i ∈ S, the aim is to find the maximum probability of hidden actions If the configuration of x and y is respectively x = {x 1 , . . . , x N } and y = {y 1 , . . . , y N }, multi-agent system optimal decision could be obtained by MAP Since the conditional independent characteristic Eq. (3), and p x i y i has the same probability distribution as shown in Eq. (1), meanwhile the element of random variable Y = {Y i }, Y i ∈ A is independent, the optimal action decision y * is determined by the following formulation where p (y i ) is the prior probability, p x i y i is an adjustment term to improve the output precision based on p (y i ). In HMRF, the optimal decision shown as Eq. (11) could be formulated as below By Eq. (4) and Eq. (1),

D. PARAMETER ESTIMATION FOR OPTIMAL DECISION VIA EXPECTATION-MAXIMIZATION ALGORITHM
Parameter estimation is one of the keys to infer the probability by PGM. However, the non-stationary dynamic process of multi-agent makes it hard to obtain the ground truth for the training data, hence in HMRF, the actions for training data are considered as the latent variables. The EM algorithm [8] is a famous method to estimate the parameters of the model with latent variables. In the multiagent system, the state of each agent X = {X 1 , . . . , X N } is known as the discrete random variable, while the action Y = {Y 1 , . . . , Y N } is the ground truth as the latent random variable. The aim is to estimate θ = {θ y i , y i ∈ A} in Eq. (11) then to obtain the optimal actions.
The joint probability distribution of x with the parameter θ is p (x; θ) and the likelihood can be calculated as below Suppose that i is a probability distribution over y i , and By Jensen's inequality [8] and Eq. (14), the infimum of the likelihood L 1 (θ) is calculated as below Note that i (y i ) is a probability distribution, hence the right hand of the inequality Eq.
where C is a constant, the inequality Eq. (16) gets the greatest low boundary. By Eq. (17) Substitute the result of Eq. (18) into Eq. (17) The probability distribution i can be obtained That is the E-step of the EM algorithm. And the next is to find the maximum of L 1 (θ) by 76600 VOLUME 8, 2020 The θ will substitute into Eq. (20) to calculate the i (y i ) for an iteration till the result convergence.
The key to Eq. (11) is to estimate the parameter θ = θ y i , y i ∈ A in HMRF by EM algorithm. Since the latent variable in Eq. (11) (22) and θ can be derived by the following equation By Eq. (11) Hence, where f x i ; θ y i is formulated as Eq. (8), And in the same way,

IV. HMRF WITH Q-LEARNING ALGORITHM FOR MULTI-AGENT SYSTEM OPTIMAL DECISION
Q-learning algorithm [10] is one of the effective off policy RL methods. The core of Q-learning is learning the value of the Q matrix which represents the relationship between action and the state. If the action decision is represented by probability, the output of the RL system could be considered as the prior probability for the MAP.

A. PRIOR PROBABILITY ESTIMATION BY Q-LEARNING ALGORITHM
Consider the multi-agent in the MRF, the state space and action space are the same for every agent. If the configuration action and state of each agent are respective y = {y 1 , y 2 , . . . , y N }, y i ∈ A, and x = {x 1 , x 2 , .., x N }, x i ∈ S, and the reward of each agent is {R 1 , R 2 , . . . , R N } in a step of the Markov process, we can define the action-value function Q (x,ȳ),x ∈ x,ȳ ∈ y. Each agent acts with the same Q-value function under the state of its own. The learning mechanism of the Q matrix is shown below where t denotes the learning step; β > 0 is the discount rate; α > 0 is the learning step-size.
Since the Q-value function of every agent is the same, the dimension of the Q matrix is L × D, L and D is respective the dimension of action space and state space.
The agents learn the same Q-matrix independently during interacting with the environment by the reward and action. Then, the softmax result of the Q-matrix could be considered as the prior knowledge of policy decision for the multi-agent in the HMRF, hence the prior probability of P y i N i in Eq. (13) could be transformed as the follows where U y i N i is the energy function, and Z is the normalization constant. Hence U y i N i and Z are set as According to Eq. (8), (13), and (31), the optimal actions of multi-agent could be inferred by MAP based on the HMRF as below The likelihood of i 1 (y i ) is  where θ = µ y i , σ y i . And consider the likelihood of f x i , θ y i , the optimal decision of y could be obtained as below y * = arg min It should be noted that the state of the multi-agent is the parameter that describes the agent in the environment. For the algorithm of this paper, the state parameter could be continuous or discrete, but the action of each agent should be discrete and could be depicted by probabilistic form. The reward model is one of the keys to the RL system, and it is hard to get a perfect model. In our methodology, the Q value should be trained as the prior probability, hence the reward depends on the application. In the top-coal caving of this paper, the reward depends on the capture of each agent, and the model is given in Eq. (37).

B. ALGORITHMS INTEGRATION
The methodology proposed in this paper includes 4 blocks, HMRF, EM, Q-learning, and MAP. The HMRF provides the model for the optimal decision. The EM estimates the parameters of HMRF. The Q-learning produces the prior probability for the MAP, and the MAP calculates the optimal action for each agent. The architecture is shown in Fig. 2.
In Fig. 2(a), the input of the multi-agent system is the observable states x = {x 1 , . . . , x N }, and the output optimal decision is y * by Eq. (35). The HMRF model in Eq. (8) provides the optimal model for MAP, and the prior probability produces by the probability form of Q-value in Eq. (32). In Fig. 2(b), the iteration of E-step Eq. (22) and M-step Eq. (27), Eq. (28) are used to estimate the parameters of the HMRF. The methodology should be implemented by two steps which are respectively shown in Algorithm 1 and Algorithm 2.
In algorithm 1, Q (x,ȳ) is shared by all the agents, and it is trained by signal-agent with the Q-learning algorithm [10].

Algorithm 1 Q-Value Training by Signal-Agent
1 Initialize Q (x,ȳ),x ∈ S,ȳ ∈ A, the learning step size α ∈ R, the discount factor 0 < β < 1; 2 Set the agent number N , the maximum number of the episode N e , the maximum step of each episode N t , and the terminal statex E ,x E ∈ S 3 for j ← 1 to N e do 4 Initialize the states Get the stats x t , and choose action y t of each agent based on x t by -greedy algorithm [10] 7 Take action y t , then get the reward R t+1 and the states x t+1 . Ouput f x i ; θ y i by Eq. (8)

15
Output the action y *

16
Update the states x 17 end The output of algorithm 1 with probability form is used as the prior probability of p y i N i for HMRF.
Algorithm 2 is to get the optimal decision of actions for the multi-agent based on the prior probability p y i N i . The environment is changing during the multi-agent actions are executing, that states of the multi-agent are variational, and the optimal decisions should be updated following the environment changes. Hence, the optimal decision in Algorithm 2 is a transient result.
Remark 3: Generally, in Algorithm 2, the optimal decision is calculated by Eq. (13) with the iteration method, such as the iterated conditional modes [7]. In this paper, the prior probability with HMRF is obtained by the RL, hence it could be calculated by Eq. (13) directly, or the iteration number is 1.
Remark 4: In Algorithm 2, the condition of breaking the EM iteration is µ y i , σ y i approaching convergence, i.e., i 2 (y i ) in Eq.35 is a constant. In the implementation, the condition often set as below.
where T φ = i 2 (y i ), m is the counter of EM iteration, and ζ is a small constant.

V. EXPERIMENT AND RESULT ANALYSIS A. TOP-COAL CAVING EXPERIMENT 1) TOP-COAL CAVING SIMULATION PLATFORM
In order to validate the proposed methodology of this paper, we develop a simulation platform by Matlab based on the open source of a two-dimension discrete element model (DEM), named dice [66]. We set 3 kinds of particles in the DEM to denote the coal, rock of the immediate-roof, and rock of the main-roof, respectively. And we set 5 windows to demonstrate the dynamic process of top-coal caving with the decision by the proposed methodology. The main output is shown in Fig. 3. In the coal particle area, there are 60 rock particles with a random location. Meanwhile, in the rock area, there are 10 coal particles with a random location. Furthermore, both sides of the boundary between coal and rock, there are 20 opposite particles with a random location.

2) MODEL OF HMRF AND LEARNING PROCESS
The action space of HS is A = {a 1 , a 2 }, a 1 and a 2 respectively denote the window being open and close. In this paper, the state of the HS is the ratio of coal in the total particles near the window, we set 12 levels S = {s 1 , . . . , s 12 }, and the corresponding ratio (or interval of the ratio ) is 0, (0, 0.1], (0.1, 0.2], . . . , (0.9, 1.0), 1.0. Hence the dimension of the Q matrix is set as 2 × 12. Each of the windows is considered as an agent, hence the state random variable of the multi-agent is X = {X 1 , . . . , X 5 }, and the hidden random variable is Y = {Y 1 , . . . , Y 5 }. The conditional probability distribution is Eq. (8) with the constraint σ y i > ρ. Since y i ∈ A = {a 1 , a 2 } the parameter of HMRF can be denoted as θ = µ a 1 , σ a 1 ; µ a 2 , σ a 2 . Based on the HMRF model, the learning process of the experiment is regulated as below

3) LEARNING THE Q MATRIX BY Q-LEARNING ALGORITHM
In this process, the reward is calculated by the following equation. R = n r r r + n c r c (37) where n r , n c is respective the number of rock and coal obtained by caving; r r , r c is respectively the reward of getting one rock particle and one coal particle. In the experiment, r r = −3, r c = 1, and the parameter of the Q-learning algorithm is set as α = 0.2, β = 0.1. After trained, the output of Q matrix with probability form is shown in Tab. 1.

4) LEARNING THE HMRF MODEL
In this process, the HMRF model is trained following the Algorithm 2, and the parameters are initialized as µ a 1 = µ a 2 = 0, σ a 1 = σ a 2 = 1; the constraint conditional is set as ρ = 1; the iteration number of EM is 10; the condition of breaking the EM iteration is ζ = 0.0001.

5) EXPERIMENT RESULT ANALYSIS
The simulations are carried out to compare the effectiveness of the proposed methodology with the RL method [10] and reinforcement learning with equivalent action [16] in which the communication between the current agent and its neighbors is depicted by the equivalent action based on mean field theory. In the following analysis, the three compared methods are denoted by RL, MF, and HMRF, respectively. The main performance indexes are coal-recall (CR) and the ratio of rock (RR) in all mining, and the formulations are shown below. CR = n c n c_total RR = n r n c + n r (38) where n c and n r is the number of rock and coal obtained by caving, n c_total is the number of all the coal particles.
Since the location of some particles is randomly assigned, in this paper 10 tests with different random locations are carried out to validate the effectiveness. The parameter dynamic process of T φ which indicates the convergence during the training is shown in Fig. 4.  In Fig. 4, the convergence index T φ is the term of i 2 (y i ) in Eq. (35). It should be noted that the environment of action is non-stationary during the windows being regulated. Hence, if the environment changed, the convergence index would have a step change, then converges to the minimum value. In the middle of the process, the state of the windows changes frequently since the states approach the boundary, hence the dynamic of the convergence index is complex. While at the end of the process, the coal has been exploited and the environment is more stationary, hence the convergence index could approach to the steady value.
The performance indexes CR and RR of experiment results are shown in Tab According to Tab. 2 and Fig. 5, it is obvious to find that the RR is increased prominently by HMRF method with a small price of RR, in which the average CR is increased from 0.5693 to 0.9188 with 0.3495 growth, and the RR is just pushed up from 0.0605 to 0.1530 with 0.0925 growth. Meanwhile, the HMRF obtains a higher reward than MF. The intuitive advantage can be found in Fig. 6, the top-coal can be exploited more fully by the HMRF method.
It is clearly shown in Tab. 1 that if the state is s 9 and below, the probability of open by RL regulating is less than close. In Fig. 7, the action record of a window is shown, in 76604 VOLUME 8, 2020 which Fig. 7 (a), (b) and (c) displays the entire action process regulated by RL, MF, and HMRF method, respectively. That the actions of the three methods are different since RL does not consider the neighbors of each agent, it could not obtain a high RR, and the top-coal cannot be exploited completely, while actions of MF and HMRF are based on the neighbors which bring the high CR.

B. PREDATOR-PREYS EXPERIMENT 1) EXPERIMENT ENVIRONMENT
This experiment is carried out on the multi-agent competitive environment [67] based on OpenAI's reinforcement learning toolkit Gym [68]. There are two preys and a predator in the restricted area. The predator ''eat'' the preys and the FIGURE 7. Action record of a window for top-coal caving.The value of an action is 1 and 2 which means the window is open and close. In the RL regulating process, if the state is s 9 and below, the window is closed, otherwise hlit is opened. While in MF and HMRF regulating processes, the action dynamic is more complex, and the window action is not just depending on the static state but also the action of the neighbors.
preys avoid that case. To validate the algorithm of this paper, at first, the three competitive agents are trained, hence the predator can learn its action policy and the two preys can act collaboratively with a different algorithm. Then the preys act as the multi-agent to test the performance of different algorithms.
The reward is defined as: (1) Collision reward: the predator owns a positive reward 10 if it collides with a prey, meanwhile the prey will be punished with a negative reward −10. (2) Outside punishment: if the agent runs out of the area, it will be punished the distance between the agent location and area boundary.
There are 12 state parameters of each agent, the position (2 states) and velocity(2 states) of themselves, the relative positions(4 states) and velocities(4 states) of other agents. In this paper, the two preys are the cooperative multi-agent and avoid being eaten. For the method of this paper, the state of each prey is the relative position of the predator. And the HMRF model is formalized as below  where θ y i = {σ x y i , σ y y i , µ x y i , µ y y i }, x x i and x y i are the relative position of predator along x-axis and y-axis. The action space is A = {0, 1, 2, 3}, and y i ∈ A, these values denote the action direction {right, left, up, down}. According to Eq. (9), the HMRF model of prey formulated as The optimal action can be obtained for the below equation which is deduced by the same method in section IV y * = arg min By the analysis method of section III, the update of parameters are the same with Eq. (27) and Eq. (28).

2) EXPERIMENT AND ANALYSIS
In the experiment, 4 typical reinforcement learning algorithms, Q-learning, deep Q-network, deep policy gradient, actor-critic are carried out to compare the optimal decision performance. In the following analysis, the above algorithms and our method are denoted by RL, DQNFT, DPG, AC, and HMRF, respectively.
The code of RL and DQN is based on the reference [67]. In the RL method, a neural network used to approximate the  Q-value, and the DQN applies the network architecture of the Q-value while adopts the fixed target network to update the parameters. The core of DPG is the reinforce algorithm [69] and the actor-critic algorithm is based on the reference [10].
At first, the RL, DQN, and HMRF are trained for 20000 episodes, DPG and AC takes 40000 episodes training. In the HMRF, the prior probability comes from the training result of RL. After training, the agent of predator is fixed and the HMRF algorithm is applied to the agent of preys. Then, 100 tests are carried out to compare the reward of each agent, in which the initial location of each agent is random. The experiment result is shown in Fig. 8.
As shown in Fig. 8, the two preys with HMRF method improves the reward of most of the 100 tests, and the predator decreases the reward in the corresponding episode. Because 76606 VOLUME 8, 2020 the initial location of each agent is random, hence in server episodes, the preys with HMRF method are punished more than the other two algorithms.
The HMRF model could train the parameters in every step because the environment is non-stationary. As we can find in Fig. 9 and Fig. 10, the value of µ i and σ i are follow a certain distribution.
Because the HMRF regulates the action of multi-agent based on the RL, hence, the Q-value has been trained as the prior probability for the HMRF. Fig. 11 shows the actions of the two methods. Since the agent communicates to each other by the HMRF model, the two preys can get more reward than RL.

VI. CONCLUSION
This paper proposes a new method based on the HMRF for the optimal decision of a multi-agent system in top-coal caving, in which each agent is considered as the node of a probabilistic graphical model, and the decision of each agent is inferred by the MAP with EM algorithm estimating the parameters. The top-coal caving simulation experiments carried out and the case study shows that the average ratio of top-coal yields denoted by coal-recall are 0.57 and 0.62 by employing the RL and MF method, while it is 0.92 by adopting the proposed method. It also shows that the average ratios of rocks in all mining are 0.06, 0.11 and 0.15 by applying the RL, MF and HMRF algorithm, respectively. Furthermore, the predatorpreys experiment validates the effectiveness of this method, in which the typical reinforcement learning algorithms RL, DQN, DPG, AC are adopted as the comparison method of HMRF, the experiment results show that the communication between the agent by HMRF can improve the reward of each agent.
To the best of our knowledge, it is the first time that the HMRF is employed to deal with the optimal decision of multi-agents in such complex environments. To extend and improve this work, future work includes (1) The training process of HMRF is divided into two processes: the prior probability obtaining by Q-learning and parameter learning of the HMRF model by EM algorithm. Hence a further study is needed to train the prior probability and parameters simultaneously.
(2) The Q-value with the probability form Eq. (31) is similar to the Gibbs distribution. Actually, the output actions from the Markov random field of the multi-agent is equivalent to the Gibbs distribution [70]. Hence, how to incorporate the Hammersley Clifford theorem into the HMRF will be very beneficial for improving the current approach.
(3) In the predator-preys experiment, there are 12 states for the Q-learning algorithm, while just 2 states are selected for HMRF since it is complex to formulate the Markov random field model for the system with multi-state variables. Hence, how to find a simple way to establish the Markov random field for multi-state variables needs further study.