A Distributed Multi-Agent Dynamic Area Coverage Algorithm Based on Reinforcement Learning

Dynamic area coverage is widely used in military and civil fields. Improving coverage efficiency is an important research direction for multi-agent dynamic area coverage. In this paper, we focus on the non-optimal coverage problem of free dynamic area coverage algorithms. We propose a distributed dynamic area coverage algorithm based on reinforcement learning and a <inline-formula> <tex-math notation="LaTeX">$\gamma $ </tex-math></inline-formula>-information map. The <inline-formula> <tex-math notation="LaTeX">$\gamma $ </tex-math></inline-formula>-information map can transform the continuous dynamic coverage process into a discrete <inline-formula> <tex-math notation="LaTeX">$\gamma $ </tex-math></inline-formula> point traversal process, while ensuring no-hole coverage. When agent communication covers the whole target area, agents can obtain the global optimal coverage strategy by learning the whole dynamic coverage process. In the event that communication does not cover the whole target area, agents can obtain a local optimal coverage strategy; in addition, agents can use the proposed algorithm to obtain a global optimal coverage path through off-line planning. Simulation results demonstrate that the required time for area coverage with the proposed algorithm is close to the optimal value, and the performance of the proposed algorithm is significantly better than the distributed anti-flocking Algorithms for dynamic area coverage.


I. INTRODUCTION
Dynamic area coverage has been widely used in target detection [1], monitoring [2] and searching [3], [4]. In the process of dynamic area coverage, agents are equipped with sensors to establish a mobile sensor network (MSN), and the agent can be controlled to achieve coverage of the target area with a dynamic coverage algorithm [5]- [7]. Compared with the traditional static area coverage method where sensor nodes cannot be rearranged easily once deployed [8], [9], the dynamic area coverage method has the characteristics of good flexibility. Moreover, when the range of the target area is large, the dynamic area coverage requires less sensors than the static area coverage.
For autonomous agents in multi-agent systems (MAS), we classify dynamic coverage control algorithms for multi-agents into two categories: 1) non-self-organizing control algorithms, and 2) self-organizing control algorithms.
The associate editor coordinating the review of this manuscript and approving it for publication was Yichuan Jiang .
The non-self-organizing control algorithm for dynamic coverage reduces repeated coverage of the area through some constraint relationships between the agents to achieve efficient dynamic coverage of the target area [10]- [12]. However, this non-self-organizing method is not robust, and an agent that is out of work as a result of some emergency situation during the coverage process will result in an uncovered area [13]. The self-organizing control algorithm enables agents to have great autonomy, robustness and good flexibility [7], [14]- [16], and solves the problem where the target area cannot be covered completely in emergencies. In this paper, we focus on the self-organizing control algorithm.
The anti-flocking control algorithm is a classic selforganizing algorithm for dynamic area coverage. The concept of anti-flocking control was first introduced in [17]. Ganganath et al. proposed a distributed anti-flocking algorithm for dynamic area coverage based on the flocking algorithm proposed in [18] and information map [19], and improved the coverage efficiency of their proposed algorithm using a territorial marking inspired information map VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ in [13]. However, in the calculation of the target point with the anti-flocking control algorithm proposed in [13], agents only consider the current coverage status, and ignore the entire coverage process. Thus, the obtained target position may not be the globally optimal point for the coverage process. Because the target point calculated by the anti-flocking control algorithm is non-optimal, this will cause the agent to cover an area repeatedly, reducing the coverage efficiency and increasing the coverage time. Without doubt, a non-optimal solution can be provided by another self-organizing control algorithm for dynamic area coverage. Therefore, a feasible way to improve dynamic coverage efficiency involves selecting the best target position using a method that considers the whole coverage process. At present, there exist some process optimization methods, including ant colony optimization (ACO) [20], particle swarm optimization (PCO) [21] and model predictive control (MPC) [22]. These methods can optimize the motion path of each agent in the MAS. However, these methods cannot take the motion state of their neighbors into account. Therefore, it is also difficult for them to determine the optimal dynamic coverage strategy in the MAS.
Reinforcement learning (RL) is a process learning algorithm that can learn the best behavior strategy for the entire process [23]. Agents can select the optimal action in different states according to their previous experience [24]. RL is widely used in various process control fields, including sensor networks [25], [26], path planning [27], [28] and robotics [29], [30]. Recently, the application of reinforcement learning to multi-agent collaborative control is increasing [31]- [34]. However, to our limited knowledge, reinforcement learning has not been applied to dynamic area coverage.
In this paper, we construct a MAS motion model based on the flocking algorithm proposed in [18]. We then propose a distributed self-organizing multi-agent dynamic area coverage algorithm based on RL, which transforms the area coverage problem into an optimal on-line planning problem for the agents' target points (γ points). To construct the RL model, we design a γ -information map by gridding the target area to record the coverage information for the γ points. Through learning in the coverage process with the proposed algorithm, agents can plan the best γ points, get the optimal coverage path, and cover the free area efficiently. In some extreme environments, it is difficult for communication to meet the real-time interaction requirements. We propose an off-line global optimal coverage path planning method based on our proposed algorithm that achieves off-line optimal planning not possible with other self-organizing dynamic coverage algorithms.
The main contributions of the paper can be summarized as follows: (1) A distributed multi-agent dynamic area coverage algorithm based on RL is proposed to obtain the global optimal or local optimal dynamic coverage strategy for the entire coverage process. (2) We present an off-line global optimal area coverage scheme based on the proposed algorithm. (3) A γ -information map is proposed and used to convert continuous area coverage problems into discrete γ point planning problems. Based on the γ -information map, a continuous-discrete hybrid control system is established.
The rest of the paper is organized as follows. In Section II, we state the problem formulation, including the system framework, the construction of the multi-agent motion model, and the design of the γ -information map. In Section III, we construct the state space and action space for the RL and define the reward function. In Section IV, the computational complexity of the proposed algorithm is analyzed. The simulation and evaluation process are introduced in Section V, and the simulation results and result analysis are described in Section VI. The conclusion is presented in Section VII.

II. PROBLEM STATEMENT
In this section, we describe the framework of a multiagent dynamic area coverage system, construct a multi-agent motion model based on the flocking algorithm, and give the definition of the γ -information map.

A. SYSTEM FRAMEWORK
The multi-agent dynamic area coverage system constructed in this paper is a continuous-discrete hybrid control system, which is described in Fig. 1. In Fig. 1, The MAS motion control model transforms discrete action into continuous action, and controls agent i to move in a continuous space. γ -information map discretizes the continuous state of agent i, and agent i select the best γ point (a discrete action) based on reinforcement learning.

B. MAS MOTION MODEL
In this paper, we assume that each agent in the MAS satisfies the particle motion model. Let p i , v i and u i ∈ R 2 denote the position, velocity and control input of the ith agent, respectively. Then agent i satisfies the following equation of motion: where N is the number of agents in the MAS, and the N agents constitute the set V = {1, 2, · · · , N }. We define the neighbor set of agent i based on the communication distance, r c , as follows: In the coverage process, each agent needs to plan its target position in the next moment according to its coverage state, and avoid collisions between agents. To achieve self-organizing control for dynamic coverage, we take the agent and the target position as α-agent and γ -agent in [18], respectively. Because there is only a repulsive force between agents, we need to redefine the formula for calculating the potential field force between α-agents in [18].
We define the potential field force between agent i and agent j as follows: where φ(z), ||z|| σ and σ (z) are defined as follows: ||z|| σ is a map R 2 → R, and σ (z) is the gradient of ||z|| σ . d in (4) is the avoidance distance, and d σ is denoted as ρ h (z) in (4) is the bump function introduced in [18] ρ h (z) = where h is a constant, and 0 < h < 1. ρ h (z) can map z to [0,1] smoothly. Considering all the neighbors of agent i, we obtain the control quantity of avoidance given by where c α is a positive constant. Based on the PID control algorithm [35], we obtain the control quantity generated by γ -agent, represented as follows: where c γ 1 and c γ 2 are the proportional and differential control parameters in the PID algorithm, respectively, and p r is the position of γ -agent.
We obtain the following expression for the control quantity from (9) and (10): For ease of reference, the parameters and notations used in this paper are summarized in Table 1.

C. γ -INFORMATION MAP
To increase the instantaneous coverage area of agents and reduce the RL state, the target area is designed as a γ -map with traversal information. The γ -information map for agent i is defined as follows: γ -Information Map: Assuming that the target area can be constructed as a rectangular region of m × n, we divide the area into k × l small rectangular regions, and the center of each small rectangle can be regarded as the γ point, which is equivalent to γagent. All centers of small rectangles constitute γ point sets, and (x, y) and m i (γ x,y ) represent the position and information value of γ x,y on the γ -information map.
k and l can be calculated from the following formula: where r s is the perceived radius of agent i, and z is the ceiling operation for z. Equation (12) guarantees that the small rectangular region where γ x,y is located is completely covered when agent i reaches γ x,y . Fig. 2 shows an example of γ -information map traversal. When communication between agents is established, the agent will fuse its γ -information map and its neighbor's γ -information map using the method given by the following equation: Equation (13) guarantees that the traversal information for γ x,y is the latest, which is beneficial for the agent to select the γ points that are not traversed and improve the traversal efficiency of the MAS.
Therefore, the γ -information map of agent i has two functions. 1) It records the information on the γ points that have been traversed by agent i. 2) It records the information on the γ points that have been traversed by the neighbors of agent i through the information interaction between the agents. The proposed γ -information map allows us to convert dynamic area coverage into γ point traversal; thus we can achieve a transformation from a continuous coverage process to a discrete traversal process, which provides the conditions for the RL algorithm proposed in the following section.

III. REINFORCEMENT LEARNING A. ASSUMPTIONS
Q-learning is a typical reinforcement learning algorithm that records the learned experience in a Q-value table, from where we can obtain the best action strategy. In the traversal process of the γ -information map, we use Q-learning to plan the γ points of agents, and we can get the best planning strategy for the γ points after learning. Thus agents can complete dynamic coverage of the target area efficiently. In applying Q-learning to dynamic area coverage, we make the following assumptions.
(1) Agents in MAS motion are on the same horizontal plane and satisfy the MAS motion model proposed in Section II. (2) Agents have the same detection range to the ground, r s > 0, and the communication distance r c between agents is the same. When communication is established between agents, agents can share information on position, velocity, γ -information map, learning experience and so on. (3) At the start of motion, communication is established between agents. Assumption (3) constrains the initial position of the agent, which creates better initial conditions to improve Q-learning and improve the learning efficiency.
To illustrate the application of Q-learning to the traversal process of the γ -information map, we provide a symbolic definition in Q-learning. Let the current state, action and reward of agent i be s i , a i and r i , respectively, and let the next state and action of agent i be s i and a i , respectively. Next, we construct the state and action space, and give the calculation method for r i and the Q-value and the criterion for action selection.

B. STATE
During the traversal process of the γ -information map, agent i obtains a fused γ -information map by interacting with its neighbors, then decides the next γ point according to its γ -information map and the target position of its neighbors. So, we define the state of agent i as follows: where p γ i is the position of the target γ point of agent i on the γ -information map, calculated as follows: where In (14), if agent j is not adjacent to agent i, then p γ j = (0, 0), otherwise p γ j can be calculated from (15).

C. ACTION SPACE
The action in Q-learning is expressed as the choice of the γ point. When agent i is in a state s i , its optional γ point is the γ x,y determined by p γ i and eight γ points adjacent to γ x,y . We use 1-9 to represent the nine γ points shown in Fig. 3. Therefore, the action space of agent i can be defined as: When the agent is at the boundary of the γ -information map, the action space of the agent will be a subset of the action space represented by (16). For example, when the agent is located at the upper left corner of the γ -information map, then A i = {1, 2, 8, 9}; when the agent is located at   When one of the following conditions is met, agent i will update its action.
(2) Agent i and agent j have the same γ point.

D. REWARD FUNCTION
We estimate the value of the selected action according to the traversal state of the γ -information map. The reward function is defined as follows: where γ x,y is the next γ point obtained by executing action a i , 0 < c r < 1 is a constant, k r is the number of repeated traversals, T is the time consumed during the process of γ -information map traversal or dynamic area coverage and R(T ) is defined as follows: where c T 1 and c T 2 are constants and c T 1 > c T 2 > 1; r ref is a standard reward value for the whole traversal process and T min is the minimum traversal time in the ideal condition, calculated from the following formula: where |v max | is the magnitude of the maximum velocity of the agent. As the velocity change of the agent is not considered in the T min calculation process, we have T > T min . Equation (17) shows that when m i (γ x,y ) = 0, the reward value of a i ∈ {1, 2, 4, 6, 8} is lower than the reward value of a i ∈ {3, 5, 7, 9}. This is in consideration of the time cost. Assuming that the magnitude of the agent's velocity is constant, and each small rectangular region on the γ -information map is a square, the require time ratio of the two cases is √ 2 : 1. However, the traversal effect of the action in both cases is the same, because both of them are traversing one γ point. When the agent repeatedly traverses the γ point, we should give a negative reward, and the more times the traversal is repeated, the greater the negative reward.
After agents traverse the whole γ -information map, we use R(T ) to reward the whole traversal process, and R(T ) is depicted in Fig. 5. The figure shows that when c T 2 T min < T < c T 1 T min , the smaller T , the greater the reward.

E. Q-VALUE UPDATE AND ACTION SELECTION
To accelerate learning for the MAS, we use the distributed cooperative Q-learning algorithm proposed in [33]. The Qvalue table of agent i is updated as follows: where α is the learning rate, λ is a discounting factor, Q κ i (s i , a i ) is the Q-value under s i and a i , w is the weight, which satisfies 0 ≤ w ≤ 1, and the superscript κ denotes the number of iterations.
To further improve the learning efficiency of Q-learning, we restrict the agent's behavior space according to the γ -information map: where A i is a subset of A i . The traversal information of all γ points in A i is the same, and their values are zero. For example, in Fig. 2, the optional action space of agent i is A i = {2, 3, 6, 7, 8, 9}. It can be inferred from (22) that an agent prefers to traverse the untraversed γ points around it, which shows that the traversal algorithm based on RL proposed in this paper can complete the whole area coverage even without training.
Obviously, when the γ points of the A i are all traversed, A i will be an empty set. So, we divide the action selection of agent i into the following two cases based on whether A i is an empty set.
Case 1 (A i = ∅): In this case, typically, the next action selection in Q-learning is based on the principle of the maximum Q-value. The maximum Q-value selects an action as follows: Case 2 (A i = ∅): In this case, we find the nearest untraversed γ point on the γ -information map by the following formula: We consider γ x 1 ,y 1 as the field source, and its attractive force to γ x,y ∈ A i is inversely proportional to the square of the distance between them. Therefore, we can choose the best action according to the magnitude of the attractive force, as represented by the following formula: The proposed episodic procedure for the distributed traversal algorithm based on Q-learning (herein Q-Traversal) is shown in Algorithm 1.

IV. COMPUTATIONAL COMPLEXITY
The main difference among the different dynamic area coverage algorithms is the obtained way of γ points. After an agent is trained with Q-Traversal algorithm, the agent can select the best γ point by querying the trained Q-value table. The process of selecting the best γ point is represented by (23). For the traditional dynamic area coverage algorithm, most of them solve the next γ point according to the current coverage state. The computational complexity of dynamic area coverage algorithm is mainly determined by the solution process The solution equations for γ point in the distributed antiflocking algorithm proposed in [13] is expressed as follows: whereX i is the two-dimensional position space of the target area. In applications, we usually convert the target area into a discrete grid map. Assume that the size of the target area is m × n and the grid factor is k g (k g > 0, k g ∈ Z ). Then the size of the grid map is k g m × k g n, and the number of γ points is k 2 g mn. In the process of solving γ point, (26) and (27) need to be executed cyclically k 2 g mn times. Equation (27) includes exponentiation and norm operation, and (28) is a process of finding the maximum value in a list of length k 2 g mn. According to (14), the size of state space of Q-Traversal algorithm can be obtained and denoted by (kl) N . In the calculation process, M i is regarded as a record of the historical position of agents, and the capacity of the state depends on the size of γ -information map and the number of agents. After all the states are trained, the Q-value of the current state can be obtained by querying a list of length (kl) N , and then the best γ point can be obtained by selecting the maximum Q-value of action space. Therefore, the computational complexity of Q-Traversal algorithm consists of querying a list and finding the maximum Q-value in action space. In fact, list query and maximum search are composed of multiple comparison operations. The number of comparisons is determined by the length of the list and the type of comparison data.
Based on the above analysis, the computational complexity of the anti-flocking algorithm and Q-Traversal algorithm are obtained, and summarized in Table 2. When k g is large, k, l, and N are small, the computational complexity of Q-Traversal algorithm will be less than the anti-flocking algorithm. Actually, in the training process of Q-Traversal algorithm, most states are not to be trained, so the length of the Q-value table is much less than (kl) N , and the computational complexity of Q-Traversal algorithm is much less than the expression presented in Table 2.

V. SIMULATION AND EVALUATION PROCESS
In this section, we present the simulation process, including parameter settings and the end conditions for simulation, and then define evaluation parameters for the performance of the algorithm.

A. SIMULATION PROCESS
In the simulation experiment, we set N = 3, r s = 4.5 m, m = n = 50 m, the simulation time step to 0.5 s and the initialization time of the MAS to 0.3 s. From (12), we obtained k = l = 8.
The motion parameters of the agent are shown in Table 3; the distributed cooperative Q-learning parameters are shown in Table 4.
When setting c r , c T 1 and c T 2 , we need to consider r c . When r c ≥ 50m, communication between agents can be established; to avoid falling into a local optimum, the values of c T 1 − c T 2 and c r should be set reasonably. When r c < 50m, the smaller r c , the lower the probability of establishing communication between agents, and the harder it is to obtain a globally optimal traversal strategy. To avoid the Q-Traversal algorithm not converging due to a failure to  find a global optimal traversal strategy, the values of c T 1 − c T 2 should be increased, and c r should be decreased.
In the simulations, we set the termination conditions for the whole training process. The training will be terminated in either of the following two scenarios.
(1) The number of trainings is more than N T .
(2) T AVE ≤ 1, with T AVE defined as follows: where T κ is the time for the κth train. Equation (29) shows that when the training time is stable, we assume that the algorithm has converged. The stop criterion for each training episode for the learning algorithm includes the following two scenarios: (1) T κ > 3T min , T min = 131.25 s can be calculated from (19).
(2) m(γ x,y ) = 0, γ x,y ∈ M 1 (γ ) and m(γ x,y ) is the fused information map of three γ -information maps, which we can obtain from the following formula: The first scenario shows that agents do not complete the coverage of the target area in time 3T min , and the second scenario shows that the agent has completed its coverage of the target area.

B. EVALUATION CRITERION
When the motion parameters of agents are determined, T can evaluate the performance of the coverage algorithm. We define the mean T and variance s 2 of the coverage time as performance indicators of the traversal algorithm.
T indicates the traversal efficiency of the traversal algorithm, and s 2 indicates the stability of T under different initial conditions. VOLUME 8, 2020 In the simulation, we compare the coverage performance in free space between Q-Traversal and the anti-flocking algorithm proposed in [13] with r c = 60 m, 50 m, 40 m, 30 m, 20 m and 10 m. In the process of comparison, we only limit its velocity in the anti-flocking algorithm, while other parameters remain unchanged.

VI. SIMULATION RESULTS AND ANALYSIS
In this section, first we give the coverage effect diagram for the anti-flocking algorithm and Q-Traversal algorithm with r c = 50 m. Then we give the test result for the T and s 2 of the anti-flocking algorithm and Q-Traversal with different values of r c . Finally, the results of the convergence experiment and robustness experiment are presented to verify the feasibility of Q-Traversal.

A. COVERAGE EFFECT
In the simulation, we set r c = 50m, c r = 1.2, c T 1 = 1.18 and c T 2 = 1.12. With these parameters, we obtained the coverage results for the anti-flocking, untrained and trained Q-Traversal algorithms as shown in Fig. 6 (a), (b) and (c), respectively. Fig. 6 (a) shows that the anti-flocking algorithm can complete the coverage of the target area, but there are repeated coverages of a region in the coverage process, which is similar to the results shown in Fig. 6 (b). This existence of repeated coverage greatly reduces the coverage efficiency. From Fig. 6 (c), when training of the Q-Traversal algorithm is completed, the actions chosen by the agent from the trained Q-value table are mostly horizontal and vertical, which reduces the time cost of oblique motion. In addition, there are almost no crossovers between the trajectories of the agents, which avoids repeated coverage of the area. Therefore, the action strategy of the agent is globally optimal.
To study the coverage rate of the three coverage methods, we recorded the cumulative area coverage at different times and plotted Fig. 7 based on the recorded data. Fig. 7 depicts the relationship between cumulative coverage area and time in the process of coverage. The slope of the curve represents the coverage rate. Fig. 7 shows that the coverage rates of the three methods are not much different at the beginning of the coverage process, but when the time is greater than 80 s, the coverage rates of the anti-flocking and untrained Q-Traversal algorithms are significantly lower than the trained Q-Traversal algorithm. In addition, because there is no repeated coverage, the coverage rate of the trained Q-Traversal algorithm is more stable than the other two.  Table 5.    Fig. 8 shows that when r c ≥ 20 m, the T of the untrained Q-Traversal algorithm is roughly the same as that of the anti-flocking algorithm, but the T of the trained Q-Traversal algorithm is obviously smaller than that of the anti-flocking algorithm. In addition, with the decrease in r c , the time for the anti-flocking algorithm will increase significantly, while the Q-Traversal algorithm is relatively stable. From the T curve for the Q-Traversal algorithm, it can be seen that with the 33518 VOLUME 8, 2020  decrease in r c , T will increase slightly. This is due to the lack of communication between agents in some cases, meaning the γ -information map of agents cannot be shared, so the action strategy is the local optimal. When r c = 50 m and 60 m, T is close to T min , and the agent can obtain the global optimal action strategy. Fig. 9 shows the s 2 curves for the three methods with different values of r c . With different values of r c , the s 2 of the Q-Traversal algorithm is obviously smaller than the s 2 of the anti-flocking algorithm. This shows that the Q-Traversal algorithm can reduce the impact of the different initial conditions of agents on the traversal effect; when r c ≥ 50 m, s 2 is close to 0, and the Q-Traversal algorithm can almost eliminate the impact.

C. CONVERGENCE OF Q-TRAVERSAL
In this experiment, we tested the convergence of Q-Traversal with r c = 20 m, 40 m and 60 m. We set the initial positions of the three agents to (16.75, 19.25), (21.75, 25.25) and (21.75, 19.25), and the parameters of the Q-Traversal algorithm were the same as in Part B. In the experiment, we tested T during  the training process, and the results of the simulations are given in Fig. 10.
The simulation results show that the Q-Traversal algorithm can achieve convergence at different communication distances. Training the Q-Traversal is a process of finding the optimal coverage strategy. When an optimal strategy under the parameters in Table 5 is found, the algorithm will converge quickly.
In addition, in the simulation process, we find: 1) When r c ≥ 50 m, the convergence value of T is close to T min , which can be regarded as the global optimal value. 2) When r c < 50 m, it is difficult for T to converge on the global optimal value, but it can converge on the local optimal value. 3) When c T 2 is set too small, the optimal coverage strategy is difficult or even impossible to find. When c T 1 is set too large, it is easy to fall into a sub-optimal strategy.
In practical application, when the communication distance is not enough or communication cannot be established, the MAS can also obtain the optimal dynamic coverage path from the off-line Q-Traversal algorithm. Thus, no matter what situation the MAS is in, the Q-Traversal algorithm allows it to dynamically cover the target area in a globally optimal manner.

D. ROBUSTNESS OF Q-TRAVERSAL
In the robustness simulation, we simulated the situation in which an agent cannot work in the process of dynamic coverage with r c = 40 m. In the simulation, we trained the Q-Traversal algorithm with N = 2, 3. We did not run the Q-Traversal algorithm for one agent when T ≥ 100s, and the agent could be regarded as out of work. We gave the following two solutions for the two cases.
Case 1: Agents in the MAS can recognize that the agent is out of work. In this case, we let the agent query the Q-value table trained with N = 2, and we obtained a track of the area coverage shown in Fig. 11 (a). Case 2: Agents in the MAS cannot recognize that the agent is out of work. In this case, nothing will be changed, and the track of the area coverage is shown in Fig. 11 (b).
We can see that the coverage effect in Fig. 11 (a) is significantly better than that in Fig. 11 (b), because we have trained the coverage strategy of the two agents.
In the experiment, we verified that the Q-traversal algorithm is robust in practical applications, demonstrating that the Q-traversal algorithm can also complete the task of dynamic coverage efficiently in the event of an emergency situation.

VII. CONCLUSION
In this paper, first we design a MAS motion model based on a flocking algorithm. Then, we construct a γ -information map by gridding the target area. Finally, we propose a free area Q-Traversal algorithm based on the γ -information map and distributed cooperative Q-learning algorithm.
The simulation demonstrated that the MAS can cover the whole target area with the untrained Q-Traversal algorithm; the effects of the untrained Q-Traversal and anti-flocking algorithms were almost the same. When communication between agents can cover the target area, the agent can obtain a global optimal coverage strategy after training with the Q-Traversal algorithm, and the coverage time is close to the minimum time under ideal conditions. When communication cannot cover the whole target area, the agent can obtain a local optimal coverage strategy after training with the Q-Traversal algorithm, and the coverage time is less than with the antiflocking or untrained Q-Traversal algorithms. In addition, simulation results show that the Q-Traversal algorithm can effectively reduce any instability in the coverage time caused by different initial conditions and is very robust.
In the event that the communication distance is insufficient and online planning is difficult, we can also use the Q-Traversal algorithm for off-line global path planning, providing another global optimal coverage method for area coverage. In these extreme cases, the Q-Traversal algorithm still has great practical value.