Multi-Robot Flocking Control Based on Deep Reinforcement Learning

In this paper, we apply deep reinforcement learning (DRL) to solve the flocking control problem of multi-robot systems in complex environments with dynamic obstacles. Starting from the traditional flocking model, we propose a DRL framework for implementing multi-robot flocking control, eliminating the tedious work of modeling and control designing. We adopt the multi-agent deep deterministic policy gradient (MADDPG) algorithm, which additionally uses the information of multiple robots in the learning process to better predict the actions that robots will take. To address the problems such as low learning efficiency and slow convergence speed of the MADDPG algorithm, this paper studies a prioritized experience replay (PER) mechanism and proposes the Prioritized Experience Replay-MADDPG (PER-MADDPG) algorithm. Based on the temporal difference (TD) error, a priority evaluation function is designed to determine which experiences are sampled preferentially from the replay buffer. In the end, the simulation results verify the effectiveness of the proposed algorithm. It has a faster convergence speed and enables the robot group to complete the flocking task in the environment with obstacles.


I. INTRODUCTION
Multi-robot systems play an important role in a wide range of applications, such as target tracking and navigation, collaborative patrol, search, rescue, forest inspection, and agricultural spraying [3]- [7]. When multiple robots work together in a complex environment, it is crucial to ensure the safety of each robot. Inspired by the group behavior of biological colony, such as bird migration and fish gathering, where the entire system is in a coordinated and orderly state to respond to external threats without any organizer, many scholars conduct research on multi-robot flocking control. However, when the working environment of the robot group is relatively complex, the group behavior strategy is required to be real-time and able to avoid various obstacles. Besides, due to the limitation of actual communication capabilities, each robot's communication range is limited. Therefore, the robot group must consider connectivity during the task to ensure that the robots can communicate with each other.
Regarding the flocking problem, Reynolds et al. [8] have come up with three basic rules, namely separation, aggregation, and consistency of velocity. The three rules are The associate editor coordinating the review of this manuscript and approving it for publication was Bidyadhar Subudhi .
instructive to the establishment of flocking motion models, and most of the subsequent flocking models proposed are based on the three rules. Vicsek [9] studied the consistency of velocities in Reynolds' rules, then he controlled agents perturbed by random noise such that their dictions of motions converge. Tian et al. [10] proposed an improved Vicsek model with limited field of view, and this model was further extended by Zhang et al. [11] with random line-of-sight directions. Among the extensive studies on flocking control problems, most of them used traditional methods such as those based on LQR [12], PCA [13] or a virtual leader [14], which are not effective in dealing with the external disturbance and the nonlinear time-varying nature of the flocking control problem. This paper uses the deep reinforcement learning (DRL) method to complete the flocking task without requiring accurate modeling and sophisticated control design that are required in traditional methods.
This paper adopts the DRL method, multi-agent deep deterministic policy gradient (MADDPG) [1], which combines neural network and deterministic policy gradient algorithm, to solve the multi-robot flocking control problem in 2D environments. Based on the MADDPG algorithm, we propose an improved version, called PER-MADDPG algorithm, by introducing the prioritized experience replay (PER) [2] mechanism. Our new algorithm effectively improves the training efficiency and shortens the convergence time. The three main contributions of this paper are as follows: (1) To the best of our knowledge, this is the first work to use MADDPG method to solve the multi-robot flocking problem. (2) Combined with the features of centralized training and decentralized execution of the MADDPG algorithm, we use only one replay buffer to store information of all the robots. (3) We propose the PER-MADDPG algorithm that combines MADDPG and PER. Simulation results show that this new algorithm has significantly improved the training efficiency.
The rest of the paper is structured as follows: Section II overviews the existing studies on reinforcement learning in the field of cooperative flocking control. Section III formulates the multi-robot flocking problem and illustrates the multi-robot reinforcement learning process. Section IV introduces the algorithmic framework for the flocking task proposed in this paper. Section V verifies the algorithm through simulation experiments. Finally, Section VI concludes the paper.

II. RELATED WORK A. MULTI-ROBOT REINFORCEMENT LEARNING
The problem of single robot reinforcement learning has been extensively studied, and a number of algorithms have been proposed in the literature [15]- [19]. However, only a few methods are available to solve the multi-agent reinforcement learning problem. Dai et al. [20] used DQN to solve the multi-robot task assignment problem and achieved some results. Sukhbaatar et al. [21] designed a neural network called CommNet to enable continuous communication in a collaborative environment. Also for multi-robot communication problems, Foerster et al. [22] used Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL) to enable the end-to-end collaborative communication within multiple robots. Palmer et al. proposed Lenient-DQN [23], which introduced Lenient loss function based on Double DQN [24] to adapt to the cooperation problem of multi-agent reinforcement learning. Foerster et al. [25] proposed COMA, which uses the centralized critic. The centralized critic can obtain global information to guide each agent to further improve each agent's modeling capabilities for information. However, as there is only one centralized critic, agents are not allowed to have different reward functions. Recently, Rashid [26] proposed the QMIX algorithm, which uses a hybrid network structure and adds global state information to improve the algorithm's performance during the training process. The MADDPG algorithm, used in this paper, adopts the centralized learning and decentralized execution mechanism and adds the action information of each agent into the training process. Empirically, obtaining the action information of each agent helps to understand the policy of other agents so that the MADDPG algorithm can be well adapted in the cooperative-competitive environment. The MADDPG algorithm has been applied in many aspects, such as multi-robot communication [27], multi-robot target assignment and path planning [28], and multi-robot target encirclement formation control [29].

B. THE APPLICATION OF REINFORCEMENT LEARNING IN FLOCKING
Several studies have been conducted to apply reinforcement learning methods to the flocking control problem. Having considered the model of flocking behavior, Morihiro et al. [30] proposed a multi-robot cooperative flocking control framework based on the Q-learning algorithm and implemented it in a simulation environment. On this basis, Tomimasu et al. [31] studied cooperative flocking control based on reinforcement learning. Adopting the Q-learning algorithm and introducing potential field methods, they further built a simulation model to make the robot learn flocking behavior. Hung et al. [32], [33] studied the flocking control problem of small fixed-wing UAVs under the background of model-free reinforcement learning. Xu et al. [34] proposed a flocking control framework based on the deep reinforcement learning multi-vehicle system (MVS) after considering the conditions with collision avoidance and communication maintenance. Although the multi-robot cooperative flocking control based on reinforcement learning methods has been initially verified on the simulation and physical platforms, most of the existing studies consider discrete action or state space, and do not deal with some problems such as slow convergence speed that may significantly deteriorate the control performance in a complex environment. Therefore, we propose the PER-MADDPG algorithm that combines features of both MADDPG and PER such that the training efficiency and convergence speed have been both noticeably improved. In addition, the introduction of the PER mechanism based on MADDPG enables robots to output actions in the continuous action space. and v t = v t 1 , . . . , v t N are used to denote the positions and velocities of N robots at time t, and p t = p t 1 , . . . , p t M is used to denote the positions of M obstacles. The discretized dynamic model for each robot r i is described as

III. PROBLEM FORMULATION
where v t i , p t i denote the velocity and the position of the i-th robot respectively at time instance t, and ∆t denotes the sampling period. We define the control input as When any two robots in the group or robot and obstacle are too close together, a force will be generated to increase their distance. (b) When the distance between one robot and the other robots in the group is greater than the communication radius, this robot will approach the others. The blue dotted line in the figure indicates that the two robots have established communication. (c) Robots will move towards the target area, and those with inconsistent velocity will generate a force to correct the direction of the movement.
no superscript, the default is to represent the information at time t.

B. DESCRIPTION OF MULTI-ROBOT FLOCKING CONTROL PROBLEM
In this paper, an undirected graph where ρ c is the maximum communication distance. In other words, when the distance between two robots is less than the maximum communication distance, there is an undirected edge between them and they can communicate with each other. Therefore, we can define the neighbor set of r i as The flocking task consists of three parts: reaching the target position, avoiding collision, and maintaining connectivity, as shown in Fig. 1.
(1) Reaching the target position: Given a target position g, the control target is to minimize the sum of distance between the robot and the target in the flocking: e = N i=1 ||p i − g|| 2 (2) Avoiding collision: The safe distance between robots and that between the robot and the obstacle are given as ρ n and ρ o .
(3) Maintaining connectivity: Given that maximum communication distance between two robots is ρ c . To ensure the robots can communicate with each other, the distance between two robots should not exceed the perception range, i.e., d( To better conform to the actual situation of flocking, we assume that the target area is globally perceived, and the positions of the obstacles are locally perceived. We define a distance ρ p (ρ p > ρ o ), then the robot can locally perceive the obstacle when d(r i , o j ) ≤ ρ p . When a robot in the group senses the obstacle, the other robots within its communication range can also obtain their relative positions to the obstacle, as shown in Fig. 2.

C. MULTI-AGENT REINFORCEMENT LEARNING (MARL)
Solving the multi-agent problem through reinforcement learning can avoid modeling the behavior of the agent in advance and control designing. The agent only needs to interact with the environment to generate its strategy.
The reinforcement learning is usually described as a Markov decision process (MDP). The Markov process of multi-robot flocking can be represented as a tuple G = N , S, A, T , R, O , where S is the state space to describe the state of the environment and the state of the robot. The joint actions of all robots can be expressed as is the two-dimension continuous action space for each robot. In the iteration, the stateaction pair of the robot can be expressed by the transition function T (s t , a t , s t+1 ) : S×A 1 ×. . .×A N ×S → [0, 1]. The robot will get a reward R during the iteration according to the The observations of all robots can be expressed as O = where O i represents the observation of the robot r i , and the observation includes its own velocity and the positions of each robot and target. The multi-robot learning process is shown in Fig. 3.
During the flocking task, each robot computes its action in continuous two-dimensional space through learned policy based on observation. The action taken by the robot r i is defined by the policy π θ i (a t |s t ) : is the probability calculated through A, and θ i ∈ R l is a parameter with l elements. The actions convert the state s t to a new state s t+1 according to the transition function T . During the training process, the goal of the robots is to learn the best policy to maximize its own cumulative discounted reward G t : where γ (0 < γ < 1) represents the discount factor in each step.

IV. THE FLOCKING CONTROL BASED ON PER-MADDPG A. PRIORITIZED EXPERIENCE REPLAY MADDPG 1) MADDPG
The MADDPG algorithm is an extention of DDPG [19]. Similar to DDPG, MADDPG also uses the actor-critic [16] structure, and both of the actor and the critic have an online policy network and a target policy network. The actor online network calculates the action a i = π i (o i ) to be performed only based on the current state o i observed by robot r i , and the critic online network evaluates the action to improve the performance of the actor online network. The target network regularly copies parameters from the online network.
In the stochastic policy gradient algorithm, if we use π = {π 1 , . . . , π N } to represent the robots' policy of N robots and use θ = {θ 1 , . . . , θ N } to represent the policy parameters, then we can write the gradient of the expected return for robot r i , J (θ i ) = E [R i ] as: where p π is the state distribution, s = (o 1 , . . . , o N ) represents the joint state, a = (a 1 , . . . a N ) represents the joint action, and Q π i (s, a) is a centralized action-value function, whose inputs are the joint action and joint state of all robots, and its output is the Q value of the robot r i . However, MADDPG adopts the deterministic policy gradient. If we consider N continuous policies µ θ i with parameters θ i (abbreviated as µ i ), the gradient can be written as: where µ = {µ 1 , . . . , µ N }, D represents the experience replay buffer which contains a series of tuples s, s , a, r recording the experiences of all robots, s is the new state of the robots after executing the actions and the r = (r 1 , . . . r N ) is the reward of all robots. Every once in a while, experiences will be randomly sampled from D to update network parameters. The critic network Q µ i is updated by the loss function as follows: where µ = µ 1 , . . . , µ N is the policy of the target network with parameter θ i and a is the output of the actor target network.
The actor network is updated by minimizing the policy gradient of robot r i which can be written as: where K is the minibatch size of samples and k is the index of samples.

2) PRIORITIZED EXPERIENCE REPLAY MECHANISM
The Experience Replay method overcomes the correlated data problem and non-stationary distribution problem of experience through storage-sampling. Due to the uneven quality of randomly extracted experience led by random sampling, the MADDPG algorithm faces difficulties of low learning efficiency and slow convergence speed. To solve the problem mentioned above, this paper introduces the PER mechanism. The PER method has been widely used in DQN, DDPG, and other algorithms, and it performs well in the single-agent reinforcement learning problem. However, in multi-agent tasks, as each agent has a separate replay buffer to store its own experience, storing and replaying according to their respective evaluation would disrupt the relevance of the centralized experience training, thereby failing to complete the training.
In view of the characteristics of the centralized training of the MADDPG algorithm, this paper uses a centralized experience buffer, which stores the joint information of the agents s, a, r, s . And then the agents preferentially sample experience according to the importance, thereby improving the algorithm's efficiency.
The main idea of PER is to replay more frequently those experiences that are more important to network updates, so how to define the importance of these experiences is the  critical issue. TD-error is often used in most reinforcement learning algorithms to update the estimate of the critic network Q µ i (s, a). Since the TD-error performs the maximum likelihood estimation, its value can be used as a correction for estimation. It can implicitly reflect the degree that the agent can learn from the experience, thereby making the estimated result more in line with the trend of future data. The bigger the value of the TD-error, the more positive the correlation of the expected action-value. On the contrary, the smaller the TD-error value is, the worse the action taken by the agent in this state would be. Replaying these experiences more frequently helps the agents gradually get the correct result of the wrong behaviors and avoids the wrong behaviors occurring again, thereby improving the overall performance of the algorithm. In this paper, we use the absolute value of the TD-error |δ| as the basis for ranking. The TD-error of experience k is calculated through the formula below: The larger TD-error shows that the difference between the evaluation value of the target network and the actual value for this experience is significant. Hence, the sampling frequency needs to be increased to update the value of the target network as well as the evaluation network as soon as possible to achieve the optimal training effect. We define the probability that experience k is sampled as: In the formula, D k = 1 rank(k) > 0 and rank(k) represents the rank of experience k in the experience replay buffer based on the absolute value of the TD-error. The parameter α determines the degree of priority, and when α = 0, it becomes the uniform sampling. The relationship between the probability of experience being sampled and rank is shown in Fig. 4. From the definition of sampling probability, it can be seen that even the experience with lower TD-error value is also probable to be sampled, thereby ensuring the diversity of the sampled experience and preventing the neural network from overfitting. Nevertheless, those experiences with higher TD-error value will replay more frequently, thus changing the sampling frequency of each state and further resulting in oscillation or even divergence during training. To deal with this issue, we adopt importance sampling to adjust and update the model by reducing the weight of the top-ranking experience.
where S is the size of the experience replay buffer, P(k) is the probability of the sampled experience k, and the parameter β is used to control the impact of importance sampling weights on learning. The parameter β will gradually increase to 1 during the training process. As β increases, the weight of the high-priority samples is almost unchanged in (10), while the weight of the low-priority samples is greatly increased. When training starts to converge eventually, the unbiased update is crucial for error convergence. In order to improve the stability of the algorithm model training, we always normalize weights by 1/ max k ω k so that they only scale the update downwards. Therefore, the definition of the loss function in (5) is changed to Based on the above introduced prioritized experience replay method, an integrated algorithm of MADDPG with prioritized experience replay is shown as in Algorithm 1. The framework of PER-MADDPG is shown in Fig. 5.

B. REWARD FUNCTION SETTING
Based on this algorithm framework, we design the reward function of the robot, according to the multi-robot flocking behavior. As shown below, the reward function is mainly used to reward the expected behaviors and punish the undesirable actions: where R g i is the reward for reaching the target point in the flocking task, R s i is the reward for generating a separation force to avoid collision among robots, R c i is the reward for maintaining the flocking aggregation to enable robots' communication, R o i is the reward for ensuring that robots in the group can avoid obstacles, and R p i is the reward for making the velocity of each robot relatively smooth. In the expression (12), ω g , ω s , ω c , ω o , ω p are positive weighting factors. Details about the reward function are given below.

1) REACHING THE TARGET
This reward function is to ensure that the robot group can reach the target. Each robot will receive a reward r goal when it reaches the target point g, and will get a punishment when it moves away from the target. The punishment is proportional to the distance from the robot to the target point.
where p i represents the position of the robot r i , ρ g > 0 represents the radius of the target area, and r goal is a positive constant. VOLUME 8, 2020 Algorithm 1 MADDPG With Prioritized Experience Replay 1: Initialize priority parameters α, β and minibatch K , replay buffer D 2: for episode=1 to total-episode do 3: Initialize a random process N for action exploration 4: Receive initial state s = (o 1 , . . . , o N )

5:
for t=1 to max-episode-length do 6: for each agent r i , select action a i = µ i (o i ) + N t w.r.t the current policy and exploration 7: Execute actions a = (a 1 , . . . , a N ) and receive reward r and new state s 8: Store s, a, r, s in replay buffer D 9: s ← s 10: for agent r i=1 to r i=N do 11: for j=1 to K do 12: Sample experience k with probability P(k) from D

13:
Compute corresponding importance-sampling weight ω k and TD-error δ k 14: Update the priority of experience k according to absolute TD-error |δ k | 15: end for 16: Uptate critic by minimizing the loss L (θ i ) = 1 K k ω k δ 2 k 17: Update actor using the sampled policy gradient: 18: end for 23: end for

2) AVOIDING COLLISION
This reward function is used to avoid collisions among robots. When the distance between two robots is less than the minimum safety distance ρ n > 0, they will be punished. Conversely, if the distance is greater than the minimum safety distance, they will be rewarded.
where r avoid is a positive constant.

3) MAINTAINING COMMUNICATION
This reward is used to promote connectivity within the group. When the distance between a robot and any other robot in the group exceeds the maximum communication distance ρ c > 0, the robot will be punished. The greater the distance is, the more severe the penalty will be. When the distance between the two robots in the group is less than the maximum communication distance, they will get a reward. Such a setting is to maintain connectivity within the group.
where r comm is a positive constant.

4) AVOIDING OBSTACLE
Besides, in order to effectively avoid the obstacles in the process of completing the flocking task, a robot in the   flocking will be punished when its distance to any obstacle is less than the safe distance ρ o > 0.
where r obstcale is a positive constant.

5) SMOOTHING VELOCITY
Furthermore, we appropriately restrict the velocity direction change to make sure that the robot group can move relatively smoothly. Suppose the angle difference between velocities at two consecutive time instances is denoted by ϕ (see Fig. 6), then we do not want ϕ to be too large. Therefore, we define the following reward: where < a, b > denotes inner product of two vectors, and the arccos function's range is [0, π].

V. SIMULATION AND EXPERIMENT
A. EXPERIMENT SETTING 1) ENVIRONMENT SETTING We have designed a simulation training environment of multiple robots flocking based on the OPENAI platform, including robots, obstacles, and target locations. The experimental area is a square centered on the origin with a side of 2, and the  radius and mass of each robot are 0.05 and 1, respectively. The maximum speed of the robot is limited to 2, and the maximum acceleration is limited to 2. The radius of each obstacle is 0.05, and the radius of the target area is 0.1. At the beginning of the training, the robots in the group are generated with coordinates (-0.7, 0.7) (-0.7, 0.9) (-0.9, 0.7), and the obstacles are generated with random coordinates , and the target position is selected randomly within the area of { T x , T y |0.8 ≤ T x , T y ≤ 1}.

2) PARAMETER SETTING AND THE TRAINING PROCESS
The four networks of Actor and Critic have the same structure, i.e., each network has three fully connected layers, and each layer has 64 units. The learning rate l r is 0.01, and the discount factor γ is 0.95. The mini-batch is 1024, and the network parameters are updated every 100 steps. The robot can move 60 steps per episode, and the total episode of training is set to 60,000. The action that the i-th robot takes is the force F t i ∈ R 2 at time instance t, so the velocity and position are updated by (1). The maximum communication range ρ c is 0.2, the collision avoidance distance between robots ρ n is 0.1, the target area radius is 0.125, and the collision avoidance distance between robots and obstacles ρ o is 0.1.

B. THE EXPERIMENTAL RESULTS
In order to verify the effectiveness of our PER-MADDPG algorithm to complete the multi-robot flocking task in different scenarios, we carried out experiments in the presence of no obstacles, static obstacles and dynamic obstacles respectively. We evaluate the performance of our algorithm and the MADDPG algorithm in terms of four indices: the collision counts between robot and robot(CCBRR), collision count between robot and obstacle(CCBRO), mean distance between robot and target(MDBRT), mean distance between robot and robot(MDBRR). If the distance between the two robots is less than ρ n , the CCBRR will be increased by 1. If the distance between the robot and obstacle is less than ρ o , the CCBRO will be increased by 1.

1) OBSTACLE-FREE
We first conducted an experiment without obstacles. We selected the results of the first 10,000 episodes as shown in Fig. 7. It is obvious that PER-MADDPG has a faster convergence speed. In addition, the PER-MADDPG algorithm performs better in avoiding collisions among robots. The paths of the trained robots are shown in Fig. 10a.

2) STATIC OBSTACLE
In this experiment, five randomly distributed obstacles are added. The experiment results are shown in Fig. 8. It can be seen that, in the stable stage, the difference between the two reward curves is small, but in the process of convergence, the convergence speed of PER-MADDPG is faster than MADDPG. Fig. 10b is the trajectory diagram of multiple robots when there are static obstacles in the scene.

3) DYNAMIC OBSTACLE
The five randomly generated obstacles in this experiment all move at random velocities within the range v = {(v x , v y )| − 0.5 < v x , v y < 0.5}, making it more difficult for the robots to complete the flocking task. The training results are shown in Fig. 9. In the dynamic obstacle scene, compared to MADDPG, the convergence speed of PER-MADDPG has been significantly improved, and the number of collisions have been obviously reduced.

VI. CONCLUSION
This paper uses MADDPG to solve the multi-robot flocking control problem without requiring complex control design as most traditional analysis methods do. Besides, to solve the algorithmic problem of low learning efficiency and slow convergence speed, this paper proposes a novel deep reinforcement learning algorithm, namely PER-MADDPG, by introducing the prioritized experience replay mechanism to enable the training results to converge faster. In addition, the experimental results also show that PER-MADDPG is better than MADDPG in completing multi-robot flocking tasks, having fewer collisions. Considering the cooperative features of the flocking task, we will improve the algorithm's training efficiency through parameter sharing in the future.