Cross-Entropy Regularized Policy Gradient for Multirobot Nonadversarial Moving Target Search

This article investigates the multirobot efficient search (MuRES) for a nonadversarial moving target problem from the multiagent reinforcement learning (MARL) perspective. MARL is deemed as a promising research field for cooperative multiagent applications. However, one of the main bottlenecks of applying MARL to the MuRES problem is the nonstationarity introduced by multiple learning agents. With learning agents simultaneously updating their policies, the environment cannot be modeled as a stationary Markov decision process, which results in the inapplicability of fundamental reinforcement learning techniques such as deep $Q$-network and policy gradient (PG). In view of that, we adopt the centralized training and decentralized execution scheme and thereby propose a cross-entropy regularized policy gradient (CE-PG) method to train the learning agents/robots. We let the robots commit to a predetermined policy during execution, collect the trajectories, and then perform centralized training for the corresponding policy improvement. In this way, the nonstationarity problem is overcome, in that the robots do not update their policies during execution. During the centralized training stage, we improve the canonical PG method to consider the interactions among robots by adding a cross-entropy regularization term, which essentially functions to “disperse” the robots in the environment. Extensive simulation results and comparisons with state of the art show CE-PG's superior performance, and we also validate the algorithm with a real multirobot system in an indoor moving target search scenario.


I. INTRODUCTION
M ULTIROBOT efficient search (MuRES) for a nonadversarial moving target has been a hot research topic, attracting increasing attention from both academic researchers and industrial entrepreneurs over the past several decades. Here, a "nonadversarial" moving target refers to the type of target whose movement dynamics are independent of, and thus do not react to, the searchers' movement strategies. On the one hand, the MuRES problem has many real-world application potentials, such as multirobot search and rescue in hazardous environments [1], [2], [3], [4], [5], collaborative source leakage localization [6], [7], and multirobot security defense and surveillance [8], [9]. On the other hand, MuRES also serves as a representative operation research topic and lies in the intersection of many fundamental research areas, such as multiagent learning [4], [10], [11], game theory [12], swarm dynamics [13], [14], [15], cooperative control [16], [17], and graph theory [18], [19].
Researchers have proposed various algorithms to solve the MuRES problem, and a brief literature review of MuRES will be provided in Section II. Here, we wish to articulate that the prevailing MuRES solutions are planning methods, which formulate the MuRES problem into a monolithic mathematical programming paradigm and then employ off-the-shelf optimization solvers, e.g., CPlex [20] and branch and bound [21], or take advantage of the special nature of the problem for distributed solutions [22]. However, to the best of our knowledge, almost all the planning methods for the MuRES problem require, as inputs, the a priori information of the target's motion dynamics and its initial position distribution, both of which are not always available in many real-world applications. On the other hand, learning methods are inherently model free, which do not need the pregauged target motion dynamics as inputs.
Therefore, in this article, we turn our attention to the field of multiagent reinforcement learning (MARL) and treat the MuRES problem from the perspective of the decentralized partially observable Markov decision process (Dec-POMDP) framework. MARL has been deemed as a promising field for cooperative multiagent applications. However, the nonstationarity caused by multiple learning agents during the search process prevents its direct application to the MuRES problem. Moreover, in MuRES, the neighbors of a learning robot are dynamically changing during the task execution process, and the total number of robots is not necessarily known to each robot and might even be subject to change during task execution, e.g., some robots might malfunction and quit the team, or new robots are added to the team for reinforcement. Those features of the MuRES problem prohibit us from directly applying canonical MARL methods as the straightforward MuRES solution. In view of the aforementioned challenges, i.e., nonstationarity, dynamic changing neighbors, and unknown and nonstationary total number of robots, we design a cross-entropy regularized policy gradient (CE-PG) method as the MuRES solution. CE-PG adopts the centralized training and decentralized execution (CTDE) scheme, which trains the learning agents 1 in a centralized manner and lets them execute the pretrained policy in a fully decentralized way. Through the CTDE scheme, the nonstationarity problem is resolved, in that the agents, i.e., robots in the MuRES context, do not change their policies during execution, which ensures the stationary Markov decision process. Moreover, during the execution phase, each individual robot uses the online Bayesian computation method to recursively estimate the probabilistic distribution of target's position as its decision-making basis, which does not need the communication or coordination with other robots, and thus, CE-PG avoids designing the complex robot-robot interaction mechanism. Furthermore, we improve the vanilla policy gradient (PG) method for the moving target search problem to include a cross-entropy regularization term, which functions to prevent the multiple robots from conglomeration. The cross-entropy term is calculated between the ego robot's policy and the average policy from all the other robots in the system. The average policy is much stabler than an individual policy and, hence, is robust against individual robot failures. Therefore, CE-PG has the unique feature of behaving well in face of individual robot failures, which is also verified in the simulation section.
The contributions of CE-PG can be summarized as follows: 1) CE-PG adopts the CTDE scheme, which resolves the nonstationarity problem during multiagent learning; 2) the execution process of each CE-PG agent is independent from its neighbors, which avoids designing the complex robot-robot interaction mechanism; and 3) the cross-entropy regularization term ensures that the robots are dispersed in the environment, and in the meanwhile, the calculation process of cross entropy between the ego robot's policy and the average policy from all the other robots makes CE-PG robust against individual robot failures. We perform simulations in a range of canonical MuRES test environments and also deploy CE-PG to a real multirobot system for nonadversarial moving target search in a self-constructed indoor environment with satisfying results.
The rest of this article is organized as follows. Section II presents a brief literature review of MuRES along the taxonomies of its objective, environment type, target's behavior, sensor type, and methodology, followed by the MuRES problem formulation and background introduction of the CTDE scheme and the vanilla PG method in Section III. The CE-PG framework, its pseudocode, and computational complexity analysis are introduced in Section IV. We present the simulation results, 1 Note that in the MuRES domain, "agent" refers to the searching robot, and we use the term "agent" and "robot" interchangeably in the MuRES context. comparisons, and analysis in Section V, followed by showcasing the deployment of CE-PG to a real multirobot system in Section VI. Finally, Section VII concludes this article. We deliver the proofs of related theorems in the Appendixes.

II. LITERATURE REVIEW
Broadly speaking, the domain of multirobot target search can be divided into two subareas: multirobot guaranteed search (MuRGS) and MuRES. MuRGS aims at coordinating a group of robots in such a way that the target cannot escape being detected, regardless of its motion characteristics and/or sensing capabilities [23]. On the other hand, MuRES targets the problem of designing efficient multirobot search strategies so that the overall search effort, e.g., the target's expected capture time, is minimized. Since this article tackles the MuRES problem, in this section, we focus on reviewing MuRES-related research along the taxonomies of 1) MuRES objectives; 2) environment types; 3) target's motion behaviors; 4) robot's sensor characteristics; and 5) prevailing MuRES methodologies. For MuRGS-related research, one may refer to [24], [25], [26], [27], [28], and [29]. Fig. 1 presents a bird's-eye view of the MuRES-related research.

A. Taxonomies of the MuRES Problem
This subsection describes the MuRES problem from different perspectives.
1) Objectives: There are two mainstream objectives in the MuRES literature, namely, MuRES Problem I, which aims at minimizing the target's expected capture time (min. CT) [22], [30], [31], and MuRES Problem II, whose objective is to maximize the target's probability of detection (max. PD) within a given time budget [20], [21], [32], [33]. 2) Environments: One may split MuRES environments into discrete environments, where the environment is represented by topological graphs [20], [22], [30], [34] or partitioned into Cartesian grids [32], and continuous environments [32], [33]. 3) Target's motion dynamics: The target to be searched for can be dichotomized into the stationary target [32], where the target does not move during the search process, and the moving target [20], [21], [22], [30], [31]. For the moving target, one may further divide it into the nonadversarial moving target [20], [21], [22], whose motion dynamics does not change with respect to the searchers' strategy, and the adversarial moving target 2 [30], [31], who changes its moving pattern based on the observation of the searchers' positions and actions. 4) Sensor characteristics: Different types of environments endow different descriptions of the robots' sensor range descriptions. For continuous environments, the sensor's detection range can be circular [26], which detects the target within a certain distance from the sensor, or line of sight [31], which detects the target as long as there is an unblocked straight line connecting the sensor with the target. For discrete environments, the sensor's detection range can be defined as same-node detection [22] and arbitrary range detection [20]. Another dimension of characterizing the sensor characteristics is whether the sensor is perfect or probabilistic. The perfect sensor always returns the true information of the target, i.e., whether there is a target within the detection range [32], while the probabilistic sensor has a certain false negative detection probability [20], i.e., fail to detect a target even if it is within the sensor detection range, and/or even false positive detection probability [17], i.e., mistakenly deem another object within the sensor range as the target.

B. MuRES Methodologies
Researchers have designed various MuRES methodologies, and in this article, we partition them into three groups: planning methods, learning methods, and swarm dynamics.
Planning methods are deemed as the most canonical MuRES solutions. Within this subgroup, the MuRES problem is usually treated as a mathematical optimization problem. With the models of target's motion dynamics, initial location distributions, and the robots' motion characteristics, researchers establish a set of mathematical equations describing the MuRES objective and related constraints. Then, off-the-shelf solvers, e.g., CPlex and branch and bound, are invoked for the exact solution [20], [21], [32]. To expedite the solution process, the preestablished mathematical optimization problem is decomposed into several small but easy-to-solve subproblems, and distributed solutions are proposed for efficient approximate solutions [20], [22]. For example, Asfora et al. [20] establish the MuRES problem as a mixed-integer linear programming (MILP) problem and use CPlex to solve the problem. In the meanwhile, they make the solution distributed by sequentially allocating the robot's decision sequence to each subsequent robot. In this way, the later robots are able to incorporate the former robots' decisions for better cooperative search strategies. The distributed solution is shown to have the similar level of performance with much less computation time.
Learning methods are recently emerging methods for the MuRES problem. Researchers within this category treat MuRES as a multirobot sequential decision-making problem and usually establish it within the framework of Dec-POMDP. After that, various (decentralized) policy optimization algorithms, such as deep deterministic policy gradient [17], deep Q-network (DQN) [35], and Monte Carlo tree search [30], are proposed. For example, Qin et al. [35] design a DQN method for MuRES within the four-connected grid world for stationary targets. To overcome the sparse reward problem, the authors additionally incorporate the environmental uncertainty reduction as the auxiliary reward for decision making. Multiagent learning for the MuRES problem is a promising direction; however, as we have stated, nonstationarity during the robots' simultaneous learning process and dynamic neighborhood information make it difficult to directly transplant the prevailing MARL methods into the MuRES domain.
The third group of methods for the MuRES problem is swarm dynamics. Researchers in this group design various agent-based interaction mechanisms as the behavior guideline for each robot to follow. While each robot is executing its own dynamics, the robot swarm, as a team, is exhibiting a certain group-level behavior for efficient target search [7], [13], [15], [36], [37], [38]. For example, Tang et al. [36] revise the grey wolf optimization method to dynamically decide the next-goal point for each individual robot; with the next goal information, each robot also takes its observed obstacle information and momentum into consideration and reaches the finally merged dynamics. Designing swarm dynamics for the MuRES problem is easy to implement, and the team behavior is naturally robust to individual robot failures or new team member additions. However, to the best of our knowledge, it is very difficult, if not impossible, to establish a clear relationship between the MuRES objective and the robot-robot interaction mechanism. It means that one has to try different types of robot-robot interaction mechanisms and "hope" that one of the emerged team behaviors from the individual interactions fits the MuRES objective.
In this article, we propose CE-PG as a new multiagent learning method for the MuRES problem. Different from state-of-the-art learning methods, we adopt the CTDE scheme, which defers the learning process to the centralized training (CT) stage and lets the robots commit to the precalculated policies during the execution stage. In this way, we overcome the nonstationary problem. Furthermore, we improve the vanilla PG method to incorporate a cross-entropy regularization term, which functions to disperse the robots in the field. In this way, we do not need to design the complex robot-robot interaction mechanism during the execution stage; instead, each robot just follows its precalculated policy, and the robots are dispersed automatically.

III. PROBLEM FORMULATION AND BACKGROUND
In this section, we first lay down the MuRES problem formulation and then introduce the basics of CTDE and PG, both of  Table I presents a list of major notations used throughout this article. Note that we will also state the related symbol's definition when it is introduced in the main contents for the first time.

A. MuRES Problem Formulation
The MuRES problem that we are investigating is to deploy a team of N robots, also named as searchers, in a discrete environment to search for one nonadversarial moving target with the minimal expected time. The term "nonadversarial" means that the target is moving according to its own motion dynamics and does not react to the searchers' positions or search strategies. In the following, we will provide descriptions of 1) the environment; 2) the target (position and motion); 3) the robots (position, action, observation, reward, and policy); and 4) the capture event.
1) The environment is represented by an undirected and connected "unit-cost" graph, G(V, E), where V (|V| = n) refers to the set of nodes and E (|E| = m) refers to the set of edges. The word "unit-cost" means that each robot's action, i.e., executing an edge or staying in the same node, has a time cost at 1. Note that the "unit-cost" assumption is a common one in the MuRES literature for the discrete environments; see [17], [20], [22], and [32] for examples. 2) The target's position at time t is denoted as e t . Note that this article presumes that both the robots and the target can only reside in nodes, i.e., e t ∈ V. The target moves according to its own motion dynamics, represented by a stochastic matrix Γ, which stochastically transits the target from its current position to one of its neighboring according to G or makes the target stay at the same node, i.e., P [e t+1 |e t ] = Γ(e t , e t+1 ).

3) The robot's position at time t is denoted as p
. . , N} is the index of the robot. The robot's action, denoted as a (i) t , can be to execute any edge connected to p (i) t at time t or to select to stay at the same node, which results in p t , is equal to the negative of the action's cost, i.e., r meaning that the robot "believes" that the target is not in node p (i) t at time t, and with z (i) t = 1 meaning that the robot "thinks" that the target is in node p (i) t at time t. Note that in this article, we consider the probabilistic sensors with both false negative and false positive detection probabilities. Moreover, since we assume that there is no communication or explicit coordination among robots during execution, the robot's decision-making policy at time t, which is denoted as π (i) , should depend only on the robot's own history of positions and observations, i.e., p It means that robot i and the target reside in the same node at time t, and in the meanwhile, robot i detects the target (z (i) t = 1). We denote the target's capture time as t cap and use t (i) cap to indicate the capture time by robot i, apparently, t cap = min i {t MuRES problem is then defined as finding the optimal joint policy π = {π (1) , π (2) , . . . , π (N ) }, which minimizes the expected capture time, i.e., E[t cap ].

B. Centralized Training and Decentralized Execution
In MARL, there are many decision-making agents, which simultaneously learn and interact with the environment. On the one extreme, one may be tempted to train each agent completely independently by treating other agents' behaviors as part of the environment, e.g., independent Q-learning (IQL) [39]. On the other extreme, one may treat all the agents as one monolithic global agent and uses the centralized reinforcement learning (RL) methods to learn the joint optimal policy. However, both extremes suffer from severe drawbacks. On the one hand, the simultaneous learning agents make the environmental transitions nonstationary, and methods like IQL cannot even guarantee t ) online, and chooses actions according to pre-trained policy network (π(θ i )). The CT module runs on the centralized server, collects all the robots' position and observation sequences, updates the centralized target's motion dynamics, and trains all the robots' policy network with the cross-entropy regularization in an offline centralized manner. The symbol "⊕" indicates the concatenation operation.
its ultimate convergence. On the other hand, the computational process of training a global joint policy with centralized methods is ultracomplex, and in MuRES, the total number of agents is also subject to change, which prohibits us from applying the centralized RL methods.
On the other hand, the CTDE scheme [40] lies between these two extremes. CTDE performs the policy training process in a centralized way, where each agent is able to access the global state as well as all the other agents' action-observation history. However, during the execution stage, each agent only has access to its own position-observation history and makes local decisions. The benefits of CTDE are twofold: first, during the training stage, the learning agent is able to make full use of the centralized information and has the potential to learn the optimal policy; second, during the execution stage, each agent is functioning in a fully decentralized manner by committing itself to the pretrained policy. Committing to a precalculated policy also makes the environment behave in a stationary way, which provides the theoretical foundation of the related RL algorithm's ultimate convergence. In summary, the essence of CTDE is to train an online decentralized policy for each agent in an offline centralized way.

C. Policy Gradient
PG methods aim at maximizing the expected return based on the PG theorem [41], by directly computing an estimate of the gradient of policy parameters. With the help of (deep) neural networks, the PG algorithms have become the prevalent RL methods. Defining J(π θ ) as the expected return, the gradient of J(π θ ) is calculated as is the estimated accumulated reward from (s t , a t ) and T is the length of the episode. One may refer to [42, p. 325] for the derivation details of the PG theorem. In this article, we make use of the vanilla PG to let the robot learn to search for the moving target and, in the meanwhile, add a cross-entropy regularization term to foster cooperation through dispersing the robots in the environment.

IV. CROSS-ENTROPY REGULARIZED POLICY GRADIENT
This section presents the CE-PG algorithm for the MuRES problem. As stated previously, CE-PG follows the CTDE scheme, which consists of two modules, namely, the offline CT module and the online decentralized execution (DE) module. The CT module collects all the agents' trajectories, i.e., p  Fig. 2 presents the overall framework of CE-PG. In the following subsections, we will first introduce CE-PG's decentralized execution module, which consists of the online PTB update and policy execution and then, with two subsections, introduce the CT module, which corrects the estimates ofΓ and b 0 and thereby trains each agent's policy with interagent cross-entropy regularization. Thereafter, we present the computational complexity analysis of CE-PG's DE module and skip the corresponding computational complexity analysis of the CT module, in that CE-PG's DE module determines the online decision-making time of the deployed algorithm, while the CT module can be executed in an offline manner. We extend CE-PG's application to robots with a broad field of view, e.g., aerial robots, and discuss the use case of an "all-to-all" communication scheme toward the end of this section.

A. Online Decentralized Execution
The online decentralized execution module of CE-PG is deployed to each agent and will 1) update the agent's PTB (b t . Before proceeding to the contents, we need to present the formal definition of PTB, which constitutes the robot's state and serves as the basis for policy parameterization. Definition 1 (Probabilistic target belief): An agent's PTB is the agent's estimated probabilistic distribution of the target's position.
Note that different agents may have different PTBs, as they have different "experiences" while interacting with the environment, and even for the same agent, its PTB will change as it interacts with the environment and collects more information. The PTB for agent i at time t is denoted as b (i) t , and we have b (i) t ∈ R n . All the agents have the same PTB at time 0, and we denote it as b 0 , which is also referred to as the target's initial position distribution.
With PTB, we are ready to define the robot's state. Robot i's state at time t, denoted as s is constituted of its current position and its current PTB, i.e., s Note that in this article, we use the one-hot encoding scheme to represent p , 1} n , and in this way, s (i) t ∈ R 2n . Next, we provide the definitions of two quantities of sensor characteristics, namely, false positive ratio (η fp ) and false negative ratio (η fn ), which will be used in the online PTB update procedure.
Definition 2 (False positive ratio): The sensor's false positive ratio refers to the probability that the sensor mistakenly believes that the target is currently in the same node as the ego robot. ∀ t,

Definition 3 (False negative ratio):
The sensor's false negative ratio refers to the probability that the sensor fails to detect the target when they are in the same node. ∀ t, we have Armed with the definition of η fn and η fp , we state the online calculation process of b (i) t with the following theorem.
where Λ ∈ R n×n is a diagonal matrix, with its elements at the main diagonal set as We defer the proof process of Theorem 1 to Appendix A, to keep the article's main contents succinct. Note that 1) for t = 1, we assign b t is essentially a vector of probabilities, we can calculate the exact values through normalization with respect to its L 1 norm.
With the online updated PTB (b (i) t ) according to (1), and the pretrained policy parameter (θ i ), the robot executes the policy by choosing a (i) t with the following probability: Algorithm 1: Online Decentralized Execution.
where s is the policy network parameterized by θ i . While the implementation details will be introduced in Section V, we note here that the policy network, π(s (i) t , a (i) t ; θ i ), is a fully connected feedforward neural network, which maps the featured states, i.e., s (i) t to the action selection probabilities.
The pseudocode of the online DE module for robot i is presented in Algorithm 1. Note that, in Algorithm 1, the target's movement (line 5) is for simulation purpose only, and the robot does not need e t to execute its decision-making policy. Since there are both false negative and false positive detection rates from the sensors, we deem the target as "captured" by robot i only when e t = p

B. Offline Centralized Training
The offline CT module provides two functionalities: 1) estimating b 0 andΓ; and 2) updating each agent's policy network's parameters, i.e., θ i , with the CE-PG. This subsection presents the first functionality of the offline CT module, and we introduce the CE-PG with a new subsection, as it serves as the core part of multirobot decision making for efficient target search. The following theorem presents the posterior estimation process of b 0 .
Theorem 2 (The posterior initial PTB update theorem): With η fp and η fn , the a priori initial PTB b 0 , and ∀ i ∈ {1, 2, . . . , N}: p where Λ 0 ∈ R n×n is a diagonal matrix, with its elements at the main diagonal set as where 1{statement} is the indicator function, which returns 1 if the statement is true, and returns 0 if the statement is false.
While the proof process of Theorem 2 is deferred to Appendix B, here, we wish to state that the derivation process makes use of the Bayesian rule to calculate the conjugate relationship between the prior and posterior b 0 . Before presenting the update process of theΓ, we present the definition of collective PTB and then lay down the offline centralized collective PTB update theorem, which makes use of all the robots' position and observation information. We will deliver the proof of Theorem 3 in Appendix C.
where Λ c ∈ R n×n is a diagonal matrix, with its elements at the main diagonal set as With the updated collective PTB from (6), we can estimateΓ with the maximum likelihood estimate as follows: where T is the length of the position/observation sequence, and The derivation process is straightforward and, hence, is omitted in this article. Note that, theoretically, one can loop between (6) and (8) for improved estimates of both the collective PTB and target's motion dynamics, which is in accordance to the expectation-maximization algorithm [43] in most unsupervised machine learning methods. However, in practice, we find that one round of update for the MuRES problem is enough to have well-behaved collective PTB andΓ values, and thus, we stick with only one round of the collective PTB andΓ updates in the CT procedure.

C. Cross-Entropy Regularized Policy Gradient
This subsection introduces CE-PG's policy optimization process, which updates each agent's parameterized policy network.
The underlying rationale of CE-PG is to maximize the individual robot's expected return, i.e., J(π(θ i )) and, in the meanwhile, disperse the robots from each other through maximizing the cross entropy. In the following, we begin with the definition of cross entropy between robots and then introduce CE-PG's policy optimization objective followed by the objective's gradient derivation process.
Definition 5 (Cross entropy from robot j to robot i): The cross entropy from robot j to robot i, denoted as H(π(θ j ), π(θ i )), refers to the expected cross entropy with respect to robot i's position sequence, i.e., H(π(θ j ), π(θ i ) where T i denotes the length of robot i's position sequence.
Note that the smaller H(π(θ j ), π(θ i )) is, the more robot j "agrees" with robot i's policy. While in MuRES, we do not want the robots to conglomerate. Thus, we want to add a regularization term to disperse the robots from each other, which corresponds to the cross-entropy maximization.
The objective of CE-PG for robot i, denoted asJ(θ i ), is to maximize the weighted summation of robot i's expected return and the average cross entropy from all the other robots to robot i, which is stated as where 0 ≤ β i ≤ 1 is robot i's balance parameter between the two subobjectives. Note that different robots in the robot team may have different balance parameter values. The gradient of J(θ i ) can be expressed as where the first term, i.e., ∇ θ i (J(π(θ i ))), can be calculated from the canonical PG method in a straightforward way, and we have In (11), T i has the same meaning as stated in Definition 5 and G(s t , and follows π(θ i ) thereafter. The second term (cross entropy) in (10), i.e., ∇ θ i (H(π(θ j ), π(θ i ))), is calculated as With the derivative ofJ(θ i ), one can update robot i's policy network's parameter with the gradient ascent algorithm as follows: Algorithm 2: Offline Centralized Training.
where 0 < α i < 1 is robot i's learning rate. The pseudocode for the offline CT module is presented in Algorithm 2. In Algorithm 2, we first collect all the robots' trajectories, e.g., position sequences and observation sequences, update the collective PTB, i.e., b t (from line 1 to line 4) and, hence, reach better estimates of the target's motion dynamics, i.e.,Γ (line 5). After that, we update each robot's policy network, i.e., π(θ i ), with CE-PG (from line 7 to line 10).

D. Computational Complexity Analysis
This subsection analyzes the computational complexity of CE-PG. In this article, we express the computational cost of an operation through the number of floating-point operations (flops). A flop is defined as an addition, subtraction, multiplication, or division of two floating-point numbers [44]. To evaluate the computational complexity of an algorithm, we count the total number of flops, express it as a function (usually a polynomial) of the dimensions of the involved matrices and vectors, and simplify the expression by ignoring all terms except for the leading ones. Note that we focus on analyzing the computational complexity of CE-PG's online DE module, in that this module dictates the algorithm's actual reaction time in deployment, while the offline CT module does not need to provide the real-time reaction functionalities and, thus, bears the heavy computational load at the server side.
Examining Algorithm 1, we can see that the core computational load happens within the "while" loop (lines 3 and 7 to be more specific), in that the initialization procedure can be done offline. Line 3 corresponds to the robot's action selection process, which requires calculating all π(s t , a; θ i ) values from neural networks. In practice, we design a three-layer neural network, with the input dimension (n i = n), the number of hidden nodes (n h = 2 × n i = 2n) and the output dimension (n o = n + m); therefore, the computational complexity of line 3 is O(n i × n h × n o ) = O(2n 2 (n + m)). While line 7 refers to the agent-based online PTB update process, referring to (1), we can see that executing line 7 has a computational complexity of O(n 3 ). Summing line 3's and line 7's computational complexity together and ignoring the constant factors, we conclude that the online DE module has a computational complexity at O(n 3 + n 2 m). Note that the online DE module is deployed to each robot, and there is no communication among robots during execution. Therefore, the computational complexity of CE-PG's online DE module does not contain the number of robots, i.e., N .

E. CE-PG+: Robot Team With a Broader Field of View
So far, we have presented both the offline CT module and the online DE module of CE-PG. However, the basic CE-PG algorithm assumes that the robots only have the same node detection ability, which fits to the use case of ground robots. In this subsection, we extend CE-PG's application to the robot team with a broader field of view, e.g., aerial robots, and also discuss the use case of an "all-to-all" communication scheme to see how it enhances CE-PG's performance.
When the robots have a broader field of view than the same node detection, e.g., aerial robots, we need to upgrade Theorems 1 and 3 to the respective versions, which consider the robot's broad sensing capabilities. Before that, we present the following formal definition of sensing range.
Definition 6 (Sensing range): Robot i's sensing range at p (i) , denoted as δ(p (i) ), refers to the set of nodes that can be detected by the ego robot when it resides in p (i) .
With Definition 6, we are ready to deliver the augmented PTB update theorem and the augmented offline collective PTB update theorem, which collectively upgrade the basic CE-PG to CE-PG+. Here, we denote the related CE-PG algorithm for robots with broader field of view as "CE-PG+." Theorem 4 (The augmented PTB update theorem): (14) where Λ ∈ R n×n is a diagonal matrix, with its elements at the main diagonal set as where δ(p (i) t ) refers to the set of nodes that can be detected by robot i when it resides in p The proof process of Theorem 4 is quite similar to that of Theorem 1 and, hence, is omitted. Note that with Theorem 4, each robot can update the individual PTB with a broader sensing information, and hence, the individual PTB is closer to the true value than the original one, which makes the decision making more efficient. Similarly, the offline collective PTB update process needs to be adapted to the broader sensing capabilities as follows.
Theorem 5 (The augmented offline collective PTB update theorem): WithΓ, η fp , η fn , ∀ i ∈ {1, 2, . . . , N}: p (16) where Λ c ∈ R n×n is a diagonal matrix, with its elements at the main diagonal set as With Theorems 4 and 5, we can upgrade the basic CE-PG algorithm to CE-PG+. The pseudocode of CE-PG+ is omitted here, in that it is quite similar to CE-PG. One can merely replace line 7 of Algorithm 1 with the augmented PTB update equation in (14) and replace line 4 of Algorithm 2 with the augmented collective PTB update equation in (16).
So far, we have upgraded CE-PG to CE-PG+, which fits to robots with a broader field of view, e.g., aerial robots. Another dimension of enhancing the basic CE-PG algorithm is to make use of the interrobot communication during deployment. Currently, CE-PG assumes that the robots do not communicate with each other during deployment, in that the intermittent communication will alter the decision-making process of the individual robot and make the overall environment nonstationary. However, when the all-to-all communication among robots is always available, we can reach a centralized CE-PG algorithm, named as "C-CE-PG," which updates the collective PTB with all the robots' observations at each time step and broadcasts the real-time collective PTB to each robot. With C-CE-PG, each robot is having a much better PTB information, and thus, the decision-making process is more efficient than that of CE-PG. We will evaluate and compare the performance of CE-PG, CE-PG+, and C-CE-PG in the next section.

V. SIMULATION RESULTS AND ANALYSIS
In this section, we evaluate and compare CE-PG's performance with the state-of-the-art MuRES algorithms in a range of MuRES test environments. For state of the art, we select 1) the finite horizon path enumeration (FHPE) method proposed in [22]; 2) FHPE's improvement with implicit coordination among robots through sequential allocation (FHPE-SA) [22]; 3) MILP for MuRES path planning proposed in [20]; and 4) MILP's distributed implementation version (D-MILP) [20]. In addition, since CE-PG is essentially an MARL algorithm under the CTDE scheme, we also apply canonical MARL algorithms under the CTDE scheme, i.e., multiagent deep deterministic policy gradient (MADDPG) [45], FACMAC [46], MASAC [47], and MAMBPO [48], for the MuRES problem and include them as the baseline algorithms. The algorithm-related parameter configurations are summarized in Table II. Note that: 1) for FHPE, the computational complexity is too high, and thus, we set the planning horizon (h) to be 2, while for its distributed version (FHPE-SA), we set h to be 5; and 2) since FACMAC, MASAC, and MAMBPO have never been applied to the MuRES problem before, we provide the specifications in the publicly available code repository, detailing the related parameter configurations and state-space definition in the MuRES domain. While for MADDPG, there exists a prior work [17] for multirobot search, and we adopt related parameters specified in this article for the implementation.
For MuRES test environments, we select two canonical ones in the multirobot search domain, namely, OFFICE and MU-SEUM, which are shown in Fig. 3(a) and (b), respectively. All the algorithms are implemented in Python 3.7, with source code publicly available. 4 We evaluate the algorithms on a 2.30-GHz, Intel(R) Core(TM) i5 8300H CPU computer with the 64-bit version of Windows 10 operating system and 16-GB RAM. In the following subsections, we will 1) showcase how CE-PG is able to generate diverse strategies of the robot team with exactly the same PTB information, with a simple yet illustrative example; 2) evaluate the impact of β i in the two MuRES test environments and select the best β i values for the subsequent baseline comparison; 3) compare the performance of CE-PG, CE-PG+, and C-CE-PG in OFFICE and MUSEUM environments; 4) benchmark CE-PG's performance with state-of-the-art MuRES solutions as well as canonical MARL baselines for the MuRES problem; 5) evaluate and compare the impact of sensor inaccuracies, i.e., η fn and η fp , to the performance of CE-PG and other baseline algorithms; and 6) compare CE-PG's robustness with MARL baselines when one or multiple robots malfunction during deployment.

A. Simple Use Case
In this subsection, we construct a simple yet illustrative multirobot search scenario, as shown in Fig. 4(a), to illustrate the per-episode training processes of two robots and how the two robots with exactly the same PTB information learn to make different/diversified decisions due to the cross-entropy regularization term.
In Fig. 4(a), two robots, starting at the robot depot, try to search for the target, which starts at the target spot. Both the robots and the target have two actions, i.e., go to "A" or go to "B," and the simulation stops after one time step. If any robot resides in the same node with the target, e.g., both Robot 1 and the target reside in node "A" after one time step, a reward of 1 is given to the robot team. Otherwise, reward is 0. The target has a 60% probability to go to "A" and 40% probability to go to "B," and we set β i = 0.5 for both robots. Fig. 4(b) shows one representative learning curve of both robots' policies (since any robot can only select to go to "A" or "B," we just simply plot each robot's policy's probability of selecting "A."). In the figure, we can see that, as the number of episodes increases, the robots separate their action selection preferences so as to simultaneously cover both "A" and "B." On the other hand, Fig. 4(c) shows the learning curves of vanilla PG without the cross-entropy regularization term by simply setting β i = 1. In the figure, we can see that, without the cross-entropy regularization, both robots will conglomerate to "A" for the large return. Note that for the specific example, which robot ultimately converges to "A" depends on the randomly initialized policy parameters. In theory, both Robot 1 and Robot 2 have equal probabilities of converging to "A," but they cannot concurrently converge to "A" due to the cross-entropy term, and Fig. 4(b) just indicates one representative learning curve.

B. Evaluating the Impact of β i
In this subsection, we evaluate the impact of β i to CE-PG's performance in the two canonical MuRES test environments, i.e., OFFICE and MUSEUM. First, β i can (theoretically) be set differently for different robots, which endows us with a wider selection range than simply setting β i to be the same value across different robots. However, in practice, it is difficult to tune the parameters if we set β i differently. Therefore, we keep β i to the same value for all robots and test the MuRES performance for a range of β i values, i.e., β i ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, and select the best one on average for the subsequent comparison with baseline algorithms.
We set up the MuRES for a nonadversarial moving target problem in the two canonical MuRES test environments as follows: 1) the robots are initialized at node 43 for OFFICE and at node 1 for MUSEUM, respectively; 2) the target is randomly initialized in the environments according to the discrete uniform distribution and moves randomly with respect to its available actions, i.e., at each time step, the target moves with equal probability to one of the adjacent nodes or stays in the same node; 3) CE-PG's policy network for each robot is a randomly initialized three-layer neural network, with n i = n inputs, n h = 2 × n i = 2n hidden nodes, and n o = m + n output channels, with "softmax" as the activation function; 4) the maximal allowed capture time is 3n, i.e., t cap = 3n if the employed MuRES algorithm fails to detect the target before 3n time steps; and 5) we set η fn = η fp = 0 for fairness in comparison, because neither FHPE nor FHPE-SA applies to the use case of stochastic sensors. Note that the same MuRES problem setup will be used in the remaining subsections as well. Fig. 5 shows the performance of different β i values across different settings of the MuRES problem. When making comparisons with baseline algorithms, we select β i = 0.5, which yields the best performance on average for the remaining experiments.

C. Comparison Among CE-PG, CE-PG+, and C-CE-PG
In Section IV-E, we extend CE-PG to CE-PG+, which fits to robots with a broader field of view, e.g., aerial robots, and also consider the use case of an all-to-all communication scheme by proposing C-CE-PG. In this subsection, we compare the performance of CE-PG, CE-PG+, and C-CE-PG in the two MuRES test environments following the problem setup as stated in the last subsection. Fig. 6 shows the comparative simulation results in both OFFICE and MUSEUM. In the figure, we can see that both C-CE-PG and CE-PG+ are having a better performance than CE-PG, in that CE-PG+ augments the agent's sensing capability and C-CE-PG endows each agent with an online collective PTB information, which is more accurate than the individual one.

D. Performance Comparison With State of the Art
This subsection compares CE-PG's performance and efficiency with state-of-the-art MuRES solutions as well as the canonical MARL algorithms. Fig. 7(a) and (b) shows the performance comparison of CE-PG with state of the art in OFFICE and MUSEUM, respectively. We rerun each experiments 1500 independent times and report the corresponding algorithm's mean target capture time (t cap ) for different number of robots. In the figure, we can see that: 1) as the number of robots increases, all the algorithms' mean target capture time decreases, which is reasonable; and 2) CE-PG achieves the top three performance across different MuRES problem setups; moreover, the disparity between CE-PG's performance with the baseline algorithms' best performance is barely discernible. Here, we wish to note that the target's motion dynamics and the target's initial position distribution are not provided to CE-PG as the inputs; instead, they are learned while the robots are interacting with the environments in the offline CT stage. For FHPE, MILP, and the corresponding distributed versions, both the target's motion dynamics (Γ) and the target's initial position distribution (b 0 ) are provided as inputs, which is unrealistic in certain application scenarios.
The decision-making time comparisons are presented in Fig. 8(a) and (b), respectively. From the figures, we can see that the decision-making time of all MARL algorithms including CE-PG is significantly smaller than that of FHPE, MILP, and their variations for both MuRES test environments. The underlying reason is that for learning-based algorithms, during deployment, the decision-making time depends solely on the neural network's forward computation time, which is usually in  the range of milliseconds, while for planning-based methods, such as FHPE and MILP, the decision-making process involves solving the formulated mathematical problem with enumeration or commercial solvers, which incur nonpolynomial complexity and cost a lot of time, as shown in Fig. 8.

E. Impacts of Sensor Characteristics: η fn and η fp
In the last subsection, we have compared the performance and efficiency of CE-PG with state of the art in two canonical MuRES test environments. However, we set η fn = η fp = 0 for comparison fairness. In this subsection, we evaluate the impacts of η fn and η fp to the performance of CE-PG. Fig. 9(a) and (b) shows the performance changing trends of CE-PG with different η fn values for the two MuRES test environments when we set η fp = 0. We rerun the same experimental setup for 1500 independent times and report the target's mean capture time as the performance indicator.
In Fig. 9, we observe that with the same experimental setup for other parameters, i.e., N , β i , increasing the value of η fn will increase the target's mean capture time. This is reasonable, in that the larger η fn is, the more inaccurate the related sensor is. Thus, it will result in more target searching time on average. However, on the flip side, the impact of η fp is not on the target's capture time, but on the overall multirobot system's false alarm rate. Here, we define the multisystem's false alarm rate as the system's total number of times that false alarm happens. Fig. 10(a) and (b) shows the false alarm rates' changing trends of CE-PG with different η fp values for the two MuRES test environments while setting η fn = 0.
From Fig. 10, we can see that with increase η fp values, the multirobot system's false alarm rate increases. Another phenomenon is that the number of robots does not affect the  false alarm rates, in that although more robots will result in smaller search time, but with each robot subject to a certain false alarm probability, the overall system's false alarm rate does not decrease. In addition, note that in both experiments, we set the maximum value of both η fn and η fp as 0.5, which means that the sensor's false negative detection probability (η fn ) is less than the sensor's true negative detection probability, and similarly, the sensor's false positive detection probability (η fp ) is less than the sensor's true positive detection probability.
So far, the evaluation of the impacts of sensor characteristics, i.e., η fn and η fp , to CE-PG is stand-alone, without making comparisons with baseline algorithms, and it is more convincing to compare the related performance with baseline algorithms. First, since all selected baseline algorithms do not consider η fp , and thus, we cannot perform the comparative evaluations. For η fn , we set the number of robots as N = 4 and evaluate the performance of CE-PG and different baseline algorithms for a range of η fn values, i.e., η fn ∈ {0.0, 0.1, 0.3, 0.5}, and report the comparative results in Fig. 11. In the figure, we can see that CE-PG achieves the top three performance (sometimes MILP or D-MILP performs better) for different η fn values in the two environments. The underlying reason is that CE-PG takes η fn into calculation during the PTB derivation process, while most baseline algorithms (except for MILP and D-MILP) do not consider the effect of η fn explicitly.

F. Robustness Evaluation With Random Robot Failures
As we have claimed in Section IV, one of the main characteristics of CE-PG is the DE, which means that during the execution process, the robots do not need to communicate with each other and will commit to a predetermined policy for the target search process. During the execution process, the robot do not rely on other robots' information to make decisions, which makes CE-PG naturally 5 robust to random robot failures. In this subsection, we evaluate and compare the robustness performance of CE-PG with state of the art through randomly withdrawing a robot (mimic a malfunctioning robot) from the multirobot system. Fig. 12(a) and (b) shows the comparative results of the robustness performance between CE-PG and other MARL algorithms. In the figure, we can see that CE-PG has the best performance in various settings when we withdraw one robot out of the robot team during execution. Note that Fig. 12(a) and (b) does not include planning-based algorithms, such as FHPE, FHPE-SA, MILP, and D-MILP, in that planning-based methods inherently assume that the complete knowledge of all the functioning robots is available all the time, and they cannot be applied to the case with one or more malfunctioning robots.
One more challenging scenario for MuRES problem's robustness evaluation is the multirobot system's performance in face of multiple malfunctioning robots. We evaluate and compare the MARL algorithms' performance when we withdraw multiple robots from the environment during the execution stage in Fig. 13 (we set the initial number of robots as N = 5). In the figure, we can see that CE-PG achieves the best performance for the two MuRES test environments. We conjecture that the underlying reason is CE-PG endows the robots with a completely DE policy without communication or coordination during execution, while all other MARL algorithms are working on top of a common factorized value function, which changes greatly with the varying number of functioning robots.

VI. EXPERIMENTAL RESULTS AND ANALYSIS
So far, we have conducted simulation comparisons between CE-PG and state of the art for the MuRES problem in the OFFICE and MUSEUM environments. Next, we deploy CE-PG   to a real multirobot system and test its functionality in a selfconstructed indoor environment for the moving target search.
The autonomous robot testbed is a DM3008 differential drive robot, 6 as shown in Fig. 14(a), with an embedded single beam Li-DAR (LDS-50C-2) for map construction and obstacle detection. The DM3008 robot already offers the simultaneous localization and mapping functionality, as well as the autonomous navigation and obstacle avoidance module, which navigates the robot within the preconstructed map. We integrate the CE-PG algorithm, which essentially assigns the next-step goal position to the robot, into DM3008. The moving target is a C30 differential drive robot with the random next-step goal position. The indoor environment, as shown in Fig. 15(a), mimics one half of the MUSEUM test environments. The mapping result of the indoor environment is shown in Fig. 15(b), with labeled topological grids for trace representation in Table III. In the experiment, we deploy three DM3008 robots into the environment and let them cooperatively search for one randomly moving target which is a C30 differential drive robot, as shown in Fig. 14(b). The robots are initialized at node 1, and the moving target is initialized randomly in the environment. While a demonstrative video is uploaded together with the manuscript, and publicly available, 7 here, we present ten sets of the multirobot moving target search results in Table III. In the table, we can see that with three DM3008 robots, we can capture the randomly moving target within the indoor environment in approximately 5.1 time steps. Fig. 16 shows a snapshot of the operating scenario of the multirobot search process, where the left half figure visualizes the status information from the robot operating system perspective, the three top right subfigures show DM3008 robots' local vision, and the bottom right subfigure shows the bird's-eye view of the real multirobot search process.

VII. CONCLUSION
This article presented CE-PG for the MuRES problem. CE-PG adopted the CTDE scheme to train and deploy the learning agents into the environments. In addition, during the execution stage, each CE-PG agent used the online Bayesian update procedure to estimate the probabilistic target distribution beliefs and use it together with its position information as the decisionmaking basis. We evaluated and compared the performance of CE-PG with state-of-the-art multirobot search algorithms in two canonical MuRES test environments, deployed CE-PG to a real multirobot system (three DM3008 differential drive robots), and demonstrated the multirobot search process in a self-constructed indoor environments with satisfying results.
Currently, CE-PG applies to the multirobot search for a nonadversarial moving target problem. In the future, we would like to improve CE-PG to the use case of an adversarial moving target, which acts to avoid being captured by the robots. Moreover, we are also keen on improving CE-PG to apply to a heterogeneous team of robots, which possess different motion and vision capabilities. Currently, CE-PG does not apply to the robot team with heterogeneous motion dynamics, in that the cross-entropy regularization term assumes that all robots are possessing the same motion capabilities when residing in the same node. In the meanwhile, we are also interested in designing a completely decentralized but coordinated multirobot search algorithm, which does not need any centralized computation and coordinates the robot team on the fly with only local information.

APPENDIX A PROOF OF THEOREM 1
Proof: We prove Theorem 1 in its element form as follows:

APPENDIX C PROOF OF THEOREM 3
Proof: We prove Theorem 3 in its element form as follows: