Acquisition of Inducing Policy in Collaborative Robot Navigation Based on Multiagent Deep Reinforcement Learning

To avoid inefficient movement or the freezing problem in crowded environments, we previously proposed a human-aware interactive navigation method that uses inducement, i.e., voice reminders or physical touch. However, the use of inducement largely depends on many factors, including human attributes, task contents, and environmental contexts. Thus, it is unrealistic to pre-design a set of parameters such as the coefficients in the cost function, personal space, and velocity in accordance with the situation. To understand and evaluate if inducement (voice reminder in this study) is effective and how and when it must be used, we propose to comprehend them through multiagent deep reinforcement learning in which the robot voluntarily acquires an inducing policy suitable for the situation. Specifically, we evaluate whether a voice reminder can improve the time to reach the goal by learning when the robot uses it. Results of simulation experiments with four different situations show that the robot could learn inducing policies suited for each situation, and the effectiveness of inducement is greatly improved in more congested and narrow situations.


I. INTRODUCTION
Path planning for autonomous mobile robots has been widely studied for various applications [1]. Robot navigation in dynamic environments is actively investigated, including the velocity obstacle methods [2], social force models [3], and machine learning-based methods [4]. However, these studies focus on using passive-avoidance strategies (PAS) to enable the robot to avoid humans passively, so the robot often gets stuck in crowded and dynamic spaces [5]. To move adequately in such environments, the robot must use an activeinducible strategy (AIS) that can voluntarily make a path for it by encouraging an obstructing human to move. Thus, we previously proposed a human-aware interactive navigation method [6] that uses inducement, i.e., voice reminders, The associate editor coordinating the review of this manuscript and approving it for publication was Zeyang Xia . e.g., ''Let me pass,'' or physical touch [7], [8]. We then developed a dynamic waypoint navigation (DWN) method [9] as a model-based path planning method and an inducible social force model (iSFM) [10] for efficient proximal crowd navigation. We confirmed that these methods enable the robot to move in crowded and dynamic spaces without being stuck by using inducement.
The navigation systems that we previously developed adopt a rule-based deterministic approach, so the robot's behavior is largely affected by parameter settings, e.g., coefficients in the cost function, personal space, velocity, and inducement. So far, we adjusted the above parameters to be suitable for the situations from exploratory experiments. For example, the cost of using physical touch is prettily large so that the robot hardly uses it [10], but the value does not have versatility. In the situation as shown in Fig. 1, the use of inducement depends on many factors, including human attributes, e.g., looking at a painting, task contents, e.g., the need to hurry due to an emergency evacuation, and environmental contexts, e.g., quiet museum, which are difficult for the robot to recognize. Thus, it is unrealistic to pre-define a set of parameters in accordance with the situation. To understand and evaluate if navigation with active inducement is effective and how and when it must be used, we propose to analyze them through a multiagent reinforcement learning framework in which the robot voluntarily acquires an inducing policy suitable for the situation. Learning systems enable the agent to define its own rules based on the obtained data. To develop a decision-making system for our purpose, multiagent deep reinforcement learning (MDRL) [11] would be effective. Recently, MDRL has been used to learn both the control and strategic aspects in soccer games [12], to learn crowd dynamics and determine efficient paths [13], and to develop an intersection management system for automated vehicles [14]. These studies consider the relationship between agents but do not consider how agents could directly interact with one another to reach their goal nor active inducement. As a similar idea to inducement, an approach that incorporates learning communication for multiagent cooperation was proposed [15], [16], [17], [18], [19]. In this approach, agents share or receive action intention through a dedicated communication channel, but in our approach, agents can directly use inducement as a voluntary action, which enables the robot to acquire a more realistic cooperative inducement policy.
In summary, no learning systems that can evaluate the effectiveness of using inducement exist. We model navigation as a cooperative task that focuses on the convergence of the entire group rather than the individual. We thus modify an MDRL-based navigation method that follows partially observable Markov decision processes [20] by introducing an inducement system. We also measure when and how often inducements are used to examine how the environment affects an agent's desire to use inducement. Our main contributions are: • To be the first trial to evaluate an inducement strategy in robot navigation through a multiagent deep reinforcement learning framework.
• To evaluate if inducement, i.e., voice reminder, can improve overall efficiency and goal convergence by comparing collective goal convergence times.
• To determine when to use voice reminders by recording their usage frequency and examining inference models.

II. DEVELOPMENT TOOLS AND SETTINGS
In this section, we introduce the tools, frameworks, and network parameters used while training and testing our policies.

A. PRIMARY TOOLS
The primary tools for development are Unity's ML-Agents Toolkit (Release 18), Unity's game engine, and PyTorch (1.10.1). The Unity ML-Agents Toolkit enables Unity to act as an environment for training agents. It includes state-ofthe-art algorithms, a Python API for training agents, a gym wrapper, and a built-in inference engine to test trained policies. Unity's engine is used for the simulation environment [21], and PyTorch is the machine learning framework. Unity's scripting API is used to create the random generation of agent positions and spawns and determine an agent's arrival to their goal and their collision with another object.

1) TRAINING ALGORITHM
For the training algorithm and architecture, we used Unity's Multi-Agent Posthumous Credit Assignment (MA-POCA), which is a multiagent trainer that trains a centralized critic for a group of agents [22]. The benefit of using MA-POCA is that it enables group rewards alongside individual rewards. Group rewards are imperative when recreating a cooperative task such as path planning since it incentivizes agents to take selfless actions that prioritize the teams' goal over their own individual goals. MA-POCA's novel architecture for COunterfactual Multi-Agent Policy Gradients (COMA) [23] utilizes self-attention in place of a fully connected layer with absorbing states over active agents in the critic network to address issues of posthumous credit assignment. Moreover, MA-POCA enables a variable number of inputs, and the network can compute a counterfactual baseline using self-attention [24]. MA-POCA uses temporal difference [25] to calculate targets for the value function and baseline updates. The above algorithm and architecture use a framework of centralized training and decentralized execution that enables policies to use extra information during training.

2) NETWORK SETTINGS
We used the hyper-parameters as listed in Table 1. The values were basically chosen based on the common configurations for POCA in ML-Agents' training configuration documentation [26], followed by tuning the hyperparameters until policy statistics (value estimates and entropy) were stable. For the network settings, we used 256 hidden units, three layers, and HyperNetworks [27], which is a conditioning type for the policy using goal observations. Alongside our extrinsic rewards, which are explained in the reward function section, we also utilized curiosity [28] for training our agents as our reward policy is sparse.

C. TRAINING AND TESTING
The models were trained on a computer with an NVIDIA RTX 3090 graphics card, an AMD Ryzen 9 5950X 16-Core processor, and 128 GB of RAM. All models were trained in a variant of the scenario with randomly generated agent spawns, agent orientations, and goal spawns to converge with a minimum of 45 million training steps, each episode lasting 30 seconds unless all agents do not crash (1st training). As Unity's engine runs at 60 frames per second and each episode is 1800 steps long, this equates to 25,000 episodes of training. After training the agents on the randomly generated variants, we trained the non-inducementand inducement-based models equally for 3,000 episodes on the set spawn variant (2nd training). This was a small training duration to avoid overfitting. After training, we ran a final test or inference on the policy for 1,000 episodes and evaluated the group convergence time, collision count, and inducement usage across all 1,000 episodes. The training was executed in parallel with 16 environments and a 20× time scale. Initial training, secondary training, and the inference test took approximately 6 hours, 45 minutes, and 12 minutes, respectively.

III. IMPLEMENTATION
This section elaborates on the agent model and action space, the types of observations, the environments to evaluate the effectiveness of inducement, and explains how the inducement interaction is rewarded and punished.

A. AGENT MODEL AND ACTION SPACE 1) AGENT MODEL
One of the limitations of using the combination of MA-POCA and group rewards is that it requires all agents in a group to be homogenous. Thus, all agents are identical as they are modeled as a group of agents that try to reach their own goals. All agents, as shown in Fig. 2, are modeled to be human-like in terms of fields of vision and velocity. Each agent has a collider of 0.85 (width) × 0.65 (depth) × 1.90 (height) m that uses a collider and collision Unity classes to check when the agent is either on their goal or colliding with another agent or wall.

2) AGENT ACTION SPACE
All agents feature a discrete action space with three branches, i.e., linear movement, rotation, and inducement, as listed in Table 2. The agent has five possible actions in the linear movement branch, including forward, backward, right, left, and no movement. The forward movement speed was determined based on the average adult human walking speed (1.4 m/s), while the other values were halved as humans typically walk forward and rotate as necessary. The rotation branch has both clockwise and counterclockwise rotations. The inducement enables agents to explicitly interact with other agents. There are no direct velocity implications behind the use of a voice reminder, but agents will be rewarded and punished based on their reactions. To reach the above velocities, the agent selects the respective action repeatedly throughout the 60 frames.  horizontal band of vision to replicate human vision and has 15 equally spread 5-m-long rays, as shown in Fig. 2 (a). The sensors send out a set of rays and output values based on the relative distance between the sensor and the tagged object (e.g., the blue agent is tagged as Tag 1).

2) VECTOR OBSERVATION
There are five vector observations, which are necessary for the agents' ability to learn how to reach their respective goals, use voice reminders, and react to them. The first two vector observations are the agent's relative distance to their goal ([0, 1]) and the agent's relative angle to their goal ([−1, 1]), as shown in Fig. 2 (b). Unlike other models that give absolute positions of both the agents and their goals [29], [30], the agents are only provided relative scalar values, to meet conditions in a real environment. Fig. 2 (c) shows a state when the agent has reached their respective goal.

3) INDUCEMENT OBSERVATION
As shown in Figs. 2 (d) and (e), the agents' inducer and inducee states are represented by green and red exclamations, respectively. The inducer state indicates that the agent has used a voice reminder and initiated the inducement event, while the inducer state indicates that they are a recipient of a voice reminder. Agents are rewarded and punished based on their reaction to an inducement, so the agents must be provided information when they are being induced.
The rationale behind these observations is to replicate human acknowledgment when being addressed. Notably and comparative to real life, the agents are oblivious to the goal position of all the other agent's respective goals.

C. ENVIRONMENT
Our training and test environments have four different situations: a baseline, efficiency (×2), and convergence.

1) BASELINE (S1)
This situation simulates a standard open environment with plenty of space to move around, as shown in Fig. 3 (a).
As most conventional multiagent path planning studies focus on this open environment [29], [30], this model will examine the effects inducements can have on the standard setting.

2) EFFICIENCY (S2)
This situation will test the efficiency in two different situations. In the first one, as shown in Fig. 3 (b), the agent is tasked to take the most efficient route when small detours are presented. In the second one, as shown in Fig. 3 (c), the agent is limited to two paths, i.e., an inefficient detour route or the shortest route but blocked by the blue agent.

3) CONVERGENCE (S3)
This situation will simply test the agents' ability to overcome a single passage blocked by two agents by encouraging them to move with voice reminders, as shown in Fig. 3 (d).

D. INDUCEMENT INTERACTION
Inducement is modeled as an event that rewards and punishes agents based on how they react over the inducement duration. Inducement begins whenever an agent uses the voice reminder action and there are other agents within a 2.5-m radius to be induced and lasts 2 seconds, as shown in Fig. 4 (a). Conditions to start the inducement include: • The agent is not already being induced. • 8 seconds of voice reminder cooldown is available.
Acceptable values for how often agents can use inducement have a heavily subjective range, but in this study, to evaluate the effectiveness of inducement and its proper use of them, we selected the above cooldown time. Agents cannot use voice reminders when they have already received another agent's inducement, but agents can be induced by multiple agents simultaneously. Once the inducement event begins, the algorithm temporarily logs the relative distance between relevant agents and the distance that the inducee moved during inducement before and after the event. Paired with relative distances between each agent and their respective goals, the reward function determines if the inducer and inducee(s) acted properly. The conditions for successful voice reminders are: • The inducer can move at least 1 m closer toward its goal and the inducer's goal is farther than the recipients.
• At least one other agent must have been part of the inducement event. The conditions for a successful response are: • An increase in relative distance between both agents so that the inducer can pass (at least 0.5 m).
• The inducee contributed to the inducement event by moving at least 1 m. Both agents can be the inducer and inducee when they use voice reminders at the same time, as shown in Fig. 4 (b). This scenario is indicative of real-world interaction. In this case, the agents can obtain the rewards for being both an inducer and inducee, if all requirements are satisfied.

E. REWARD FUNCTION
The reward function is set to focus on group convergence over the individual by prioritizing agents with further goals. The reward and punishments except for those for inducement were set based on the best practices for stable training [31]. The best practices are a max reward magnitude of no greater than 1.0 and a small penalty to help faster convergence. Table 3 lists the relationship between actions and rewards. Agents are rewarded for each step when they remain on their goal, and the reward is linearly distributed across that time. Specifically, if an agent stays on its goal for 180 frames (3 s), it receives a reward of 0.1, and the max is 1.0 (30 s). Collisions are an immediate penalty. Proper usage of voice reminders rewards the inducer, while poor usage punishes it. The response to voice reminders is also rewarded and punished accordingly. The values for the inducement usage and responses were scaled to be slightly more impactful than the reward for goal arrival since it is inducement to a submissive movement that is away from their goal. The reward for goal arrival at 3 s is 0.1, so the reward for a successful reaction to a voice reminder (2 s) is 0.12 to compensate the inducee for relocating off their goal. Moreover, we set a group punishment for each agent of −0.0055 for every step that all agents are not on their goals to ensure timely group convergence, and the max is −1.0 (≈1800 steps ×−0.0055). Ending the punishment requires that all agents are on their goal, so an agent must learn to avoid crashes since group convergence would no longer happen.

IV. EXPERIMENTS
The results are based on the collective group of agents, but the primary purpose of evaluating the effectiveness of inducement is on the white agent as they are presented with the most complex routes, specifically in S2 and S3. The numbers of collisions are those that occur over one episode summary (50 episodes). The success rate was calculated by checking if all agents were on the goal before the end of the episode (30 s).

A. OPEN ENVIRONMENT (S1)
We ran three experiments, ideal, non-inducement, and inducement, as shown in Figs. 5 (a)-(c). In ideal, the agents are trained on the set spawn positions for the entire training time to examine the effects of training on the test environment as well as find the lower bound for arrival times. As shown in Fig. 5 (a), the agents learned the trends of all the other agents' movements, which led to a spiral movement toward the goal. This movement is ideal but quite unrealistic for a group of agents to move with such synchronization. Fig. 5 (b) shows a more normal movement with all agents moving directly toward their goal until they were close to one another. Once the agents were close to one another, several agents waited while others continued to move toward their goal. In Fig. 5 (c), while several agents simply avoided the others due to a bountiful amount of space, others used inducement to help get a better path. Fig. 5 (d) shows the average arrival time and the number of collisions. The results show convergences time of 5.3, 7.4, and 8.0 s for ideal, non-inducement, and inducement, respectively. Collisions only occurred twice in the non-inducement and inducement throughout 1000 episodes, and the success rates were 98.4% for non-inducement and 97.9% for inducement.
In summary, the introduction of inducement was not beneficial to agents in an open environment. In fact, it slightly harmed the overall efficiency of the group by 7.9%.

B. EFFICIENCY ENVIRONMENT-SMALL DETOUR (S2-1)
We tested two experiments, inducement and non-inducement. In non-inducement, we found that the agent took multiple routes despite having the same spawn location and goals. In the first route, which was most selected (>90%), as shown in Fig. 6 (a), the target agent took the upper path and passed by the other three agents before reaching its goal. While in Fig. 6 (b), in rare cases, the agent decided to wait for the blue agent's VOLUME 11, 2023  awareness and take the shorter route. In inducement, Fig. 6 (c) shows that the target agent always took the center route and induced the blue agent once it was in range. The results in Fig. 6 (d) show that inducement was 37% more efficient than non-inducement. In addition, agents with a voice reminder (3.16) had less than half the average collisions compared with non-inducement (8.08). The primary reason for collisions for non-inducement was that four out of the five agents were consistently interacting within a closed space, which led to agents backing up into walls due to their limited field of vision. Overall, the success rates for non-inducement and inducement were 83.5 and 92.2%, respectively.
The main outcome of the experiment was that in cases where there are small detours, the use of inducement could be beneficial but is highly dependent on the degree of the detour.

C. EFFICIENCY ENVIRONMENT-LARGE DETOUR (S2-2)
We ran another test with the upper path closed off to see how agents would react. We found that both policies defaulted to taking the center route every time. In non-inducement, the agents never explored the inefficient route located at the bottom of Fig. 7 (a), rather they decided that the best course of action was to 'wait' on the blue agent's awareness or try to squeeze by. For inducement, as we expected, Fig. 7 (b) shows that agents took the same route as the one in S2-1 as the target agent never has to wait for awareness. By closing the top path, the arrival time for non-inducement increased by 126.2% to that of inducement. In inducement, the agents arrived at 5.5 s on average, while in non-inducement, it took 12.4 s, as shown in Fig. 7 (c). The collisions were similar to S2-1, but the success rate for the non-inducement (92.2%) was much closer to that of inducement (92.7%) at the expense of efficiency. This increase in success rate is due to the reduction of agents who pass through the highly congested upper path.
In scenarios with few paths and large detours, the integration of inducement had a much more noticeable effect on the average convergence time of agents.

D. CONVERGENCE ENVIRONMENT (S3)
This scenario was designed to mimic a situation with two unaware people talking in front of the only passage, but the agents were able to detect the target agent. The mutual awareness alone was insufficient for the model to converge to a policy where all agents could regularly reach their goals. In most episodes for non-inducement, the blue and purple agents would block the path, unaware of the target agent's desire, as shown in Fig. 8 (a). For inducement, Fig. 8 (b) shows that the agent could explicitly interact with the other agents to create a space sufficient to pass. Fig. 8 (c) shows that with a voice reminder, the agents could collectively reach their goal in 17.4 s. The agents with non-inducement were seldom able to reach all three goals simultaneously. A 30-s arrival time indicates that the agents could not lower their arrival time and the episode timed out. The average collisions were higher for inducement (13.2) than those for non-inducement (10.2). As this environment was the most restrictive, agents had little space to move when being induced. The main factor for failed convergence and the larger number of collisions was when induced agents would back up into the wall. For the success rates, non-inducement was 2.45% (there were extremely rare cases where the agent squeezed through the gap), and that for inducement was 66.2%. The major reason for lowering the success rate is that one of the two induced agents crashed during inducement. An approach to solve this issue would be to model the action of people looking around when the agent is induced.

1) INDUCEMENT USAGE
Along with evaluating the effectiveness of voice reminders in various settings, we also measured how often the voice reminder was used throughout all four environments. Fig. 9 shows the number of voice reminder usages per episode for the target agent in all the environments. The figure shows that inducement was used the least in the open environment (S1), used more as alternative paths were limited, and used the most in the convergence environment (S3). Note that each agent can use inducement up to three times per episode. We confirmed from the result that the effectiveness of inducement differs in environments. The number of voice reminder usages is larger in order of S3, S2-2, S2-1, and S1.

2) CONTRIBUTION AND LIMITATIONS
We here summarize the contribution as well as limitations. This study is a preliminary trial to acquire an inducing policy using a multiagent deep reinforcement framework. The contribution is to obtain an interpretation of the effectiveness and necessity of using inducement, which can be a large forward step for updating interactive navigation. This study proved that inducement is effective in the time efficiency of a mobile robot. It also shows that the agents can learn the importance of inducements in restrictive environments over open ones. This is possible because the multiagent reinforcement learning framework clarifies when the agents use inducement and when they accept it. The results can be applied to parameter settings for rule-based navigation methods [10]. Meanwhile, the model we used is limited in how it rewards inducement responses. Agents are rewarded and punished for reacting and not reacting to inducement, respectively. In some situations, reacting to inducement is not suitable for an agent, so we need to define and incentivize a proper reaction. For future improvement, we will focus on heterogeneous agents since it would enable variable responses and modeling different types of agents that would reinforce the need for other inducements and responses.

V. CONCLUSION
To understand and evaluate if inducement (voice reminder) is effective and how and when it must be used, we propose to comprehend them through multiagent deep reinforcement learning (MDRL) in which the robot voluntarily acquires an inducing policy suitable for the situation. We introduced an inducement-based multiagent path planning algorithm that enables agents to use voice reminders to help convey their desires while traversing to their goals. We developed four different situations to evaluate how agents would implicitly and explicitly interact with one another. Experimental results validated our preconceptions that inducement can greatly improve group efficiency in situations where agents are unaware of another agent's presence. Inducement in open environments could be harmful to the group's efficiency. Meanwhile, inducement increases agent efficiency and detour severity in environments with more severe detours. Finally, in a single blocked passage, inducement was the determining factor for group convergence. The results also conveyed that the agents were more prone to using voice reminders when there was a lack of open space.
In future works, we plan on extending the types of explicit inducement actions by adding physical touch, enriching input information such as attention to collision risk [32], and integrating this framework into physical multi-robot systems.