Local Path Planning: Dynamic Window Approach With Q-Learning Considering Congestion Environments for Mobile Robot

In recent years, autonomous mobile robots have significantly increased in prevalence due to their ability to augment and diversify the workforce. One critical aspect of their operation is effective local path planning, which considers dynamic constraints. In this context, the Dynamic Window Approach (DWA) has been widely recognized as a robust local path planning. DWA produces a set of path candidates derived from velocity space subject to dynamic constraints. An optimal path is selected from path candidates through an evaluation function guided by fixed weight coefficients. However, fixed weight coefficients are typically designed for a specific environmental context. Consequently, changes in environmental conditions such as congestion levels, road width, and obstacle density could potentially lead the evaluation function to select inefficient paths or even result in collisions. To overcome this challenge, this paper proposes the dynamic weight coefficients based on Q-learning for DWA (DQDWA). The proposed method uses a pre-learned Q-table that comprises robot states, environmental conditions, and actions of weight coefficients. DQDWA can use the pre-learned Q-table to dynamically select optimal paths and weight coefficients that better adapt to varying environmental conditions. The performance of DQDWA was validated through extensive simulations and real experiments to confirm its ability to enhance the effectiveness of local path planning.


I. INTRODUCTION
As the world grapples with declining birthrates and an aging population, these demographic shifts are increasingly viewed as serious issues [1], [2]. To mitigate the resultant strain on the workforce, there has been a growing emphasis on implementing autonomous mobile robots in various contexts such as warehouses [3] and factories [4]. These robots need to navigate different environments autonomously. Therefore, the robot requires the integration of a diverse set of technologies, including localization [5], mapping [6], perception [7], and path planning [8].
The associate editor coordinating the review of this manuscript and approving it for publication was Zheng Chen . This paper primarily explores the realm of path planning. The path planning technology is divided into global and local path planning [9]. Global path planning generates a path from the starting point to the destination based on a pre-existing map [10], [11]. However, it fails to account for unknown or unexpected obstacles in real-world environments. Therefore, in dynamic human workspaces, robots should reach their destinations and avoid obstacles autonomously and adaptively [12], [13]. Consequently, the focus has shifted towards local path planning, which factors in the dynamic obstacles not accounted for on the pre-established maps.
With this research, we delve into local path planning considering those obstacles not included on pre-built maps [14], [15], [16], [17], [18], [19]. While dynamic obstacles are certainly a consideration [20], [21], [22], [23], this paper focuses on static environments like factories and warehouses. The Dynamic Window Approach (DWA), which accounts for dynamic constraints, has emerged as a prevalent local path planning method [24]. Despite numerous reported improvements to DWA [15], [25], [26], its limitations persist. In particular, DWA's fixed weight coefficients, which determine the optimal path based on factors such as goal position, obstacle distance, and robot velocity, fail to adapt to changes in environmental situations. This can lead to the selection of inefficient paths or even collisions, especially in confined or crowded spaces like factories and warehouses.
To address these issues, researches on dynamic weight coefficients for DWA have been carried out. Abubakr et al. and Hong et al. adjusted weight coefficients with fuzzy logic [28], [29]. These approaches dynamically adjusted weight coefficients using fuzzy logic to analyze goal positions and obstacles. Chang et al. proposed using Q-learning to dynamically adjust the weight coefficients of DWA [30]. Q-learning is a method of reinforcement learning. It doesn't require prior knowledge of the environment, making it suitable for robot path planning. Additionally, it offers a low cost for learning.
Considering these advantages, this paper focuses on the Q-learning method for adjusting weight coefficients of DWA. While the conventional method [30] adjusts weight coefficients based on goal information, velocities, and obstacles, it doesn't account for the spatial area and congestion rates of the environment. The conventional method leads to the selection of inefficient paths or even collisions, depending on the specific situation. To remedy the issue, this paper proposes a dynamic weight coefficient adjustment approach based on Q-learning for DWA that accounts for environmental situations (DQDWA). DQDWA considers environmental conditions such as goal distance, goal direction, velocity, visible area, and congestion. DQDWA can dynamically adjust the weight coefficients of the evaluation function based on these environmental conditions. Extensive simulations and experiments have been carried out to demonstrate the effectiveness and advantages of DQDWA in real-world scenarios.
The main contributions of this paper are threefold: • This paper proposes DQDWA. DQDWA can dynamically adjust the weight coefficients of the evaluation function based on these environmental conditions.
• DQDWA incorporates the concept of context-awareness, where weight coefficients are not static but dynamically adjusted according to the area of spaces and congestion levels. This approach enhances the adaptability and performance of autonomous robots in varied situations.
• The effectiveness of DQDWA has been validated through extensive simulations and real-world experiments. The results demonstrate that the proposed method outperforms traditional DWA in terms of efficiency and safety. This paper is organized into eight sections including the current section. Sections II, III, and IV provide a comprehensive overview of the coordinate system, the Dynamic Window Approach (DWA), and Q-learning, respectively. Section V proposes Dynamic Weight Coefficients based on Q-learning for DWA (DQDWA). Sections VI and VII show the results from our simulations and real-world experiments to highlight the effectiveness and utility of DQDWA. Finally, section VIII provides conclusions. Fig. 1 illustrates the coordinate system for the robot utilized in this study. This paper defines two coordinate systems: the local coordinate system LC , and the global coordinate system GB . The quantities measured in the global coordinate system are expressed with the superscript GB ⃝. Variables belonging to the local coordinate system do not carry a superscript. The origin in the global coordinate system is situated at the initial position of the robot. The origin in local coordinate system is positioned at the midpoint between the robot's wheels. As shown in Fig. 1, ( GB x, GB y) and GB θ represent the position and angle of the robot in the global coordinate system, respectively. L rob denotes the radius of the robot.

III. DYNAMIC WINDOW APPROACH (DWA)
A. OVERVIEW OF DWA The Dynamic Window Approach (DWA) is a commonly used method in local path planning [24]. Initially, the velocity space with dynamic constraints (VSD) is determined based on the robot's current velocities. Subsequently, at each time step, an optimal path is selected from the VSD using an evaluation function. This optimal path selection is dependent on the weight coefficients of the evaluation function. Details about the velocity space and the optimal path selection are further elaborated in Sections III-B and III-C, respectively.

B. VELOCITY SPACE
DWA generates a velocity space with dynamic constraint, denoted as D vsd , using translational and angular velocities as illustrated in Fig. 2 (a). The velocity space D vsd is defined as follows.
where D all represents the range of maximum and minimum velocities determined by the robot's specifications. D dw known as the dynamic window, defines the range of 96734 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. velocities that the robot can achieve at the next time step. D obs consists of velocities that enable the robot to stop before colliding with an obstacle.

C. OPTICAL PATH
The velocity space D vsd is discretized by equally dividing the range of the translational and angular velocities. This results in pairs of translational and angular velocities within the velocity space D vsd , which serve as the velocity candidates. As shown in Fig. 2 (b), DWA generates predicted paths for each velocity candidate under the assumption of constant velocity motion. Path candidates are evaluated using the following evaluation function J .
where W gol , W vel , and W obs represent the weight coefficients associated with the goal, velocity, and obstacles, respectively. c gol indicates the distance between the predicted robot position and the goal position. c vel corresponds to the current translational velocity. c obs represents the shortest distance from the predicted robot position on the path to the obstacle. The optimal path is then determined by maximizing the evaluation function J . More details on DWA can be found in [27].

IV. Q-LEARNING
To dynamically adjust the weight coefficients of the evaluation function in DWA, this study incorporates Q-learning [31], a type of reinforcement learning method. Fig. 3 outlines the concept of Q-learning, which updates a Q-table that stores the Q-values for each action in each state. The Q-table is a m × n matrix, where m and n correspond to the numbers of states and actions, respectively. The formula to update the Q-value is defined as follows.
where α and γ represent the learning rate and discount rate, respectively. R(s, a) and Q(s ′ , a) denote the reward for the agent and the maximum Q-value in the next state, respectively. The training process for Q-learning involves four steps, as outlined in Fig. 3: Train1:In the current state s, the agent chooses action a using the ϵ-greedy method with the Q-table. Train2:The agent receives the next state and reward R from the environment. Train3:The Q-value in the Q-table is updated using (3). Train1-Train3 are repeated until the Q-table converges to a threshold value.
The ϵ-greedy method [32] is utilized for action selection. With probability ϵ, the action is chosen randomly, while with probability 1 − ϵ, the action with the highest expected reward is chosen. More details about Q-learning can be found in [31].

V. PROPOSED METHOD (DQDWA) A. OVERVIEW OF DQDWA
This section proposes Dynamic Weight Coefficients based on Q-learning for DWA approach considering environmental situations (DQDWA). While the conventional method [30] adjusts weight coefficients based on certain parameters, it does not consider visible area and congestion rate as environmental factors. Therefore, the conventional method may lead to inefficient path selection or even collisions depending on the circumstances. To address this limitation, the proposed method includes visible area and congestion as key factors in defining environmental situations.
where W dis is the weight coefficient for the goal distance and l rg is the distance between the robot and the goal.

2) DEFINITION OF STATE DIMENSION s 2 (GOAL DIRECTION)
s 2 represents the state indicating the angular difference between the robot's direction and the goal's direction. s 2 is defined as follows.
2 otherwise (6) where θ rg is the angle between the robot and the goal.

3) DEFINITION OF STATE DIMENSION s 3 (TRAVELLED DISTANCE)
s 3 is the state related to the distance that the robot will travel from its current position after one second. The travelled distance η is calculated as follows.
s 3 is defined as follows.
where V max is the maximum translational velocity of the robot.

4) DEFINITIONOF STATE DIMENSION s 4 (VISIBLE AREA)
s 4 is the state that quantifies the visible area around the robot.
The divided area f i for the state s 4 is defined as follows.
where d i is the i-th distance data measured by the distance sensor, and N is the total number of distance data points.
Distance data is obtained in 2π N radian increments in a counter-clockwise direction. The total divided area f all is calculated as follows.
s 4 is defined as follows.
96736 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where W are is the weight coefficient of the area, and F max is the maximum possible sum of the divided areas.

5) DEFINITION OF STATE DIMENSION s 5 (CONGESTION)
s 5 is the state associated with congestion. s 5 is defined based on the number of obstacles surrounding the robot.
where n fwd and n bwd are the number of sensor data points within a threshold distance D thr in the front and rear halves of the robot, respectively. n all is defined as follows. n all = n fwd + n bwd (13)

C. DEFINITION OF REWARD
To adjust the weight coefficients of the evaluation function with Q-learning, the reward R is defined as follows.
Note that the initial value of R is set to 0. R 1 is a reward related to the result; goal or collision.  R 3 is a reward related to distance from the obstacle.    Fig. 6 (a)-(c), Env. 1-3 were designed to examine the differences in robot behavior due to crowding within restricted spaces. Fig. 6 (d) shows Env. 4, which was established to investigate robot behavior in a spacious area filled with numerous obstacles. In Fig. 6 (e), Env. 5 was designed to evaluate robot behavior amidst obstacles and humans. All of these environments were thoughtfully designed with real-world scenarios in mind, specifically warehouse and factory settings. The learning process was continued until the Q-table had been updated 30,000 times.

C. SIMULATION ENVIRONMENT
In this simulation, the following two types of simulations were conducted.   • Case S1: Conducted a single simulation in each environment (Env. 1-5).
• Case S2: Performed 30 simulations in unfamiliar environments (Env. 6,7). The starting positions in Cases S1 and S2 were set to ( GB x sta , GB y sta ) = (0.0, 0.0). In Case S1, the goal positions of Env. 1-5, denoted as ( GB x gol , GB y gol ), were established as (−1. Env. 6 was designed with a higher density of obstacles compared to Env. 4. This was intended to test the robot's ability to deal with unforeseen obstacles that were not encountered during the learning phase. Env. 7 was designed to simulate a manufacturing plant. In this environment, the robot had to recognize and avoid not only static obstacles but also dynamic obstacles, such as humans, while navigating toward its goal.

1) CASE S1
Tables 3 -4 present the results for Case S1. The abbreviations TL and PD denote the trajectory length and the movement posture displacement, respectively. Table 4 indicates the number of collisions, along with the average time, trajectory length (TL), and posture displacement (PD). Figs. 8-12 depict the trajectories in each environment.
For DWA I-III, while they delivered satisfactory results in some environments, there were instances where the robot collided with obstacles. Moreover, the robot often required TABLE 3. Simulation results in case S1 (1time in each environment).

TABLE 4.
Simulation results in case S1 (average in each Env. ).
a long duration to reach the goal position. The simulation results for DWA I-III varied depending on the environmental situation, as these methods utilize fixed weight coefficients.
In the case of CDQ, the simulation results for Env. 1-3 were better than those for DWA I-III, since CDQ selects weight coefficients considering the environmental situation. However, the results for Env. 4 and 5 were nearly identical to those for DWA I-III. This is because CDQ does not define environmental situations based on visible space size or obstacle count. Therefore, optimal weight coefficients were not chosen in narrow or crowded spaces.
In DQDWA, the robot successfully reached the goal position in the shortest time and with the smallest TL and PD. This is because DQDWA takes into account both space size and congestion, enabling the selection of optimal weight coefficients tailored to each environment. DQDWA allows for 96738 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.      more efficient routing while ensuring safety and preventing the robot from circling in one place.  In Env. 7, the goal position was selected from two points, with ( GB x gol , GB y gol ) being (8.5, 2.0) and (8.5, −2.0).

2) CASE S2
In the case of DWA I, while the success rate was high, the robot took a significantly longer time to reach the goal position. DWA II and DWA III achieved smaller values for time, trajectory length (TL), and posture displacement (PD), but their success rates were comparatively lower. This is because these approaches prioritized high translational velocity and goal distance over obstacle avoidance, leading to more collisions.      CDQ yielded a lower success rate than DQDWA. Additionally, the averages of time, TL, and PD were the largest in Env. 6 and the second largest in Env. 7. This indicates that CDQ didn't select efficient paths in environments that were not encountered during the learning phase.
In contrast, DQDWA achieved the highest success rate. Furthermore, it reached the goal position in a time span comparable to DWA II, which prioritizes translational velocity, and with TL and PD as small as DWA III, which prioritizes the goal distance. Therefore, DQDWA selected efficient paths while maintaining safety in unlearned environments.
The effectiveness of the proposed method, DQDWA, was thus confirmed through the simulation results for both Case S1 and Case S2.

A. EXPERIMENT SETUP
The experiment was carried out with ROS and Turtlebot3. Fig. 13 (a) shows an overview of Turtlebot3. Turtlebot3 is equipped with a distance sensor (LDS-01). The distance sensor measured environmental information. Fig. 13 (b)-(c) 96740 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. show the experiment environments. Fig. 13 (d)-(e) show their images. For the experiments, we have two scenarios defined as follows.
• Case E1: This represents a simple environment with four obstacles.
• Case E2: This represents a crowded environment with seven obstacles. The start position was ( GB x sta , GB y sta ) = (0.0, 0.0), and the goal position was ( GB x gol , GB y gol ) = (4.0, 0.0). The models and parameters used in the experiment were the same as the simulation; DWA I, DWA II, DWA III, CDQ, and DQDWA. Table 6 presents the experimental results. Figs.14-15 illustrate the trajectories for each case, and Fig. 16 shows snapshots from DQDWA run.

B. EXPERIMENT RESULTS
In DWA I-III, collisions sometimes occurred. Even in cases where the goal was reached, these methods resulted in longer times, larger trajectory lengths (TL), and greater posture displacements (PD) compared to DQDWA. Their inability to adjust weight coefficients dynamically led to collisions and the selection of inefficient paths.
CDQ also resulted in a collision in Case E2. Moreover, its time, TL, and PD in Case E1 were larger than those of DWA II and DQDWA. These results suggest that CDQ was not able to select appropriate weight coefficients based on the environmental situations.
Conversely, DQDWA successfully reached the goal position and registered the shortest time, smallest TL, and least PD in both cases. DQDWA was capable of adjusting weight coefficients effectively in real-time. The effectiveness of the proposed method, therefore, was confirmed by the experimental results.

VIII. CONCLUSION
This paper introduced DQDWA, the dynamic weight coefficients based on Q-learning for DWA considering environmental situations. We focused on defining the state for Q-learning and included definitions for the area of space, taking into account congested areas. With DQDWA, the robot could select optimal paths by dynamic adjustments of weight coefficients. The effectiveness of the proposed method was validated through simulations and real-world experiments.
In the future, we aim to refine and improve DQDWA as follows.
• Incorporating Moving Obstacles: The current evaluations of DQDWA have been conducted in static environments. Future work will look into accommodating moving obstacles in the learning and experiment environments.
• Experiments in Diverse Environments: Our experiments have been performed with a single type of robot and sensor. We plan to evaluate DQDWA's performance across various environments and using different types of robots and sensors.
• Exploring Alternative Learning Methods: Presently, we utilize Q-learning as the sole learning method to adjust weight coefficients. Future efforts will investigate other learning methods for dynamic adjustment of these coefficients.