Introduction
As the world grapples with declining birthrates and an aging population, these demographic shifts are increasingly viewed as serious issues [1], [2]. To mitigate the resultant strain on the workforce, there has been a growing emphasis on implementing autonomous mobile robots in various contexts such as warehouses [3] and factories [4]. These robots need to navigate different environments autonomously. Therefore, the robot requires the integration of a diverse set of technologies, including localization [5], mapping [6], perception [7], and path planning [8].
This paper primarily explores the realm of path planning. The path planning technology is divided into global and local path planning [9]. Global path planning generates a path from the starting point to the destination based on a pre-existing map [10], [11]. However, it fails to account for unknown or unexpected obstacles in real-world environments. Therefore, in dynamic human workspaces, robots should reach their destinations and avoid obstacles autonomously and adaptively [12], [13]. Consequently, the focus has shifted towards local path planning, which factors in the dynamic obstacles not accounted for on the pre-established maps.
With this research, we delve into local path planning considering those obstacles not included on pre-built maps [14], [15], [16], [17], [18], [19]. While dynamic obstacles are certainly a consideration [20], [21], [22], [23], this paper focuses on static environments like factories and warehouses. The Dynamic Window Approach (DWA), which accounts for dynamic constraints, has emerged as a prevalent local path planning method [24]. Despite numerous reported improvements to DWA [15], [25], [26], its limitations persist. In particular, DWA’s fixed weight coefficients, which determine the optimal path based on factors such as goal position, obstacle distance, and robot velocity, fail to adapt to changes in environmental situations. This can lead to the selection of inefficient paths or even collisions, especially in confined or crowded spaces like factories and warehouses.
To address these issues, researches on dynamic weight coefficients for DWA have been carried out. Abubakr et al. and Hong et al. adjusted weight coefficients with fuzzy logic [28], [29]. These approaches dynamically adjusted weight coefficients using fuzzy logic to analyze goal positions and obstacles. Chang et al. proposed using Q-learning to dynamically adjust the weight coefficients of DWA [30]. Q-learning is a method of reinforcement learning. It doesn’t require prior knowledge of the environment, making it suitable for robot path planning. Additionally, it offers a low cost for learning.
Considering these advantages, this paper focuses on the Q-learning method for adjusting weight coefficients of DWA. While the conventional method [30] adjusts weight coefficients based on goal information, velocities, and obstacles, it doesn’t account for the spatial area and congestion rates of the environment. The conventional method leads to the selection of inefficient paths or even collisions, depending on the specific situation. To remedy the issue, this paper proposes a dynamic weight coefficient adjustment approach based on Q-learning for DWA that accounts for environmental situations (DQDWA). DQDWA considers environmental conditions such as goal distance, goal direction, velocity, visible area, and congestion. DQDWA can dynamically adjust the weight coefficients of the evaluation function based on these environmental conditions. Extensive simulations and experiments have been carried out to demonstrate the effectiveness and advantages of DQDWA in real-world scenarios.
The main contributions of this paper are threefold:
This paper proposes DQDWA. DQDWA can dynamically adjust the weight coefficients of the evaluation function based on these environmental conditions.
DQDWA incorporates the concept of context-awareness, where weight coefficients are not static but dynamically adjusted according to the area of spaces and congestion levels. This approach enhances the adaptability and performance of autonomous robots in varied situations.
The effectiveness of DQDWA has been validated through extensive simulations and real-world experiments. The results demonstrate that the proposed method outperforms traditional DWA in terms of efficiency and safety.
This paper is organized into eight sections including the current section. Sections II, III, and IV provide a comprehensive overview of the coordinate system, the Dynamic Window Approach (DWA), and Q-learning, respectively. Section V proposes Dynamic Weight Coefficients based on Q-learning for DWA (DQDWA). Sections VI and VII show the results from our simulations and real-world experiments to highlight the effectiveness and utility of DQDWA. Finally, section VIII provides conclusions.
Coordinate System
Fig. 1 illustrates the coordinate system for the robot utilized in this study. This paper defines two coordinate systems: the local coordinate system
Dynamic Window Approach (DWA)
A. Overview of DWA
The Dynamic Window Approach (DWA) is a commonly used method in local path planning [24]. Initially, the velocity space with dynamic constraints (VSD) is determined based on the robot’s current velocities. Subsequently, at each time step, an optimal path is selected from the VSD using an evaluation function. This optimal path selection is dependent on the weight coefficients of the evaluation function. Details about the velocity space and the optimal path selection are further elaborated in Sections III-B and III-C, respectively.
B. Velocity Space
DWA generates a velocity space with dynamic constraint, denoted as \begin{equation*} D^{vsd} =D^{all} \cap D^{dw} \cap D^{obs} \tag{1}\end{equation*}
C. Optical Path
The velocity space
Path candidates are evaluated using the following evaluation function \begin{equation*} J = W^{gol} \cdot c^{gol} + W^{vel} \cdot c^{vel} + W^{obs} \cdot c^{obs} \tag{2}\end{equation*}
Q-Learning
To dynamically adjust the weight coefficients of the evaluation function in DWA, this study incorporates Q-learning [31], a type of reinforcement learning method. Fig. 3 outlines the concept of Q-learning, which updates a Q-table that stores the Q-values for each action in each state. The Q-table is a \begin{equation*} Q(s,a) = (1-\alpha)Q(s,a) + \alpha [R(s,a) + \gamma Q(s',a)] \tag{3}\end{equation*}
Train1:
In the current state
, the agent chooses actions using thea -greedy method with the Q-table.\epsilon Train2:
The agent receives the next state and reward
from the environment.R Train3:
The Q-value in the Q-table is updated using (3).
The
Proposed Method (DQDWA)
A. Overview of DQDWA
This section proposes Dynamic Weight Coefficients based on Q-learning for DWA approach considering environmental situations (DQDWA). While the conventional method [30] adjusts weight coefficients based on certain parameters, it does not consider visible area and congestion rate as environmental factors. Therefore, the conventional method may lead to inefficient path selection or even collisions depending on the circumstances. To address this limitation, the proposed method includes visible area and congestion as key factors in defining environmental situations. Fig. 4 provides a overview of DQDWA which consists of four steps:
Step1:
Determine the robot state
-s_{1} based on environmental information measured by the distance sensor. The definitions ofs_{5} -s_{1} are detailed in Section V-B.s_{5} Step2:
The trained Q-table chooses the appropriate combination of weight coefficients for the given state. The definition of the action dimension and the Q-learning process are detailed in Sections IV, V-C, and V-D.
Step3:
The chosen weight coefficients are applied to the evaluation function of DWA. DWA is elaborated on in Section III.
Step4:
The robot moves according to the translational and angular velocity that maximizes the evaluation function.
B. Definition of State Dimension
Fig. 5 provides a visual representation of state dimensions \begin{equation*} \boldsymbol {s} = \begin{bmatrix} s_{1} & s_{2} & s_{3} & s_{4} & s_{5} \end{bmatrix} ^{\mathsf {T}} \tag{4}\end{equation*}
1) Definition of State Dimension s_{1}
(Goal Distance)
\begin{align*} s_{1} = \begin{cases} \displaystyle 1\quad \text { if}\quad l^{rg} < W^{dis} L^{rob}\\ \displaystyle 2\quad \text { otherwise} \end{cases} \tag{5}\end{align*}
2) Definition of State Dimension s_{2}
(Goal Direction)
\begin{align*} s_{2} = \begin{cases} \displaystyle 1\quad \text { if}\quad {\theta }^{rg}\in \big[-\cfrac {\pi }{2},\cfrac {\pi }{2} \big)\\ \displaystyle 2\quad \text { otherwise} \end{cases} \tag{6}\end{align*}
3) Definition of State Dimension s_{3}
(Travelled Distance)
\begin{align*} \eta = \begin{cases} \displaystyle \cfrac {2v}{\omega }\quad &\text {if}\quad {|\omega |}>\pi \\ \displaystyle v &\text {else if}\quad \omega = 0\\ \displaystyle \cfrac {2v}{\omega } \sin (\cfrac {\omega }{2})\quad &\text {otherwise} \end{cases} \tag{7}\end{align*}
\begin{align*} s_{3} = \begin{cases} \displaystyle 1\quad \text { if}\quad |\eta | \leq \cfrac {V^{max}}{2}\\ \displaystyle 2\quad \text { otherwise} \end{cases} \tag{8}\end{align*}
4) Definition of State Dimension s_{4}
(Visible Area)
\begin{equation*} f_{i} = \cfrac {1}{2} d_{i} d_{i+1}\sin (\cfrac {2\pi }{N}) \tag{9}\end{equation*}
\begin{equation*} f^{all} = \sum _{i=1}^{N} f_{i} \tag{10}\end{equation*}
\begin{align*} s_{4} = \begin{cases} \displaystyle 1 \quad \text { if}\quad f^{all} \leq W^{are}F^{max}\\ \displaystyle 2 \quad \text { otherwise} \end{cases} \tag{11}\end{align*}
5) Definition of State Dimension s_{5}
(Congestion)
\begin{align*} s_{5} = \begin{cases} \displaystyle 1 \quad \text { if}\quad n^{fwd} > \cfrac {N}{4}\\ \displaystyle 2 \quad \text { else if}\quad n^{bwd} >\cfrac {N}{4}\\ \displaystyle 3 \quad \text { else if}\quad n^{all} > \cfrac {N}{4} \\ \displaystyle 4 \quad \text { otherwise} \end{cases} \tag{12}\end{align*}
\begin{equation*} n^{all} = n^{fwd} + n^{bwd} \tag{13}\end{equation*}
C. Definition of Reward
To adjust the weight coefficients of the evaluation function with Q-learning, the reward \begin{equation*} R = R_{1}+R_{2}+R_{3} \tag{14}\end{equation*}
\begin{align*} R_{1} = \begin{cases} \displaystyle 5000 \quad &\text {if}\qquad \text { reach goal}\\ \displaystyle -200 \quad &\text {else if}\qquad \text { collide obstacle}\\ \displaystyle -2 \qquad &\text {otherwise} \end{cases} \tag{15}\end{align*}
\begin{align*} R_{2} &=\begin{cases} \displaystyle 0 \quad &\text {if} \qquad \text { reach goal or collide obstacle}\\ \displaystyle 10 \quad &\text {if}\qquad \text { get close to goal position}\\ \displaystyle -10 \quad &\text {else if}\qquad \text { farther from goal position}\end{cases} \\{}\tag{16}\end{align*}
\begin{align*} R_{3} &= \begin{cases} \displaystyle 0 \quad &\text {if} \qquad \text { reach goal or collide obstacle}\\ \displaystyle -5 \quad &\text {if} \qquad \text { approach obstacle}\\ \displaystyle 5 \quad &\text {else if}\qquad \text { go away from obstacle}\end{cases} \\{}\tag{17}\end{align*}
D. Definition of Action Dimension
The weight coefficients for position, velocity, and obstacles are each selected from the set {1,2,3}. We omit the sets {2,2,2} and {3,3,3} as they are equivalent to the set {1,1,1}. As a result, 25 unique combinations are obtained, which form the action dimension.
The Q-table comprises the actions and the states of the robot in various environmental situations. By utilizing the learned Q-table, DQDWA selects the optimal path using dynamic weight coefficients considering these environmental situations.
Simulation
A. Simulation Setup
The simulation system was implemented using the Robot Operating System (ROS) and Gazebo. In this simulation, we evaluated five patterns of DWA weight coefficients: DWA I, DWA II, DWA III, Conventional DWA with Q-learning (CDQ) [30], and DQDWA. The constant weight coefficients
B. Pre-Train of Q-Table
Fig. 6 (a)-(e) depict the environments utilized during the learning process. Env. and goal positions were randomly selected at the start of each trial as outlined in Table 2.
C. Simulation Environment
In this simulation, the following two types of simulations were conducted.
Case S1: Conducted a single simulation in each environment (Env. 1-5).
Case S2: Performed 30 simulations in unfamiliar environments (Env. 6,7).
Env. 6 was designed with a higher density of obstacles compared to Env. 4. This was intended to test the robot’s ability to deal with unforeseen obstacles that were not encountered during the learning phase. Env. 7 was designed to simulate a manufacturing plant. In this environment, the robot had to recognize and avoid not only static obstacles but also dynamic obstacles, such as humans, while navigating toward its goal.
D. Simulation Results
1) Case S1
Tables 3-4 present the results for Case S1. The abbreviations TL and PD denote the trajectory length and the movement posture displacement, respectively. Table 4 indicates the number of collisions, along with the average time, trajectory length (TL), and posture displacement (PD). Figs. 8–12 depict the trajectories in each environment.
For DWA I-III, while they delivered satisfactory results in some environments, there were instances where the robot collided with obstacles. Moreover, the robot often required a long duration to reach the goal position. The simulation results for DWA I-III varied depending on the environmental situation, as these methods utilize fixed weight coefficients.
In the case of CDQ, the simulation results for Env. 1–3 were better than those for DWA I-III, since CDQ selects weight coefficients considering the environmental situation. However, the results for Env. 4 and 5 were nearly identical to those for DWA I-III. This is because CDQ does not define environmental situations based on visible space size or obstacle count. Therefore, optimal weight coefficients were not chosen in narrow or crowded spaces.
In DQDWA, the robot successfully reached the goal position in the shortest time and with the smallest TL and PD. This is because DQDWA takes into account both space size and congestion, enabling the selection of optimal weight coefficients tailored to each environment. DQDWA allows for more efficient routing while ensuring safety and preventing the robot from circling in one place.
2) Case S2
Table 5 presents the results for Case S2, while Figs. 8–12 (f)-(g) illustrate the corresponding trajectories. In this simulation, the goal position was randomly selected at the start of each trial. For Env. 6, the goal position was selected from four points with
In the case of DWA I, while the success rate was high, the robot took a significantly longer time to reach the goal position. DWA II and DWA III achieved smaller values for time, trajectory length (TL), and posture displacement (PD), but their success rates were comparatively lower. This is because these approaches prioritized high translational velocity and goal distance over obstacle avoidance, leading to more collisions.
CDQ yielded a lower success rate than DQDWA. Additionally, the averages of time, TL, and PD were the largest in Env. 6 and the second largest in Env. 7. This indicates that CDQ didn’t select efficient paths in environments that were not encountered during the learning phase.
In contrast, DQDWA achieved the highest success rate. Furthermore, it reached the goal position in a time span comparable to DWA II, which prioritizes translational velocity, and with TL and PD as small as DWA III, which prioritizes the goal distance. Therefore, DQDWA selected efficient paths while maintaining safety in unlearned environments.
The effectiveness of the proposed method, DQDWA, was thus confirmed through the simulation results for both Case S1 and Case S2.
Experiment
A. Experiment Setup
The experiment was carried out with ROS and Turtlebot3. Fig. 13 (a) shows an overview of Turtlebot3. Turtlebot3 is equipped with a distance sensor (LDS-01). The distance sensor measured environmental information. Fig. 13 (b)-(c) show the experiment environments. Fig. 13 (d)-(e) show their images. For the experiments, we have two scenarios defined as follows.
Case E1: This represents a simple environment with four obstacles.
Case E2: This represents a crowded environment with seven obstacles.
B. Experiment Results
Table 6 presents the experimental results. Figs. 14–15 illustrate the trajectories for each case, and Fig. 16 shows snapshots from DQDWA run.
In DWA I-III, collisions sometimes occurred. Even in cases where the goal was reached, these methods resulted in longer times, larger trajectory lengths (TL), and greater posture displacements (PD) compared to DQDWA. Their inability to adjust weight coefficients dynamically led to collisions and the selection of inefficient paths.
CDQ also resulted in a collision in Case E2. Moreover, its time, TL, and PD in Case E1 were larger than those of DWA II and DQDWA. These results suggest that CDQ was not able to select appropriate weight coefficients based on the environmental situations.
Conversely, DQDWA successfully reached the goal position and registered the shortest time, smallest TL, and least PD in both cases. DQDWA was capable of adjusting weight coefficients effectively in real-time. The effectiveness of the proposed method, therefore, was confirmed by the experimental results.
Conclusion
This paper introduced DQDWA, the dynamic weight coefficients based on Q-learning for DWA considering environmental situations. We focused on defining the state for Q-learning and included definitions for the area of space, taking into account congested areas. With DQDWA, the robot could select optimal paths by dynamic adjustments of weight coefficients. The effectiveness of the proposed method was validated through simulations and real-world experiments.
In the future, we aim to refine and improve DQDWA as follows.
Incorporating Moving Obstacles: The current evaluations of DQDWA have been conducted in static environments. Future work will look into accommodating moving obstacles in the learning and experiment environments.
Experiments in Diverse Environments: Our experiments have been performed with a single type of robot and sensor. We plan to evaluate DQDWA’s performance across various environments and using different types of robots and sensors.
Exploring Alternative Learning Methods: Presently, we utilize Q-learning as the sole learning method to adjust weight coefficients. Future efforts will investigate other learning methods for dynamic adjustment of these coefficients.