UAV Path Planning Based on the Average TD3 Algorithm With Prioritized Experience Replay

Path planning is one of the important components of the Unmanned Aerial Vehicle (UAV) mission, and it is also the key guarantee for the successful completion of the UAV’s mission. The traditional path planning algorithm has certain limitations and deficiencies in the complex dynamic environment. Aiming at the dynamic complex obstacle environment, this paper proposes an improved TD3 algorithm, which enables the UAV to complete the autonomous path planning through online learning and continuous trial and error. The algorithm changes the experience pool of TD3 algorithm to priority experience replay, so that the agent can distinguish the importance of empirical samples, improve the sampling efficiency of the algorithm, and reduce the training time. The average TD3 is proposed, and the average value of <inline-formula> <tex-math notation="LaTeX">$Q_{1}Q_{2}$ </tex-math></inline-formula> is taken when the target value is updated to solve the problem of overestimating the <inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula> value while avoiding underestimating the <inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula> value, so that the improved algorithm has better stability and can adapt to various complex obstacle environments. A new reward function is set up, so that each step of the UAV action can receive reward feedback, which solves the problem of sparse reward in deep reinforcement learning. The experimental results show that this method can train the UAV to reach the target safely and quickly in a multi-obstacle environment. Compared with DDPG, SAC and traditional TD3, the path planning success rate of this algorithm is higher than that of the other three algorithms, and the collision rate is lower than that of the comparison algorithm, which has better path planning performance.


I. INTRODUCTION
Unmanned aerial vehicle (UAV) are radio-controlled aircraft operated remotely or through self-contained program control devices.Because of its small size, low cost, easy to use and other advantages, it is widely used in various fields [1].UAV path planning plays a vital role in establishing the UAV mission model and serves as a crucial guarantee for the successful completion of the UAV mission.Its purpose is to plan the optimal flight path in a given scenario, considering the path length, terrain environment, threat The associate editor coordinating the review of this manuscript and approving it for publication was Alessandro Floris .information, UAV maneuverability constraints and other related factors [2].A well-designed path planning algorithm can enable UAV to accomplish tasks at a minimal cost, particularly in complex and dynamic environments.The path planning mentioned in this paper is a point-to-point planning method, which is characterized by obstacle avoidance, the shortest and smoothest running path.In contrast to coverage path planning, complete coverage path planning involves determining a path that traverses all points in a given region or spatial range while simultaneously avoiding obstacles [3].
In recent years, scholars have conducted a lot of research in the field of UAV path planning algorithms, and proposed a variety of path planning algorithms.These algorithms have their own characteristics in application fields, advantages and disadvantages.According to the different research methods, UAV path planning algorithms can be divided into two categories: one is based on non-learning algorithms, including classical path planning algorithms and intelligent optimization algorithms; the other is learning-based algorithms, such as deep reinforcement learning algorithms [4].
The classical path planning algorithms, such as Artificial potential field (APF) [5], [6], Rapidly-exploring random tree (RRT) [7], [8], A* algorithm [9], [10], Voronoi diagram (VD) [11], [12], [13], Probabilistic road map (PRM) [14], and so on.The main reason why these classical path planning methods can be successful is that they are easy to implement.At the same time, they show good results in path optimization, fast solution generation and static environment with simple obstacles.However, the time complexity of these algorithms is relatively high, and the performance is easily reduced when dealing with high-dimensional space path planning.In addition, they are also easy to fall into local optimums, which may lead to large deviations in path planning results.Intelligent optimization algorithms, such as Genetic algorithm (GA) [15], Particle swarm optimization (PSO) [16], [17], [18], Gray wolf optimization (GWO) [19], Differential evolution (DE) [20], etc.These algorithms are simple to implement in UAV path planning, have global search ability, and show good robustness for large-scale optimization problems.However, in the path planning process, they may fall into local optimal solution.In addition, the computational complexity of these algorithms is high, the performance depends largely on the choice of parameters, and the convergence speed is relatively slow.
The above path planning algorithms are based on search-based and sample-based methods to generate viable paths within a given environment.However, with the increase of environmental complexity and uncertainty, the feasibility of these methods is greatly reduced.In addition, after the front-end path search, the above method also needs to optimize the back-end trajectory, which leads to a high time complexity of the algorithm.In practical applications, the utilization of the aforementioned methods faces significant limitations when the UAV needs to adapt to unfamiliar environments.Currently, enabling real-time collision-free path planning for UAVs from start to end in unknown environments remains a formidable challenge.In such unfamiliar environments, UAV lack knowledge of the environment and the environment is very likely to change all the time, which requires UAV to have the capacity to perceive, decide and act, as well as the ability to explore and learn.For this reason, it is particularly crucial to design a method that enables UAV to learn autonomous path planning in an unknown environment.In recent years, the Deep reinforcement learning (DRL) method with autonomous learning ability has successfully solved the path planning problem of UAV in unknown environments.As a decision-making control method different from traditional machine learning algorithms, DRL enables agents to adapt to the environment through online learning and continuous trial and error without any guidance signal in an unknown environment.Because DRL achieves the saliency of the target effect through training, it has attracted the attention of many researchers and has begun to be applied in the field of UAV path planning.
Han et al. [21] proposed an improved Deep Q-network (DQN) that utilizes priority and exponential sampling methods, enhancing sampling algorithm stability and performance by adjusting the random uniform sampling of UAV flight experience samples.Xie et al. [22] introduced an improved Deep recurrent Q-Network (DRQN) that combines reward and Q values using a novel action selection policy to mitigate inaccurate neural network predictions during earlystage training.The improved DRQN algorithm exhibits low computational complexity, significantly improving learning efficiency and stability.Runjia et al. [23] proposed a multi critic-delayed Deep deterministic policy gradient (DDPG) method that utilizes average estimation of multi evaluation networks to decrease the DDPG's reliance on the evaluation network.The method employs delayed learning to mitigate overestimation and target network error accumulation, resulting in superior path planning performance compared with traditional DDPG.Hu et al. [24] proposed the REL-DDPG algorithm in their research, which is a DDPG algorithm based on the concept of relevant experience learning.Compared to the traditional DDPG algorithm, this algorithm shows significant improvements in terms of convergence speed and effectiveness.Bohao and Wu [25] proposed an enhance DDPG.This algorithm guides the UAV to track targets by designing a new reward function and smoothens the trajectory of the UAV using penalty terms.Additionally, the algorithm approximates the environmental state using a long short-term memory network, thereby enhancing the algorithm's approximation accuracy and data utilization.Zhang et al. [26] proposed an improved Twin-delayed deep deterministic policy gradient (TD3) algorithm, which utilizes a twin stream actor-critic network architecture.This algorithm extracts environmental features from observations and their variations to handle the stochasticity and dynamics of obstacles in the environment.Experimental results demonstrate that the algorithm exhibits good path planning performance in dynamic environments.Lee et al. [27] proposed a new Soft actor-critic (SAC) algorithm called SACHER.Experimental results show that SACHER is capable of generating optimal paths for UAV.Yan et al. [28] set up a dual deep Q network (D3QN) algorithm based on global situation information.This method uses a set of situation diagrams as input to approximate the Q value corresponding to all candidate actions.In addition, it combines ε-greedy strategy and heuristic search rules to select actions.Experiments show that the algorithm shows good performance under both static and dynamic task settings.Peng et al. [29] studied a UAV-assisted mobile edge computing network, and adopted a DRL framework for the problem of size explosion.A Dual deep Q-learning network (DDQN) algorithm is proposed to realize the path planning of UAV.The simulation results verify the effectiveness of the path planning scheme.
DRL algorithms have been applied to tackle the autonomous path planning problem of UAV, yielding improved outcomes.However, existing algorithms such as DQN, DRQN, DDPG have overestimation of Q value, while TD3 algorithm has underestimation of Q value.And in practice, there are some problems to be solved and optimized in path planning using DRL in complex environments, such as long exploration period, sparse rewards, low sample utilization, and convergence stability.Aiming at these problems, this paper proposes an Improved TD3 (I-TD3) algorithm that enables UAV to exhibit better path planning performance in complex and dynamic obstacle environments.The difference of this paper is that the traditional TD3 algorithm takes the minimum value of Q 1 Q 2 when updating the target value, and the algorithm of this paper takes the average value of Q 1 Q 2 when updating the target value, which enhances the stability of the algorithm and enables the UAV to adapt to various obstacle environments; for the low utilization of samples, this paper changes the experience pool of the TD3 algorithm into the Priority experience replay (PER), which improves the algorithm's utilization of samples and reduces the training time; in the face of the DRL reward sparsity problem, a new reward function is set so that each step of UAV action can receive reward feedback.
The main contributions of this paper are as follows: (1) Using the OpenAI Gym of DRL as a simulation platform, a three-dimensional continuous simulation environment is customized.The established simulation environment can visualize the training process and help analyze the behavior of the UAV.The simulation results show that the proposed algorithm has strong stability and a high success rate, and can effectively solve the path planning problem of UAV in a dynamic environment.
(2) The priority experience replay is used as the experience replay pool of the TD3 algorithm, so that the agent can distinguish the importance of the experience sample, reduce the training time, improve the sample utilization rate, and improve the efficiency of the experience pool extraction experience.
(3) This paper proposes an average TD3 algorithm.Different from the traditional TD3 algorithm, when updating the target value, the algorithm in this paper takes the average value of Q 1 Q 2 instead of the minimum value.On the basis of solving the problem of overestimating Q value, the situation of underestimating Q value is avoided, so that the improved algorithm has better stability and can adapt to various complex obstacle environments.
(4) We re-set the new reward function to ensure that the UAV receives reward feedback at every step of the action.This improvement not only solves the problem of sparse feedback in DRL, but also greatly improves the convergence speed of the algorithm.This allows the UAV to complete its mission in an efficient manner.
The rest of this paper is organized as follows: The second part introduces some background knowledge of Markov decision process (MDP) and DRL.In the third part, the algorithm proposed in this paper is described in detail.In the fourth part, we describe the state space, action space and reward function of UAV path planning.The path planning process of UAV based on this algorithm is also described.The fifth part describes the experimental environment, experimental details, experimental settings and parameter settings, and analyzes the simulation results.The sixth part summarizes and prospects this paper.

II. BACKGROUND A. MARKOV DECISION PROCESS
MDP refers to a random process with Markov property, which is a sequential decision model.The model was proposed by Bellman in 1957 to solve problems with uncertain and dynamic characteristics, such as robot navigation problems and asset portfolio problems.An agent is a machine learning agent in MDP.It can perceive the state of the external environment and make corresponding decisions accordingly, and constantly adjust its own decisions by applying actions to the environment and relying on the feedback from the environment.The environment in the MDP model covers everything outside the agent, and its state changes due to the influence of the agent 's behavior, and these changes can be fully or partially perceived by the agent.After each decision, the environment will provide corresponding rewards to the agent [30], [31], [32].The MDP is shown in Fig. 1.
MDP can be expressed as: MDP = (S, A, P, R, γ ), Where S is the set of all possible states in the problem, and A is the set of all actions that the agent can take in the problem.P is a state transition probability function, which is used to calculate the probability of taking an action in a state to the next state.R is a reward function, which is used to measure the reward obtained by taking an action agent in a certain state.γ is the discount factor, also known as the attenuation factor, γ ∈ [0, 1] which is used to weigh the impact of future rewards on cumulative rewards.As shown in the Fig. 1, in the learning phase, the agent receives the state S t from the environment and performs the action A t according to the learning policy π; then the environment returns a reward value R t to the agent, and the purpose of the agent is to learn the policy of maximizing the reward from the environment.Repeating this process, the final agent update policy maximizes the cumulative reward return G t .
The policy function: The reward function is defined as follows: The cumulative reward value: The state-value function: DRL is an algorithm that seamlessly combines Reinforcement learning (RL) and Deep learning (DL).By utilizing MDP, it effectively characterizes the interaction between the agent and the environment.The primary goal of DRL, akin to MDP, is to determine the optimal policy that permits the agent to achieve its objective within the current environment, while maximizing the rewards obtained from executing that policy.Throughout the training process, DRL allows the agent to actively interact with the environment, make informed decisions by selecting actions, and subsequently receive informative feedback using a reward mechanism.Continuously exploring and pursuing greater rewards, it ultimately achieves an outstanding action selection policy [33].Taking DDPG as an example, it is a DRL algorithm based on AC framework, utilized for addressing problems in continuous action spaces.DDPG combines deep neural networks with deterministic policy gradients, enabling the learning of continuous action policies.The DDPG algorithm primarily consists of two networks: the actor network and the critic network.The actor network functions as a deterministic policy function, taking the state as input and producing the corresponding action as output.The critic network serves as a Q-value function that evaluates the value of the current state and action [34].The DDPG framework is illustrated in Fig. 2.

III. IMPROVED TWIN DELAYED DEEP DETERMINISTIC POLICY GRADIENTS
At present, the traditional UAV path planning algorithm has certain limitations.When the UAV is in an unknown environment, it is necessary to plan the map globally every time, resulting in slow planning and time-consuming, and it is difficult to find a safe path.Therefore, how to make the UAV have the ability of autonomous learning and adapting to environmental changes in planning is particularly important.The DRL algorithm has the advantages of modelfree, online learning, and offline policy, which breaks the limitations of traditional algorithms and enables UAV to perform autonomous path planning in unknown environments.Simultaneously, DRL has the capability to govern the continuous actions of UAV, aligning it more closely with the practical requirements of UAV path planning.
To address the UAV path planning problem, the TD3 algorithm, a DRL approach built upon policy gradients, has been employed in this study.Its advantage lies in the fact that it is updated with the policy as the target, and it is directly fitted to the policy during the training process, which realizes the output of the continuous action space, and it can reduce the training time and speed up the convergence of the algorithm because it does not need to calculate the action values.If the DRL method based on the value function is used, although it can solve the continuous or high-dimensional state space problem well, its action space is discrete, the planned path is not smooth, and it may be necessary to optimize the trajectory of the backend, which is still a great limitation in UAV path planning.Meanwhile, encountering the random strategy problem, the value function-based method may produce large changes in each update during training, and is not easy to converge.

A. TWIN DELAYED DEEP DETERMINISTIC POLICY GRADIENTS (TD3)
The TD3 is a DRL algorithm that utilizes deterministic policies.It builds upon the DDPG algorithm and introduces three key techniques [35], [36]: 1) Dual network: Two sets of Actor-Critic frameworks are used, and the target value is calculated by taking the minimum value from the critic network, preventing overestimation of the network.
2) Target policy smoothing: When calculating the target value, adding noise perturbations to the outputs of the target policy to make the training more stable and facilitate convergence.
3) Delayed update: The actor network is updated after multiple updates to the critic network.This delayed update method can reduce error accumulation and make the training of the actor network more stable and reliable.
The TD3 algorithm consists of 2 actor networks and 4 critic networks.The critic target network evaluates the sampled states s t+1 and actions ãt to output Q(s t+1 , ãt ).During the updating process, the target values are updated based on the maximum Q-value, which can introduce some errors with each update.Over multiple updates, these errors can accumulate, resulting in overestimation of the values for certain states.To address this issue, the TD3 algorithm uses two critic networks to evaluate the Q-values.During the update process, select the smaller Q-value to update the target value.
The TD3 uses a delayed update method when updating network parameters.After every d updates to the critic network parameters, the actor network parameters are updated.The update formula for the critic network is as follows: The update formula for the actor network is as follows: The update formula for the target network is as follows:

B. IMPROVED TD3 (I-TD3)
When extracting samples from the experience pool for training, the TD3 currently utilizes a random sampling method, resulting in low learning efficiency.Additionally, in an attempt to address the issue of overestimated Q values, the TD3 algorithm updates the target value using the minimum value of Q 1 Q 2 , resulting in an underestimation of the Q value.This paper aims to enhance the existing algorithm by introducing priority experience replay to the TD3 algorithm's experience pool.This modification enables the agent to distinguish the importance of empirical samples, leading to improved learning efficiency and reduced training time.The proposed approach, referred to as average TD3, updates the target value by taking the average value of This avoids underestimation and overestimation of the Q value, resulting in enhanced algorithm stability.Furthermore, the reward function is adjusted to provide feedback at each step of the UAV's action, effectively tackling the problem of reward sparsity in DRL.

1) PRIORITY EXPERIENCE REPLAY (PER)
In the training process of DRL, it is necessary to store the input and output data of the network, thus requiring the establishment of an experience replay buffer to store experience data.When the data is replayed, the agent updates the network parameters according to the previously observed empirical data.The form of the data is (s t , a, r, s t+1 ), and all the data in the experience pool are randomly sampled during the update.Priority experience replay (PER) is to extract the most valuable experience when extracting experience, but it cannot only extract the most valuable, otherwise it will cause over-fitting.It should be that the higher the value, the greater the probability of extraction, and the lowest the value, it will also be extracted with a certain probability.The key to the priority experience playback mechanism is to play back very successful or extremely failed experiences at a higher frequency, and these experience samples have higher learning value [37], [38], [39].
In DRL, TD-error represents the discrepancy between the current Q value and the target Q value, reflecting the degree of learning required by the agent.The larger the TD-error difference, the more the experience samples need to be updated, accelerating the agent's task completion.Therefore, TD-error is used to differentiate the importance level of experience samples, and TD-error is defined as follows: Sampling probability of experience samples: among them, p j is a priority index based on TD-error, α is a priority adjustment parameter.To maintain sample diversity, random factors are taken into consideration when selecting experience samples, which means that even experience samples with small TD-error values have the possibility of being chosen.When α takes 1, the TD-error value is directly used; when α takes 0, it is the original uniform random sampling.The priority indicator is based on a ranking approach.
The agent tends to update experience samples with high TD-error, which modifies the original probability distribution and introduces errors into the model.Consequently, the model may fail to converge during neural network training.To mitigate this issue, importance sampling is employed to correct weight changes: in the above equation, M is the number of experience replay pools, and parameter β is the degree of correction error.According to the above process, the data that interact with the environment can distinguish the importance of the experience sample and enhance the learning efficiency of the experience sample.

2) AVERAGE TD3
The TD3 algorithm solves the overestimation of DDPG, but there is also a case of underestimation.In order to solve this problem, this paper proposes an average TD3 algorithm to solve the problem of overestimation and avoid underestimation.
The overestimation of the DDPG algorithm comes from two aspects: bootstrapping and maximization.If the overestimation is uniform, it will not affect the final decision of the agent; If it is non-uniform, the final decision of the agent will be significantly influenced by it.However, in fact, the overestimation of the network is usually non-uniform.
When updating the critic network, assuming that the data sampled from the experience pool is (s t , a t , r t , s t+1 ), Firstly, we will compute the target values y: Owing to network overestimation, therefore: where ) denotes the true optimal state action value of the state action to (s t+1 , a t+1 ).We then let Q θ ′ i (s t , a t ) approximate y, so that Q θ ′ i (s t , a t ) is estimated, that is: When the critic network is updated, the state-action values get overestimated.To address the problem, the TD3 algorithm selects the minimum value from Q 1 Q 2 to perform parameter updates.This ensures that the algorithm does not suffer from this issue.However, because the minimum value is chosen as the target value during each update, it may result in the underestimation of Q-values.Therefore, this paper selects the average value of the Q 1 Q 2 to update the target values.This modification allows the improved algorithm to solve the problem of overestimated Q-values while avoiding the occurrence of underestimated Q-values.

IV. UAV PATH PLANNING BASED ON I-TD3 ALGORITHM A. STATE SPACE
During the process of DRL, the UAV determines which actions to take based on the received state information from the environment.Therefore, designing a suitable state space is of utmost importance.The state space should accurately represent the current state of the UAV and provide information about significant environmental elements such as obstacle position, shape, and target position.Thus, we define the state space as a combination of sensor-detected environmental information and the UAV's own state.
In this paper, we employ LIDAR for obstacle detection, and the environmental observations are depicted in Fig. 3 and 4. Fig. 3 shows the LIDAR's horizontal plane laser distance, while Fig. 4 shows the LIDAR's vertical plane laser distance.The scanning angle range is denoted by π, and the distance between the two laser beams at the angle is If the LIDAR detector is unable to detect any obstacles within a specified range, the emitted ray's length will be equivalent to the maximum distance that can be detected.
Environmental information is defined as: ξ i is a hot code.If the sensor detects an object of limited distance, it is 1; otherwise, 0.
Taking a quadcopter with an X configuration as an example, the state of the UAV can be measured in real-time using GPS and gyroscopes.where (x, y, z), representing the real-time position of the UAV, (v x , v y , v z ) denotes the speed of x, y, z the UAV along; d 0 is the straight-line distance between the UAV and the target, β represents the angle between the direction of d 0 and the yaxis.
In order to expedite the completion of the navigation task and improve convergence speed, we have changed the position of the UAV to its relative position with respect to the target.As a result, the state space of the UAV has been redefined.In summary, the state is defined as:

B. ACTION SPACE
The propellers of the four-rotor UAV consist of two positive propellers and two negative propellers, symmetrically distributed in the four corners of the UAV frame [40].
During operation, the propellers generate a downward airflow through high-speed rotation, which provides an upward lift force to the UAV.By analyzing the forces acting on a quadcopter's basic flight attitude, we can conclude that the UAV achieves various flight attitudes by adjusting the different rotational speeds of its four motors [41].Therefore, the flight control board can control the UAV's flight attitude and position by altering the lift and torque through different input voltages to the brushless motors, based on external demands.This paper considers the forces in various directions of UAV as executable actions, enabling the UAV to achieve functions such as takeoff, landing, forward movement, backward movement, and lateral movement.Simultaneously, the UAV's steering is controlled by the rotation angle.The reward function is a crucial component of DRL.
Designing a reasonable reward function not only improves the convergence speed of the training process but also enables the UAV to efficiently and safely accomplish its tasks [42], [43].
The reward function r(s t , a t ) represents the environmental feedback for taking action a t in a state s t , and it can be used to evaluate the quality of the action taken in the current state.If the reward r(s t , a t ) is large, it means that acting in the current state is good for achieving the task and the probability of acting in the next policy update will increase.Otherwise, the probability decreases.The reward function in this paper aims to guide the UAV to the target location while ensuring its safety.The reward function is set as follows: The reward function consists of four components: 1) To expedite the UAV's navigation mission, a penalty of −0.01 is imposed on it at each step.
2) When β ∈ [0, π 2 ), the UAV is flying in the direction close to the target point, so it gets a positive reward.When β = π 2 , the UAV is neither close to nor far from the target point, so There are neither rewards nor punishments.When β ∈ ( π 2 , π], the UAV is moving away from the target, so it gets a penalty. 3) To make the UAV avoid colliding with an obstacle, when the distance between the UAV and the nearest obstacle is less than d safe , the UAV will suffer a distance penalty.When the UAV collides with an obstacle, it will suffer a penalty of −5.When the distance between the UAV and the nearest obstacle is greater than d safe , the obstacle poses no threat to the UAV, so it will not receive any penalty.The value of d safe is 10.
4) To incentivize the UAV to reach the designated target zone quickly, a function that measures the distance between the UAV and the target has been set.If the distance is negative, the UAV will incur a penalty of −0.1.Upon reaching the target point, the UAV will be rewarded with a value of 5.

D. STATE NORMALIZATION
The state space introduced in Section IV-A involves state values of different units and scales, so the state values of the input network need to be preprocessed.In this article, normalization method is employed to process the state values.In the state space, all state values except ξ i require preprocessing.
Among them, k max and k min denote the upper and lower boundary values, respectively, along the k-axis for the task scene, and v max,k denotes the maximum achievable velocity of the UAV in the direction of the k-axis.

E. DESIGN OF UAV PATH PLANNING METHOD BASED ON I-TD3
When the UAV first interacts with the environment, it cannot distinguish between obstacles and targets.It adjusts its policy based on reward values and penalty values received from environmental feedback during the exploration process, ultimately accomplishing the path planning task.The framework of the path planning algorithm is illustrated in Fig. 7.
When the UAV explores the environment, the exploration of the action space can be increased by adding Gaussian noise.Simultaneously, the explored experiences are stored in the form of tuples and placed into an experience replay pool.During network training, PER is introduced to prioritize the learning of important experiences, thereby reducing training time.When updating the target values, we choose the average value of Q, which makes the algorithm more stable.In the end, the UAV is able to autonomously plan paths and successfully complete various tasks in complex environments.
The pseudo-code for the I-TD3 algorithm is as follows: Algorithm 1 The I-TD3 algorithm Initialize actor network π φ ,and critic networks select action a t ∼ π(s t ) + ϵ,ϵ ∼ N (0, δ), reward r and new state s t+1 Store (s t , a t , r t , s t+1 ) in M set the priority P t = max i<t P i if t > M then for j = 1 to k do Based on P(j) sampling empirical samples Update the empirical sample priority based on the TD-error Update φ by the deterministic policy gradient

A. EXPERIMENT PLATFORM AND SETTINGS
OpenAI Gym is used as the simulation platform.OpenAI has developed and maintained a Python library called Gym, which serves as a toolkit for developing and comparing DRL algorithms.Gym allows testing and learning the performance of DRL algorithms and is compatible with other numerical calculation libraries such as TensorFlow and Torch.Gym provides commonly used DRL environments and also allows for customization.The simulation environment based on Gym customization is depicted in Fig. 8.
The simulation environment we have developed is a rectangular area measuring 400 × 400 × 100.Within this environment, the starting point of the UAV is represented by the blue area, while the destination is denoted by the green area.Additionally, random obstacles are scattered throughout the white areas of the environment.These obstacles are  positioned in mid-air and are intended to train the UAV's ability to avoid collisions in the Z-axis direction.In our experiment, the UAV is capable of flying at a maximum speed of 20 and has a maximum stride of 1000.It maintains a flying height ranging from 20 to 100, while the target it aims to reach is a sphere with a diameter of 20.The task of the UAV is to start from the blue area without collision and eventually reach the green area.Environment II consists of four dynamic environments, E5 sets five cylinders with dimensions of 15 × 100 and five cubes with dimensions of 30 × 30 × 50, where half of the obstacles move in the negative direction of the Y-axis at the speed of 10, and once the obstacle reaches the boundary of the task space, it initiates backward movement and proceeds to repeat the procedure; E6 differs from E5 in that obstacles move at a speed of 20; E7 Set 5 cylinders of size 15 × 100 and 5 cubes of size 30 × 30 × 50, with all obstacles moving at speed 10; E8 differs from E7 in that all obstacles move at speed 20.

C. TRAINING AND RESULTS
The training of the UAV involves exploring and adjusting its action policy based on environmental feedback to ultimately achieve path planning and obstacle avoidance.At the beginning of each training session, the network parameters are initialized, and start and end points are randomly generated within the corresponding region.Training will end if any of the following conditions occur: (1) the training step reaches 1000, (2) the UAV collides with an obstacle, or (3) the UAV reaches the destination.DDPG, TD3, SAC [44] and the proposed algorithm are used to train the UAV in the environment E1.In the training process of 2000 rounds, the average reward curves of the four algorithms are shown in Fig. 9.
We observe that during the initial training phase, the UAV explores the environment randomly, resulting in a very low average reward.As the UAV gathers more data, it starts  training the network to update its policy.With an increase in the number of training sessions, the average reward value gradually increases, and the average reward curves for all four algorithms converge at around 200 episodes.Toward the end of training, the average reward curve approaches approximately 0. Compared with DDPG, TD3 and SAC algorithms, the proposed algorithm has faster convergence speed and more stable convergence process.
In DRL, parameters refer to various adjustable variables, which directly affect the structure, algorithm efficiency and training process of deep neural networks.The selection and adjustment of these parameters are very important for the performance, convergence speed, stability and final learning effect of the algorithm.Usually, a series of experiments are conducted to determine the best combination of parameters to achieve better performance and learning results.In this paper, through the adjustment of multiple rounds of training experiments, the experimental parameters used are determined, and the specific values are listed in Table 1.The setting of these parameters makes the algorithm converge rapidly in the training process, greatly shortens the training time, ensures the stability of the algorithm, and has superior generalization ability and strong adaptability, which can adapt to the simulation environment constructed in this paper.

D. TESTING AND RESULTS
After conducting 2000 episodes of training on DDPG, TD3, SAC, and the I-TD3, the policies are then evaluated in two experimental environments: Environment I and Environment II.In Environment I, where all obstacles are stationary, the algorithm's ability to sense the relative motion trend between the UAV and obstacles is evaluated.In Environment II, where all obstacles are dynamic, the UAV's real-time decisionmaking capability is tested, as the trained policies are fully   utilized.In addition, no random actions occur during the testing phase.
The evaluation metrics for assessing algorithm performance include Average Reward (AR), Loss Rate (LR), Collision Rate (CR), and Success Rate (SR).AR represents the average reward value over the entire testing period, indicating the average quality of the test; LR represents the percentage of rounds out of the 2000 episodes in which the UAV did not reach the target or collide with obstacles, while CR represents the percentage of collisions between the UAV and obstacles in 2000 episodes.SR indicates the percentage of successful target findings.The test results of DDPG, TD3, SAC and I-TD3 in Environment I are shown in Table 2.
We can see that the I-TD3 has the highest success rate in the four experimental environments of Environment I.With the increase of the complexity of the experimental environment, the advantages of the algorithm are more obvious.The number of obstacles gradually increases from E1 to E4, and the success rate decreases in turn.However, the success rate of the I-TD3 algorithm is the lowest (6%), while the SAC (6.15%), DDPG (8.4%) and TD3 (10.75%) are higher.Due to the limited training fragments and the randomness of the environment, the success rate of the algorithm did not reach 100%, but still achieved good results.Finally, compared with TD3 and SAC algorithms, the I-TD3 has a higher success rate, but the average reward is lower.This is because in more cases, our algorithm makes the UAV neither find the target nor collide with obstacles, resulting in extremely low reward values.The results show that the I-TD3 is more adaptable to complex environments than DDPG, SAC and TD3.Fig. 10 shows the success rate per 100 episodes in four static environments of Environment I. Fig. 11 displays a typical scenario in which the UAV, using the I-TD3 algorithm, successfully reaches the target point in Environment I.
Next, these algorithms will be tested in Environment II.The results of DDPG, TD3, SAC and I-TD3 in different dynamic environments are shown in the Table 3.
As evident from the table, the success rate of the four algorithms shows a decline as the presence of dynamic obstacles increases.While there may be slight deviations in selecting optimal behavior in a static environment, the impact is relatively minimal.However, in a dynamic environment, particularly when obstacles are moving at high speeds, it can lead to disastrous consequences.In these four algorithms, from environment E5 to E8, with the increase of dynamic obstacles, the success rate of DDPG decreased by 7.9%, SAC decreased by 6.35%, TD3 decreased by 6.5%, while the success rate of the I-TD3 decreased by 3.25%.In the table, we can see that the success rate of the algorithm in this paper is much higher than that of DDPG, TD3 and SAC in the environment E5 to E8, and the collision rate is also the lowest among the four algorithms.This is due to the proposed algorithm's ability to quickly perceive changes in the surrounding environment, enabling the UAV to avoid obstacles and make timely decisions.The results suggest that the I-TD3 demonstrates a high degree of adaptability in dynamic environments.

VI. CONCLUSION
This paper proposes a UAV path planning method based on DRL, which enables it to complete the path planning task autonomously in a multi-obstacle environment.We introduce priority experience playback as the experience replay pool of TD3 algorithm, which improves the utilization of sample data.At the same time, the average TD3 algorithm is proposed, which avoids the underestimation of Q value and improves the stability of the algorithm on the basis of solving the problem of overestimation of Q value.In addition, we design a new reward function so that the UAV can obtain reward feedback for each step of action, which solves the problem of reward sparseness in DRL.We tested the algorithm in a custom simulation environment.The results show that the algorithm can train the UAV to reach the target area safely and quickly in a multi-obstacle environment, and has good path planning performance.Compared with DDPG, TD3 and SAC, the proposed algorithm shows better stability and generalization ability in complex dynamic environments.
The experimental results of this algorithm are good, but there are still some shortcomings.When the simulation experiment is set up, the dynamic obstacles set up are moving at a uniform speed.In the actual flight environment, the obstacles encountered are not all moving at a uniform speed according to the prescribed line, but are random speeds and random routes.Therefore, the experimental environment closer to reality will be considered to test the performance of the I-TD3.Currently, this paper exclusively focuses on addressing the path planning issue for a single agent.Nonetheless, in our future research, we will delve into the path planning problem for multiple agents.Doing so will enable us to further explore and understand the collaborative path planning of UAV clusters, facilitating the achievement of more intricate tasks.

FIGURE 3 .
FIGURE 3. Laser distance on the horizontal plane of the UAV.

FIGURE 4 .
FIGURE 4. Laser distance on the vertical plane of the UAV.

FIGURE 5 .
FIGURE 5.The state space of the UAV.

FIGURE 6 .
FIGURE 6.The action space of the UAV.In the Fig. 6, 300 • a Forward , 300 • a Right , and 100 • a Up respectively represent the forces acting upon the UAV in the directions of the Y-axis, X-axis, and Z-axis.They can control the UAV's movement in the forward/backward, left/right, and upward/downward directions.α = α − 2a Ratation represents

FIGURE 7 .
FIGURE 7. Framework diagram of UAV path planning based on I-TD3 algorithm.
B. SIMULATION ENVIRONMENT For training and testing the performance of the I-TD3 algorithm, two experimental environments are established: Environment I and Environment II.Environment I consists of four static environments, and E1 is set up with five cylinders with dimensions of 15 × 100; E2 sets 5 cylinders of size 15 × 100 and 5 cubes of size 30 ×30× 50; E3 sets 5 cylinders of size 15×100,5 cylinders of size 15 × 50, and 5 cubes of size 50 ×30 ×50; E4 Set 5 cylinders of size 15 × 100, 5 cylinders of size 15 × 50, 5 cubes of size 30 × 30 × 50, and 5 cubes of size 50 × 30 × 30.

FIGURE 9 .
FIGURE 9.The convergence curve of average rewards obtained from the training environment E1.

FIGURE 10 .
FIGURE 10.The average success rate of Environment I.

FIGURE 11 .
FIGURE 11.A typical case of I-TD3 algorithm successfully reaching the target area in Environment I.

TABLE 2 .
Test results under Environment I.

TABLE 3 .
results under Environment II.