AUV Dynamic Obstacle Avoidance Method Based on Improved PPO Algorithm

Designing a reasonable obstacle avoidance method for AUV 3D path planning is difficult, and existing obstacle avoidance methods have certain drawbacks. For example, they are only applicable to 2D planar applications and cannot effectively handle dynamic obstacles. To address these problems, we design an obstacle collision prediction model (CPM). Based on the results of the simulation of obstacles’ inertial motion, the safety of the AUV navigation is evaluated to improve the model’s sensitivity to dynamic obstacles. Then, we enhance the learning ability of the sequence sample data by combining it with a long short-term memory (LSTM) network, thus improving the training efficiency and effect of the algorithm. The trained proximal policy optimization (PPO) network can output reasonable actions in order to control the AUV to avoid obstacles, forming an AUV 3D dynamic obstacle avoidance strategy based on the CPM-LSTM-PPO algorithm. The simulation results show that the proposed algorithm has good generalization in uncertain environments. Moreover, it achieves dynamic AUV obstacle avoidance in different three-dimensional unknown environments, providing theoretical and technical support for real path planning.


I. INTRODUCTION
The autonomous underwater vehicle (AUV), which serves as a lightweight underwater observation tool, has gradually become an important piece of equipment for countries to explore marine resources and strengthen the national navy. Among other advantages, AUVs are small, easy to control, and highly intelligent. [1], [2] In a complex and changing marine environment, AUVs' safety obstacle avoidance technology not only supports their navigational and operational The associate editor coordinating the review of this manuscript and approving it for publication was Usama Mir . functions but is also an important part of their navigation control technology. As various countries have expanded their efforts in ocean exploration, further improvement of AUVs' dynamic obstacle avoidance capability in complex marine environments has become a key avenue for increasing their effectiveness [3], [4].
In AUVs' application scenario, the complex and dense dynamic obstacles of an uncertain environment pose a huge challenge to safe navigation. Traditional obstacle avoidance methods (such as the A * algorithm, artificial potential field method, Voronoi diagram, RRT algorithm, swarm intelligence algorithm [5], [6], [7], [8], [9], etc.) have been used to avoid obstacles when the environmental information is known. However, due to the dense dynamic obstacles in uncertain environments, AUVs cannot obtain the motion information of dynamic obstacles in advance. Therefore, traditional methods cannot be effectively applied to real-time obstacle avoidance. In addition, the complexity and variability of the uncertain environment imposes higher requirements on the speed of the obstacle avoidance algorithm. Traditional methods are overly reliant on environmental dynamic models and AUV models; as a result, the accuracy of the models greatly affects these methods' performance. On the one hand, simple models cannot adequately characterize the complexity of the environment. On the other, complex models are too computationally intensive, which not only wastes computational resources but also takes too long to meet the needs of AUVs in uncertain environments. Therefore, it is necessary to design a method that can realize dynamic obstacle avoidance for AUVs in an uncertain environment [10], [11].
With the development of artificial intelligence, more and more advanced intelligent algorithms have been applied in various fields to solve problems that cannot be solved by traditional algorithms. Among intelligent decision-making algorithms, deep reinforcement learning methods stand out for their powerful high-dimensional information perception, understanding, and nonlinear processing capabilities. Wu Yahui et al. proposed a model of obstacle avoidance built on a modified artificial potential field method with the obstacle information in the environment and the posture and angle information of the robot movement, thus achieving autonomous movement of the robot in unfamiliar scenes [12]. Xiong Juntao et al. addressed the problem that the range repulsion of the artificial potential field method affected the shortest path planning. They proposed a method for setting the directional penalty obstacle avoidance function, which converted the obstacle range penalty into a single direction penalty. By establishing a virtual robot motion collision model, the direction penalties were given selectively by the analysis results of the model [13]. Liu Qingjie et al. note that rewarding sparseness in the learning process makes it difficult to obtain better results; therefore, to improve the reward mechanism, they increase real-time rewards and punishments as a supplement to solve the problem of long learning time and unstable training [14]. Sun Lixiang et al. developed a reward function for reinforcement learning based on human spatial behavior. In this approach, states in which the robot angle changes significantly are punished to achieve the requirements of comfortable obstacle avoidance [15]. Mirowski et al. used deep reinforcement learning to make navigation decisions in a grid environment. However, they performed the task in a static obstacle environment, thus failing to verify the algorithm's practicability in an uncertain environment [16]. Qiao et al. used CMAC (cerebellar model arithmetic computer) and SARSA (a temporal-difference reinforcement learning method) to complete automatic obstacle avoidance in unknown environments. However, this method is limited to collision avoidance for a single obstacle [17]. Finally, Zhou Bin et al. used the received signal strength to define the reward value and employed Q-learning to complete AUV path planning and obstacle avoidance. Nevertheless, as the application scenario is simple, they also fail to consider uncertain environments with a large number of dynamic obstacles [18].
In summary, although the above methods have achieved good results in their respective environments, there are still some shortcomings, mainly in the following two areas. First, most algorithms only perform obstacle avoidance or path planning in a static environment and cannot deal with dynamic obstacles; as a result, they are difficult to apply in uncertain environments. Second, due to the environment type for obstacle avoidance and the consideration of model complexity and computational intensity, deep reinforcement learning algorithms can only be applied to the field of two-dimensional plane obstacle avoidance, which is far from a three-dimensional environment. Therefore, these algorithms have certain limitations in guiding real-world applications.
Aiming at the above problems and relying on existing research, this study proposes an AUV dynamic obstacle avoidance method based on proximal policy optimization (PPO) for uncertain environments with dense dynamic obstacles. The specific steps of the study are as follows. First, the obstacle collision prediction model is designed. Based on the results of the simulation of the obstacles' inertial motion, the safety of the AUV navigation is evaluated to improve the model's sensitivity to dynamic obstacles. The introduction of the long short-term memory network transforms the environmental state into a high-dimensional perception situation, strengthening the network's ability to learn time-series obstacle avoidance data. Thus, we propose an AUV dynamic obstacle avoidance method based on a CPM-LSTM-PPO algorithm, which makes full use of the plasticity of the offline training of the neural network and real-time online use. Finally, we design various obstacle avoidance simulation experiments to compare the proposed method with other algorithms in order to verify its effectiveness and superiority.

II. PPO ALGORITHM AND LSTM NETWORK A. PROXIMAL POLICY OPTIMIZATION ALGORITHM
The proximal policy optimization algorithm [19] is an improved deep reinforcement learning algorithm proposed by OpenAI in 2017. In the same year, DeepMind showed that the agent could explore complex skills without special instructions by training a PPO. This further proved that the PPO algorithm can be better applied to the tasks of continuous control and continuous plotting.
PPO is a new type of policy gradient (PG) algorithm. The main philosophy of the PG algorithm is to use gradient boosting to update the policy π in order to maximize the expected reward. In the PG algorithm, the objective function of the network parameter θ update is as follows: The biggest advantage of the PG algorithm is that it can choose actions in a continuous space. Its disadvantage is that although it is sensitive to step size, choosing a suitable step size is difficult. To address this shortcoming, this paper first uses the ratio of the action probability π θ (a|s) under the current strategy to the action probability π θ old (a|s) of the previous strategy to observe the effect of the agent's action. The ratio of old and new strategies is recorded as If the reward function r t (θ) > 1, it indicates that the probability of the action occurring under this policy is higher than that of the previous policy; if 0 < r t (θ) < 1, the probability is lower than the previous policy. The objective function can be designed as follows: Second, to avoid policy mutation during the process of parameter updating, the objective function (Formula (3)) must be constrained. The PPO algorithm improves the stability of training agent behavior by constraining policy updates to a small range. There are two kinds of constraints that the PPO algorithm can adopt: limiting the KL divergence or truncation. In practical applications, researchers have found that the truncated method works better. Therefore, the objective function of PPO is optimized as follows: Among them, ε is a truncation constant used to assist in setting the range of policy updates; it is usually set to 0.1 or 0.2. The clip function is a truncation function that limits the value of the old and new policy parameters r t (θ ) to the interval [1 − ε, 1 + ε], as shown in Figure 1. The objective function uses the min function to represent the smaller value between the probability ratio of the old and new strategies and the truncation function.
When the advantage function A is positive, it means that the current action has a positive effect on the optimization goal. Therefore, the probability of its occurrence should be increased, but the update range should be limited to below 1 + ε. When A is negative (indicating that the current behavior is negative), it should be blocked while reducing its probability to 1 − ε.
The core philosophy of the PPO algorithm is to avoid the use of large policy updates in order to solve the problem of difficult step size determination and low data efficiency in the PG algorithm. This greatly reduces the difficulty of debugging by researchers.

B. LONG SHORT-TERM MEMORY NETWORK
Each unit of the LSTM network can be divided into a forget gate f t , input gate i t , and output gate o t (Fig. 2).
Among them, the forget gate uses the sigmoid function to determine whether the output h t−1 and cell state C t−1 of the network at the previous time continue to exist in the cell state C t of the current network. The calculation formula of the forget gate is as follows: In formula (5), W f is the weight matrix; b f is the offset; x t is the input of the current network; and g represents vector splicing.
The input gate multiplies the information output by the sigmoid function and the tach function to determine how much of the current input x t is to be transferred to the cell state C t . The formula of the input gate is as follows: The output gate also uses the information output by the sigmoid function and the tach function to determine how much of the unit state C t can be transferred to the current output h t . The formula of the output gate is as follows:

III. CPM-LSTM-PPO ALGORITHM
The main goal of this paper is to allow the AUV to find a reasonable path to the target position within the specified step size. Here, ''reasonable'' refers to distinguishing obstacles in different directions and speeds so that the AUV behavior path is closer to reality. In this paper, the PPO algorithm is used for 3D dynamic obstacle avoidance tasks in an unknown environment with multiple obstacles. The training process consists of four stages: initialization, action execution, reward acquisition, and training decision-making ( Fig. 3). First, a reasonable environment state and action state are designed. Second, AUV uses sonar to detect environmental information and collect data. Then, it inputs these data as feature vectors combined with a reward function into a neural network for training. Finally, the optimal action is selected according to the exploration strategy, and the output reaches the next visual observation. AUV continuously loops and iterates the three stages-executing actions, obtaining rewards, and making training decisions-until the training is completed.

A. COLLISION PREDICTION MODEL
The first step is to construct a 3D coordinate system. Take the position of the AUV when the active sailing function is turned on as the origin (0, 0, 0). The heading is the positive direction of the y-axis. The positive direction of the x-axis is in the horizontal direction, perpendicular to the heading direction and pointing to the right. The positive z-axis direction is perpendicular to the heading direction and pointing to the water surface. The second step is to map the detected obstacle recognition frame to the map and update the coordinate information of obstacles and the AUV in real time.
Assuming that the velocity v obs , pitch angle θ obs , and yaw angle ψ obs of the obstacle within t seconds are all fixed, the position of the coordinate system in the previous frame of the obstacle measured by the sonar is (x1, y1, z1), and the current frame position of the obstacle is (x obs , y obs , z obs ), the speed of obstacle navigation is as follows: The yaw angle is The pitch angle is These formulas allow the dynamic information of the obstacle to be judged.
After storing the above information, a three-dimensional map of the absolute coordinates of obstacles, target positions, and the AUV itself is formed.
To build a collision prediction model, the collision distance must be calculated first.
Assuming that the position of the current frame of the AUV is (x auv , y auv , z auv ), the movement of the coordinates after completing a step navigation action is ( x auv , y auv , z auv ). That is, the position of the AUV after completing a step navigation action is (x auv + x auv , y auv + y auv , z auv + z auv ), and the time required for the AUV to complete a step sailing action is t seconds ( t is on the order of milliseconds).
The amount of movement of the obstacle in the x-axis after t seconds is x obs = v obs t cos θ obs cos ψ obs . The amount of movement on the y-axis is y obs = v obs t cos θ obs sin ψ obs .
The amount of movement on the z-axis is z obs = v obs t sin θ obs .
That is, the coordinate of the obstacle after t seconds is (x obs + x obs , y obs + y obs , z obs + z obs ).
Then, after t seconds, the distance between the AUV and the obstacle in (11), as shown at the bottom of the next page.
The obstacle distance is scored according to dist, from which the obstacle distance reward R t is obtained. In this paper, the safe distance is set to 5 meters, the general distance is 3.5 meters, and the dangerous distance is 2 meters. Therefore, the AUV obstacle distance reward R t is as follows: AUV dynamic obstacle avoidance is a continuous process, and the navigation action taken in the current step will greatly affect the next action. Therefore, focusing exclusively on the effect of the current action will often affect overall obstacle avoidance. At the same time, considering the inertia of object motion, both the AUV and dynamic obstacles are unlikely to change their original speed and heading within a few tens of t seconds. Therefore, it may be assumed that the AUV takes the current navigation action over the course of the next several dozen steps, and the influence of inertial motion is estimated to calculate the overall AUV obstacle distance reward G m t : In formula (13), G m t is the total obstacle distance reward obtained in m steps. R n t is the obstacle distance reward at VOLUME 10, 2022 the nth step (that is, after n t seconds). γ is the attenuation factor, which is between (0, 1) because the closer R t is, the greater its impact on the algorithm. As A gradually becomes farther, the accuracy gradually decreases due to its predictability. The addition of γ prevents the collision prediction model from being either too short-sighted or too long-term.
Considering the computational performance of AUV, after simulation experiments, we obtain m = 30, γ = 0.95: The collision prediction model in this paper is divided into four levels: A (safe), B (lower collision risk), C (higher collision risk), and D (extremely dangerous). We substitute G 30 t into the following formula to obtain the AUV's estimated collision rating S q for this obstacle: 18.85 < G 30 t ≤ 25.13 C, 6.28 < G 30 t ≤ 18.85 D, G 30 t ≤ 6.28 (15) Assuming that q obstacles are identified on the same frame of the sonar image, we repeat the above steps for these q obstacles to obtain collision prediction set S:

B. STATE SPACE AND ACTION SPACE
The environment model of the AUV must consider the target position, boundary information, and obstacle collision prediction model to engage in reasonable behavior to avoid a collision. A variety of obstacles are set up in this paper's simulation environment and change randomly within a certain range. Due to the huge number of states and actions in the continuous high-dimensional space, it is difficult for the algorithm to converge. Therefore, we discretize information such as obstacles around the AUV into a finite number of states and formulate the state space reasonably. The state space is defined as follows: In formula (17), (x auv , y auv , z auv ) is the position of the AUV's current frame, and dist end is the distance between the AUV and the target position. In addition, step is the number of steps taken to navigate, and S is the collision prediction set.
To speed up the convergence of the network model, the action space consists of six discrete actions: (a 0 , a 1 , a 2 , . . . , a 5 ) Among them, a 0 , a 1 , a 2 , a 3 , a 4 , and a 5 are 0.2 m forward in the directions of +x-axis, −x-axis, +y-axis, −y-axis, +z-axis, and −z-axis, respectively. + and − indicate the forward and reverse directions, respectively.

C. DESIGN OF CPM-LSTM-PPO ALGORITHM FRAMEWORK
Since the underwater environment has highly dynamic, highdimensional characteristics and complexity, simply using the fully connected neural network in the PPO algorithm to approximate the policy function and evaluation function is inadequate. The policy network and evaluation network in this paper use the LSTM network framework. First, the LSTM network is introduced to extract features from a high-dimensional environmental situation, output useful perception information, and enhance the learning ability of serial sample data. Then, it approximates the policy function and evaluation function through a fully connected neural network. Figure 4 shows the framework of the CPM-LSTM-PPO algorithm.
For the policy network part, six nodes are set up in the input layer, corresponding to the six states of s t . The hidden layer sets up the LSTM layer and the fully connected layer. The LSTM layer sets up three network units, and the fully connected layer is designed as three layers, all of which use tach as the activation function. The output layer sets a node and uses softmax as the activation function for the simplified discrete action a t . Figure 5 shows the policy network framework.

D. REWARD AND PUNISHMENT FUNCTION
In deep reinforcement learning algorithms, all objectives can be described by maximizing the expected cumulative reward. Therefore, AUVs can learn the correct strategy from feedback signals when interacting with the environment.
The reward and punishment function is the key to determining whether the deep reinforcement learning network model can successfully converge. In this paper, the reward and punishment function R is mainly composed of three parts: the reward and punishment for distance change R 1 , the reward and punishment for collision prediction R 2 , and the reward and punishment for arrival, out of bounds, and collision occurrence R 3 . R 1 means that if the AUV is closer to the target position after performing a step action, it will give an appropriate reward; otherwise, it will give a penalty. R 2 indicates that the collision prediction reward and punishment are given according to each rating Sq in S. R 3 means that the AUV will give a completion reward when it reaches the target position and a failure penalty if the coordinates exceed the delimited boundary or collide. dist = (x auv + x auv − x obs − x obs ) 2 + y auv + y auv − y obs − y obs 2 + (z auv + z auv − z obs − z obs ) 2 (11)  The reward and punishment function is designed as follows: In formula (20), predist end represents the distance between the AUV before performing the action and the target position.
Appropriate safety rewards and punishments and severe dangerous action penalties through collision prediction allow the algorithm to take safe obstacle avoidance actions.
To prevent situations in which the AUV can never reach the target position, this paper sets a maximum limit number of steps σ of the map. This value changes according to map size: where l, w, h are the length, width, and height of the map, respectively. λ is a parameter related to the complexity of the map; a larger value should be set for a more complex map. R ≥ 30, 000, or R ≤ −10, 000, or step number ≥ σ , will immediately end the current round of episodes.

IV. EXPERIMENTS
This paper uses the Python-based physics engine PyBullet to build the simulation environment. The computer configuration for AUV training is as follows: the hardware environment is an Intel i5-7300HQ processor, 16GB memory, and NVIDIA GeForce GTX 1050Ti graphics card. The software environment is Python 3.10.
In this paper, several experiments are designed to verify the algorithm's effectiveness. Experiment 1 is an AUV dynamic obstacle avoidance simulation experiment based on the CPM-LSTM-PPO algorithm. Experiment 2 is a comparison experiment between the algorithm in this paper and other algorithms. Experiment 3 examines a random dynamic obstacle avoidance scene.

A. SIMULATION EXPERIMENT 1) ENVIRONMENT MODEL AND TRAINING PARAMETERS
The basic training simulation environment designed in this paper has certain representativeness (Fig. 6). The length, width, and height of the training environment are 55 m, 18 m, VOLUME 10, 2022  and 14 m, respectively. The red line is the boundary line, the green line is the target position, and the orange line is a segment of the navigation trajectory generated by the AUV every 40 steps. The AUV first traverses three uprights and then five lateral static obstacles. Then, it passes through two dynamic obstacles that move left and right and one that moves up and down. The obstacles follow uniform round-trip linear motion.
In the experiment, the continuous observation vector space is used, and the eigenvectors are applied to represent the observation results of the intelligent agent at each step. Each AUV continuously learns and explores according to the method proposed in this paper. During the training process, R is always followed for certain rewards and punishments. The entire scene is reset at the end of each round or if the AUV goes out of bounds. The total number of training iterations in this experiment is 6,000, and the PPO parameter settings are shown in Table 1.

2) EXPERIMENTAL RESULTS AND ANALYSIS
The average reward obtained every 10 rounds during the training process, as well as the number of steps taken by the AUV to reach the target position each time, was recorded ( Figs. 7 and 8). As the number of iteration rounds increases, when the algorithm iterates to about 2,000 rounds, the average reward has increased from a negative value to 0. This indicates that the CPM-LSTM-PPO algorithm has gained  some obstacle avoidance experience. When the algorithm iterates to the 3,000th round, the average reward for every 10 rounds fluctuates around 15,000. The reason why the average reward fails to converge above 30,000 rounds is that the failed attempts will lower the average reward every 10 rounds; as a result, the algorithm's success rate is below 100%. Figure 8 shows that after the AUV first reaches the target position, the number of steps used gradually decreases. After reaching the target position 1,500 times, the number of steps used tends to be stable and continues to fluctuate around 500 steps, indicating that the CPM-LSTM-PPO algorithm tends to converge. Figure 9 shows the path planned by the training model using the CPM-LSTM-PPO algorithm. The AUV selects the safest position in the middle when crossing the static and horizontal columns, indicating that the model has high path smoothness and has learned the target position and dynamic obstacle avoidance function.
In the same experimental environment, this paper divides the reward and punishment functions into two cases for comparison: first, a complete reward and punishment mechanism, namely R = R 1 + R 2 + R 3 ; and second, a version without the collision prediction model, that is, R = R 1 + R 3 . Figure 10 compares the average rewards per hundred rounds based on different reward and punishment functions.  The blue line represents the first training case (the CPM-LSTM-PPO algorithm), and the orange line represents the second case. It can be intuitively seen from the figure that the blue line achieves a better cumulative reward value in fewer iterations; the average reward has reached 15,000 after 3,000 rounds of training. In the case without the collision prediction model, the AUV has a longer learning time in the early training, and the average reward reaches 10,000 when training 5,000 times. The experimental results show that adding a collision prediction model can improve AUV training efficiency and speed up the AUV's exploration of the environment.

B. COMPARATIVE EXPERIMENT
For more complex multi-dynamic obstacle scenarios, this paper performs AUV dynamic obstacle avoidance tasks based on the DQN algorithm, the TRPO algorithm, the LSTM-PPO algorithm, and the proposed CPM-LSTM-PPO algorithm. In particular, we compare the average reward obtained in the same scenario and the number of steps taken to reach the target position.
The multi-dynamic obstacle scene consists of seven cubes that engage in round-trip linear motion with different headings and speeds. Figure 11 shows the average rewards per hundred rounds obtained by the four algorithms in a multi-dynamic obstacle environment. The CPM-LSTM-PPO algorithm has less fluctuation in the early training process than the DQN and TRPO algorithms. All three algorithms start to converge around 6,000 rounds, but the LSTM-PPO algorithm gradually decreases after achieving a high reward score in 2,000 rounds. The algorithm in this paper uses the memory function of the LSTM neural network to accumulate higher rewards with the help of the collision prediction model. In the later stages of training, the average reward convergence per ten rounds fluctuates around 22,000. The DQN, TRPO, and LSTM-PPO algorithms converge at 5,000, 8,000, and 10,000 rounds, respectively. These results indicate that the CPM-LSTM-PPO algorithm model has high performance, strong stability, and better generalization ability. Figure 12 shows the algorithm's obstacle avoidance process in a multi-dynamic obstacle scene. It can be clearly seen that the AUV maneuvers to avoid cubic obstacles, always maintains a safe distance from the obstacles, and completes the obstacle avoidance task in the process of driving to the target position. The path is smooth, without sharp turns, and without many redundant sections. The track of the LSTM-PPO algorithm without collision prediction is similar to this, but it does not keep enough distance from the obstacles. Figure 13 is the obstacle avoidance diagram of the comparison algorithm in the multi-dynamic obstacle scene. The planned paths of the DQN and TRPO algorithms are the same. They all make only the necessary evasive maneuvers, and the path tends to be a smooth arc, which results in the fewest steps used and the shortest path. However, the downside is that they ignore the need to keep a safe distance from the obstacles, giving the system a lower score. This is also one of the factors explaining why the final average reward of the DQN, TRPO, and LSTM-PPO algorithms is lower than that of the proposed algorithm in this paper. Figure 14 provides a comparison chart of the number of steps used in a multi-dynamic obstacle scene. The number of steps used by the CPM-LSTM-PPO algorithm decreases rapidly after reaching the target position 200 times, while the number of steps converges around 570 steps after 300 times. Meanwhile, the number of steps used by the LSTM-PPO algorithm gradually decreases with the increase in the number of successes, finally converging around 450 steps after 400 times. Both the DQN and TRPO algorithms converge to around 260 steps after reaching the target position 600 times. Table 2 shows the obstacle avoidance results of each algorithm in a multi-dynamic obstacle scene after 5,000 rounds of training. Although the CPM-LSTM-PPO algorithm uses the most average steps, its 70.76% success rate is much higher than the 56.66% of the DQN algorithm and 52.06%  of the TRPO algorithm. The CPM-LSTM-PPO algorithm takes more evasive actions to maintain a safe distance from obstacles, using the collision prediction model to improve the AUV's sensitivity to dynamic obstacles. It thus achieves a higher obstacle avoidance success rate.

V. DISCUSSION
Although the obstacles are dynamic in the above two experimental scenarios, their initial position, heading, and speed are all fixed. To give the algorithm good generalizability, it is necessary to randomize the position and motion information of each obstacle in the training environment. This will inevitably lead to longer algorithm training times and will require better obstacle avoidance performance from the algorithm. To test the obstacle avoidance effect of the CPM-LSTM-PPO algorithm in a random environment, we modify the multi-dynamic obstacle scene used in this paper. The initial positions (x, y, z) of the seven cube obstacles will appear randomly among ([−8, 8], [0, 45], [2,10]). The speed of each step is random between [0.02, 0.15], and the heading is uncertain. The obstacles engage in back-and-forth motion after touching the boundary.
Transfer learning makes the training of the target task more flexible, efficient, and realistic by applying the experience learned from the source task to the target task.   The implementation methods of transfer learning include instance-based, feature-based, model-based, and relationbased methods. In this paper, model-based transfer learning is used to initialize the weights of the CPM-LSTM-PPO model network by using the model parameter pretrained in the multi-dynamic obstacle scenario, replacing the original random initialization operation, and completing global finetuning. The rest of the training process is carried out as usual. This can achieve a faster model fit and improve the results. Figures 15 and 16 show the average reward per 10 rounds and the average number of steps taken by the AUV to reach the target position per 10 times, respectively. With the prior knowledge of transfer learning, the model achieves high scores at the beginning of training, and the average reward fluctuates around 15,000. When the algorithm iterates to the 10,000th round, the average reward fluctuation decreases, but the score decreases as well. Figure 16 shows that the average number of steps gradually decreases with the number of times the target position is reached. After reaching the target position 5,800 times, the average number of steps continued to fluctuate around 490 steps. Figure 17 is a path-planning diagram for a random dynamic obstacle avoidance scenario. It can be seen that the proposed algorithm still has good passing performance and strong generalization ability in this completely random, complex, unknown environment. This suggests that the algorithm could also be used in a real unknown underwater environment.
In the experimental data for 5,000 runs after training, 2,971 times are successful, and the success rate is 59.42%. The average number of steps is 503. While the success rate is lower than that of the multi-dynamic obstacle scene, the number of steps used is also lower. In this paper, a near-end policy optimization algorithm is used to control the virtual AUV on the map to explore the obstacle avoidance path (instead of directly controlling a real AUV). This approach decouples the obstacle avoidance method from the AUV's propulsion system. The obstacle avoidance method in this paper is applicable as long as the propulsion system can be controlled to follow the path on the map. Regardless of the number of thrusters and the method of propulsion, the obstacle avoidance method greatly improves the algorithm's generalizability.
However, there are inevitably errors in the actual application process. In particular, this paper assumes that AUVs can avoid obstacles in an ideal environment, which means that they can obtain obstacle information without delay and are not affected by dynamic current and other factors in the underwater environment. Therefore, future research can refer to the following latest work. Zhengru Fang et al. formulated a two-stage joint power control, computational resource allocation, and trajectory scheduling for Internet of Underwater Things (IoUT) networks; the approach considered the turbulent ocean environments in the context of a multi-AUV-aided heterogeneous network for energy-efficient information collection [20]. J Wang et al. proposed an active queue management (AQM) policy for the IoUT node in order to reduce the peak age of information (PAoI), beneficially compressing the packets with a long waiting time [21]. G. Han et al. focused on passive attacks in underwater acoustic sensor networks and proposed an autonomous underwater vehicle (AUV)-aided data-importance-based scheme for protecting location privacy (DIS-PLP) [22].

VI. CONCLUSION
This study builds an obstacle collision prediction model. Based on the results of the simulation of the obstacle inertial motion, the safety of AUV navigation is evaluated to improve the model's sensitivity to dynamic obstacles. The introduction of the long short-term memory network transforms the environmental state into a high-dimensional perception situation, strengthening the network's ability to learn time-series obstacle avoidance data. Thus, we propose an AUV dynamic obstacle avoidance method based on a CPM-LSTM-PPO algorithm. Using the improved PPO algorithm in an unknown 3D environment with multiple types of obstacles and without any prior knowledge, AUV can find a better path and complete the obstacle avoidance task after repeated trial and error. This is more in line with actual situations and has a high success rate of obstacle avoidance.