Communication-Efficient and Collision-Free Motion Planning of Underwater Vehicles via Integral Reinforcement Learning

Motion planning of underwater vehicles is regarded as a promising technique to make up the flexibility deficiency of underwater sensor networks (USNs). Nonetheless, the unique characteristics of underwater channel and environment make it challenging to achieve the above mission. This article is concerned with a communication-efficient and collision-free motion planning issue for underwater vehicles in fading channel and obstacle environment. We first develop a model-based integral reinforcement learning (IRL) estimator to predict the stochastic signal-to-noise ratio (SNR). With the estimated SNR, an integrated optimization problem for the codesign of communication efficiency and motion planning is constructed, in which the underwater vehicle dynamics, communication capacity, collision avoidance, and position control are all considered. In order to tackle this problem, a model-free IRL algorithm is designed to drive underwater vehicles to the desired position points while maximizing the communication capacity and avoiding the collision. It is worth mentioning that, the proposed motion planning solution in this article considers a realistic underwater communication channel, as well as a realistic dynamic model for underwater vehicles. Finally, simulation and experimental results are demonstrated to verify the effectiveness of the proposed approach.


Communication-Efficient and Collision-Free Motion
Planning of Underwater Vehicles via Integral Reinforcement Learning Jing Yan , Senior Member, IEEE, Wenqiang Cao , Xian Yang , Cailian Chen , Member, IEEE, and Xinping Guan , Fellow, IEEE Abstract-Motion planning of underwater vehicles is regarded as a promising technique to make up the flexibility deficiency of underwater sensor networks (USNs).Nonetheless, the unique characteristics of underwater channel and environment make it challenging to achieve the above mission.This article is concerned with a communication-efficient and collision-free motion planning issue for underwater vehicles in fading channel and obstacle environment.We first develop a model-based integral reinforcement learning (IRL) estimator to predict the stochastic signal-to-noise ratio (SNR).With the estimated SNR, an integrated optimization problem for the codesign of communication efficiency and motion planning is constructed, in which the underwater vehicle dynamics, communication capacity, collision avoidance, and position control are all considered.In order to tackle this problem, a model-free IRL algorithm is designed to drive underwater vehicles to the desired position points while maximizing the communication capacity and avoiding the collision.It is worth mentioning that, the proposed motion planning solution in this article considers a realistic underwater communication channel, as well as a realistic dynamic model for underwater vehicles.Finally, simulation and experimental results are demonstrated to verify the effectiveness of the proposed approach.

I. INTRODUCTION
I N ORDER to understand and explore the ocean, many underwater sensor nodes, including multibeam swath bathymeter, sonar array, and acoustic Doppler current profiler, have been deployed to form the underwater sensor networks (USNs) [1], [2].The deployment of USNs can increase the space-time cover ability of ocean monitoring; however, USNs Jing Yan, Wenqiang Cao, and Xian Yang are with the Institute of Electrical Engineering, Yanshan University, Qinhuangdao 066004, China (e-mail: jyan@ysu.edu.cn;cwq@stumail.ysu.edu.cn;xyang@ysu.edu.cn).
Digital Object Identifier 10.1109/TNNLS.2022.3226776lack the necessary flexibility and autonomy, which cannot deal with highly dynamic uncertainties in complex underwater environment.With regard to this, underwater vehicle-assisted USNs have been emerged as a new promising communication platform in future ocean-observation systems, due to the high mobility, controllable maneuver, and on-demand deployment.These appealing advantages have enabled various applications, including intrusion surveillance, data gathering, geographic mapping, petroleum exploration, and transmission of images from remote sites (see [3], [4], [5] and references therein).
In underwater vehicle-assisted USNs, one of the most critical issues is to plan paths for underwater vehicles.For instance, an energy-efficient motion planning strategy was provided in [6] to balance the communication energy consumption and prolong the network lifetime.Yetkin et al. [7] incorporated the environment information into the path planning of underwater vehicles, through which a decision-theoreticbased subsea search algorithm was designed.In [8], the endto-end data freshness constraint was conducted to determine the paths of underwater vehicles, whose aim was to retrieve the collected data to control center as soon as possible.Followed by this, a heuristic algorithm was provided in [9] to optimize the paths of underwater vehicles, with respect to data quality and underwater coverage efficiency.Wang et al. [10] employed acoustic camera to capture the position and shape of unknown underwater pipelines.These schemes are well developed; however, they do not take the collision avoidance into consideration.As we have seen already, obstacles such as wrecks and plankton inevitably exist in water, while at the same time the collision between vehicles may occur when they work together.The above collision constraint has a strong impact on the motion safety and communication channel of underwater vehicles [11].Therefore, it is necessary to plan a collision-free path for each underwater vehicle.
To resolve the above problem, Song et al. [12] developed a joint flocking and guidance scheme for underwater vehicles evolving in environments with obstacles.In [13], an artificial potential-based motion planning strategy was conducted, where the software-defined technology was employed to improve the scalability and controllability.A multilayered motion planning scheme was presented in [14] for underwater navigation, wherein a local motion planner was employed to avoid collision with obstacles.Note that the dynamics models of underwater vehicles in [12], [13], and [14] treat each vehicle as a point mass, such that a second-order nonlinear equation can be conducted to describe the kinematics of underwater vehicles.However, the kinematics model cannot capture the low-level interactions during the implementation of motion planning on actual vehicle.As such, Heshmati-Alamdari et al. [15] jointly considered the kinematics and dynamics models of underwater vehicles, through which a model predictive controller was developed to steer each underwater vehicle to the desired trajectory with collision avoidance.In [16], an adaptive motion controller was provided to achieve finite-time formation control and obstacle avoidance for underwater vehicles.Nevertheless, the above motion planning schemes rely on full or partial knowledge of underwater vehicle dynamic model.Due to the harsh ocean conditions, it is difficult if not impossible to acquire the accurate dynamics model of underwater vehicles.With regard to this, an artificial potential function-based motion planning scheme was developed in [17] to relax the dependence of model parameters for underwater vehicles.Also of relevance, the Astar algorithms were designed in [18] and [19] to steer underwater vehicles to the target points with collision avoidance.Although the artificial potential function and Astar algorithms are simple to implement, they are easy to get stuck at locally optimal value.More recently, Jiang et al. [20] and Kontoudis and Vamvoudakis [21] employed the distributed learning algorithms to reduce the dependency on model parameters and achieve global optimization; however they are not developed in the context of collision-free motion planning for underwater vehicles.Due to the complex dynamics of underwater vehicles, the issue of how to adopt the learning strategy to design a collision-free motion planning scheme without relying on models for underwater vehicles is largely unexplored.
Apart from that, most of the existing motion planning schemes focus on the control techniques and they ignore the influence of underwater acoustic communication channel.To be specific, they assume that the channel quality is approximated by a deterministic disk model.The above assumption is reasonable for terrestrial vehicles; however, it is not valid for underwater vehicles.As has been pointed out in [22] and [23], underwater acoustic communication suffers from stronger shadowing and multipath fading than the terrestrial radio wave communication.Ignoring the shadowing and multipath fading factors may result in the deterioration of communication quality during the motion planning process.Thereby, we need to incorporate the underwater communication quality into the motion planning procedure metric, such that a communication-efficient and collision-free motion planning scheme can be developed to improve the communication capacity via the control feedback of underwater vehicles.The above idea is similar to the codesign of estimation and communication for multiagent systems, e.g., [24], [25].To this end, we notice that some communication-efficient motion planning schemes have been developed for terrestrial vehicles.For instance, a gradient estimation-based motion controller was developed in [26] to optimize the communication chain.In [27] and [28], two codesign frameworks for aerial vehicle motion planning and communication efficiency were constructed.Yan and Mostofi [29] integrated the probabilistic signal-to-noise ratio (SNR) prediction approach into the router planning of robots.Followed by this, Ali et al. [30] extended the first-order linear kinematic model of vehicles to the second-order, through which a motion-communication cooptimization solution was designed.Note that the dynamics model of vehicles in [26], [27], [28], [29], and [30] is reduced to a first-order or second-order kinematic equation; however, it cannot capture the actual dynamics model of underwater vehicles.In [31], the kinematic and dynamics models of robots were incorporated into the communication-aware motion planning.Nonetheless, the least square estimators are developed in the above literatures to seek the SNR parameters, which are easy to trap in local optimum.To compensate these shortcomings, the distributed learning can offer us with a feasible solution, since it seeks a global optimum solution via online learning and iteration [32].Our previous works [33], [34] employed the learning algorithm to solve the underwater localization problem.Nevertheless, how to develop a learning-based solution that can jointly solve collision-free motion planning and global-optimum SNR estimation for underwater vehicles is still an unsolved issue.
This article studies a communication-efficient and collisionfree motion planning problem for underwater vehicles in fading channel and obstacle environment.A novel two-stage solution is developed, i.e., sensor nodes predict the SNR parameters in the first stage, and underwater vehicles dynamically adjust their positions in the second stage.In such a solution, underwater vehicles behave as mobile communication relaying nodes whose aim is to improve the communication capacity.Main contributions of this article lie in three aspects.[29], [30], [31], the IRL estimator in this article can avoid local minimum.Meanwhile, the shadowing and multipath fading effects are considered in this article, which are ignored by terrestrial vehicles, e.g., [26].

3) Model-Free Learning Algorithm for Collision-Free
Motion Planning: With the predicted SNR information, a model-free IRL algorithm is developed to steer underwater vehicles to the desired position points while avoiding collision with obstacles and the other vehicles.Different from the motion planning controllers in [12], [13], and [14], the developed motion planning algorithm in this article not only considers the collision avoidance, but also takes into account the dynamics model of underwater vehicles.Compared with the solutions in [15] and [16], it relaxes the dependency on vehicle model parameters.

II. SYSTEM MODEL AND PROBLEM FORMULATION
We would like to transmit the data information from source sensor node to destination sensor node, as shown in Fig. 1.To this end, the following four types of nodes are provided.
1) Source Sensor Node: The position of source sensor node is fixed, and it sends the collected data to the destination sensor node through the relays of underwater vehicles.2) Destination Sensor Node: The position of destination sensor node is also fixed, whose objective is to indirectly receive the data from source sensor node.3) Underwater Vehicle: Underwater vehicles act as mobile communication relaying nodes.4) Ordinary Sensor Node: Ordinary sensor nodes are provided to sense and collect channel measurements to underwater vehicles, which do not undertake the relay task.On the basis of the above framework, we employ a team of ordinary sensor nodes to predict the stochastic SNR parameters.In view of this, let p s = {p s,1 , . . ., p s,N } be the team position set of ordinary sensor nodes and p s,i = [x s,i , y s,i , z s,i ] T be the position vector of ordinary sensor node i ∈ V s = {1, . . ., N}, where x s,i , y s,i , and z s,i are the positions on X-, Y -, and Z -axis, respectively.Let E = {R 0 , R 1 , . . ., R M+1 } be the end-to-end communication chain, where R 0 denotes the source sensor node, R M+1 denotes the destination sensor node, and the others are the underwater vehicles.Specifically, V = {R 1 , . . ., R M } represents the set of underwater vehicles, where each underwater vehicle R i relays data from its single source neighbor R i−1 to its single destination neighbor R i+1 .For underwater vehicle R i , its position vector can be defined as p i = [x i , y i , z i ] T , while the position vectors of source senor node and destination senor node are defined as p 0 = [x 0 , y 0 , z 0 ] T and p M+1 = [x M+1 , y M+1 , z M+1 ] T , respectively.Besides that, N and M are the total numbers of ordinary sensor node and underwater vehicles.
The inertial reference frame (IRF) and body-fixed reference frame (BRF) are jointly utilized to depict the dynamic model of underwater vehicles.The position and orientation vector for underwater vehicle R i ∈ V in IRF is defined as where ψ i is the angle on yaw.The linear and angle velocity vector in BRF is v i = [u i , v i , w i , r i ] T , where u i , v i , and w i are the linear velocities on surge, sway, and heave, respectively.In addition, r i is the angle velocity on yaw.From [35], [36], the dynamic model of underwater vehicle R i is where the inertia, Coriolis-centripetal, and damping matrices, respectively.J i (η i ) ∈ R 4×4 is the rotation matrix, g i (η i ) ∈ R 4 is the hydrostatic force, and T is the control input, where τ u i , τ v i , τ w i , and τ r i are the control forces on surge, sway, heave, and yaw, respectively. Define and hence, model ( 1) is rearranged as with In order to improve the communication capacity, the link capacity C i of underwater vehicle R i is introduced.Referring to [37], one knows the link capacity of an end-to-end communication chain is equal to the capacity of the worst link.Along with this, the link capacity C i is defined as where c i−1,i is the link capacity between source neighbor R i−1 and vehicle R i .Meanwhile, c i,i+1 is the link capacity between vehicle R i and destination neighbor R i+1 .Note that the link capacity is a function of bandwidth and communication quality SNR [37].Hence, the SNR between source neighbor R i−1 and vehicle R i is expressed as where 10 log 10 α( 0.003.In addition, K dB denotes the average energy consumption of transmitting 1 bit data in dB, n PL denotes the spreading coefficient, l i−1,i = p i−1 − p i denotes the relative distance between source neighbor R i−1 and underwater vehicle R i , N 0 dB denotes the noise power spectral density in dB, and α( f ) is the acoustic absorption with frequency f .Moreover, σ SH (p i−1 , p i ) and μ MP (p i−1 , p i ) represent the location-related stochastic parameters, which reflect the effects of shadowing and multipath fading, respectively.
Remark 1: Different from the simplified SNR models in [38] and [39], the shadow fading parameter σ SH and the multipath parameter μ MP are both considered in this article, which can well capture the realistic underwater environment.An example of the shadowing and multipath fading is shown in Fig. 2.
From ( 6) and noting with the Shannon-Hartley theorem [26], the link capacity c i−1,i which provides the theoretical upper bound can be obtained as with where B denotes the communication bandwidth.Similarly, the detailed expression of c i,i+1 can also be acquired.

Assumption 1:
The obstacles in underwater environment can be covered by convex cylinders.The mth (m = 1, 2, . ..) obstacle in the environment is denoted as B m (O m , ρ m ), where O m is its center and ρ m (ρ m > 0) is its radius.
Definition 1 (Obstacle Set): For underwater vehicle R i at time t, the detected obstacle set can be defined as the subset Assumption 2: The obstacles are sparsely exist in the underwater environment, and hence, the impact of obstacles on communication channel is ignored.Of note, this assumption has been made in some existing works, e.g., [29], [40].
Assumption 3: The offline map of the target area is to be known by the sonar installed on underwater vehicle.Meanwhile, underwater vehicle is equipped with camera, which is capable of online detecting obstacles within a certain range.
Accordingly, the following two problems are formulated.

Problem 1 (SNR Prediction in Fading Channel):
The shadow fading and multipath parameters in SNR model cannot be obtained previously.In view of this, we attempt to design a model-based IRL estimator to capture the unknown channel parameters.This problem can be reduced to the estimation of K dB , n PL , σ SH and μ MP with the limited channel measurements from ordinary sensor nodes i ∈ {1, . . ., N}.
Problem 2 (Collision-Free Motion Planning): It is impossible to acquire the accurate dynamic model of underwater vehicles.Meanwhile, obstacles increase the difficulty of motion planning of underwater vehicles.In view of this, we aim to employ IRL to develop a model-free and collision-free motion planning algorithm.This problem is reduced to maximize C i while guaranteeing p i − O m > ρ m and p i − p j > 0.

III. MAIN RESULTS
We first design an IRL-based estimator to capture the unknown shadowing and multipath parameters.Along with this, the IRL is adopted to develop a model-free and collisionfree motion planning algorithm for underwater vehicles.Finally, the theoretical analysis for our solution is presented.

A. IRL-Based Estimator for Online SNR Prediction
Initially, underwater vehicle R i ∈ V at location p i broadcasts an initiator message to its neighboring ordinary sensor nodes.Then, underwater vehicle R i switches into the listening mode.For any ordinary sensor node j ∈ N i , it senses the SNR of underwater vehicle R i , denoted by SNR dB (p i , p s, j ) where N i is the neighboring ordinary sensor set of underwater vehicle R i .After that, ordinary sensor node j ∈ N i replies its position and SNR measurement to underwater vehicle R i .Repeating the above procedure, the collected messages on underwater vehicle R i ∈ V can be expressed as p s, j , SNR dB (p i , p s, j ) For clear of expression, the ordinary sensor nodes in set N i are labeled as 1 i , 2 i , . . ., |N i | i .We stack the above SNR measurements into a vector Noting with ( 6), one has with where It is worth mentioning that N 0 dB and f can be acquired by the priori knowledge.
In (6), θ R i is a deterministic vector, representing the offset and slope of path loss, while σ R i and μ R i are location-related random shadowing and multipath fading parameter vectors, respectively.We can estimate θ R i by the available measurements; however, one cannot estimate σ R i and μ R i due to their randomness.For that reason, we estimate the statistical characteristics of σ R i and μ R i , rather than the real-time values of σ R i and μ R i .It is assumed that σ R i is captured by a zero-mean Gaussian noise with an exponential spatial correlation.Similar to the assumption in [29] and [41], μ R i is captured by lognormal distribution without the spatial correlation.Accordingly, we employ the spatial correlation to predict the SNR parameters.Then, the covariance matrices of σ R i and μ R i can be expressed as where ξ 2 R i is the power of shadowing, ϕ R i is the parameter controlling the spatial correlation, ρ 2 R i is the multipath fading power, and d j i , j i is the distance between ordinary sensor nodes Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Since σ R i and μ R i are independent, one can define the variance of Y R i dB as Hence, the estimations of θ R i and R i can be acquired by maximizing the following maximum likelihood function, i.e., and hence, for In our previous work [42], a separate design strategy was developed, where θ R i was estimated in Phase I, ξ 2 R i , ρ 2 R i , and ϕ R i were estimated in Phase II, and the iteration was conducted in Phase III.The above separate design has high computational complexity, while its estimation accuracy is sensitive to the measurement noise.To cover these deficiencies, this article jointly estimates θ R i , ξ 2 R i , ρ 2 R i , and ϕ R i .To this end, we differentiate (13) with respect to variable θ R i and variance R i , and hence, one can further have From ( 14) and ( 15), the optimization of θ R i and R i is acquired by solving Of note, R i is the sum vector of multipath and shadow fading.Based on this and with the definition in (11), one rearranges R i as where Based on ( 14)-( 16), we can easily obtain We define χ where } denotes the set of ordinary sensor nodes pairs for j i and j i .In addition, λ 1 > 0 and λ 2 > 0 are the tuning indexes of shadowing and multipath cost function terms, respectively.Moreover, Q 1 a positive definite matrix.Let χ R i be the estimation of χ R i , and u ∈ R 4 be the increment input vector of χ R i .Hence, the estimation procedure of χ R i can be described as A model-based IRL estimator is developed to seek χ * R i , whose basic idea is to minimize the integral temporal difference error [43], [44].Then, the cost function is defined as where R 1 is a positive definite matrix.From (22), the value function for the estimation of and hence, the optimal value of χ R i is to select u, such that an optimal update policy of u can be obtained, i.e., In the following, the IRL strategy includes two steps, i.e., policy evaluation and policy improvement.In policy evaluation, V 1 ( χ R i (t)) is evaluated by using (23), given the current update policy.In policy improvement, the optimal update policy is selected until the convergence is reached for the iteration procedure.The above steps are detailed as follows.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
1) Initialization: Initially, the policy and value function are set as u (0) (0) = 0 and V (0) ( χ R i (0)) = 0. 2) Policy Evaluation: For each iteration s, one calculates the following value function: 3) Policy Improvement: Find an updated control policy u (s+1) through the following rule: In order to smoothly approximate the value function , a critic network is introduced as where φ 1 ( χ R i (t)) is basis function for weight vector W 1 .
Based on (27), one rearranges (25) as and its residual error can be expressed as The purpose of weight updating is to minimize the overall residual error, i.e., min ∘ 0 e 1 (τ )dτ .Hence, the recursive least square method is used to update the weight W 1 .Let (s)  1 = t+T t g 1 ( χ R i , u (s) )dτ denote the value of the cost function for a period of time under control policy u (s) .Along with this, the following law is adopted to update W (s)  1 , i.e., with variance matrix P s to adjust the update speed, i.e., where κ 1 is the forgetting factor in the weight update process.
In addition, φ is the excitation function, whose role is to drive the weight to accomplish convergence.Moreover, ||W (s) 1 −W (s−1) the termination conditions for the above two interaction functions, where 1 and 2 are small positive decimals.
When the above iteration procedure is ended, the optimal SNR parameters θ * R i , ξ * 2 R i , ϕ * R i and ρ * 2 R i can be obtained.Accordingly, the predicted mean of SNR between source neighbor R i−1 and vehicle R i is expressed as with

B. Model-Free and Collision-Free Motion Planning Algorithm
The following cost function is defined for underwater vehicle R i to maximize its channel capacity C i , i.e.,

E i,L (p
Then, the obstacle avoidance function for underwater vehicle R i can be defined as with where l i,m = ρ m + ρ i,m , ρ i,m is the minimum distance from underwater vehicle R i to obstacle m, ρ om is the radius of the mth obstacle's influence boundary cylinder, c is the steepness of repulsive function, and ν is the repulsive range.Remark 2: In (35), a large c causes a steep shape for obstacle avoidance, while a large ν causes a wide repulsive range.In view of this, a point P A = [l left , κ 1 ] that tends to ρ om from left is selected to capture the steepness, and a point P B = [l right , κ 2 ] that stay off ρ om from right is selected to capture the repulsive range.Of note, 0 < κ 2 < κ 1 < 1.Then, c and ν are selected as c = ((ln(log κ 1 κ 2 ))/(2 ln l right − ln l left )) and ν = (l left /(exp(((ln(−2 ln κ 1 ))/2c)))).An example of the above selection result is shown in Fig. 3(a).
Most of the existing works (e.g., [13], [45]) set f (l i,m ) as 1 if l i,m ≤ ν, and f (l i,m ) = 0 otherwise.The above design leads to the discontinuity of repulsive potential field, which causes the system dithering.To avoid this shortage, a smooth coefficient f (l i,m ) is introduced to eliminate the impact of avoidance function E i,O (p i ) beyond the safety distance.Clearly, E i,O (p i ) = 0 is equivalent to l i,m = ρ om , which means underwater vehicle R i moves on the influence boundaries of the obstacles.If l i,m > ν, one regards that underwater vehicle R i has already escaped the influences of obstacles, and hence, the value of f (l i,m ) is very small.If l i,m ≤ ν, one regards that underwater vehicle R i enters into the influences of obstacles, and hence, the value of f (l i,m ) is increased with the decreasing of l i,m .The effect of f (l i,m ) on the obstacle avoidance function is depicted by Fig. 3(b).
Underwater vehicle R i also requires to avoid collision with its neighboring vehicle j ∈ N * i , where N * i is its neighboring set.Similar to (34), the internal collision avoidance function for underwater vehicle R i can be denoted by (36) where l i, j = p i − p j is the relative distance between underwater vehicle R i and neighboring vehicle j ∈ N * i .In addition, ν 1 and c 1 are the repulsive range and steepness for the internal collision avoidance, respectively.
With ( 2), ( 33), (34), and (36), the total cost function for the motion planning of underwater vehicle R i is where pi and li,m denote the estimations of p i and l i,m , respectively.Ri is a positive definite matrix.Besides that, β 1 , β 2 , and β 3 are positive constants, whose role is to balance the communication efficiency and collision avoidance.
Based on (37), the value function for control input where Xi denotes the estimated value of X i .Hence, one can construct the following optimal problem: Meanwhile, dynamic model (2) can be rearranged as where τ (s)  i is the updated policy in the sth iteration and τ i is an admissible policy for the learning procedure.
Combining (38) with (40), the derivative of the value function in the sth iteration can be calculated as where Ri = diag(r i,1 , r i,2 , r i,3 , r i, 4) is a positive definite matrix, and ∇ V (s) X i which is similar to the one in Section III-A.Hence, the desired policy τ (s)  i can be obtained by solving (41).
In the following, a model-free policy iteration algorithm is employed to seek the optimal update policy τ * i . 1) Initialization: Initially, the policy and value function are set as τ (0) i (0) = 0 and V (0) i ( Xi (0)) = 0, respectively.2) Policy Evaluation: For each iteration s, calculate the following value function obtained by integrating (41): 3) Policy Improvement: Find an updated control policy τ (s+1)   i through the following rule: with In the above process, the analytical forms of V (s) i and τ (s)   i are unknown previously.In order to smoothly approximate the value function V (s) i and the desired policy τ (s)  i , the critic network and actor network are introduced as where φ i 2 is the basis function vector for the weight vector Wi .In addition, τ (s+1) , where φ i 3, i is the basis function vector for the weight vector Wi, i .Of note, Wi, i is the ith policy weight for i ∈ {1, . . ., 4}.
Noting with ( 45) and ( 46), one can deduce the following result from (44), i.e., where τi, i (t) = τ i, i −τ (s) i, i .Specifically, τ i, i denotes the ith element of τ i , and τ (s)  i, i denotes the i th element of τ (s)  i .Let (s) i = t+T t ḡi (p i , τ (s)  i )dτ denote the value of the cost function for a period of time T under the given control Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 1 IRL-Based Motion Planning Controller
Input: χ R i (0), X i (0), τ i (0) Output: The optimal control policy τ * i  (s)  i .From (47), the update weight vector can be updated by the following iteration procedure, i.e., W(s with variance matrix Pi to adjust the update speed, i.e., where κ is the forgetting factor.In addition, Based on the above iteration procedure, the optimization control input τ * i can be obtained, as depicted by Algorithm 1, where T max,1 and T max,2 represent the maximum times for SNR prediction and motion planning, respectively. Remark 3: In Section III-A, a model-based IRL estimator is adopted, since the kinematic model ( 21) that is employed can be accurately known by underwater vehicles.By contrast, the dynamic model ( 2) is adopted in Section III-B.Due to the harsh ocean environment, it is difficult to acquire the accurate dynamics model of underwater vehicles, e.g., A i , B i , and G i .In view of this, a model-based IRL estimator is developed in Section III-B, even if it is complicated in implementation.

C. Preformation Analysis
For SNR prediction, the optimal policy is given in ( 26), whose convergence is presented as follows.
Theorem 1: Given an initial admissible policy u (0) (0), the policy iteration (26) can make u (s+1) converge to the optimal policy u * , such that χ * R i can also be obtained.Proof: The proof is given in Appendix A. For motion planning control, the model-free IRL is adopted to find the optimal control strategy τ * i , as provided by (39).With regard to this, the following convergence analysis is presented for the optimal control strategy τ * i .Theorem 2: Given an initial admissible policy τ (0)  i (0), the policy iterations ( 42) and ( 43) can make τ (s+1)   i converge to the optimal policy τ * i .Proof: The proof is given in Appendix B. The underwater acoustic communication in this article is divided into the following two parts: 1) data collection of SNR measurements (see Section III-A) and 2) collision avoidance between different vehicles (see Section III-B).Then, the communication complexity is studied by counting the transmitted and received scalars for each node, as similar to [34] and [46].
Step 1 (Complexity in Part 1): Recall that underwater vehicle R i ∈ V broadcasts an initiator message to its neighboring ordinary sensor nodes, through which underwater vehicle R i ∈ V receives the replies from ordinary sensor node j ∈ N i , i.e., {p s, j , SNR dB (p i , p s, j )} j ∈N i .Based on this, underwater vehicle R i ∈ V transmits 1 scalars and receives |N i | i scalars during the data collection procedure.Along with this, any ordinary sensor node j ∈ N i receives the initiator message from underwater vehicle R i , and then it replies its position and SNR measurement to underwater vehicle R i .Correspondingly, ordinary sensor node j ∈ N i transmits four scalars and receives one scalars during the data collection procedure.
Step 2 (Complexity in Part 2): During the internal collision avoidance procedure, underwater vehicle R i ∈ V transmits its position information to its neighboring vehicles, and meanwhile it receives the position information from neighboring vehicles.Thus, the transmitted and received scalars for underwater vehicle R i ∈ V are given as four and four, respectively.In addition, the ordinary sensor nodes do not implement communication task in this part, so the transmitted and received scalars for ordinary sensor node j ∈ N i are all zeros.
The collision avoidance analysis is presented as follows.
Corollary 1: Given cost function (37) and value function (38), underwater vehicle R i never collides with obstacles or neighbor R j , if the following condition is satisfied, i.e., ḡi pi , τ (0) where

A. Simulation Results
In this section, simulation results are presented to verify the effectiveness.Specifically, the positions of source sensor node and destination sensor node are set as [−30, −30, −8] T and [30, 30, −8] T , respectively.The initial position and orientation vectors of underwater vehicles R 1 and R 2 are set as Besides that, Q 1 = diag([5, 5, 5, 5]), λ 1 = 0.5, λ 2 = 0.01, and T = 0.1.Accordingly, the SNR estimator is adopted, and hence, the estimated path loss, shadowing, and multipath fading parameters are shown in Fig. 4(a)-(c), respectively.On the basis of this, the link capacity in underwater area can be shown in Fig. 4(d).The estimated parameters converge to the true values, which verify the effectiveness of the proposed SNR estimator in this article.
In [29], the least square estimator is adopted to estimate the SNR parameters, but the least square estimator can make the parameters fall into local optimum.To show the above phenomenon, we assume that the SNR measurement process is polluted by external noise, where the external noise is set as # = 5000 cos 2 (0.25K dB,1 + 5.3) + 2000 cos 2 (0.5n PL,1 + 0.3).Take Part 3 as an example, the optimization problem can be updated as χ * R i = argmin{Part 3 + # }.With the IRL-based estimator in this article, the predicted link capacity by using 3500 different sampling points is shown in Fig. 4(e).Meanwhile, the cost comparison by using the least square estimator (e.g., [29]) and the IRL-based estimator in this article is provided by Fig. 4(f).We find that the parameters estimation result of IRL-based estimator proposed in this article is closer to the true value than that of the least squares estimator.
2) Motion Planning of a Single Underwater Vehicle: With the predicted SNR information, we consider a simple motion planning scenario, i.e., a single underwater vehicle is deployed to relay the data from source sensor node to destination sensor node.Along with this, the value and policy basis functions of underwater vehicle R 1 can be expressed as equal and reach stability after t = 50 s.Once C 1 is maximized, underwater vehicle R 1 can hover at (0, 0, −8).Overall, the link capacity is increased by 258.48% than the initial value, i.e., from 17.3079 to 62.0461.
3) Motion Planning of Multiunderwater Vehicle: Next, we consider a general motion planning scenario, i.e., two underwater vehicles are deployed to relay the data from source sensor node to destination sensor node.To this end, the value and policy basis functions of underwater vehicles are defined as the same in Section IV-A2.In addition, β 1 = 100, β 2 = 1 and R1 = R2 = diag([0.1,0.08, 1, 1.2]).Accordingly, the trajectories of underwater vehicles R 1 and R 2 are shown in Fig. 6(a), whose position and orientation are presented in Fig. 6(b) and (c).Correspondingly, the optimal policies of underwater vehicles R 1 and R 2 are shown in Fig. 6(d), where the learned weights are presented in Fig. 6(e)-(h).
Clearly, the collision avoidance is also be guaranteed, while the learned weights can converge to the optimal values.Based on this, the link capacities of the two underwater vehicles and the segmented link capacity c 0,1 , c 1,2 , and c 2,3 are shown in Fig. 6(i).From Fig. 6(i), we know the link capacity of the networks is gradually increased by the motion-planning process, where c 0,1 , c 1,2 , and c 2,3 become equal after t = 38 s and reach stability after t = 50 s.Once C 1 and C 1 are maximized, underwater vehicles R 1 and R 2 hover at (10, 10, −8) and (−10, −10, −8), respectively.Overall, the link capacity is increased by 426.99%, i.e., from 17.3079 to 91.2101.These results demonstrate the meaning and necessary of our communication-efficiency motion planning solution.
4) Comparison With the Other Motion Planning Solutions: Note that a Lyapunov guidance vector field (LGVF)-based motion planning algorithm was provided in [26], where the spiral forward was performed by vehicle.Clearly, the spiral forward can increase the path length of vehicle, which may reduce the lifetime of vehicle.By ignoring the underwater obstacle, the LGVF-based motion planning algorithm is adopted here, and hence, the trajectories of the underwater vehicle R 1 by using the above two algorithms are shown in Fig. 7(a).The positions on surge, sway, and depth are given in Fig. 7(b).The comparison of the path length required by the two algorithms is shown in Fig. 7(c).Meanwhile, the link capacities of the two algorithms are shown in Fig. 7(d).From Fig. 7(a)-(d), we can see that the link capacities by using the above two algorithms can both be improved; however, the path length required in this article is less than the one in [26] since the spiral forward is not required in this article.
Another important characteristic of our solution is the independence of system model, i.e., it is not necessary to know the nominal value of the underwater vehicle in advance.This characteristic is of great practical significance to ocean monitoring because it is difficult to obtain the nominal value in harsh underwater environment.With respect to this, the modelbased learning controller (e.g., [33]) is adopted by the underwater vehicle.Then, the following two cases are considered: 1) the model matrix M 1 can be accurately obtained and 2) the model matrix M 1 cannot be accurately obtained due to environment noise and model uncertainty.By employing modelbased learning approach, the motion trajectories of underwater vehicle R 1 in Case 1 and Case 2 are shown in Fig. 7(e) and (f), respectively.Correspondingly, the link capacities in Case 1 and Case 2 are presented in Fig. 7(g) and (h), respectively.Clearly, the underwater vehicle can achieve the motion planning task when the model information is accurate, and meanwhile, the link capacity is significantly improved.However, when the model information is inaccurate, the motion planning task of underwater vehicle cannot be well achieved, which can result

B. Experimental Results
This section presents the experimental results.As depicted in Fig. 5(a), the desired relay position for a single underwater vehicle is on the midpoint between the source sensor node and the destination sensor node.With regard to this, three nodes including an underwater vehicle, a source sensor node, and a destination sensor node are considered, as shown in Fig. 8.The work frequency band of the wireless communication system is within 21-27 kHz, and it adopts the orthogonal Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.frequency division multiple access (OFDM) mode.In order to overcome the multipath and Doppler effects, the cyclic prefix is extended and the frequency interval is enlarged.Due to this, our communication system has a stable communication rate of 300 b/s and a maximum communication rate of 2000 b/s.Different from the terrestrial environment, the SNR S i−1,i is affected by path loss, shadow fading, multipath fading, and various external underwater disturbance, especially in shallow water near the shore.In the experiment, with the change of distance, the S i−1,i value is about 0.15, so the communication rate is 300 b/s.For underwater vehicle, the BlueROV from Blue Robotics is adopted, which features six thrusters, a flight controller, a wireless communication unit, and a Raspberry Pi.
In the following, the underwater vehicle patrols in different positions to relay the data of source sensor node to the destination sensor node, whose motion trajectory is shown in Fig. 9(a).For clear description, the relative distance between underwater vehicle and source (or destination) sensor node is provided in Fig. 9(b).Correspondingly, the end-to-end probability of successful data transmission (EPSDT) is shown in Fig. 9(c), which is defined as the successful data transmission of the worst link.We find that the successful data transmission for each communication link is increased with the increase of relative distance.Meanwhile, the successful data transmission can reach to the same value (i.e., 90%) when the relative distances for the two links are the same (i.e., 14.5803 m).Based on the definition of EPSDT, we can know that the EPSDT can reach to the maximum when the underwater vehicle is on the midpoint between the source sensor node and the destination sensor node.It is clear that these results are consistent with the simulation results.Similarly, the results for multiple underwater vehicles can also be obtained, and this part is omitted here due to page limitation.
V. CONCLUSION This article gives a communication-efficient and collisionfree motion planning solution for underwater vehicles.By adopting the model-based IRL, an online SNR estimator is designed to capture the unknown shadowing and multipath parameters, such that the SNR in unvisited positions can be predicted by underwater vehicles.With the predicted channel information, a model-free IRL motion algorithm is conducted to drive underwater vehicles to the desired position points while maximizing the communication capacity and avoiding the collision.Finally, simulation and experimental results are both presented to verify the effectiveness.
In the future, we will employ the distributed learning approach to resolve the codesign problem of underwater detection, communication, and control.Meanwhile, how to verify the results in ocean environment is also our future work.

APPENDIX A PROOF OF THEOREM 1
Given an initial admissible policy u (0) (0) with the system trajectory of χ R i = u (s+1) , the task of this proof is to prove To this end, we set u (s) as an admissible policy.Based on this, we take the derivative of V (s)  1 ( χ R i ) along χ R i = u (s+1) , through which one has According to the definition of V * 1 ( χ R i , 0), the Hamilton-Jacobi-Bellman (HJB) equation becomes Combining (51) with (52), we can get − g 1 ( χ R i , u (s) ).(53) With (26), we have ((∂ V (s) T R 1 .Based on this, one can rearrange (53) as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. 1 1 ( χ R i , u (s) ), which means the optimal value of χ R i (i.e., χ * R i ) can also be obtained.That completes the proof.

APPENDIX B PROOF OF THEOREM 2
Given an initial admissible policy τ (0)  i (0) with system trajectory of Ẋi = A i X i + B i τ (s)  i + G i , the task of this proof is to prove the solution to the model-free Bellman equation is the same as the model-based Bellman equation.
First, the form of model-based Bellman equation is established.Differentiating the value function (38), we have the following Bellman equation, i.e.,

H V (s)
i , τ (s)   i = ∇ V (s)T By the stationarity condition ∂ H ( V (s) i , τ (s) i )/∂τ (s)   i = 0, we can have the following optimal control, i.e., From ( 56) and (57), one obtains On the other hand, the form of model-free Bellman equation can also be established.From (41), we have Substituting (57) into (59), we can have the same Bellman equation as (56).Therefore, the optimal solution τ * i to the Bellman function in model-free equation is the same as that of the Bellman function in model-based equation.
Note that the convergence of the solution to model-based Bellman equation [i.e., (57)] has been proven in Theorem 1.
Based on this, one knows the policy iterations ( 42) and ( 43) can make τ (s+1)   i converge to the optimal policy τ * i .That completes the proof.

APPENDIX C PROOF OF COROLLARY 1
The proof includes two parts.First, we prove that V (s) i ( Xi (t), τ (s) i ) is decreasing function with policy τ (s+1 Taking the derivative of Vi ( Xi (t), τ (s+1) i ) along Ẋi = A i Xi + B i τ (s+1)   i + G i , we have According to (40), we have Combining τ (s+1) with (60) and (61), one can further have Similarly, the equivalence between the solution of IRL and the optimal policy solution τ (s+1)   i can be proved by Theorem 2. Next, we prove that the collision never occurs between underwater vehicle R i and obstacles or neighbor R j if (50) are satisfied.Assume that at t = t * , underwater vehicle R i collides with obstacles or neighbor R j , then cost function becomes ḡi pi , τ

Manuscript received 1
August 2022; revised 11 November 2022; accepted 28 November 2022.Date of publication 13 December 2022; date of current version 4 June 2024.This work was supported in part by the National Natural Science Foundation of China under Grant 62222314, Grant 61973263, Grant 61873345, and Grant 62033011; in part by the Youth Talent Program of Hebei under Grant BJ2020031; in part by the Distinguished Young Foundation of Hebei Province under Grant F2022203001; in part by the Central Guidance Local Foundation of Hebei Province under Grant 226Z3201G; and in part by the Three-Three-Three Foundation of Hebei Province under Grant C20221019.(Corresponding author: Jing Yan.)

Fig. 1 .
Fig. 1.Communication links from source sensor node to destination sensor node through the motion relays of underwater vehicles.

Fig. 2 .
Fig. 2. (a) Only the path loss is considered.(b) Path loss and shadowing are considered.(c) Path loss and multipath fading are considered.

Fig. 5 .
Fig. 5. Simulation results for motion planning of a single vehicle.(a) Trajectory of vehicle R 1 .(b) Position and orientation.(c) Optimal control policy.(d) Learned weight vector W1 .(e) Learned weight vector W1 .(f) Link capacity.

Fig. 6 .
Fig. 6.Simulation results for the motion planning of multiple underwater vehicles.(a) Trajectories of two underwater vehicles.(b) Position and orientation of vehicle R 1 .(c) Position and orientation of vehicle R 2 .(d) Optimal control policies.(e) Learned weight vector W1 .(f) Learned weight vector W1 .(g) Learned weight vector W2 .(h) Learned weight vector W2 .(i) Link capacity for E = {R 0 , R 1 , R 2 , R 3 }.

Fig. 7 .
Fig. 7. Comparison for the IRL-based motion planning solution with the other existing solutions.(a) Comparison with LGVF [26].(b) Position of R 1 .(c) Trajectory length of two algorithms.(d) Link capacity.(e) Motion trajectory in Case 1. (f) Motion trajectory in Case 2. (g) Link capacity in Case 1. (h) Link capacity in Case 2.

Fig. 8 .
Fig. 8. Experiment deployment, where an underwater vehicle, a source sensor node, and a destination sensor node is included.

Fig. 9 .
Fig. 9. Experimental results for deployment of underwater vehicle.Motion trajectory of underwater vehicle.(b) Relative distances with source and destination sensor nodes.(c) Probability of successful data transmission for each link.