Integrated Localization and Tracking for AUV With Model Uncertainties via Scalable Sampling-Based Reinforcement Learning Approach

This article studies the joint localization and tracking issue for the autonomous underwater vehicle (AUV), with the constraints of asynchronous time clock in cyberchannels and model uncertainty in physical channels. More specifically, we develop a reinforcement learning (RL)-based asynchronous localization algorithm to localize the position of AUV, where the time clock of AUV is not required to be well synchronized with the real time. Based on the estimated position, a scalable sampling strategy called multivariate probabilistic collocation method with orthogonal fractional factorial design (M-PCM-OFFD) is employed to evaluate the time-varying uncertain model parameters of AUV. After that, an RL-based tracking controller is designed to drive AUV to the desired target point. Besides that, the performance analyses for the integration solution are also presented. Of note, the advantages of our solution are highlighted as: 1) the RL-based localization algorithm can avoid local optimal in traditional least-square methods; 2) the M-PCM-OFFD-based sampling strategy can address the model uncertainty and reduce the computational cost; and 3) the integration design of localization and tracking can reduce the communication energy consumption. Finally, simulation and experiment demonstrate that the proposed localization algorithm can effectively eliminate the impact of asynchronous clock, and more importantly, the integration of M-PCM-OFFD in the RL-based tracking controller can find accurate optimization solutions with limited computational costs.

academia and engineering (see [1]- [3] and the references therein). For some complicated and dynamic missions, the single-handed work of AUV is not sufficient to achieve rapid response decisions or operations. In view of this, the idea of human-on-the-loop (HOTL) is put forward [4], in which autonomous unmanned surface vehicles (USVs) and human operator are cooperatively assisted to support the decision-making process. Compared with the single-handed AUV system, the collaborative nature of the HOTL system can increase the source of information in the cyberchannel, which allows AUV to concentrate on high-level decision making. Inspired by this, an illustration of the HOTL system is depicted in Fig. 1.
For the HOTL system, human operators in the control center determine the expected target point and then send this message to AUV through the relay of USVs. The most urgent and vital issue for AUV is to design appreciate tracking controllers, such that it can reach the target point. Although many tracking controllers have been given for terrestrial vehicles in the past several years [5], [6], they are dependent on the assumption that the vehicle location can be accurately accessed by global positioning system (GPS). As we know, GPS is not available for AUV because radio waves are strongly faded in water. As a result, the tracking controllers developed for terrestrial vehicles cannot be generalized to AUV. With regard to this case, how to develop a localization scheme for the tracking control of AUV turns into a primary issue to be solved.
In view of the self-localization issue, some researchers employ the inertial measurement unit (IMU) [7] and Doppler velocity log (DVL) [8], [9] to calculate the position of AUV. It is noted that IMU and DVL have the shortcoming of error accumulation for the lack of a necessary feedback mechanism. To shield the error accumulation, many time-based localization schemes have been developed. For instance, a consensus estimation-based localization scheme was presented in [10]. In [11], the projection technique was applied to the localization of underwater sensor networks. Misra et al. [12] considered the scenarios of sparse network partitioning, through which a Stackelberg game-based localization algorithm was provided. The above localization algorithms depend on the assumption of the synchronized clock, that is, the clock between the sending node and receiving node needs to keep consistency. However, GPS is not available for AUV, and more importantly, This  the propagation delay of underwater acoustic speed (about 1500 m/s) is approximately five orders of magnitude larger than the one of radio frequency channels (about 3 × 10 8 m/s). These unique characteristics make it hard to guarantee accurate clock synchronization, i.e., the time clocks in the water display asynchronous feature. For this case, some asynchronous underwater localization algorithms have been presented. To be specific, an integrated synchronization and localization solution was developed in [13], while a mobility predictionbased asynchronous localization algorithm was given in [14]. In [15], an integrated localization and tracking framework was constructed, where the least-square method was employed to estimate the position of AUV. Also of relevance, two unified frameworks of localization and synchronization were provided in [16] and [17]. It should be noted that the localization schemes in [13]- [17] require first-order linearization technology to determine the Jacobian matrix, through which least-square estimators are developed to seek the location of AUV. However, the computation of the Jacobian matrix can introduce large linearization errors [18], besides that, the least squares estimators are easy to trap in local optimum. With this issue in hand, we notice that reinforcement learning (RL) [19] can offer us with a feasible solution, since it seeks a global optimum solution via online learning where the calculation of Jacobian matrix is not necessary. As far as we know, how to develop an RL-based asynchronous localization algorithm that integrates the tracking control of AUV is not well studied.
With the assumption that AUV location can be ubiquitously accessed, several tracking controllers have been sought to drive AUV to the target point. For instance, Li et al. [20] and Shen et al. [21] provided receding horizon-based tracking controllers for AUV. In [22], the backstepping method for cascaded system was adopted to develop a tracking controller. In [23], the method of Lagrangian particle was incorporated into the tracking control of AUV, and then a theoretical model for error growth was established. The aforementioned tracking controllers are well designed, however, they ignore the influences of model uncertainties and external disturbances for AUV. In practice, AUV usually suffers from model uncertainties and external disturbances, due to the rigorous ocean environment. For such a case, Cui et al. [24] employed the extended-state observer to address the uncertainties and disturbances, through which a sliding mode controller was designed to guarantee the tracking performance of AUV. More recently, RL excites the interest of researchers [25], in which the uncertainties and disturbances are incorporated into the optimal control policy by utilizing the rewards from environment information. For example, an adaptive nonlinear tracking controller by employing the actor-critic RL strategy was developed in [26]. Also of relevance, a tracking scheme for the depth control of AUV was proposed in [27]. In [28], the model uncertainty of AUV was involved into the RL-based tracking controller, where two neural networks were given to approximate the control policy and critic function. In addition, an adaptive tracking controller for AUV subject to model uncertainty was developed in [29], where the uncertainties and disturbances were well compensated by the action neural network. Followed by this, a sideslip-compensated tracking solution was developed in [30] to reduce the influence of external disturbances. It is worth mentioning that the model uncertainties and disturbances in the above works are captured by analytical solutions, e.g., extended-state observer and neural networks. However, due to the large-scale and complex characteristic of large-scale infrastructure systems, the analytical solutions to system dynamics are generally unavailable [31]. Alternatively, the simulation-based evaluation using complex computerized dynamics may be a relatively appropriate choice for uncertainties and disturbances. Inspired by this, Monte Carlo (MC) methods have been widely employed to evaluate the model uncertainties and disturbances, e.g., the risk evaluation of groundwater contamination [32] and the remote monitoring with directional antennas [33]. Although the MC method is a feasible and popular solution to evaluate the uncertainties and disturbances, it cannot satisfy the real-time requirement due to the large number of simulations. In order to reduce the number of simulations, our previous work [15] employed the multivariate probabilistic collocation method (M-PCM) to evaluate the time-varying uncertain model parameters, through which an RL-based tracking controller was presented for AUV. Compared with MC method, M-PCM can reduce the number of simulations significantly [34], however, its computational costs are related to the number of uncertain parameters, i.e., the computational costs grow exponentially with the number of uncertain parameters [35]. On the other hand, the dynamic of AUV in [15] was reduced to a linear second-order differential equation, and it cannot reflect the practical scenario because AUV dynamics usually have nonlinear characteristic with multiple freedoms [36]. With regard to the aforementioned issues, how to develop a scalable sampling-based RL approach that can jointly solve the ubiquitous localization and tracking control for AUV is largely unexplored.
In this article, an integrated localization and tracking solution with the consideration of model uncertainty and asynchronous clock is developed for AUV. We first formulate the AUV localization as an RL problem, through which an RL-based asynchronous localization algorithm is given to localize the position of AUV. Based on the estimated location, an RL-based tracking controller that involves the localization process is developed for AUV, where the uncertain parameters are evaluated by the multivariate probabilistic collocation method with the orthogonal fractional factorial design (M-PCM-OFFD). Main contributions of this article lie in three aspects.

1) RL-Based Localization Algorithm That Integrates
Tracking Control: We design an RL-based asynchronous localization algorithm to localize the location of AUV. Compared with the least squares estimators [13]- [17], the RL-based asynchronous localization algorithm in this article can avoid the local optimum effectively. Meanwhile, the integration of localization and tracking control can reduce communication energy consumption, because the data generated by the tracking control can be applied to guide the localization of AUV. 2) RL-Based Tracking Controller That Involves M-PCM-OFFD: We design an RL and M-PCM-OFFD-based tracking controller to drive AUV to the target point. Different from the RL-based tracking controller [28], [29], the simulation-based evaluation is conducted in this article to address the uncertainties and disturbances. Meanwhile, the M-PCM-OFFD in this article can enhance the scalability of uncertainty evaluation as compared with the M-PCM-based solution in [15].

3) Experimental Test Through an Embedded Communication and Control System:
It is worth mentioning that most of the existing RL-based tracking controllers are verified through simulation results. From a practical point of view, it is necessary to verify the theoretical results by a real experiment. Inspired by this, we design and implement an embedded system in water pool to verify our theory results.

II. SYSTEM MODELING AND PROBLEM FORMULATION
In order to realize joint localization and tracking, a network architecture including two types of nodes is provided. 1) USVs: USVs are equipped with GPS modules for selflocalization, acoustic system for AUV localization, and WiFi for communication with control center. The role of USVs is to relay the control command from control center, and then to provide localization service for AUV. 2) AUV: AUV is an untethered vehicle whose role is to track the target point. It can make direct communication with USVs, and the time clock of AUV is not synchronized with the actual time. Note that the depth of AUV can be accurately obtained by depth unit, so that only horizontal locations are required to be estimated. In the body-fixed reference frame, the velocity vector of AUV is denoted as v = [u, v, r] T , where u and v are the linear velocity on surge and sway, respectively, and r denotes the angle velocity on yaw. In the inertial reference frame, the position and heading vector of AUV is represented as η = [x, y, ψ] T , where x and y are the positions on surge and sway, respectively, and ψ denotes the angle on yaw. We study the tracking control on surge, sway and yaw, because the thruster layout does not enable active control on pitch and roll [21]. Referring to [21] and [29], the motion model of AUV iṡ and M 3 represent the mass vector on surge, sway, and yaw, respectively. C(v) is the Coriolis-centripetal matrix, i.e., and D(v) = diag(k u +k u|u| |u|, k v +k v|v| |v|, k r +k r|r| |r|) denotes the damping matrix. k u , k v , and k r are the linear damping scales. k u|u| , k v|v| , and k r|r| denote the quadratic damping scales. G(η) is the restoring force matrix. τ = [τ x , τ y , τ ψ ] T denotes the control torque vector, where τ x , τ y , and τ ψ are torques on surge, sway, and yaw, respectively. A = [a 1 , a 2 , a 3 ] T denotes the uncertainty and disturbance parameter matrix. Particularly, a 1 , a 2 , and a 3 follow the independent probability density functions of f A 1 (a 1 ), f A 2 (a 2 ), and f A 3 (a 3 ), respectively. J(η) is the rotation matrix, i.e., The state vector of AUV can be integrated as ζ = [η T , v T ] T , and let δ ∈ R + denote the sampling interval. By using the first-order Taylor expansion, model (1) is rearranged as Assumption 1: Similar to [24] and [37], we assume that the nominal matrices M, C, D, and G are available for controller design. This implies that these nominal values can be obtained by computational fluid dynamics computation or experiment analysis. The statistical property of A is known for the controller design, which can acquired by prior knowledge.
Different from the simplified model in [14] and [38], a time clock model consisting both offset and skew is given for AUV, as depicted by Fig. 2. Let α AUV represent the clock skew between AUV and actual clock t, β AUV be the clock offset between AUV and actual clock t, and T be the local clock of AUV. Thereby, the clock model of AUV is presented as With the consideration of acoustic energy consumption, a popular acoustic energy model [39] is adopted here. Particularly, the relative transmission distance between two points is denoted as d, and the bit length of data packet is b. Besides, the bit duration is T b . Accordingly, the energy consumption of transmitting b bits with distance d is given as and the energy consumption of receiving them is given as where E elec is the energy consumption of processing one bit message,f is the transmit channel center frequency, L is the depth, and P(f ) is the absorption coefficient. From (6) and (7), the energy consumption for AUV in the communication procedure can be calculated as With these considerations in mind, the problem formulation in this article can be summarized as follows.
Problem 1 (Self-Localization of AUV): From the viewpoint of signal processing, AUV position cannot be ubiquitously accessed through GPS. As a result, how to estimate the location of AUV turns into the primary issue to be solved. In view of this, we aim to develop an RL-based localization algorithm to calculate the location of AUV. Hence, this problem can be converted to the estimations of x and y under the considerations of (4), (5), and (8).
Problem 2 (Tracking Control of AUV): From the viewpoint of cybernetics, AUV dynamics often involve time-varying uncertain model parameters. In view of this, we attempt to introduce the M-PCM-OFFD into the uncertainty evaluation, through which a scalable sampling-based RL tracking controller is designed. This problem is converted to guarantee x → x r , y → y r , and ψ → ψ r with limited time intervals, where x r , y r , and ψ r are the surge position, sway position, and yaw angle of the target point, respectively.

III. RL-BASED LOCALIZATION AND TRACKING
In this section, a core classification of RL strategies, named approximate dynamic programming (ADP), is conducted to find the solutions for the optimization problems. We first design an RL-based localization algorithm to estimate the position of AUV, and then a scalable sampling-based RL tracking controller is given to drive AUV to the target point. Finally, theoretical analyses are provided. Besides that, some variables and parameters are given in Table I.

A. RL-Based Self-Localization of AUV
To save energy consumption in the communication procedure, the time axis in the localization and tracking procedures is split into multiple measurement windows. One measurement window T w is an integer multiple of δ. With respect to each measurement window, the procedure of self-localization contains three main steps, i.e., Data Collection, Localization, and Prediction. In step I, i.e., in the first time step of T w , AUV sends out an initiator message to USVs, and then the timestamps are gathered through a two way transmission process. Based on the timestamps collected in step I, an RL-based localization algorithm is performed in step II. In step III, i.e., in the other time steps of T w , the tracking controller τ (see Section III-B) is involved to predict the location of AUV, through which the energy consumption in communication procedure can be saved, since AUV is not required to broadcast message in step III. It is noted that the length of measurement window T w is very important to the localization performance. If T w is too large, the localization error will be accumulated due to the lack of feedback mechanism. If T w is too small, the communication energy consumption will be increased. From a practical point of view, we can select T w according to the system environments and application requirements. Now, the detailed designing process of step I is presented. At the timestamp T b,b , AUV transmits initiator message to USVs. Then, AUV switches into the listening mode until it receives the reply from USV 3. On the other hand, the message from AUV is received by USV n ∈ {1, 2, 3} at timestamp t b,n , and then USV n ∈ {1, 2, 3} transforms into the waiting mode with the aim of receiving messages from the other USVs. At timestamp t m,n , USVn ∈ {2, 3} receives message from USV m ∈ {1, . . . ,n − 1}. Subsequently, USV n sends a reply message to AUV at timestamp t n,n , and then, AUV receives it at timestamp T n,b . After the reply message from USV 3 is received by AUV, the process of step I is finished. To show more clearly, a depiction of step I is given in Fig. 3.
In step II, the following time differences are constructed to eliminate the influence of clock skew and offset, i.e., where represent one-way propagation delays. d b,n is the relative distance from AUV to USVn, d b,m is the relative distance from AUV to USV m, d m,n is the relative distance between USV m and USVn, and c is the acoustic propagation speed. Particularly, ωn ,m and ω m,n satisfy the distributions of ωn ,m ∼ N (0, 2σ 2 ) and ω m,n ∼ N (0, 3σ 2 ), where each noise is assumed to obey zero-mean Gaussian distribution with variance σ 2 .
The localization is then formulated as with Remark 1: In (10), the optimization problem is to minimize the localization error, such that an accurate estimation of (x, y) can be obtained. It is worth mentioning that the localization error consists of two parts, i.e., [ According to this, one can know that the localization error in (10) is a random variable rather than a deterministic constant, since its components sat- The localization optimization of (10) can be carried out by an RL-based algorithm. The main idea is to seek the optimal estimation increment via RL-based online learning. Let X = [x, y] T andX = [x,ŷ] T denote the actual and estimated position vectors of AUV, respectively. Then, the (l+1)th state iteration can be represented aŝ where u 1,l is the estimation increment for the (l+1)th iteration. Next, an RL-based algorithm is proposed to seek the optimal estimation increment of u 1,l . By using (10) and applying the procedures of RL [19], the value function is defined as where γ 1 ∈ (0, 1] denotes the localization discount rate. It needs to be emphasized that γ 1 reflects the importance of future reward, and it depends on the sight horizon of AUV. For instance, γ 1 = 0 means that AUV is shortsighted by only concerning current rewards, while a value of γ 1 = 1 makes AUV strive for a long-term high reward. Based on (12), we use the value iteration to calculate the optimal estimation increment of u 1,l , as follows.
1) Initialize: The increment and value functions are given as u 1,l,0 = 0 and V 1,0 (X l ) = 0, respectively. 2) Value Update: Calculate the value using 3) Policy Improvement: Use the following rule to determine an improved increment policy: Note that (13) and (14) involve unknown functions V 1,j and u 1,l,j , which need to be estimated. To this end, the following functions are defined to approximate V 1,j and u 1,l,j : where U v and σ v represent the weight vector and basis function for V 1 , respectively. Besides, U u and σ u represent the weight vector and basis function for u 1,l , respectively. Thus, the value update can be derived from (13) and (15) where U T v,j+1 is obtained by least squares estimator. Similarly, U u in (15) can be calculated as where ∇σ v (X) = ∂σ v (X)/∂X. Besides,δ > 0 and i are the tuning parameter and index, respectively. The above steps are iteratively conducted until convergence, i.e., U v,j+1 − U v,j <¯ 1 and U u,j+1 − U u,j <¯ 2 hold simultaneously for positive decimals¯ 1 and¯ 2 . Thereby, the optimal estimation increment u * 1,l can be acquired. Substituting u * 1,l into (11), one can obtain the location of AUV. Finally, the detailed process of step III is conducted. Of note, the localization and tracking control are jointly related. Thus, the data generated by the tracking control can be applied to guide the localization. Inspired by this, tracking controller τ can be employed to predict the position. As a result, from (4), the velocity vector of AUV can be rearranged as and, hence, the position of AUV at timestamp where · denotes the floor function. For estimating the clock skew α AUV , we construct the relationship between α AUV , β AUV andX l . It can be denoted asB denotes the measurement noise. The time clock of AUV can be estimated by employing the least squares method, i.e.,C = (Ā TĀ ) −1ĀTB .

B. RL-Based Tracking Control of AUV
This section presents an RL-based tracking controller for AUV. From a feedback point of view, the relationship between tracking and localization is depicted by Fig. 4. Note that the state vector ζ can be given by employingX. Letζ denote the estimation of ζ , and ζ r is the state vector of the target point. Based on this, the tracking error is defined as e(k) =ζ (k)−ζ r . To reduce the tracking error, the following cost function is given to calculate the one-step cost, i.e.: where Q 1 and Q 2 are positive-definite matrices.
With the consideration of the time-varying uncertain model parameter vector A, the total cost can be given as where E A(k) (·) is the expectation with respect to A(k), and γ ∈ (0, 1] denotes the discount rate. The value function is denoted as V(ζ (k)) = J(ζ (k)), then Bellman's equation has the following form: The aim of RL is to find the optimal policy that minimizes the cost to obtain Then, according to (21) and (22), the optimal control policy is denoted as Similar to Section III-A, the value iteration can be disposed to seek the optimal control policy τ * (k). Particularly, the value iteration process can be divided into two steps, i.e., value update and policy improvement. Started with an admissible stateζ (k) and a current control policy τ (k), V(ζ (k)) can be solved by using (21) in the value update step. In the policy improvement step, the best control policy τ * (k) is derived by evaluating the control policy through (23). As a result, the following value iteration is employed to seek τ * (k). 1) Initialization: Choose an admissible control policy τ 0 (k), and the initial cost V 0 (ζ (k)) is set to 0. 2) Value Update: Denote the iteration step as s, and then the following value can be updated, i.e., 3) Policy Improvement: Use the following rule to determine an improved control policy: which yields provided that ∂V(ζ (k))/∂τ (k) = 0. In order to smoothly approximate V s and τ s , the following two neural network functions are adopted, i.e., where W v and v are the weight vector and basis function for V, respectively. Besides, W u and u are the weight vector and basis function for τ, respectively. Thus, the (s + 1)th value iteration of W v can be derived from (24) and (26), i.e., Note that W v,s+1 can be estimated by the least square (LS) method. Let N = {g(ζ [1] , τ [1] ),ζ [1] , τ [1] ,ζ [2] , . . . , g(ζ [N] , τ [N] ),ζ [N] , τ [N] ,ζ [N+1] } denote the measured data set. Hence, by using (24) and (27), the residual error is given as Then, the iteration procedure is transformed to minimize the following total residual errors: (29) and, hence, from (27)-(29), the solution W v,s+1 is where Similar to (17), the weight vector W u can be acquired, i.e., whereh{·} is the function that preserve only diagonal elements of a matrix. ∇ v (ζ ) = ∂ v (ζ )/∂ζ, > 0 is a tuning parameter, and q is the tuning index for gradient descent. From (30) and (31), the expectations of g(ζ are required to be calculated. Note that the most popular sampling method is MC simulation, but it needs lots of simulation points to converge to the mean. In our previous work [15], M-PCM is adopted to estimate the mean, however the M-PCM is not scalable. For such a reason, this article employs a scalable sampling method called M-PCM-OFFD [34], [35], [40] to calculate the mean. Particularly, M-PCM-OFFD employs M-PCM to acquire the simulation points, through which OFFD is conducted to reduce the computational costs and improve the scalability. In the following, it is assumed that p independent uncertain parameters, i.e., a 1 , . . . , a p , follow the probability density functions of f A 1 (a 1 ), . . . , f A p (a p ), while each parameter has a degree up to 3, i.e., n l # = 2. Referring to [34], [35], and [40], the following property of M-PCM-OFFD is presented.
Notes of Algorithm 1: In lines 2-7, M-PCM simulation points are obtained according to f A 1 (a 1 ), f A 2 (a 2 ), and f A 3 (a 3 ). In line 8, M-PCM-OFFD simulation points are obtained by using Property 1. In lines 9-21, the optimal control policy is obtained by referring to (20)−(31). Of note, <H Remark 2: For clear expression, the following typical example is presented. Consider an original system with three  Y(a 1 , a 2 , a 3 respectively. By using MC, the output mean can be calculated as E (Y(a 1 , a 2 , a 3 )) = 3844.67, where 2 6 = 64 simulation points are required. With M-PCM, the reduced-order mapping is presented by Y (a 1 , a 2 , a 3 ) = 2197a 1 +470.67a 2 +1942a 3 − 7552, whose output mean is 3844.67 and eight simulation points are required [see Fig. 5(a)]. In addition, M-PCM-OFFD is also utilized, and a subset of M-PCM can be adopted to calculate the output mean, e.g., Fig. 5(b) or (c). Based on this, the reduced-order mapping for the subset in Fig. 5 by Y (a 1 , a 2 , a 3 where n # is the measurement noise matrix. The mean value of n # is Q, and the variance is R. To be specific, Q and R can be acquired by the Sage-Husa estimator, as presented in [41]. Denoting the log-likelihood function as ln (X), thus one has Next, we denote the ground truth asX. Hence, the Fisher information matrix can be represented by With (35) and (36), the CRLB is calculated by CRLB(X) = L −1 (X). Define that R = diag(R 1 , R 2 , R 3 , R 4 , R 5 , R 6 ) and, hence, we can calculate Accordingly, the localization error of X satisfies
Proof: With respect to Property 1, we have which is approximately calculated through the mean output of a system mapping Y (ζ (k), τ (k), a(k)) with M-PCM-OFFD.
Each uncertain parameter in (4) has a degree up to 3, i.e., n i = 2. Particularly, there is no item that more than τ # of j 1 , j 2 , j 3 are nonzeros. Then, the mapping is set as a(k))}. To this end, the contradiction method is applied and, thus, the following cases are conducted.

3) Convergence of the RL-Based Tracking Controller:
This section studies the convergence of (24) and (25), whose aim is to prove V s → V * and τ s → τ * as s → ∞. Let μ s denote an arbitrary control policy, and s is defined as Note that V s is minimized by τ s (k) with respect to the control input τ . Then, V s (ζ (k)) ≤ s (ζ (k)) when V 0 (ζ (k)) = 0 (ζ (k)) = 0. Accordingly, we have the following corollary. Corollary 1: Consider system (4) with value function V s .
It is noted that V s (ζ (k)) is a nondecreasing sequence and V * (ζ (k)) ≤ F(ζ (k)) from Corollary 1. Therefore, we have That completes the proof. Remark 3: Based on Theorems 1 and 2, we can conclude that the ADP framework in this article can assure the parameter convergence. The reason associated with the above conclusion can be given as follows: 1) in our ADP framework, MPCM-OFFD is employed to approximate the value function, whose output mean can reach to the expectation in the presence of uncertainties and disturbances (see Theorem 1) and 2) with the expected value function, policy improvement is conducted in (25) to seek the optimums, where V s and τ s can approximately converge to the optimums (see Theorem 2). Therefore, the convergence of parameters can be assured if the conditions in Theorems 1 and 2 are satisfied. Besides that, the tuning parameter for the gradient descent in (31) also plays an important role in the convergence. Particularly, usually starts from a small value until the tradeoff between convergence rate and system stability are balanced. Of note, the selection criteria of is out of the scope of this article, and it will be investigated in our future work.
Remark 4: Note that the idea of simultaneous localization and map building (SLAM) is similar to the proposed method in this article. However, our solution is essentially different from SLAM. Specifically, the SLAM is to place a vehicle at an unknown location in an unknown environment, and then employ the relative observations to build an environment map while simultaneously utilizing this map to estimate the local location of the vehicle. Thereby, SLAM focuses on the map construction and local location estimation. In contrast, our solution is to estimate the global location of the vehicle and then drive the vehicle to the desired position point.

A. Simulation Studies
Simulations are presented to verify the effectiveness, where ζ r = [75, 85, −5, 0, 0, 0] T . In addition, it is assumed that the depth of AUV is fixed as −5 m. The main parameters used in this simulation are presented in Table II.

1) AUV Localization Under Asynchronous Clock:
We first test the effectiveness of the localization algorithm, as provided in Section III-A. For AUV, the actual clock skew (i.e., α AUV ) and clock offset (i.e., β AUV ) are given as 1.02 and 0.02, respectively. Then, the actual and localized trajectories of AUV are depicted in Fig. 6(a). The localization error of AUV is defined as Fig. 7. Localization comparison for the asynchronous localization algorithm, e.g., [13], [14], [17], and [38]. localization errors are depicted in Fig. 6(b), and the estimated clock information is presented in Fig. 6(c). It is clear that the self-localization of AUV can be realized, while the asynchronous clock information can be well estimated.
With the RL-based approach, the increment of localization (i.e., u 1,l ) at time step k = 1 is presented in Fig. 6(d). By the way, we denote U m v , U x,n u , and U y,n u as the weight vectors, where n and m depend on σ v and σ u , respectively. Accordingly, the weight vectors are given in Fig. 6(e) and (f), respectively. Clearly, these values are convergent, reflecting the effectiveness of the RL-based localization algorithm.
2) Comparison of Asynchronous Localization Schemes: In [14] and [38], a simplified time clock model T = t + β AUV is adopted, where α AUV is not involved, i.e., α AUV = 1. With the above simplification, we apply the algorithm in [14], [38]  to locate AUV. As such, the trajectories of AUV are provided in Fig. 7(a). Meanwhile, the localization errors on x and y axes are presented in Fig. 7(b), while the localization error e V is shown in Fig. 7(c). It is clear that the localization task can be realized. Next, a general scenario, i.e., α AUV = 1, is considered. Especially, the clock skew is given as α AUV = 1.00032. The trajectories and localization errors of AUV are shown in Fig. 7(d) and (e), respectively. From Fig. 7(a)-(e), one knows the clock skew depresses the localization accuracy. However, the RLbased localization algorithm in this article can prevent the influence of clock offset and clock skew. Through the above comparisons, it is easily obtained that the consideration of clock skew is very necessary, and the RL-based localization algorithm can effectively remove the impact of the clock skew.
As it is mentioned above, the RL-based joint localization and tracking solution in this article enables to save communication consumption. To verify this conclusion, we consider the following two scenarios: 1) T w = 3δ and 2) T w = δ. In Scenario 1, AUV does not always need to broadcast localization message at each time step of T w , and this scenario coincides with the solution in this article. Meanwhile, Scenario 2 is coincided with the solution in [13] and [17], since AUV requires to broadcast localization message at each time step of T w . Accordingly, the localization errors on x and y axes for the above-mentioned scenarios are depicted in Fig. 7(f), while the localization error e V is shown in Fig. 7(g). Correspondingly, the consumptions in communication procedure are shown in Fig. 7(h). Through comparison with the solution in [13] and [17], the localization accuracy with our solution can be guaranteed, and more importantly, the communication consumption can be significantly reduced by using the solution in this article. Through the above comparison, one obtains the integration of tracking control is meaningful for the localization task.
3) Comparison With the LS Estimator: In [15], the LS method was employed to localize the AUV. Nevertheless, the traditional LS-based estimators can easily fall into local minimum. The following example is presented to support the judgment. Consider an optimization problem of 2 , where x # ∈ R is the decision variable with an initial estimation of 5,â 1 = 0.15,â 2 = 0.2, andâ 3 = 0.25, x * # is the optimized decision value, and G i is the measurement whose error satisfies the distribution of N (0, 0.05). By employing the LS-based estimator, one can obtain x * # = 5.2942 with L = 12 126. Meanwhile, RL strategy is also adopted, through which one can calculate x * # = 0.12 with L = 0.0593, as shown in Fig. 8. Clearly, the RL-based strategy can obtain the global optimum (i.e., x * # = 0.12), while LS can easily fall into local optimum (i.e., x * # = 5.2942).

4) Tracking Control of AUV With RL and M-PCM-OFFD:
Without loss of generality, the time-varying uncertain model vector A in (4) satisfies the following probability density functions: The tracking errors can be defined as e 1 = x − x r , e 2 = y − y r , and e 3 = ψ − ψ r . Accordingly, the basis functions for value and control policy are represented as v = [e 1 e 3 , e 2 e 3 , e 2 e 1 , e 2 1 , e 2 2 , e 2 1 e 2 3 , e 2 2 e 2 3 ] T and u = [e 1 , e 2 , e 3 ] T , respectively. Through the above design, the tracking trajectories of AUV with RL-based tracking controller are shown in Fig. 9(a), and their tracking errors are provided in Fig. 9(b). It is obtained that the tracking control in this article can be realized, since the tracking errors can approximately converge to zeros. Meanwhile, the linear and angle velocities of AUV are shown in Fig. 9(c) and (d), respectively. Correspondingly, the optimal control policy is depicted in Fig. 9(e). The weights for the value function and control policy, i.e., W v and W u , are provided in Fig. 9(f) and (g), respectively. Besides, the total cost is depicted in Fig. 9(h). From Fig. 9(f)-(h), we know the weights can converge to optimization, while the cost can reach to a minimum.

5) Comparison for the RL-Based Tracking Controller:
One important merit of our tracking controller is the integration of localization. When the localization is denied, IMU and DVL can be employed to estimate the location of AUV. However, the IMU and DVL have the shortage of error accumulation because of the lack of the feedback mechanism. To test this judgement, the IMU is adopted to provide localization information for the tracking controller. As a result, the tracking errors of AUV are given in Fig. 9(i). Obviously, the tracking performance is much poor as compared with Fig. 9(b). Through this comparison, one obtains that the integration of localization is meaningful for the design of tracking controller.
For the comparative purpose, the proportion integration differentiation (PID)-based tracking controller in [42] is adopted to drive AUV to the target point. It is noted that the tracking errors of AUV are presented in Fig. 9(j). Similarly, the tracking performance is much poor as compared with Fig. 9(b). This comparison reflects that the RL-based tracking controller can find out the optimal policy from the uncertain environment.

6) Equivalence Between M-PCM-OFFD and MC:
As mentioned above, the M-PCM-OFFD strategy adopts a small number of uncertain samples to calculate the output mean. In order to test this judgement, we implement MC, M-PCM, and M-PCM-OFFD to calculate the mean of three states with respect to A, respectively. It is worth mentioning that the number of uncertain parameters is 3, then the output mean for each parameter is depicted in Fig. 9(k). Besides, the convergence procedure is presented in Fig. 9(l). It is shown that the total number of simulation by M-PCM (eight times) is much less than MC (about 800 times), and M-PCM-OFFD (four times) can significantly reduce the number of simulation as compared with M-PCM. Fig. 9(k) and (l) demonstrates that the M-PCM-OFFD scheme is computationally meaningful and practical for the real-time cyber-physical systems.

B. Experimental Studies
Experimental studies are performed in our lab. Due to the limited experimental conditions, we verify the effectiveness of the localization algorithm and tracking controller separately.
For the localization algorithm, an underwater acoustic communication system that includes transducers, control center, and acoustic modems is constructed, as depicted by Fig. 10(a). Particularly, transducers can be regarded as the sensor nodes, with the duties of underwater signal generation and reception. In our experiment, the transducers are provided by Hangzhou Ambrella Automation Technology Company, Ltd [see Fig. 10(b)], whose frequency and bandwidth are 35 and 4 KHz, respectively. The modem is installed on STM32 processor, which has been widely adopted by embedded systems due to its low cost, high efficiency and low consumption. Meanwhile, the overall structure of acoustic modem is shown by Fig. 10(c).
With the above localization system, the relative distances between any two nodes are measured. Let d i,j denote the relative distance between nodes i and j, where i, j ∈ {1, 2, 3, 4} and i = j. Next, two cases are considered: 1) by assuming the clock is synchronized, the time difference between any two nodes  is adopted to measure the relative distance and 2) without assuming the clock is synchronized, the asynchronous localization protocol as presented in Section III-A is adopted to measure the relative distance. Accordingly, the above relative distances are presented in Table III. Clearly, the asynchronous localization protocol in this article can remove the influence of clock skew and clock offset, however, the localization protocols under synchronization assumption are not suitable for underwater nodes. Based on this, we employ the LS estimator [15] and RL estimator to seek the position of node 1. Specially, a periodic noise # = 0.05 cos 2 (0.2x − 0.02) is added to the optimization problem (10), through which the position estimations of node 1 are presented in Fig. 11. It is noticed that the RL-based asynchronous localization algorithm in this article can avoid the local minimum.
For the tracking controller, the detailed experiment setup is presented in Fig. 12(a). To be specific, the hardware system of experiment has the following components. 1) Control Center: It provides the reference point of AUV, and it sends message to AUV. 2) USV (or Relay Node): USV is installed with wireless positioning unit, which can adopt the ultrawideband tool to provide localization for AUV. 3) AUV: AUV tracks the reference point with the assistance of relay nodes, and RL algorithm is embedded into AUV through STM32 development boards. In addition, the single-axis gyroscope with model GGPM01U is adopted to measure the yaw angle. The structure of the above experiment is expressed in Fig. 12(b). Three reference points are given as [2, 1.5, 0] T , [3,4, 0] T and [4, 1.5, 0] T , respectively. The AUV is initialized in an arbitrary position in the area. As discussed in Section II, the uncertainty vector A is related with the internal model uncertainties and the external disturbances. In this context, the model uncertainties in our experiment are induced by the limited prior knowledge of the AUV model, while the external disturbances are induced by the water waves. Of note, the water waves in pool are created by the relative motion of AUV rather than by the wave-making system.
When the initial signal from control center is received, AUV begins to learn the status and environment information. The process of learning is terminated while W v and W u are both converged, through which AUV can adopt the optimized control policy to track three target points in turn. With the embedded control and communication system, the AUV tracking trajectories are shown in Fig. 12(c). Moreover, the PID-based tracking controller [42] is also adopted to drive AUV to the target points. The positions and errors on surge, sway, and angle on yaw are shown in Fig. 12(d)-(f), respectively. It is noted that the tracking process with the PID-based controller [42] is not smooth, and there is a big deviation when AUV turns to another target point. Alternatively, the tracking performance with the RL-based controller is stable and accurate throughout the process. Thereby, the tracking performance with the RL-based algorithm is better than the one with the PID-based tracking controller [42]. Of note, the above results are coincided with the theory algorithm in Section III-B. Particularly, the video of the PID-based experiment is given by [43], while the video of the RL-based experiment in this article is given in [44]. For more clear illustration, another video link is presented in [45] to show the implementation and limitation of our experiment. Besides that, the whole procedure for the tracking control experiment can be depicted by Fig. 13.
It is noted that the localization and tracking experiments are implemented separately. The reason attributed to this design can be explained as follows. 1) On the hardware level, the integration design requires the control and communication units to be assemble together in the cabin; however, the above hardware assembly is not allowed in our lab due to the narrow cabin and the big communication unit. 2) On the software level, the integration design requires the communication and control waveforms to be shared in the same frequency band; however, the above requirement cannot be satisfied in our lab due to the serious echo interference. Nevertheless, we believe our experiments can support parts of the results in this article, because the separate implement is the foundation of the integration implementation. The integration implementation of the experiment will be the key direction of future research. Discussion: It is worth noting that the above experiment is conducted in indoor pool. For further testing in the real marine environment, the following three points need to pay more attentions: 1) the acoustic communication technology requires to be upgraded and embedded in the control unit of AUV; 2) edge computing and cloud computing technologies require to be alternately used to guarantee the tracking in real time; and 3) reliable waterproof and high mobility require to be upgraded due to the harsh marine environment.

V. CONCLUSION AND FUTURE WORKS
An RL-based solution that jointly considers AUV localization and tracking was developed. We formulated the AUV localization as an RL problem, through which an asylocalization algorithm is conducted to localize the position information of AUV. On the basis of position estimation, an RL-based tracking controller that involves M-PCM-OFFD strategy was developed to drive AUV to the target point. Simulation and experiment results were both conducted to verify the effectiveness. In the future, we will spread our solution to solve the formation of AUVs in more practical communication channels. Besides that, how to jointly implement the localization and tracking experiments is also the focus of our future research.