A New Compensation Method of Dynamic Lever-Arm Effect Error for Hypersonic Vehicles

For the Inertial Navigation System (INS) in hypersonic flight, when the centroid of the Inertial Measurement Unit (IMU) does not coincide with that of hypersonic vehicle, a lever-arm effect error will be generated, which will further expand in the dynamic complex environment. In order to improve the navigation accuracy, the lever-arm effect error needs to be compensated. Although much research has been done on the compensation of lever-arm effect, the existing methods are almost based on the fixed lever-arm length, which can not adapt to the changes of lever-arm in a dynamic and complex environment. In order to obtain the higher accuracy of navigation in the complex dynamic environment for a hypersonic vehicle, a more perfect dynamic lever-arm model is established, and a new compensation method for dynamic lever-arm effect based on reinforcement learning (RL-DLAC) is proposed. The method adopts the framework of reinforcement learning and obtains the optimal parameter estimation of the dynamic lever-arm model from the continuous action space according to the deep deterministic policy gradient (DDPG). The simulation results show that the proposed method can estimate the parameters of the dynamic lever-arm model with an error of only 0.31%, compensate the lever-arm effect error in complex dynamic environment and improve the navigation accuracy.


I. INTRODUCTION
A hypersonic vehicle, as a cutting-edge weapon in modern combat environment, has attracted the attention of many countries because of its hypersonic and highly mobile characteristics [1], [2]. As one of the key technologies of hypersonic vehicles, precise navigation is important to ensure the feasibility and efficiency of them [3]. The Inertial Navigation System (INS) is generally selected as the main navigation system of hypersonic vehicle due to the advantage of autonomous navigation [4]. The error of its Inertial Measurement Unit (IMU) plays an important role in INS accuracy. After the error of the IMU is calibrated and compensated, the accuracy of navigation can be effectively improved [5].
The IMU is generally difficult to meet the requirements of coincidence with the centroid of a hypersonic vehicle during The associate editor coordinating the review of this manuscript and approving it for publication was Erwu Liu. installation. Therefore, the lever-arm length between the IMU and the vehicle will produce the lever-arm effect error [6]. In the high speed and high maneuverability environment, due to the influence of elastic deformation and centroid change [7], the influence of lever-arm effect error generated by the IMU will be further highlighted. Therefore, it is of great significance to study the compensation of dynamic lever-arm effect error for hypersonic vehicles.
At present, many scholars at home and abroad have studied the error compensation of lever-arm effect, mainly including mechanical compensation method and digital filtering method. The mechanical compensation method generally relies on the corresponding angular motion scheme and the measurement system to estimate the lever-arm [8]- [12]. As a more practical on-line compensation method, digital filter methodology generally adopts the path of designing the relationship between lever-arm parameters and the navigation system model to calibrate the lever-arm through filtering [13]- [16]. However, most of the current compensation methods for lever-arm effect focus on the premise of a constant lever-arm [17]- [20], which is difficult to adapt to the complex and dynamic flight environment. The current research on dynamic lever-arm compensation is generally based on the existing fixed lever-arm compensation methods, considering the dynamic change of the lever-arm caused by random flexible deformation, and using an Adaptive Kalman Filter (AKF) to compensate for the error from lever-arm effect [21]- [23]. However, the research does not consider the influence of other factors affecting the change of the leverarm, and there is a lack of more detailed analysis of the model parameters, so its applicability needs to be improved.
Reinforcement learning (RL) has been applied in navigation systems because of its unique learning ability to interact with the environment [24]- [27]. For the actual complex dynamic environment, the change of lever-arm can also be transformed into a problem of interaction with the flight environment. Therefore, aiming at the lever-arm effect compensation for hypersonic vehicles in a complex dynamic environment, a new method of dynamic lever-arm effect based on reinforcement learning (RL-DLAC) is proposed. The method first establishes a more perfect elastic deformation dynamic lever-arm model under high maneuverability conditions, and then uses the environmental interaction mechanism of RL to carry out the adaptive estimation of the dynamic lever-arm, so as to effectively improve the navigation accuracy and adapt to the complex changes of the lever-arm.
The contribution of this study is that the change of dynamic lever-arm model is first analyzed on the basis of breaking through the traditional fixed lever-arm compensation, and a new method is developed for the compensation of dynamic lever-arm effect. The proposed method transforms the influence of different factors on the change of lever-arm into the change of dynamic lever-arm model parameters, which is an unknown term related to the environment. Thus, it is more suitable for compensating dynamic lever arm effect in complex dynamic environment of hypersonic vehicles.
The remainder of this paper is organized as follows. In Section 2, dynamic lever-arm error model of hypersonic vehicle is presented. The design for RL-DLAC is described in Section 3. The performance of RL-DLAC in calibration and compensation of the dynamic lever-arm for hypersonic vehicles is verified in Section 4. Finally, the conclusions are drawn in Section 5.

II. DYNAMIC LEVER-ARM ERROR MODEL
This section shows the limitations of the traditional method through the traditional lever-arm effect principle, and establishes a more perfect dynamic lever-arm error model based on the elastic deformation mechanism of hypersonic vehicle.

A. TRADITIONAL LEVER-ARM ERROR MODEL
Ideally, the IMU should measure the specific force information of the vehicle centroid. However, in the actual system, due to the limitation of the vehicle structure, the centroid of the accelerometer on the IMU does not necessarily coincide with the centroid of the vehicle. Therefore, the output of the accelerometer contains the lever-arm effect error when the vehicle is in angular motion (since the three accelerometers on the IMU are close, the inner lever-arm error between different accelerometers will not be considered here). The schematic diagram of that is shown in Fig. 1.
O i and O b are the centers of inertial coordinate system and body coordinate system respectively, as shown in Fig. 1. O A is the center of mass of IMU. r A is the vector of lever-arm length. ω ib is the rotational angular velocity of the body coordinate system relative to the inertial coordinate system [22]. The accelerometer error caused by the lever-arm effect was written as ∇ lever , which is as follows: In the traditional compensation method for the lever-arm effect, it is generally assumed that the relative position of the vehicle and IMU does not change, which is equivalent tor A =ṙ A = 0. And then the lever-arm length is extended to the state quantity in the inertial navigation error equation and estimated by Kalman filter. This method cannot guarantee the navigation accuracy in case of a dynamic lever-arm. Fig. 2 shows that a change in the lever-arm will lead to an increase in navigation error.
For the research on the change of lever-arm, there is only the method that the change in the lever-arm is regarded as random flexible deformation. Due to the change in the lever-arm, that meansr A = 0,ṙ A = 0. However, the current model of flexible deformation is generally regarded as a second-order Markov process. The random deformation obviously does not have a clear physical meaning and is inaccurate. Therefore, it is necessary to improve the change model of the lever-arm. Besides, due to the change of environment, the parameters of an actual dynamic lever-arm model are also time-varying. At present, there is no compensation method for the change in the lever-arm model. Actually, the lever-arm model will VOLUME 10, 2022 change due to the change of natural frequency and vibration mode in the actual flight environment of hypersonic vehicles. Therefore, it is necessary to find a compensation method for the change in the dynamic lever-arm model.

B. IMPROVED DYNAMIC LEVER-ARM ERROR MODEL
The structure of hypersonic vehicles generally has a certain stiffness, but structural deformation and bending vibration will occur under the high dynamic maneuver environment and the action of external forces, such as aerodynamic force. Therefore, the lever-arm length between the IMU and vehicle centroid can no longer be filtered and estimated as a constant. Similarly, it cannot be approximated by a simple secondorder Markov process, because the actual lever-arm deformation in the vehicle structure is not a simple random process, and there is actual external force that needs to be considered. The derivation of an improved dynamic lever-arm model will be described in more detail.
The hypersonic vehicle is approximated as a slender beam structure, and the cross-sectional area along the longitudinal direction changes little. Therefore, the simplified beam can be used to replace the vehicle structure. Its transverse vibration in space can be regarded as the movement of a non-uniform elastic beam with free ends. The elastic displacement, flexural stiffness and mass inertia of the beam along the length are determined by the structural parameters of the vehicle. The elastic deformation diagram of the vehicle is shown in Fig. 3. The elastic deflection displacement equation can be written as where W i (x) represents the relative lateral displacement relationship of each point on the longitudinal axis of the vehicle, which is called the i-th natural mode function. q i (t) is the generalized coordinate of elastic vibration, which can determine the amplitude of elastic vibration after the elastic vibration mode function is determined. The generalized coordinate q i (t) is varying with time, which satisfiesq where ω i is the natural frequency of the i-th mode. ζ i is the damping coefficient of the i-th mode. Q i is the generalized force of the i-th mode. M i is the generalized mass of the i-th mode. Q i and M i satisfies where f yb (x) is the normal projection of the external force acting on the vehicle. m(x) is the mass distribution along the longitudinal axis of the vehicle. l is the equivalent length of the beam. The external force of the vehicle is simplified to the action of gravity, thrust and aerodynamic force as given by where F gyb , F kyb and F ayb are the projection of gravity, thrust and aerodynamic force on the O b Y b -axis respectively. According to the elastic deformation and random flexible deformation of the lever-arm of the vehicle, the generalized coordinates of the elastic vibration satisfy the relationship as follows:q where u qi is the control quantity of elastic deformation. w qi is the noise of elastic deformation.
Taking the third-order vibration mode as an example, the relationship of elastic deformation length satisfies According to the geometric relationship, the relationship of lever-arm length is as follows: According to the above equations, the error of lever-arm effect can be given by where r A0 is the initial lever-arm length (Initial calibration value of the default value which does not affect the compensation accuracy). q is the vector of generalized coordinate. u l and w l satisfy u l = Y u q + ω b× ib + ω b× ib ω b× ib r A0 and w l = Y w q , respectively.
In view of the extensive research on air-breathing hypersonic vehicles, one of those is taken as the simulation object [28]. The structure and system distribution diagram of the hypersonic vehicle is shown in Fig. 4.
The front and rear bodies of the hypersonic vehicle are equipped with subsystems, and liquid oxygen and hydrogen storage tanks are installed in the middle and both sides.
The natural frequency and mode shape are related to the bending stiffness, mass and beam structure mode shape function of hypersonic vehicle material. In high-speed flight, the mass of hypersonic vehicle decreases with the consumption of fuel. Besides, the increase of surface temperature will reduce the bending stiffness of material under the influence of aerodynamic heating on the surface of hypersonic vehicle. Therefore, the natural frequency of the vibration mode of hypersonic vehicle will change. The compensation of dynamic lever-arm model caused by the change of natural frequency will be described in detail in Section 3.

III. DYNAMIC LEVER-ARM EFFECT COMPENSATION METHOD BASED ON REINFORCEMENT LEARNING
This section introduces the design of a dynamic lever-arm effect compensation method. Reinforcement learning is used to learn the parameters of the dynamic lever-arm model, and the dynamic lever-arm is estimated by filtering to compensate for the lever-arm effect error.

A. PRINCIPLE OF DYNAMIC LEVER-ARM EFFECT COMPENSATION METHOD
According to the improved dynamic lever-arm error model in Section 2, the lever-arm model for hypersonic vehicles is variable and unknown in the actual flight process. Therefore, RL-DLAC is proposed to compensate for the error caused by the dynamic lever-arm effect, so as to obtain higher navigation accuracy. The principle of RL-DLAC is shown in Fig. 5.
ω ib andf b are the outputs of the gyroscope and acceleration obtained through the IMU, respectively. After drift correction and dynamic lever-arm effect error compensation, ω ib and f b are obtained for inertial navigation solution. Given the initial state X 0 and the initial state covariance matrix P 0 , if there is GNSS measurement input at the current time, the module will update the measurement to obtain the final error state estimation resultX k , and then the current position estimation resultr is obtained after inertial navigation correction. At the beginning, the natural frequency parameter ω i in the lever-arm model is provided by the default value ω 0 i (initial calibration result or previous estimation result). The later simulation takes the third-order vibration model as an example, and the ω 0 i is written as ω 0 1 ω 0 2 ω 0 3 T . Considering that the GNSS carrier phase differential positioning results have the characteristics of high accuracy but high environmental requirements, the method will consistently monitor the GNSS positioning status. Besides, the method will calculate the positioning error between the current integrated navigation system resultr and r G . If the calculation error is greater than the threshold, it indicates that there is a certain deviation in the natural frequency of the dynamic lever-arm model, resulting in the inaccurate results of the integrated navigation. Then, the method collects the IMU and GNSS data closest to the current time, which is added to the RL module for adaptive learning. Finally, the learned natural frequency is fed back to the previous dynamic lever-arm effect compensation system to continue filtering and correcting.

B. CONSTRUCTION OF REINFORCEMENT LEARNING
The goal of the reinforcement learning module is to learn the natural frequency of the dynamic lever-arm model, which can minimize the position error of the current condition obtained by RL-DLAC. The constituent elements of the reinforcement learning module are shown in Table 1.
The implemented tasks conform to the standard interface of an infinite-horizon discounted Markov decision process (MDP), defined by the tuple s, α, P, r, ρ 0 , where s is an infinite set of states, α is a set of actions, r : s × α → R VOLUME 10, 2022   is the transition probability distribution represented by the agent, r : s × α → R is the reward value obtained from the environment, and ρ 0 : s → R ≥ 0 is the initial state. These characteristics allow the RL-DLAC to be formalized as an MDP where operator actions cause state transitions, which affect the ramp-up value of reward. Capturing the dependencies between actions and reactions allows the machine learning methods to be used to find good policies for the ramp-up of observed states, from which we can also obtain the learned optimal state.

1) ENVIRONMENT DEFINITION
The most commonly used and representative navigation system in hypersonic vehicles is the INS/GNSS integrated navigation [29]. Therefore, the INS/GNSS integrated navigation system is taken as the environmental element of RL, which will be used to evaluate the current dynamic lever-arm model.
In order to compensate for the lever-arm effect error, it is necessary to establish the IMU error equation under the influence of the lever-arm effect error (installation error and scale factor error are not considered here), which is given by where ∇ is the accelerometer drift error, ε b and ε r are the constant drift and first-order Markov process of gyroscope respectively.
The generalized coordinates of the dynamic lever-arm are extended to the state quantity in the integrated navigation system, which is The state equation and measurement equation of the dynamic lever-arm effect error compensation system are

t)X(t) + B(t)U(t) + G(t)W (t) Z(t) = H(t)X(t)
where F(t) is the system matrix, G(t) is the system state noise coefficient matrix, W (t) is the system noise matrix, H(t) is the measurement coefficient matrix, V (t) is the measurement noise matrix, and U(t) is the control matrix. The system matrix F(t) is given by 9×9 is the inertial navigation error matrix, and the other matrices are in (18)- (23), as shown at the bottom of the next page, where T g and T a are the correlation time of the gyroscope and accelerometer, respectively.
The error of the dynamic lever-arm effect is mainly reflected in the error equation of speed and position, so the system measurement equation is

Z(t) = H(t)X(t)
RL-DLAC uses a Kalman filter with deterministic control, which can be expressed as where X k is the state of the estimated time k, Z k is the measurement of the estimated time k, k,k−1 is the state transition matrix, H k is the measurement matrix, R k is the measurement noise covariance matrix, P k is the covariance matrix of estimated state, and K k is the gain matrix of filter.

2) STATE QUANTITY OF RL
Generally, a set of default calibration values is provided after production of the vehicle for corresponding types of structures. The initial value of the natural frequency of elastic deformation is used as an example, as shown in Table 2. In the actual flight process, due to the harsh flight environment of hypersonic vehicles, its natural frequency will change. Therefore, the natural frequency is taken as the state quantity of reinforcement learning.

3) REWARD FUNCTION OF RL
When the environment receives an action that causes a state change, we need to define a sufficient reward mechanism to evaluate the current action. Since the state quantity of reinforcement learning determines the integrated navigation accuracy, we modeled the reward as the opposite of the location error over the N seconds of IMU and GNSS data. The reward value is configured to the opposite of the final localization error, expressed as r = − r − r G . By maximizing the reward, we will obtain the natural frequency of elastic deformation, which can minimize the positioning error.

4) AGENT OF RL
DDPG is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces and it includes two neural networks (NNs): a critic NN introduced to evaluate the long-time performance of the designed control in the current time step and an action NN used to output continuous action in the corresponding state [30]. Moreover, in order to improve learning efficiency and prevent local optimality, an experience replay memory is created to store historical samples, each time a certain number of samples are randomly selected for training. In each episode, the DDPG executes a certain time step. At each time step, the action α i is selected by the action evaluation network of the ACTOR module, which is based on the cost function that maximizes the reward from taking accurate action. Although there are some designs of actor-critic network that can reduce the computational cost of reinforcement learning [31]- [33], DDPG is still selected as the agent of reinforcement learning because the actual parameters will not change dramatically in a certain period of time. DDPG builds an actor network for an agent to select actions via the actor-critic system, instead of using the traditional greedy algorithm. This method has been proven to enable learning of good policies for many tasks using low-dimensional observations. The network structure used in DDPG is Multi-Layer Perception (MLP) through multiple fully connected (FC) layers. The structure of the actor part is not the same as that of the critic part since the meanings of the input and output are different.
The actor evaluation network and the actor target network are three-layer fully connected networks. The dimension of the input layer is 3 and that of the hidden layer is 30, followed by a ReLU activation function. The output layer size is equal to the action. After applying a tanh activation function to the neurons of the output layer, we can finally obtain the predicted action. The key parameters in DDPG are listed in Table 3.

IV. SIMULATION ANALYSIS AND DISCUSSION
In this section, we present examples to verify the performance of the RL-DLAC for compensation of dynamic leverarm effect error. Comparing the RL-DLAC method with the traditional fixed lever-arm compensation method based on Kalman filter (KF-FLAC), it can be found that RL-DLAC has more obvious advantages in the compensation of dynamic lever-arm effect.
For the performance evaluation of the proposed RL-DLAC method, two different applications are studied in the simulation analysis, which are as follows.

A. APPLICATION OF FIXED LEVER-ARM EFFECT COMPENSATION
The change in the dynamic lever-arm is small, when Q i is small during flight. Besides, the influence of fuel consumption on the change of mass center is small in a period of time, so it can be approximately considered that the change in the lever-arm of the vehicle is 0. The calibration results of the RL-DLAC and KF-FLAC for the fixed lever-arm error are compared, as shown in the Fig. 6. Fig. 6 shows that both RL-DLAC and KF-FLAC can calibrate the fixed lever-arm, but RL-DLAC has higher accuracy.

B. APPLICATION OF DYNAMIC LEVER-ARM EFFECT COMPENSATION
In the actual flight process, the hypersonic vehicle is facing a complex and dynamic environment. As shown in Fig. 7, a dynamic flight trajectory is simulated according to actual hypersonic vehicle flight, which involves various maneuvers such as climbing, pitching, rolling and turning.
The initial position of the hypersonic vehicle was at north latitude 30.011 • , east longitude 120.094 • and 30km altitude. The initial velocities along the three axes of the E-N-U navigation frame were 0m/s, 1700m/s and 0m/s, respectively. The initial position error was 5m, 5m, and 10m, initial velocity error was 0.2m/s, 0.2m/s, and 0.2m/s and initial attitude error was 1 , 1 , 1 . The gyro's constant drift and white noise were 0.1 • /h and 0.01 • /h. The accelerometer's zero bias and white noise were 10 −3 g and 10 −4 g. The root mean square errors (RMSEs) of GNSS position and velocity were 5m, and 0.1m/s. The sampling periods of INS and GNSS were 0.02s and 0.1s. The simulation time was 300s and the filtering period of the INS/GNSS integration was 0.1s.  In the high dynamic environment, the change of lever-arm is as shown in the Fig. 8.
It can be seen from Section II that the increase of surface temperature will reduce the bending stiffness of material under the influence of aerodynamic heating on the surface of hypersonic vehicle. Therefore, the natural frequency of the vibration mode of hypersonic vehicle will change. The initial value of the natural frequency of the hypersonic vehicle structure has been given in Table 2. It will change with the extension of flight time and the change of flight state. In order to make the simulation conform to reality, the change of natural frequency is simulated by using a set of test data from hypersonic vehicles, as shown in Table 4.
The RL-DLAC is used to identify the dynamic lever-arm, learn the change of its natural frequency, and improve the dynamic lever-arm model in the actual flight process. The natural frequency learning results are shown in Table 5.
It can be found from Table 5 that RL-DLAC has a good natural frequency learning effect, which determines the dynamic lever-arm model. The average error is only 0.31%.
The RL-DLAC and KF-FLAC are used to calibrate the dynamic lever-arm respectively, as shown in Fig. 9.     9 shows that RL-DLAC has better calibration results for dynamic lever-arm, and the traditional method is no longer applicable. The reason is that the lever-arm model designed by the traditional method is static, which ignores the influence of the lever-arm change in the actual flight environment. The RL-DLAC not only considers the change in the lever-arm, but also considers the change in parameters in the lever-arm model. Through reinforcement learning, the parameters of the lever-arm model are determined, and a more perfect dynamic lever arm model is established. Therefore, the compensation effect for the lever-arm is better.
The navigation errors of the two methods for hypersonic vehicles are shown in Fig. 10 and Fig. 11.
Obviously, Fig. 10 and Fig. 11 show that the navigation accuracy of using the RL-DLAC method to compensate for the dynamic lever-arm effect error is improved when  compared with the traditional KF-FLAC method. This is because the lever-arm effect error caused by the dynamic lever-arm is not eliminated in the traditional method. Besides, due to the cumulative effect of errors, the navigation errors of the traditional method will further diverge over time.
Compared with the divergence effect of the traditional method in the face of the dynamic lever-arm effect, the RL-DLAC method proposed in this paper has better navigation accuracy. The main reason is that we have established a more perfect dynamic lever-arm model and the model parameters are learned through reinforcement learning to ensure the accuracy of the model during flight, which suppresses the divergence of navigation error.

V. CONCLUSION
In order to improve the navigation accuracy of hypersonic vehicles, a new compensation method for the dynamic lever-arm effect based on reinforcement learning (RL-DLAC), which can adapt to the complex flight environment, is proposed in this paper.
RL-DLAC has two significant features. (A) Focusing on the change of the lever-arm in a high dynamic environment, a more perfect error model of the lever-arm effect is established based on elastic deformation theory, so that the influence of the lever-arm effect error caused by the change in the lever-arm can be reduced by compensation. (B) According to the changes of parameters in the lever-arm effect error model, a reinforcement learning method is designed based on the natural frequency, which realizes the adaptive estimation of the parameters of the lever-arm effect error dynamic model and enhances the robustness of the method. The RL-DLAC method can use the deep deterministic policy gradient (DDPG) to obtain the best state estimation and natural frequency from continuous actions. The simulation results show that the natural frequency of vehicle structure estimated by the method can obtain high-precision navigation results through the dynamic lever-arm error model. Compared with the traditional methods, it improves the compensation process of dynamic lever-arm effect and suppresses the divergence of navigation error caused by the change of lever-arm model.
The application of the intelligent method will also produce some limitations, such as high-performance computing power, which is a key challenge to the hardware system. If we want to realize the real-time control, a small network structure and design of the basis function need be developed, but it will be difficult to select the initial value for convergence. The DDPG is adopted in this paper, which is suitable for practical problems with strong nonlinearity. In order to reduce the computational cost, this paper has reduced the learning parameters to three, but even so, due to the existence of training in reinforcement learning, it will inevitably lead to the difficulty of real-time performance. Fortunately, due to the slow influence of model parameters caused by the environment, there will be no drastic change within a certain time range, which can maintain the high positioning accuracy of the navigation system and make up for the problem of long training time. If the flight time is short in practical application, we can also collect data first, adopt offline training, and then use the learned parameters when flying in the same environment. Future research work will focus on the improvement of the RL-DLAC. We expect to propose a dynamic lever-arm effect compensation method based on the model-free learning method. CHAOFAN LIU received the bachelor's degree in detection, guidance, and control from Beihang University, Beijing, China, in 2020, where he is currently pursuing the master's degree in electronic information with the School of Astronautics. His current research interest includes aero-optical effects.
XIANG WEI received the bachelor's degree in detection, guidance, and control from Beihang University, Beijing, China, in 2020. He is currently pursuing the master's degree in electronic information with the School of Astronautics. His current research interest includes dynamic control allocation.