Range-Aware Impact Angle Guidance Law With Deep Reinforcement Meta-Learning

In this article, a new guidance law is proposed for impact angle constrained missile with time-varying velocity against a maneuvering target. The proposed guidance law is based on model-based deep reinforcement learning (RL) technique where a deep neural network is trained to be a predictive model used in model predictive path integral (MPPI) control. Tube-MPPI, a robust approach utilizing ancillary controller for disturbance rejection, is introduced in guidance law design in this work to deal with the MPPI degradation of robustness when the deep predictive model differs with actual environment. To further improve the performance, meta-learning is utilized to enable the deep neural dynamics adapt to environment changes online. With this approach the model mismatch of the nominal controller is reduced to improve tube-MPPI performance. Furthermore, a range-aware hyperbolic function is proposed as an adaptive function in the MPPI performance index design. Thus, reduced initial acceleration command and increased terminal velocity benefit guidance performance. Numerical simulations under various conditions demonstrate the effectiveness of proposed guidance law.


I. INTRODUCTION
Interception at a desired intercept angle help missile in increasing penetration capability, warhead effectiveness and reduce collateral damage. It may be necessary for modern missile to intercept target not only at a small miss distance, but also at a desired intercept angle. When facing these new requirements, conventional guidance law design method faced elevated difficulty, and deep reinforcement learning is a powerful tool in tackling these problems.
Rising interest has been witnessed on the application of deep RL in guidance design, with great potential shown by deep reinforcement learning. Compared with guidance designed using traditional control theory [1], deep RL is a data driven method. Many recent works has utilized deep RL in guidance law design for performance enhancement [2], or for requirement traditional control theory hard to satisfy [3], [4]. Many works in deep RL guidance laws utilize model-free deep reinforcement learning. However, The associate editor coordinating the review of this manuscript and approving it for publication was Vivek Kumar Sehgal . model-free RL lacks sample efficiency compared with model-based methods, thus require large number of interactions with environment. Deep model-based reinforcement learning utilizes deep neural network as model, is generally considered data efficient [5] and is welcomed in many real-world control tasks. MPPI is one of the typical methods of model-based deep reinforcement learning which utilizes a deep neural network as the dynamics model to obtain the optimal control solution to the Hamilton-Jacobi-Bellman (HJB) equation via Monte-Carlo sampling of path integrals, thus solves the optimal control problem and is widely used in many real-world tasks [6]. Ref [7] utilizes MPPI method to solve the guidance problem under impact angle constraint. However, MPPI sometimes suffer from degradation of robustness when the deep neural dynamics differs with the real environment. Many work try to robustified the MPPI method by ensemble models [8], L 1 adaptive control [9], and so on. Tube-MPPI method is also proposed by combining an ancillary controller to keep the system states in the tube centered at nominal state computed using MPPI as nominal controller [10]. In this work, the deep neural dynamics VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ mismatch problem in Tube-MPPI is further improved using meta-learning to adapt the deep neural dynamics to environment changes online. Meta-learning provides learningto-learn capability to deep neural network and is thus essential to real-world application of deep reinforcement learning to adapt to changes in environment online. This is critical in guidance problems since the target maneuver pose a large perturbation to the engagement dynamics. Thus in this work a Meta-learning Tube-MPPI method is proposed to tackle the impact angle guidance problem intercepting a strong maneuvering target. Impact angle constrained guidance law helps to increase the missile capability, however in most of the existing guidance laws, high acceleration is needed at the beginning of the flight. The large acceleration command consumes excessive missile momentum energy, and will result control saturation. In [11], a hyperbolic tangent function weighted guidance law is proposed trying to tackle this problem. However, the value of proposed function grows exponentially with time, thus penalizing mostly impact angle and miss distance error while neglecting acceleration loss. Thus, utilizing the elegant saturation property of hyperbolic tangent function, a variant of hyperbolic function is proposed in this article that has an adjustable value during final phase of guidance. Different from [11] which use time as decision variable, range-to-go is used in this article since it provides more accurate information about current stage of guidance. Thus, in this work, a novel range-aware hyperbolic tangent function is proposed to reduce input saturation at the initial phase of guidance.
In this article, we develop a new range-aware metalearning tube-MPPI guidance law with impact angle constraint. Given the limitations of prior work, the proposed approach is more sample efficient and impact angle constraint when compared with existing deep RL guidance laws. It also improves ancillary controller robustified MPPI method by reduced model mismatch using meta-learning, and benefit guidance performance by range-aware adaptive weighting compared with existing error shaping guidance laws. The main contribution of our work is as follows: 1) A meta-learning tube-MPPI control method is proposed. With this approach, the tube-MPPI performance is improved through reduced model mismatch of nominal controller using meta-learning model adaption. 2) A range-aware hyperbolic function is designed as an adaptive error shaping function in guidance law performance index design. This method benefits guidance performance by reduced initial acceleration and increased terminal velocity. 3) A new guidance scheme is formulated with aforementioned techniques for a varying velocity interceptor intercepting maneuvering target with desired terminal impact angle.
This article is organized as follows. Section II reviews existing works on deep RL guidance laws, weighted optimal guidance laws, MPPI and meta-learning. Section III details a novel guidance scheme based on model-based RL and meta-learning tube-MPPI. Numerical simulations are conducted to show the effectiveness of the proposed method in Section IV. Finally, conclusion is offered in section V.

II. RELATED WORK A. DEEP RL IN GUIDANCE LAW DESIGN
Deep RL has proven to be successful in many control tasks. With fast evolving capability and good performance of deep RL, a growing trend emerges that modern guidance strategy incorporated deep RL framework to tackle guidance problem. Both model-free and model-based methods are incorporated in guidance design. For model-free methods, in [2], RL is used to design a missile guidance law in homing-phase, and it gives superior performance compared with guidance law design using Lyapunov theory. To tackle challenging environment and unknown highly variable dynamics, an adaptive guidance law and integrated navigation is proposed in [3] with deep meta-RL. Meta-learning can provide adaption to unforeseen environment changes through online learning while most traditional adaptive guidance is limited to specific faults [1], [14]. In Ref. [4], [12], Deep RL is also used to design a novel guidance law with solely seeker LOS angle and angular rate measurement for a mid-course exo-atmospheric interception. In [13], a deep RL based guidance law with missile attitude loop is proposed using PPO. Our work utilizes model-based RL techniques, thus has higher sample efficiency than these model-free RL guidance laws, and also achieves impact angle constrained guidance. Model-based RL is also used in guidance law design. In [7], a novel adaptive intercept angle guidance law with deep meta-RL is proposed for missile with actuator failures. Our work differs with [7] in the tube-MPPI approach and range aware hyperbolic functions that enhance the guidance performance.

B. OPTIMAL ERROR SHAPING GUIDANCE LAWS
Over the years, various efforts have been made to design improved optimal guidance law using different performance index [15]. Weighted cost function is utilized to shape the missile trajectory and distribute acceleration command during the engagement. Time-to-go [16], range-to-go [17] and generalized formulation [18] of weighted cost are used to alleviate initial high acceleration and highly curved trajectory problems in impact angle constraint guidance law. Other functions such as sinusoidal function [15], Gaussian function [19] have also been utilized in designing weighted optimal guidance laws. Ref. [11] employ hyperbolic tangent function as weighting in guidance design. However, the value of this variant of hyperbolic function in [11] grows exponentially with time. A range-aware hyperbolic tangent function is designed in this work to tackle this problem. Recent work in [20] also use error shaping to trade off acceleration against rate of error convergence. Our method inspired by this tradeoff but differs by the range-aware weighting function that is adaptive respect to different stage of engagement, which benefit guidance performance.

C. MODEL-BASED RL
Model-based reinforcement learning is welcomed in many real-world control tasks for its high efficiency since a deep neural networks model is learned to solve the control task. MPPI is a typical method of model-based reinforcement learning to solve the control problem using the deep system model. MPPI is firstly used on real hardware in aggressive driving of rally vehicles in Ref. [22], and is implemented in a wide range of real control tasks including complex robot manipulation [23], missile guidance [7] and so on. Many attempts have been made to robustifies MPPI method. Ref [8] utilized model ensemble to tackle this problem, however, the ensemble of models normally deteriorate computation speed and may be inefficiency for real-time system. In [9], a L 1 adaptive control method is combined with MPPI to address this problem and validated in multirotor racing. The Tube-MPPI in [10] utilize tube-based model predictive framework and robustifies MPPI by combine an ancillary controller as the tracking controller of nominal MPPI controller. Still, large difference between deep neural dynamic and true environment will impact central path and deteriorate ancillary controller tracking performance. Thus, meta-learning Tube-MPPI method is proposed in this work to address this problem. By utilizing meta-learning deep network dynamics is able to learn changes in environment online via learning to learn [24]. This is usually done by an update rule to the learner [25]. In this work, the tube-MPPI performance is improved through reduced model mismatch through meta-learning constantly adapting neural dynamics to changes in environment.

III. PROBLEM FORMULATION
The missile-target engagement dynamics is established for the purpose of guidance law development. Consider skidto-turn roll-stabilized missile, the three dimensional missile target engagement geometry between missile M and target T in the inertial coordinate frame O I X I Y I Z I is shown in Fig. 1, where the missile M has a velocity V M , with direction defined by θ m and φ m ; the target has a velocity V T , with direction defined by θ t and φ t ; line-of-sight (LOS) angles are denoted by θ L and φ L and the relative range is denoted by R.
Then the three-dimensional relative kinematic dynamics between missile and target can be expressed as follows [27]: The maneuver dynamics for target can be expressed as: where a yt and a zt are target accelerations. The forces acting on missile include thrust T , drag D, zero-lift drag D 0 and induced drag D i . With missile mass denoted by m, and missile acceleration denoted by a ym and a zm , the dynamics of missile motion can be expressed as follows [28]: The equations of the forces can be expressed as: where C D0 , K , A r , e, ρ, s and Q are zero-lift drag coefficient, interceptor, induced frag coefficient, aspect ratio, efficient factor, atmosphere density, reference area and dynamic pressure.
The objective of the guidance law is to achieve the interception of missile and target, with desired impact angle θ LD and φ LD . The impact angle is defined to be the LOS angles as in [27], [28]. As nullifying the LOS angular ratesθ L andφ L can lead to the interception, the solution to this problem is to design the missile accelerations to guarantee the following equations:θ Thus we can see the problem of guidance law with desired impact angle can be reduced to the problem of controlling the LOS angles and angular rates as described in the above equation. VOLUME 8, 2020

IV. DESIGN OF PROPOSED GUIDANCE LAW
In this section, a range aware impact angle guidance law is proposed with model-based RL and range-aware hyperbolic tangent function. By utilizing model-based RL and metalearning, better data efficiency and online adaption capability is achieved. Meta-learning tube-MPPI approach which combine online adaption sampling based RL and disturbance rejection ancillary controller is constructed as model-based RL approach. Range-aware hyperbolic tangent function is then constructed as adaptive function used in performance index design to alleviate large initial acceleration command and highly curved trajectory problem in impact angle constrained guidance law. The schematic diagram of proposed guidance scheme with meta-learning tube-MPPI is shown in Fig.2.

A. META-LEARNING NEURAL NETWORK DYNAMICS MODEL
A deep neural network dynamic model is built to be the predictive system dynamics model in model-based RL. Such neural dynamic model can be learned from observation data from real system. The neural network dynamics is noted aŝ , where x t and u t are system state and input at time t, θ is the weight coefficients in the neural network, andx t+1 is the predicted system state at time t+1. The neural network utilized is a multi-layer dense network with ReLU activations. This deep neural dynamic is verified in [7] to have a neglectable prediction error, which will make failure of proposed deep RL controller unlikely.
Using meta-learning, deep neural dynamics can adapt to changes in environment online which solve the changing environment problem deep model-based RL facing. The meta-learning approach we adopted from [26] has two phases to make the neural dynamics optimized to the training dataset and also adapted to environment online. These two phases, the meta-training step and online adaption step are reviewed in the rest of this session.
In the meta-training step, the optimized deep neural dynamic model parameter θ * is trained to be further adapted online. The model is trained using normalized training dataset by minimizing the mean square error of the prediction and actual value with 12 hours of offline training. The data is normalized to help the gradient flow in the training. Adam optimizer [29], a stochastic gradient descent optimization method, is employed to tackle this optimization problem.
In the online adaption phase, the meta-trained model f θ * (x t , u t ) is adapted using recent experience τ ε (t − M , t − 1) gained through environment to be a more accurate predictor. The adaption rule is selected to be gradient ascent of the likelihood of mean square error between prediction and ground truth using the recent experience: (14) where α is the learning rate.

B. TUBE-MPPI CONTROLLER
Based on the meta-learning neural dynamic model trained above, a tube-MPPI controller can be built for the guidance problem. Tube-MPPI is a variant of tube-MPC which consist of a nominal controller and an ancillary controller [10]. The nominal considers general costs and generates nominal state: the central path, while the ancillary controller tracks the actual system state in a tube centered at the central path. The actual state is bound in a tube centered at the central path. Since in this guidance law we are more concerned with the robust ability of tube-MPPI and there are no other state constraints, therefor we do not concern with the computation of this bound in this work.
The nominal system can ignore system disturbances, like in [10], [30], two copies of the nominal controller are run with one from the actual state and the other one from nominal state. The mechanism accepts the MPPI solution of actual state if its cost is lower than the cost of solution from nominal state plus some threshold. In this way disturbances that are not catastrophic are feedback to the nominal controller to replan.

1) NOMINAL CONTROLLER
The MPPI controller, which is a sampling based optimal control method solving the stochastic HJB equation [22], is used as the nominal controller and is given in Alg. 1. In MPPI, we consider the optimal control problem to find a control sequence that minimize the cost functions: where the c and φ are positive definite running and terminal cost function respectively, and P denote the distribution for t = 1, . . . , T do 7: end for 10: 15: for n = 0, . . . , N − 1 do 16: w n t ← 1 η exp(− 1 λ (S E n t )) 17: end for 18: for t = 0, . . . , T − 1 do 19: u t += N n=1 w n t δu n t 20: end for 21: X ← Simulate(x 0 , U) 22: PublishSolution(X, U) 23: for t = 0, . . . , T − 1 do 24: u t−1 ← u t 25: end for 26: corresponding to the dynamics F (x, u + δu), δu is a Gaussian noise vector. The noise is essential to use sampling method originated from stochastic optimal control and also as a way of exploration. Denote V as perturbed input into the system, h as the input sequence of uncontrolled system and p as the open-loop control sequence, the free energy of the dynamic system is defined as follows: where λ is a positive scalar. According to [31], the cost of optimal control problem is bounded below from this free energy. Further derivation using Jensen's inequality can get the optimal distribution of the control sequence: where η is the normalizing factor. Then we can get the optimal control solution by minimizing the gap measured by Kullback-Leibler divergence: After expanding out the KL divergence, and analyzing the concave result, the optimal control sequence can be derived as follows. Since the optimal distribution Q * cannot be directly sampled, importance sampling technique is taken to get the sequence where the importance sampling weight w (V ) is: The temperature coefficient λ of this softmax distribution is designed to be as follows to normalize the cost function distribution as in [7]: Then the control sequence of MPPI is updated using N samples as iterative update law: Remark 1: The MPPI framework can be viewed as a stochastic optimal control (SOC) approach. With inspiration from [32]- [34], the stability is discussed as follows. If we denote the corresponding continuous dynamical system which is the guidance dynamics as: where B defines the covariance of the system, and w is a Brownian disturbance. If we denote the value function as V, then the continuous time value function is: then the stochastic HJB equation is given as: with the boundary condition V (x) = φ (x) and optimal control expressed as: VOLUME 8, 2020 According to [21], this control can be computed using Feynman-Kac formula to the transformed Chapman-Kolmogorov equation, and (26) can also be expressed as: where As the importance sampling is an unbiased estimate [37] and the universal approximation theorem of the feed-forward neural network used in our work, the noise adopted from the MPPI control method is zero mean. If we denote the variance of noise profile as , the noise enters the system through control is B = G √ . Thus after discretization, and set t as the unit time, the control command of (27) can be derived as: where is the Gaussian noise vector. Thus this is equivalent to the solution of the path integral approach in (20) and the resulting control command is the solution to the stochastic HJB equation, which can also be expressed in (26), which is also shown in [38]. If we choose the value function V as the stochastic control Lyapunov function (SCLF) [35], according to proof of Lemma 3.14 in A.1.4 in [39], V is positive definite in Lyapunov sense such that V(0, t) = 0, V(e, t) ≥ µ(|e|) ∀t > 0, µ ∈ K. Then recall (25) and (26), Since by definition, c is positive definite and R is positive definite matrix, thus the value function is a strict SCLF with L (V (x, t)) ≤ 0. According to theorem 5.3 in [36], the corresponding system (24) is stable in probability and the MPPI controller is a stabilizing controller.

2) ANCILLARY CONTROLLER
The ancillary controller acts as a tracking controller, which keep the actual system state in the tube, centered at the central path computed by the nominal controller. This is a standard tracking problem with small initial error and quadratic cost. With many solution exist, a nonlinear MPC utilizing iLQG is selected as the ancillary controller as in [10], [30], with the state convergence in finite time shown in [41], this widely used control method provide relative good performance.

C. RANGE-AWARE HYPERBOLIC TANGENT FUNCTION
The hyperbolic tangent function can be expressed as follows: and its figure is drawn in fig.3. Different from time, range-to-go provide more accurate information about the current stage of guidance, thus relative range is used as the decision variable of the adaptive function. We also want an adjustable saturated value at the end of guidance, thus the elegant property of hyperbolic function is necessary. Utilizing the above analysis, a variant of hyperbolic tangent function is designed as: where K RA , ϕ, σ are positive coefficient. If we select them as 1, 1, 300 respectively, its figure can be drawn below: From the figure we can see the function value is low when the interceptor is far from the target. The value then increases faster and faster as range closes and saturate at K RA at the end. The saturated value, initial value and increasing speed can be adjusted using K RA , ϕ, σ . Thus using this function, we can adjust the trading off between acceleration command and error approaching rate, and thus achieve error shaping in the objective function.
As in [7], based on (13), the derivative of LOS angular rates reference planning algorithm is defined as: where K 1 > 0, K 2 > 0, and the planned reference is: In this way, adjustable approaching rate make the error variables change more smoothly, and the convergence of the error variable to zero can be proved by selecting a Lyapunov function V = 1 2 e 2 1 , then: Thus the error variable will converge to zero and control objective is satisfied.
As we know,ė 1 is the derivate of the LOS angular rate that is proportional to the acceleration command. Thus the tradeoff between error approaching rate and acceleration command can be utilized using proposed range-aware hyperbolic function. The state dependent cost function of MPPI is then selected as:ė

V. NUMERICAL SIMULATION
In this section, numerical simulations for the proposed meta-learning tube-MPPI guidance law and other two guidance laws are conducted for comparison. The other two guidance laws taken into account for comparison are the meta-MPPI guidance law from [7] and a guidance law constructed using tube-MPPI in [10]. Monte Carlo simulation of the proposed guidance law is also utilized to verify the robustness and effectiveness of proposed guidance law. The interceptor has acceleration limit A M −max = 200m/s 2 in both directions, intercepting a strong maneuvering target with and the initial conditions are listed in table 1. In the simulation, a realistic interceptor velocity model from [28] is used, where the value of zero-lift drag coefficient C D0 , induced frag coefficient K , aspect ratio A r , efficient factor e, atmosphere density ρ, reference area s can be found. The interceptor thrust and mass are considered as: The initial engagement parameters in the simulation is given in table 1, where Unif means a uniform distribution, and the interceptors have a max 200 m/s 2 acceleration limit in these cases.
The MPPI controller used as nominal controller has a horizon 3 with control cycle 5ms, 1000 trajectories drawn, temperature coefficient λ * set to 1. Ancillary controller has control cycle of 2ms. The meta-learning neural network dynamics has two hidden layers with 512 neurons, ReLU activation, and is trained using twenty minutes of data. The online learning rate α for meta-learning is set to 0.001. The step size of simulation integration is adaptive and less than 0.01ms for the environment. Table 2 shows the simulation result of case 1, where φ LT and φ LT is terminal LOS angles respectively. The miss distance of proposed guidance law is smaller than the other guidance laws. Which indicates a better LOS angular rate tracking performance at the end of guidance. The error in terminal impact angle is also smaller with the proposed guidance law. With range-aware hyperbolic function, the proposed guidance law has a smaller cumulative control effort that results quicker impact and increased terminal velocity. Thus the results demonstrate the proposed guidance law has better performance than the meta-MPPI guidance law and the proposed meta-learning MPPI method has better performance than tube-MPPI in [10]. In case 1, the scenario setting is as listed in table 1. A comparison demonstration of the proposed guidance, meta-MPPI law and tube-MPPI guidance law is conducted. As the simulation result in table 2 shows, the proposed guidance achieves better outcome than the other guidance law in miss distance, terminal angle error and impact time. Fig. 4 shows the trajectories of interceptor and target in these guidance laws, all guidance laws drive the interceptor to interception, but quicker impact is achieved through shaping of the trajectory in the proposed guidance law. The LOS angles and angular rates during the interception in this case are shown in Fig. 6. The LOS angles of meta-MPPI guidance law diverge at the end of the interception, due to the fluctuate in the LOS angular rates which shows the guidance law has difficulty in tracking the desired LOS angle and angular rates. The proposed method achieves better tracking performance with ancillary controller and tube method. LOS angles and angular rates of tube-MPPI guidance law also fluctuate since nominal controller diverge greatly with strong target maneuver VOLUME 8, 2020  disturbance, and we can see the proposed method has better tracking ability since online adaption to environment changes is done by meta-learning. Better LOS angle and angular rates tracking performance also result in better terminal angle and miss distance as shown in table 2. Fig. 7 shows the acceleration profile of the guidance law during engagement, it can be seen that the acceleration command of meta-MPPI guidance law chatters at the terminal phase of the guidance law. From Fig. 6-8, the proposed guidance law consumes less energy and poses larger terminal velocity than meta-MPPI guidance law due to range-aware hyperbolic tangent function as weighting function in guidance law design. Thus the proposed guidance law achieves satisfactory performance with range-aware hyperbolic tangent function, tube-MPPI method and meta-learning.
Simulation using different setting in engagement scenario is conducted in case 2 to further demonstrate the comparative performance of the proposed guidance law. The simulation results are listed in table 3. The results shows the proposed guidance law achieves better performance than comparative guidance laws. We can see the tube-MPPI method has a worsen result than case 1. This is partly because the LOS angles happens to fluctuate near the desired LOS angles in case 1 which can be seen in Fig. 7. The range-aware hyperbolic tangent function result in a reduced cumulative control   effort that also result in a reduced impact time and increased terminal velocity than meta-MPPI guidance law.
The trajectories of interceptors and target, LOS angle and angular rates, acceleration profile and interceptor velocity  is shown in Fig. 9-12. We can see from these figures that guidance laws provide similar simulation result in case 1. From Fig. 11, the control command of meta-MPPI guidance law fluctuate at end, causing the LOS angular rate to diverge. Control command of guidance law jump at 3s which is caused by thrust burnout causing perturbation to the system. Fig.11 also shows the tube-MPPI control command sometimes deviate due to tracking performance, and cause a slight increase in cumulative control command between proposed control approach and tube-MPPI control method. With range-aware adaptive function, the control command in the initial of interception is reduced, which is shown in Fig.11 when the meta-MPPI command saturate. The reduced in initial command results increased terminal velocity and quicker impact time as in Fig.12. Thus the proposed guidance law achieves better LOS angle, angular tracking performance and consumes less energy which result in better miss distance, terminal angle error, impact and terminal velocity.
As the Monte Carlo method is powerful in analyzing effectiveness and robustness, 5000 rollouts are conducted  to further verify the performance of proposed method. The initial condition is listed in table 1, the initial LOS angle and missile heading is set to be in a uniform distribution ranging 0.6 rad to cover operating initial condition range. The target acceleration changing rate is reduced to get a clearer picture of result, and has 60 m/s 2 max value. The Gaussian measurement noise of LOS angle is set to zero mean, 8mrad standard deviation. And standard deviation of LOS angular VOLUME 8, 2020   rates Gaussian noise is set to one percent its current measurement value. The results are shown in Fig.13-16. In Fig.13, all trajectories of interceptors and targets are transformed into origin at interception point to make them easy to see. The trajectories show all rollouts has successful hit. The histogram of miss distance, terminal impact angle and impact  time is shown in Fig.14, and we can see majority has a small error. The deviation in mean is due to the consistency in target maneuver. As LOS angles and angular rates of interceptors shown in Fig.15, all LOS angles and angular rates converge to desired value. In Fig.16, we can see the different terminal velocity of missile caused by different engagement condition.

VI. CONCLUSION
In this article, we present a new range-aware impact guidance law using model-based RL technique for a varying velocity interceptor intercepting a maneuvering target with desired impact angle. Model-based deep RL method is used in guidance law design and a deep neural dynamic capable of online adapting to environment change via meta-learning is used as predictive model. The predictive model is then utilized in MPPI to solve the optimal control problem via importance sampling of path integrals to compute the nominal state and control as central path. An ancillary controller tracks the central path to keep system states in the tube. The benefit of combing meta-learning and tube-MPPI method is that the model mismatch of the nominal controller is reduced to improve overall control performance. The benefit is verified in simulation which shows the proposed approach achieves better tracking performance than tube-MPPI method. Numerical simulation clearly indicates the proposed method can reduce acceleration at initial phase and increase terminal velocity. Compared with meta-MPPI guidance which acceleration command and LOS angular rate chatters at end, the proposed guidance law clearly shows more robust capability in disturbance rejection. Monte Carlo simulation result verifies the effectiveness and robustness of proposed guidance law under operating conditions.