A Deep Deterministic Policy Gradient Learning Approach to Missile Autopilot Design

In this paper a Deep Reinforcement Learning algorithm, known as Deep Deterministic Policy Gradient (DDPG), is applied to the problem of designing a missile lateral acceleration control system. To this aim, the autopilot control problem is recast in the Reinforcement Learning framework, where the environment consists of a 2-Degrees-of-Freedom nonlinear model of the missile’s longitudinal dynamics, while the agent training procedure is carried out on a linearized version of the model. In particular, we show how to account not only for the stabilization of the longitudinal dynamic, but also for the main performance indexes (settling-time, undershoot, steady-state error, etc.) in the DDPG reward function. The effectiveness of the proposed DDPG-based missile autopilot is assessed through extensive numerical simulations, carried out on both the linearized and the fully nonlinear dynamics by considering different flight conditions and uncertainty in the aerodynamic coefficients, and its performance is compared against two model-based control strategies in order to check the capability of the proposed data-driven approach to achieve prescribed closed-loop response in a completely model-free fashion.


I. INTRODUCTION
An autopilot system for a modern missile must be able to stabilize the missile rotational dynamics and to effectively track the sequence of acceleration commands provided by the navigation and guidance system to follow the desired trajectory. Generally, to achieve these objectives, missile autopilots are designed exploiting classical model-based approaches, mainly relying on linearization of nonlinear dynamics and gain scheduling control (see for example [1] and the references therein). However, since the closed-loop performance might be significantly deteriorated by the presence of highly nonlinear terms in the plant dynamics [2], several nonlinear control strategies have been proposed to tackle this issue, ranging from sliding mode approaches [3] to backstepping [4], to nonlinear model predictive control [5] and H ∞ techniques [6], [7]. All these solutions are model-based and require accurate knowledge of the plant dynamics. However, The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos . this may be a restrictive assumption, as in practice unavoidable uncertainties arise due to unmodeled dynamics, timevarying parameters, or unpredictable environmental effects. When modeling becomes difficult or inaccurate, a data driven-based approach to control design might prove advantageous. To this aim, a field of machine learning known as Reinforcement Learning (RL) [8], [9] has recently attracted a wide research interest, thanks to its ability in learning an optimal control policy by exploring an unknown environment with the objective of maximizing a numerical reward signal, without any precise description of the plant.
RL algorithms proposed in literature can be classified according to two different paradigms, namely model-based and model-free, depending on the assumed knowledge of the environment model [10]. Although the approaches that belong to the first class, i.e. the model-based RL, have been extensively investigated in real applications (see for example [11] and the references therein), they are generally designed under the restrictive assumption that model information is available to the agent. Therefore, the VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ performance of these approaches highly rely on the accuracy of the model [12]. These concepts advise using model-free approaches when this information is not available for the training phase. Indeed, conversely to the first methods, model-free RL requires more interactions with the external environment and bases its functionality mainly on the environment changes and feedbacks, without the need to deeply understand its functioning [11]. Therefore, these methodologies do not require an estimation of the Markov Decision Process (MDP) model, and the value or policy function can be evaluated directly by sampling in order to approximate the task solution [10]. Although these features can limit the applicability of these strategies in some real applications, they can be used in all cases where there is no a priori information useful for the training phase and, therefore, can be exploited to address the more challenging case of a completely unknown environment. Moreover, recent developments and remarkable achievements in image processing [13], face recognition [14] and natural language processing [15] fields suggested to integrate the Deep Learning theory into the RL framework, leading to the concept of Deep Reinforcement Learning (DRL), which leverages the ability of Deep Neural Networks to serve as universal function approximators to achieve improved control performance [16], [17]. Thanks to DRL, it is possible to deploy RL-based control systems for all those applications where continuous or high-dimensional state and action spaces make traditional RL strategies, such as Q-learning, impractical or insufficient. In particular, Deep Deterministic Policy Gradient (DDPG, [17]) is currently one of the most common approaches in this field of research. RL and DRL have been successfully applied to various control engineering problems, ranging from autonomous vehicles [18]- [21], to energy and electrical systems [22]- [25], robotics [26], [27], IoT security [28], [29] and maritime applications [30], [31]. Surprisingly, despite their significant potential, only few recent works propose the use of RL techniques as a control strategy for tackling different air vehicles problems, the most representative being perhaps [32] and [33], where the authors exploit a DDPG approach to design the inner-loop controller providing attitude control for a quadrotor, the autopilot of an Unmanned Combat Aerial Vehicle, respectively, or more recently [34] where a RL-based missile path-planning algorithm is proposed for head-on interception. In addition, in [12] a DDPG approach is exploited to tune the control gains of a typical fixed-structure three-loop autopilot [7], with the aim of optimizing the missile autopilot performance.
In this perspective, the objective of this work is to investigate the possibility of successfully exploiting high-performance learning tools for the design of data-driven missile autopilot control in a model-free fashion. To this aim, a policy gradient model-free RL approach, specifically the DDPG strategy, is adopted to stabilize the longitudinal dynamics of a missile and to satisfy some performance requirements through the choice of a suitable reward function. DDPG is a relatively simple Policy Gradient (PG) actorcritic algorithm based on the use of deep neural networks, which has been chosen for the purposes of this work due to its sample efficiency and the small number of hyperparameters involved, which makes the tuning procedure more straightforward when compared to more sophisticated RL techniques. Indeed, deep RL algorithms usually have a quite large number of free parameters (the structure of the neural networks, the learning rates, the soft update policy in case of twin neural networks, as in TD3, and so on) whose effect on the final result is not always obvious or immediately interpretable. In recent years, several DRL algorithms have been proposed in the literature, some of which can improve the characteristics of the agent's training with respect to the DDPG algorithm exploited here; indeed, DDPG is sometimes prone to training instability issues (mostly because it does not implement any explicit bound on the gradient ascent stepsize).
For the sake of completeness and to better motivate our work, despite our focus is on DDPG, in the following discussion we will try to give to the reader an overview of other comparable DRL methods, while further details of the DDPG algorithm are instead given in Section IV.
In general, PG RL algorithms aim at exploiting some form of gradient ascent to optimize the policy so as to maximize some given objective function, based on the reward obtained at each time step. However, the gradient method does not prescribe a way to choose a safe step-size in the optimization procedure. For this reason, the Trust Region Policy Optimization (TRPO) algorithm was proposed in [35], which proposes to limit the Kullack-Leibler divergence between the old and updated policies in order to limit the gradient steps amplitude. Proximal Policy Optimization (PPO) [36] is a revised version of TRPO, which exploits a clipping mechanism in order to obtain a Trust Region-like optimization algorithm which is compatible with the classical Stochastic Gradient Descent. It is worth to remark that both PPO and TRPO implicitly call for stochastic policies.
On the other hand, RL research moved along a parallel path in order to increase the sample efficiency of the training algorithms for agents which employ neural networks (especially in the actor-critic framework). The simplest algorithm belonging to this class of techniques is DDPG, which contains ideas that stem from the Deep Q-Network algorithm, but that is naturally suited for continuous actions spaces, and which exploits a replay buffer technique. In some implementations target networks are also used to improve the algorithm's stability.
Modifications to DDPG have been proposed in the technical literature to improve some aspects of the agent's training procedure; in [37] the TD3 algorithm was proposed, which adds some devices to avoid overestimation and reduce variance, providing better stability properties in some application cases, while a maximum entropy version of DDPG/TD3 named Soft Actor-Critic (SAC) has been introduced in [38].
In this view, our final choice fell on the DDPG algorithm, which tends to be more sample efficient than PPO on one hand, and to have less tuning parameters than more sophisticated techniques such as TD3 or SAC on the other. Thus, in this paper, we first recast the missile autopilot design control problem into the RL framework, with the primary aim of testing this approach in terms of control performance (settling time, undershoot, etc.) and then we compare the fully data-driven DDPG controller against classical model-based control strategies (such as H ∞ [6] and Model Reference Adaptive Control [39]).
Most notably, despite the nonlinearities that affect the process under exam, it was found that, when applying the DDPG approach to the autopilot problem, a deep knowledge of the plant model is not required and a linear model can be effectively used during the training procedure, to reduce the required computational burden, without degrading the performance of the real closed-loop system, at least close to the considered equilibrium point. Along this line, the agent obtained through the proposed method is then validated on the 2-Degrees of Freedom (2-DoF) fully nonlinear model.
The analysis further discloses how the careful study and definition of the reward function allows to easily shape the performance in the transient behavior, for example by decreasing the undershoot phenomena. In addition, comparison results in a realistic flight scenario confirm that the excellent capabilities of the proposed RL approach in capturing the underlying unknown nonlinear behaviors allow providing satisfactory closed-loop performances, which are comparable to those of state-of-the-art model-based techniques, without the need for running a detailed model of the process in real-time or for having a detailed a priori knowledge of the nonlinear dynamics. In addition, simulations at different Mach numbers and with random variations in the aerodynamic coefficients employing a Monte Carlo approach are performed in order to provide some meaningful insight on the robustness of the closed-loop.
It is finally worth noting that the need for pioneering solutions to respond to unmet challenges as well as to new opportunities derived from the application of AI techniques to this research field is confirmed by the autopilot system very recently designed in [40] leveraging a modified TRPO agent trained on a detailed nonlinear model of the plant dynamics. In particular, such system exploits a transformed acceleration signal as the controlled variable to overcome the inherent non-minimum phase characteristics of the missile dynamic. This approach does not allow the authors to take into account, during the training of the RL agent, the typical undershoot that characterizes the transient response of a missile to a step request in the acceleration. As opposed to [40], the present work instead tries to investigate the capability of a purely data-driven missile autopilot by explicitly considering the main performance indexes (settling-time, undershoot, steadystate error, etc.) in the DDPG reward function.
The rest of the paper is organized as follows. Sections II and III describe the control requirements and the missile nonlinear 2-DoF model, respectively, while in Section IV a brief introduction to the DDPG algorithm is provided. The details of the proposed RL approach, in terms of agent structure, reward function engineering and training procedure, are described in Section V while, simulation results are discussed in Section VI, where the performance of the proposed RL agent is compared to those of a self-scheduled H ∞ autopilot [6] and of the Augmented Adaptive Controller presented in [39]. Eventually some conclusive remarks are given in Section VII.

II. PROBLEM STATEMENT
This section defines the control requirements that will be taken into account in the design of the proposed autopilot based on a RL control approach.
During the flight, the longitudinal dynamics of a missile can be unstable, depending on the relative location between the center of pressure and the center of mass, i.e. the center of pressure is the point where the lifting force is considered to act, as shown in Fig. 1. In order to stabilize and to control the longitudinal dynamic of the missile a tail fin is introduced. It follows that the controller must generate the required tail deflection to produce the desired normal acceleration, while stabilizing the airframe rotational motion. Moreover, the transient response of the missile to a step request in the normal acceleration is characterized by an initial undershoot, which is reflected by the fact that the associated linearized model is a non-minimum phase one [39]. Ideally, this undershoot should be kept as small as possible; however, as it will be shown in what follows, this results in a slower response, hence a trade-off between the bandwidth of the closed-loop system and the maximum undershoot must be sought.
Based on the previous observations, the following qualitative requirements are considered in Section V-B to design the reward function of the proposed DDPG approach, in order to ensure performance that are similar to those of other solutions available in literature ( [6], [39], [41]): 1) the control system shall ensure the stability of the closed-loop system over the largest possible operating range, defined in terms of angle of attack α(t) and the Mach number M ; it should be noticed that a wider range in terms of α(t) is preferable since typical applications foresee the scheduling of different controllers as a function of M (see [41] as an example); 2) the control system shall take into account the maximum deflection that can be applied to the tail; 3) in tracking a step command in the normal acceleration, the control system shall minimize the following quantities: a) the rising time at the 90% of the final value; b) the overshoot; c) the undershoot; d) the steady-state error.

III. LONGITUDINAL MISSILE DYNAMIC MODEL
In order to simulate the missile dynamics and to prove the effectiveness of the proposed autopilot system, the following simplified 2-DoF nonlinear model proposed in literature [39], [41] is considered, which is capable of describing the longitudinal dynamics of a tailed controlled missile (see Fig. 1) under the following assumption. Assumption 1 (Fully Decoupled Dynamics): It is assumed that the pitch, yaw and roll channels are decoupled, so coupling phenomena are ignored. Given Assumption 1, the longitudinal dynamic of a missile can be described as follows: Equations (1c) and (1d) define a second-order linear model of the actuator that links the tail fin deflection command δ c (t) to the actual deflection δ(t), where ζ is the actuator damping ratio and ω a is its natural frequency. Part of the system's nonlinearity lies in the definition of the aerodynamic coefficients for both the normal force and the pitch momentum, respectively C n and C m , which are given by: where time dependency is dropped to simplify the notation. Table 1 shows the values of the model parameters.

IV. DEEP DETERMINISTIC POLICY GRADIENT
In the RL approach, an agent must learn to interact with an unknown environment in a way that maximizes the expected cumulative value of a given reward function. Usually, the environment is modeled as a Partially Observable MPD (PO-MDP); in particular, at each time instant, the agent receives from the environment an observation and must pick an action a t based on such observation. In principle, the observation may differ from the system's actual state. However, for simplicity, we confuse the state and the observation since s t is the state of the agent's internal representation of the environment, modelled as a PO-MDP. The computed action affects the next state transition of the system, from s t to s t+1 , after which the agent receives a reward r t+1 and a new observation. The objective of the training process is to find a policy that maximizes the cumulative reward, defined as where the discount factor γ is generally close to (but less than) 1. Moreover, the reward is computed over several episodes, each consisting of (up to) N steps. In classical RL tabular methods, discrete action and observation spaces are considered. The name tabular reflects the fact that, in such methods, the agent usually stores a table that associates to each state-action pair the value of the expected cumulative reward R t (represented by the action-value function Q(s t , a t )). If the agent had access to the true value of Q for each action-state couple, the optimal policy would be the choice of a that maximizes Q for each state s (greedy policy). As a consequence, the objective of the training boils down to finding a table that accurately represents Q(s, a).
Tabular methods, however, are limited in working only with discrete action and observation spaces, being inefficient in the presence of continuous and high dimensional spaces. To overcome this limitation, several extensions have been proposed in the technical literature, mainly exploiting deep neural networks and their capability of serving as universal • actor: exploits the critic to improve the policy, represented with another approximator µ(s). In this study, an actor-critic method known as DDPG algorithm, originally proposed in [17], is considered. DDPG is a model-free, off-policy approach that extends the DPG [16] with the exploitation of deep neural networks. A simple representation of the DDPG paradigm is shown in Fig. 2. In DDPG, an actor network µ(s|θ µ ) is used to represent the current policy, while a critic network Q(s, a|θ Q ) is used to approximate the action-value function Q(s, a) (θ µ and θ Q indicate the corresponding network's parameters). In particular, the critic network is trained so as to minimize the following loss function being y t = r t+1 + γ Q(s t+1 , µ(s t+1 )|θ Q ) (or just y t = r t+1 , if the next state is terminal). The average is usually computed across a mini-batch, randomly extracted from a replay buffer B containing a (large) collection of the past transitions (s t , a t , r t+1 , s t+1 ). The actor weights are updated in the direction of the critic action-value gradient, according to the chain rule applied to the expected return J w.r.t. the actor parameters Since the Q update is prone to divergence, target networks are employed in order to improve the learning stability. A copy of the critic and actor networks, indicated as Q (s, a|θ Q ) and µ (s|θ µ ) respectively, are used in order to evaluate the target values [17]. The critic and actor parameters, i.e. θ Q and θ µ are updated according to (4), while the target networks are updated either periodically or in a soft fashion, i.e. θ ← τ θ +(1−τ )θ , with τ 1. Finally, to find a balance between the exploration of the state-action space and the exploitation of the current policy, a noise sampled process N can be added to the actor policy µ (s t ) = µ(s t |θ µ ) + N , where N is an Ornstein-Uhlenbeck process [17]. The DDPG algorithm steps are listed in Algorithm 1 (see also [17]).

V. RL CONTROL SYSTEM FOR MISSILE
In this section, the proposed DDPG control algorithm is introduced, focusing on the control system architecture, neural networks and details concerning the reward function proposed for the training phase.

A. CONTROLLER ARCHITECTURE
Starting from the state variables of the 2-DoF nonlinear missile model in (1), the observations vector for the agent training has been chosen as: is the acceleration reference value, generated by the guidance and navigation system of the missile. It is worth to remark that, despite the actuator angular speed δ v (t) is a state variable, it was not included among the observations due to the technological difficulties in obtaining a reliable measurement of this quantity. The only control action considered is the missile's tail fin deflection request δ c (t), and the control sampling time has been set equal to 0.01 s.
The structures of the neural networks (see Fig. 2) have been defined through a trial-and-error procedure in terms of the number of hidden layers and neurons, activation functions, etc., considering a trade-off between performance and limited computational capacity available on board. The main results of the analysis that was carried out are summarized in what follows.
The architecture of the critic neural network is shown in Table 2. This neural network is characterized by 5 input variables, i.e. observation and the action variables, and a single output variable, representing the critic's estimate of the action-value function. Note that, all input variables were normalized so as to take values in the range [0, 1]. Five fully connected layers connect inputs and outputs, each of them characterized by 100 neurons and REctified Linear Units (RELU) activation function. In particular, the fully connected layers 1 and 2 process in sequence the observation variables, while the fully connected layer 3 processes the action variable. Then, the outputs of layers 2 and 3 are summed before passing through the fully connected layers 4 and 5. 1. Randomly initialize critic and actor networks Q(s, a|θ Q ) and µ(s, a|θ µ ) with weights θ Q and θ µ ; 2. Initialize the target networks Q and µ with weights θ Q ← θ Q , θ µ ← θ µ ; 3. Initialize the replay buffer B; for episode = 1, M do 4. Initialize the random process noise N for the action exploration; 5. Receive initial observation state s 1 ; for t = 1, T do 6. Select an action a t = µ(s t |θ µ )+N t according to the current policy and the exploration noise; 7. Execute the action a t and take reward r t and the new state s t+1 ; 8. Store transition (s t , a t , r t , s t+1 ) in B; 9. Sample a random minibatch of N transitions (s i , a i , r i , s i+1 ) from B; 10. Set y i = r i + γ Q (s i+1 , µ (s i+1 |θ µ ) θ Q ); 11. Update the critic minimizing the loss function by using equation (4) 12. Update the actor policy using the sampled policy gradient in equation (5) 13. Update the target networks:

end for end for
The architecture of the actor's neural network is shown in Table 3. This network has 4 input variables, i.e. the observation variables. Also in this case, the input variables have been normalized to take values in the range [0, 1]. The only output of the network is the control action. Between the input and output layers there are four fully connected layers, each containing 100 neurons. The first three layers have RELU activation function, while the last has a hyperbolic tangent (tanh) activation function, which produces an output in the range [−1, 1]. The output of this layer is then scaled taking into account the maximum allowed actuator deflectionδ c .

B. REWARD ENGINEERING
Once the structure of the agent has been set, a reward function must be defined, taking into account the requirements discussed in Section II. To attain the desired goals, the following reward function has been used: 2 − ω 2 (t)K 4 e 2 (t) − (1 − ω 3 (t)) K 5 δ 2 (t)  where P fail , P step , P win and K i for i = 1, . . . , 5 are positive constants, ω i (t) for i = 1, 2, 3 are Boolean variables and e(t) = η ref − η(t) is the tracking error. In particular, once a range for the lateral acceleration has been fixed, P fail defines a penalty which is applied when η(t) exceeds the prescribed bounds, causing the premature termination of the current episode; otherwise, the bonus P step is applied. Finally, an additional bonus P win is given to the agent when the norm of the tracking error is less than a given threshold. The Boolean variables ω i (t), i = 1, 2, 3 allow to apply the penalty and bonus defined previously. In particular, ω 1 (t) is true if η is inside the desired range, ω 2 (t) takes the value true if the step response shows an undershoot, and ω 3 (t) is true if the tracking error is less than a given threshold. It can be seen how the control policy is rewarded by function (6) when the missile acceleration is steered and kept close to the reference, i.e. requirements 3a and 3d are verified, while it is penalized when the missile motion exceeds a prescribed range of lateral acceleration values. The quadratic terms in the missile angular velocity and actuator deflection and deflection speed are used to take into account the requirements about the overshoot and the undershoot, and to limit the control power. Indeed, due to the non-minimum phase behavior of the linearized plant, a further error penalty is considered to limit the undershoot. Since some requirements conflict with each other, e.g. rising time and overshoot, the positive constants K i , i = 1, . . . , 5 weight the terms in the reward function so that the resulting control policy will be a trade-off solution.

C. TRAINING PROCEDURE
According to the reward function (6), a training procedure has been performed on the missile linearized model around the equilibrium point defined by M = 3 and α(t) = 0 [deg], where each episode consists in a simulation of random maneuver. In this way, the policy has been optimized with respect to a single flight condition; then, the controller has been validated in all the considered operating range, in order to verify requirement 1.
More specifically, each training episode is characterized by a different step command, whose amplitude is chosen randomly within the range [−1, 1] [g], and terminates when the simulation time reaches its maximum value, chosen as T max = 1.5 [s], or when the lateral acceleration exceeds the desired range of values. Table 4 contains the values of the training parameters.

VI. SIMULATION RESULTS
In this section, the effectiveness of the proposed controller is characterized through numerical simulations.
The DDPG agent has been trained by implementing the procedure defined in Section V-C. In particular, the constants in the reward function (6) have been chosen equal to in order to obtain a maximum undershoot less than the 50% of the reference value, a maximum rising time less than 1 [s] and a maximum steady-state error less than the 5% of the reference value.
Section VI-A shows how the proposed data-driven approach is capable to learn the nonlinear behaviour of the missile described by (1) from the limited experience that it gets from the response of the linearized model for specific flight conditions. Moreover, we evaluate the robustness of the control system for different flight conditions and in presence of uncertainty in the aerodynamic coefficients. A further assessment is carried out in Section V-B, by comparing the DDPG trained agent with two robust model-based strategies, i.e. the self-scheduled H ∞ control and the Adaptive Augmenting Controller (ACC), previously presented in literature in [6] and [39], respectively. This comparison shows the efficiency of the proposed model-free autopilot in guaranteeing proper closed-loop tracking performances and exhibiting, at the same time, a lower undershoot, according to the proposed engineering reward function (6). The comparison among these three controllers is carried out by emulating the effects of the guidance law in the outer-loop, whose aim is to provide the proper missile acceleration [42], as a time-varying reference signal, which proves the capability of the proposed approach to work in more complex scenarios as realistic missile guide systems.

A. CONTROLLER VALIDATION
In this section the closed-loop responses of the linearized and nonlinear models are compared to validate the trained DDPG agent. A maneuver starting from the flight condition considered in the training phase and with three different acceleration requests is considered (see the black trace in Fig. 3). The simulation results reported in Fig. 3 show that the closed-loop responses to the first request of one additional g are similar, hence we can consider the control requirements satisfied in both cases. Furthermore, this simulation also shows that, when the requested acceleration brings the system far from the reference equilibrium for the linearized model, the DDPG agent still exhibits acceptable performance, hence proving its capabilities in learning a control law that generalizes the nonlinear behavior, although the training procedure was based on a linear approximation of the missile response.
Moreover, the robustness of the proposed approach has been evaluated considering 820 different nonlinear simulations performed for a step command of magnitude η ref = 1 [g], with different initial angles of attack α(0) ∈ [−10 , 10] [deg], and with different Mach number M ∈ [2 , 4]. The control performance have been evaluated changing the cumulative reward at the end of each simulation. Results in Fig. 4 show that the reward is almost independent on the initial value of α(t), while the impact of M is more evident. However, this degradation can be avoided by considering the Mach number as a scheduling parameter, as mentioned in Section II. It is worth to observe that the narrow red band for M = 3 ÷ 3.2 depends on the fact that the regime value differs from the desired value of less than the 5%, i.e., ω 3 (t) is true in (6).    (6), over the considered flight envelope. In particular, cumulative reward has been evaluated for the same maneuver starting from different initial attack angles α 0 and at different Mach numbers M.
Furthermore, the robustness of the proposed approach has been evaluated through Monte Carlo simulations performed with the nonlinear model, that start at the initial flight condition α 0 = 0 [deg], M = 3, and where an additive uncertainty on aerodynamic coefficients C n and C m has been introduced in the range [−20 , 20] % of the corresponding nominal value. Fig. 6 shows the results for 100 runs when a step command of magnitude η ref = 1 [g] is applied. Despite this significant variation in the model parameters, the approach still guarantees the closed-loop stability in all the perturbed scenarios. As for the case of variations in the Mach number, robustness against model uncertainty was not considered during the training phase. Therefore, as expected, some slight performance degradation can be observed. However, obviously this degradation can be counteracted by including also robustness as a further objective of the training phase.

B. COMPARISON WITH MODEL-BASED METHODOLOGIES
To better discuss the advantages of the proposed DDPG strategy in tracking the missile lateral acceleration, we now compare its closed-loop behaviour with two different robust model-based strategies proposed in the literature to solve the same control problem. Specifically, the former has been presented in [6], where authors developed a robust self-scheduled H ∞ control to regulate the lateral acceleration of a missile, while the latter consists of an adaptive control mechanism named Adaptive Augmenting Controller (AAC) [39]. The design procedure for the self-scheduled H ∞ controller relies on a Linear Parameter-Varying (LPV) model of the missile, whose state space representation depends on both α and M . In this case, robustness is achieved by guaranteeing H ∞ performance for the LPV plant, when α ranges in the interval [−20 , 20] [deg], while M ∈ [2 , 4].
Similarly, the AAC achieves robustness by designing a baseline state-feedback that guarantees robust stability for all the models belonging to the convex hull defined by the linearized models with M = 3 and α equal to {0 , 5 , 10 , 15 , 20} [deg]. Moreover, for both the model-based controller, the gains have been tuned so as to obtain a similar undershoot when performing a 1 [g] maneuver.
The simulation results are shown in Figs. 7 and 8, where the closed-loop responses have been compared by performing  two different maneuvers. For the comparison in Fig. 7 we have considered a sequence of three step commands. When tracking the first 1 [g] step reference, all the controllers show the same undershoot, while the response of the RL agent is characterized by a slightly shorter settling time. When a reference changes larger than 1 [g] is requested, the response of the data-driven controller is always characterized by the smallest undershoot and the smallest control effort when  compared with the two model-based controllers. Moreover, in the worst case, the settling time for the RL controller is similar to those of the other two considered approaches.
The further simulations shown in Fig. 8 refer to the response to a reference signal similar to the one computed by a guidance system, as proposed in [42]. Here we want to remark that, despite the RL controller was not trained using such a class of reference signals, it shows similar performance when compared to the two model-based controllers. From these results, it is possible to conclude that the proposed DDPG autopilot shows the same robustness against model uncertainties as to the two model-based approaches.
This result is achieved without the need for a detailed system model, as required by both the self-scheduled H ∞ controller and the AAC. Indeed the former required an LPV description of the missile response, and the latter more than one linearized model, while the proposed RL agent practically achieves the same performance exploiting just a single linear model. This further proves the effectiveness of model-free data-driven approaches for the design of robust autopilot systems.

VII. CONCLUSION
The feasibility of a model-free controller for the lateral acceleration of a missile has been investigated in this article. Specifically, exploiting the DDPG approach, a RL agent has been trained on the linearized dynamics of a 2-DoF nonlinear missile model, taking into account the main performance indexes. To assess the effectiveness of the proposed approach, different scenarios have been simulated on a 2-DoF nonlinear model, proving the efficiency of the data-driven approach in stabilizing the rotational dynamics, satisfying the control requirements in design flight conditions. Furthermore, a robustness analysis is provided to show the capability of the proposed approach in guaranteeing closed-loop stability in a wide range of flight conditions and in presence of model uncertainty. Along this line, future works will involve the improvement of the robustness w.r.t. variations of the Mach number, model uncertainties and measurement noise, by the explicit inclusion of robustness as a further objective during the training phase. where he has participated to various projects connected to the plasma magnetic control systems. His current research interests include control of nuclear fusion devices, fault detection and identification of discrete event systems modeled with Petri nets, and stability of hybrid systems. He has published more than 200 journals and conference papers on these topics, and has coauthored two monographs titled ''Finite-Time Stability and Control'' and ''Finite-Time Stability: An Input-Output Approach.'' DARIO GIUSEPPE LUI (Member, IEEE) was born in Avellino, in 1990. He received the master's degree (Hons.) in electronics engineering for automation and telecommunications (specialization in automation) and the Ph.D. degree in information technology for engineering from the University of Sannio, Benevento, Italy, in 2015 and 2020, respectively. He is currently a Research Fellow at the University of Naples Federico II. His work focuses on the distributed control of multi-agent systems in the presence of communication impairments, with application to automotive field and reinforcement learning.
ADRIANO MELE received the M.Sc. degree (Hons.) in automation engineering from the University of Naples Federico II, in 2015, and the joint Ph.D. degree in nuclear fusion science and engineering from the University of Naples Federico II and the University of Padua, in 2019. He has been a Visiting Researcher at the EAST Tokamak, Hefei, China; the Swiss Plasma Center of EPFL, Lausanne, Switzerland; the ITER Remote Experimentation Center, Rokkasho, Japan, and the French Commissariat à l'énergie atomique et aux énergies alternatives, Cadarache, France. He is currently a Researcher at the University of Tuscia, Viterbo. He held graduate and post-graduate courses on information technologies for industrial automation, industrial control systems, and plasma magnetic control in Tokamak reactors. His current research interests include control engineering, in particular with applications to fusion plasmas, including the investigation of data-driven and machine learning methods. He was a recipient of an EUROFusion Engineering Grant dedicated to the development of diagnostic systems for the forthcoming DTT Tokamak.
STEFANIA SANTINI (Member, IEEE) received the M.Sc. degree in electronic engineering and the Ph.D. degree in automatic control from the University of Naples Federico II, Naples, Italy, in 1996 and 1999, respectively. She is currently an Associate Professor of automatic control. She is involved in many projects with industry, including small-and medium-sized enterprises operating in the automotive field. Her research interests include the area of the analysis and control of nonlinear systems with applications to automotive engineering, transportation technologies, computational biology, and energy systems.
GAETANO TARTAGLIONE (Member, IEEE) was born in Caserta, Italy, in 1989. He received the master's degree (Hons.) in aerospace engineering from the University of Naples Federico II, Italy, in 2014, and the Ph.D. degree in information engineering from the University of Naples Parthenope, Italy, in 2018. Currently, he is a Researcher at the Engineering Department, University of Naples Parthenope. His research interests include control of nuclear fusion devices, multi-agent cooperative control, and finite-time stability and stabilization for the class of linear time-varying stochastic systems.