Neural Network Optimal Feedback Control with Guaranteed Local Stability

Recent research shows that supervised learning can be an effective tool for designing near-optimal feedback controllers for high-dimensional nonlinear dynamic systems. But the behavior of neural network controllers is still not well understood. In particular, some neural networks with high test accuracy can fail to even locally stabilize the dynamic system. To address this challenge we propose several novel neural network architectures, which we show guarantee local asymptotic stability while retaining the approximation capacity to learn the optimal feedback policy semi-globally. The proposed architectures are compared against standard neural network feedback controllers through numerical simulations of two high-dimensional nonlinear optimal control problems: stabilization of an unstable Burgers-type partial differential equation, and altitude and course tracking for an unmanned aerial vehicle. The simulations demonstrate that standard neural networks can fail to stabilize the dynamics even when trained well, while the proposed architectures are always at least locally stabilizing and can achieve near-optimal performance.


I. INTRODUCTION
Designing optimal feedback controllers for high-dimensional nonlinear systems remains an outstanding challenge for the control community. Even when the system dynamics are known, to design such controllers one needs to solve a Hamilton-Jacobi-Bellman (HJB) partial differential equation (PDE), whose dimension is the same as that of the state space. This leads to the well-known curse of dimensionality, which rules out traditional discretization-based approaches.
Recent work has demonstrated the promise of supervised learning as one potential approach for handling such challenging, high-dimensional problems. The main idea is to generate data by solving many open loop optimal control problems (OCPs) and then fit a model to this data set, thus obtaining an approximate optimal feedback controller. Various specific model design and training approaches have been developed within this framework. Earlier work [1]- [3] uses sparse grid interpolation to approximate the solution of the HJB equation -called the value function -and its gradient, which is used to compute the optimal feedback control. This line of work has been futher developed using neural networks (NNs) [4]- [8] and sparse polynomials [9], significantly increasing the maximum feasible problem dimension. Alternatively, one can directly approximate the value gradient [10], [11] or control policy [4], [5], [11]- [14].
There are also several closely-related research directions which can be classified as self-supervised learning methods. The method of successive approximations is a well-studied approach based on iterative updates of a value function model and/or control policy by approximately solving a series of Lyapunov equations [15]- [17]. These methods are equipped with convergence guarantees but they often depend on specific problem dynamics, a priori access to a semi-globally stabilizing controller, or polynomial model Figure 1: Simulated trajectory of a fixed wing UAV (29) controlled with an NN feedback controller, compared to the optimal trajectory. structures whose size grow exponentially with the problem dimension. A second group of methods attempt to solve the HJB PDE in the least-squares sense by minimizing the residual of the PDE and boundary conditions at (randomly sampled) collocation points [18]- [21]. More recently, [22]- [24] have proposed methods to solve the HJB equation along its characteristics without generating data. Generally speaking, such self-supervised approaches avoid the cost of data generation by taking on a harder learning problem.
Despite promising developments in the methodology, much less work has been done to study and improve the stability and reliability of these NN controllers. To see why this is needed, if we train a set of NNs to control a fixed wing unmanned aerial vehicle (UAV) studied in Section VI, a surprisingly large fraction of these fail to stabilize the system despite having good approximation accuracy. Figure 1 shows a closed loop simulation with one such controller where the NN-controlled trajectory closely tracks the optimal (stable) trajectory before suddenly destabilizing and eventually settling at an undesired steady state. Behavior like this obviates the need for better understanding, more rigorous testing, and more reliable algorithms. This problem has been previously recognized by [11], [25], while [13] point out that test accuracy incompletely characterizes NN controller performance, suggesting some practical evaluations of optimality and stability. [26] study linear stability near a desired equilibrium, linear time delay stability, and stability around a nominal trajectory using high order Taylor maps. In terms of algorithm development, [8], [11] propose several NN architectures incorporating a linear quadratic regulator (LQR) which makes training more reliable and improves local stability properties.
The purpose of this paper is twofold: first, to bring attention to stability issues with NN-controlled systems; and second, to propose some NN architectures which can mitigate some of these challenges. These architectures improve on previous work in [8], [11] by guaranteeing, at a minimum, local asymptotic stability (LAS) of the system. This is accomplished by exactly recovering the LQR gain at the origin by construction. We also prove a universal approximation theorem for NNs with such structures, showing that they can approximate the nonlinear optimal feedback law up to arbitrary accuracy, and consequently provide semi-global stability and optimality. This paper is organized as follows. We start by describing the problem setting in Section II. In Section III we describe the novel NN architectures and present theory underpinning their local stability properties and their ability to approximate the full nonlinear optimal feedback policy. In Section IV we outline the supervised learning procedure. It should be noted, however, that the proposed architectures do not have to be trained using supervised learning; in principle they can also be implemented with self-supervised learning. In Section V we apply several practical closed loop stability and optimality tests to demonstrate the advantages of the proposed NN architectures. As a testbed we use the Burgers'-type PDE system (25), which is nonlinear, open loop unstable, and high-dimensional. Then in Section VI we apply the proposed control design methodology to learn optimal feedback controllers for a six degree-of-freedom (6DoF) fixed wing UAV with nonlinear dynamics and aerodynamics, showing that the framework can be applied to practical problems. A summary and directions for future work are given in Section VII. The code used for simulations in this paper will be made publicly available at https://github.com/Tenavi/QRnet.

II. PROBLEM SETTING
We focus on infinite horizon nonlinear OCPs of the form (1) is the control, f : R n × U → R n is a vector field which is continuously differentiable (C 1 ) in x and u, and (x f , u f ) ∈ R n × U is a (possibly unstable) equilibrium of f (·). We consider box control constraints for vectors u max , u min ∈ R m ; and running costs L : for smooth functions q : This is a standard running cost for regularization or set-point tracking problems. We make the standard assumptions that u f is an interior point of U and that the OCP (1) is well-posed, i.e. there exists an optimal control u * : [0, ∞) → U such that J [u * (·)] < ∞. Due to real-time application requirements, we would like to design a control policy in explicit feedback form, u = u * (x), which can be evaluated online given any measurement of x. The mathematical framework for designing such an optimal feedback policy is the HJB equation.
Define the value function V : R n → R as the optimal costto-go of (1) starting at Under appropriate conditions, the value function is the unique viscosity solution [27] of the steady state HJB PDE, where is the Hamiltonian. If (4) can be solved (in the viscosity sense), then it provides both necessary and sufficient conditions for optimality. Furthermore, the optimal feedback control is then obtained from the Hamiltonian minimization condition,

III. ARCHITECTURES FOR OPTIMAL FEEDBACK DESIGN
Our goal is to construct a feedback policy which approximates the optimal control, i.e. u(x) ≈ u * (x). While previous work has clearly demonstrated the potential of deep learning as a means of overcoming the curse of dimensionality in optimal control, NNs are notoriously "black boxes" and their behavior -especially when implemented in the closed loop system -is complex and hard to predict. Notably, even if we can train a highly accurate NN, it can still fail to stabilize the system. Thus there is a clear need for designing NN feedback controllers with built-in stability properties. In prior work, the authors proposed V -QRnet [8] (originally just called QRnet), λ-QRnet, and u-QRnet [11]. These architectures combine an LQR controller with NNs. The LQR terms are good approximations of the optimal control near x f , and improve local stability. Meanwhile, the NNs are intended to capture nonlinearities and thereby learn the nonlinear optimal feedback over a large domain.
However, none of these architectures guarantee LAS, which motivates us to pursue alternative designs. In this paper we introduce four novel NN architectures, λ Jac -QRnet, λ mat -QRnet, u Jac -QRnet, and u mat -QRnet, all of which guarantee LAS of x f while retaining the ability to approximate the nonlinear optimal control semi-globally.
The remainder of this section is organized as follows. In Section A we review λ-QRnet, u-QRnet, and LQR control design. The novel NN architectures are presented in Sections B and C. In Section D we show that these controllers automatically provide LAS, and finally in Section E we prove that they have the capacity to approximate the nonlinear optimal control. Throughout the paper, architectures denoted with a leading V approximate the value function, those with λ approximate the value gradient, and those with u directly approximate the optimal control. V -NN, λ-NN, and u-NN refer to standard feedforward NNs for approximating the value function, value gradient, and optimal control, respectively.

A. λ-QRnet AND u-QRnet
Introduced in preliminary work [11], λ-QRnet and u-QRnet are straightforward linear combinations of LQR with NNs. Let Under the standard conditions that (A, B) is controllable and A, Q 1/2 is observable, LQR gives a quadratic value function approximation, and a linear state feedback law, where P ∈ R n×n is a positive definite matrix satisfying the Riccati equation, Sufficiently near x f , the LQR value function (8) and control (9) are good approximations of the true value function V (·) and optimal control u * (·), respectively. Specifically, But further away from x f , the control is suboptimal and in some cases may even fail to stabilize the nonlinear dynamics. For this reason we are interested in combining the local stability and optimality of LQR with NNs to learn the full nonlinear optimal control u * (·) over a semi-global domain. Now we describe λ-QRnet, which approximates the value gradient V x (·) as Here N : R n × R p → R n is an NN with C 1 activation functions and parameters θ ∈ R p , and the linear component 2P (x − x f ) is the gradient of the LQR value function (8). We then substitute (11) into (6) to obtain an approximate optimal feedback control: λ-QRnet can be easily implemented when we can solve (6) for an explicit formula the optimal feedback control in terms of the state and value gradient, as is the case for many problems of interest. Alternatively, we can directly approximate the optimal control with u-QRnet: (13) where now N : R n × R p → R m . In (13), sat (·) is the saturation function defined for each i = 1, . . . , m as VOLUME 00 3 This article has been accepted for publication in IEEE Open Journal of Control Systems. This is the author's version which has not been fully edited and content may change prior to final publication. Next, σ : R m → U is a generalized logistic function which smoothly saturates the nonlinear control 1 : . (15) Here multiplication and division are performed elementwise, and we set the constants c 1 , c 2 ∈ R m as It is straightforward to verify that these choices of smoothly imposes saturation constraints while preserving the unsaturated behavior near u f , as we visualize in Figure 2. Use of this smooth saturation function makes learning easier since it prevents vanishing gradients when the control becomes saturated. The hard saturation function (14) acting on the LQR control inside the smooth saturation function (15) may appear redundant, but this is actually important for limiting the effect of the linear term away from x f . This reduces the burden on the NN: it is free to approximate nonlinearities without negating any excess contributions from the linear component.
It is easy to show that subtracting N (x f ; θ) in (11) and (13) makes the goal state x f a closed loop equilibrium [11]. This is not true for standard NN controllers, V -NN, λ-NN, u-NN, or V -QRnet. Including the LQR terms improve local stability due to LQR's large gain and phase margins, but does not exactly recover LQR and so cannot assure LAS without adequate training.

B. "JACOBIAN" QRnet ARCHITECTURES
Now we describe λ Jac -QRnet and u Jac -QRnet. These are similar to λ-QRnet and u-QRnet, except that we subtract the Jacobian of the NN components. This ensures that the controllers exactly recover LQR at x f , thus guaranteeing LAS. Furthermore, we will prove that these architectures retain the nonlinear function approximation capacity of standard feedforward NNs, allowing them to approximate the full nonlinear value gradient and optimal control.
First we have λ Jac -QRnet: u Jac -QRnet has an analogous structure: These models are slower to train than λ-QRnet (11) and u-QRnet (13) since the Jacobian [∂N /∂x] (x f ; θ) must be evaluated during each forward pass. After training, however, we can store the Jacobian matrix in memory so that it does not have to be recomputed online. Therefore online evaluation is just as fast as λ-QRnet and u-QRnet.

C. "MATRIX" QRnet ARCHITECTURES
In this section we describe λ mat -QRnet and u mat -QRnet. These alternatives to the "Jacobian"-style architectures employ matrix-valued NNs. Thus they avoid the costly Jacobian computations in exchange for having to optimize more NN parameters. These "matrix" QRnets enjoy the same stability and approximation properties as the "Jacobian" QRnets. We have not found a consistent performance advantage of either the "Jacobian" or "matrix" QRnets: their relative learning ability appears to be problem-dependent. First consider λ mat -QRnet: Notice that in this case N : R n × R p → R n×n is matrixvalued. Next we have u mat -QRnet: (20) where now N : R n × R p → R m×n . A drawback of λ mat -QRnet (19) is that the number of NN parameters scales with O Lw 2 + wn 2 , where L is the number of layers and w is their width. For high dimensional OCPs, this can make (19) challenging to train as well as deploy on small processors. Meanwhile, the number of NN parameters in (20) scales with O Lw 2 + wmn . Since we typically have m n, u mat -QRnet is often much smaller and hence much faster to train than λ mat -QRnet.

D. LOCAL ASYMPTOTIC STABILITY GUARANTEES
Like λ-QRnet and u-QRnet, the new architectures automatically make the goal state x f an equilibrium. Moreover, if we linearize the feedback control u(·) at x f then we recover the LQR control gain (9). This holds even when the models are poorly trained. This is desirable because LQR locally asymptotically stabilizes x f , and hence the proposed controllers are locally stabilizing by construction. This property is stated formally in Proposition 1 below. The proof is straightforward but tedious, so we omit it for brevity.

E. APPROXIMATION CAPACITY
LAS is a critical but bare minimum requirement. To achieve the ultimate goal of semi-global stability and optimality through training, the NN architectures must be able to approximate u * (·) with sufficient accuracy. Unfortunately, we cannot directly use the Taylor series-like forms of (17)(18)(19)(20) or existing NN universal approximation theorems like [28] to show this is possible. This is because V x (·) and u * (·) are in general not C 1 everywhere, and because the NN architectures used in this work are not standard.
Nevertheless, for OCPs like (1) we expect V x (·) and u * (·) to be everywhere continuous and locally C 1 in a neighborhood of x f . In this case Theorems 1 and 2 presented below show that NNs of the form (17)(18)(19)(20) are universal approximators for such functions 2 To prove Theorems 1 and 2 we will first specialize the Stone-Weierstrass approximation theorem [29] to locally C 1 functions, and then apply an NN universal approximation theorem [28]. For clarity of presentation, we simply state the main results here and defer the proofs to the Appendix.
Throughout this section let X ⊂ R n be compact, let x f be an interior point of X, and without loss of generality let x f = 0 and u f = 0. By C X; R d we denote the space of continuous functions on X taking values in R d .
Our first main result concerns the approximation capacity of the "Jacobian" QRnet architectures introduced in Section B. As mentioned previously, since we expect the value gradient and optimal control for the OCP (1) to be continuous and locally C 1 , this supports the use of λ Jac -QRnet and u Jac -QRnet as nonlinear function approximators.
Theorem 1 (Jacobian QRnet approximation): Then for all ε > 0, there exists a feedforward NN with bounded, non-constant, C 1 activation functions, N ∈ C 1 X; R d , such that for all x ∈ X, An analogous approximation theorem can be obtained for the "matrix" QRnet architectures introduced in Section C, λ mat -QRnet and u mat -QRnet.
Then for all ε > 0, there exists a feedforward NN with bounded, non-constant, C 1 activation functions, N ∈ C 1 X; R d×n , such that for all x ∈ X,

IV. MODEL TRAINING
In this section we provide an overview of the supervised learning approach we use to train the NN controllers. Note that the proposed NNs do not have to be trained using supervised learning: they can be implemented in conjunction with any learning algorithm as long as an LQR controller can be computed for the system. In this work we focus on the impact of NN architecture rather than learning algorithm, so we restrict the scope to supervised learning. Supervised learning can be broken down into three steps: data generation (Section A), NN optimization (Section B), and finally model evaluation against test data (Section C). In Section V we will illustrate a more rigorous test regimen specifically for control design, by which we compare the proposed controllers with LQR and standard NN controllers.

A. DATA GENERATION
To circumvent the challenge of directly solving the HJB PDE (4), we can find an approximate optimal control by joining the solutions of many open loop OCPs. Each open loop OCP can be solved independently without the use of a spatial grid, thus mitigating the curse of dimensionality. Numerical methods based on this idea are referred to as causality-free [1].
To generate training and testing data sets for supervised learning we solve the open loop OCP (1) for a set of (randomly sampled) initial conditions. Note that in practice we approximate (1) by a finite horizon problem with large final time. Each open loop optimal trajectory and control provide input-output pairs x (i) , V x x (i) , u * x (i) , where the superscript (i) is the sample index. Aggregating data from all open loop solutions 3 , we obtain a data set In the following we briefly review common computational methods for solving open loop optimal control. For more detailed discussions of data generation approaches for supervised learning, we refer the reader to [30] and references therein.
Algorithms for solving the open loop OCP (1) can be broadly classified as indirect and direct methods [31]. Indirect methods take the "optimize then discretize" approach, computing open loop optimal solutions to (1) by numerically solving necessary conditions of optimality from Pontryagin's Minimum Principle (PMP). These necessary conditions are given in term of a two-point boundary value problem (BVP) in terms of the state x : [0, ∞) → R n and costate λ : [0, ∞) → R n , which under some conditions is equivalent to the value gradient along the optimal trajectory. There are mature BVP solvers that can be used for such computations.
Direct methods take the "discretize then optimize" approach, transforming the OCP (1) into a nonlinear programming problem. In contrast to indirect methods, direct 3 Note that there is no need to distinguish data from different trajectories as the value function and optimal feedback control are time-independent. VOLUME 00 5 This article has been accepted for publication in IEEE Open Journal of Control Systems. This is the author's version which has not been fully edited and content may change prior to final publication.  [33], [34], which allows one to extract costate data from the solution of the discretized OCP. In this paper we employ an indirect method for the Burgers'-type PDE stabilization problem (Section V) and a Radau direct method for the UAV problem (Section VI).

B. SUPERVISED LEARNING
Once a set of training data is available, the next step is training -i.e. data-driven optimization. Denoting the model parameters (i.e. the NN weights and biases) by θ ∈ R p , then the NN is trained by minimizing a mean squared error (MSE) loss function: As is standard in machine learning, the models learn on data which has been scaled to the range [−1, 1], and the output is accordingly rescaled to the original domain when ultimately used for control.
When training the value gradient models, one can augment the loss function (22) with an additional MSE term to learn the value gradient, and/or a term to minimize the residual of the HJB equation (4). The proposed NNs would also work well in conjunction with active learning methods [7].

C. QUANTIFYING MODEL ACCURACY
To quantify the accuracy of the model we generate a second test data set, D test , from independently drawn initial conditions. During training, the NN sees only data points from the training set D train , while D test is reserved for evaluating approximation accuracy after training. A typical metric is the relative mean 2 error, where N test denotes the number of test points x (i) ∈ D test . A low test error indicates that the NN generalizes well, i.e. it did not overfit the training data. However, even with low test error, there is a chance that the NN could still perform poorly when implemented in the closed loop system as seen in Figure 1. For this reason we believe that test metrics like (24) are insufficient in the context of control design; we should instead focus on rigorous closed loop stability and performance tests such as those presented in Section V.

V. NUMERICAL RESULTS
In this section we compare the proposed controllers to standard feedforward NNs trained to approximate the value function, its gradient, and the optimal control. We also compare to V -QRnet [8], λ-QRnet, and u-QRnet [11]. We present results for three different tests: 1) linear stability near x f (Section B); 2) Monte Carlo (MC) nonlinear stability (Section C); 3) MC optimality analysis (Section D).
Such tests are of course familiar to the control community, but we believe it is worth emphasizing their importance since more rigorous and realistic testing is needed in order to start trusting NN controllers in real-world applications. We also note that these tests are just a starting point: further examples include stabilization time [13], time delay stability [26], and robustness to measurement noise, disturbances, and parameter variations.
The numerical results clearly illustrate that standard NNs are not consistently stable, even when they have good approximation accuracy. Meanwhile, the results confirm that the proposed architectures guarantee LAS while still being able to accurately approximate the nonlinear optimal control throughout the training domain.

A. UNSTABLE BURGERS'-TYPE PDE
To test the NN architectures we revisit the Burgers'-type PDE stabilization OCP from [8], [11]. This is a highdimensional nonlinear OCP formulated by Chebyshev pseudospectral spatial discretization of an unstable version of a Burgers' PDE. Briefly, the problem can be summarized as Here x : [0, ∞) → R n represents the PDE state X(t, ξ) collocated at spatial coordinates ξ j = cos (jπ/n), j = 1, . . . , n, u : [0, ∞) → R m is the control, D ∈ R n×n is the Chebyshev differentiation matrix, Q ∈ R n×n , R ∈ R m×m are diagonal positive definite matrices, and "•" denotes element-wise multiplication. The parameters ν, β > 0, α ∈ R n , and B ∈ R n×m are defined in [8], and we take n = 64 and m = 2. Initial conditions are also selected as in [8].
We generate data by solving the OCP (1) for randomly sampled initial conditions, using an indirect method. For this problem we reliably obtain high quality data with the SciPy [35] implementation of the two-point BVP solver [36]. To get models with varying approximation accuracy, we generate training data sets with different numbers of trajectories.
Note that because data generation depends on random sampling and (22) is a highly non-convex optimization problem, results can vary for different random seeds. To account for this, for each different data set size we conduct ten trials with different randomly generated training trajectories and 6 VOLUME 00 This article has been accepted for publication in IEEE Open Journal of Control Systems. This is the author's version which has not been fully edited and content may change prior to final publication.  (24) on an independent test data set containing 500 trajectories. V -NN and V -QRnet are trained as described in [8]. For the value gradient networks we do not use the value gradient loss term (23) since for this problem it did not improve results. To be consistent, all NNs have L = 5 hidden layers with w = 32 neurons each and tanh(·) nonlinearities. Optimization of (22) is carried out with L-BFGS [37], which stops when the relative change in the loss is sufficiently small. All models are trained on an NVIDIA RTX 2080Ti GPU. Figure 3 shows training times and test accuracies of the NNs. We see that models which approximate the value gradient, especially λ Jac -QRnet and λ mat -QRnet, take the longest to train because of the large number of NN parameters. Despite this, the training time is still very reasonable at under eight minutes. We also find that the new architectures have similar test accuracy statistics to the standard NNs, confirming that they can learn complicated nonlinear functions as suggested by Theorems 1 and 2. For this problem there is no clear performance distinction between the "Jacobian" and "matrix" architectures or between the λ and u models.

B. LOCAL STABILITY VERIFICATION
As a first step we assess the local stability of each NNcontrolled system. Letx ∈ R n be an equilibrium of the closed loop system,ẋ = f (x, u(x)), i.e. f (x, u (x)) = 0. Note that the NN controllers introduced in Section III always have f (x f , u (x f )) = 0, but x f may not be a closed loop equilibrium for V -NN, λ-NN, u-NN, and V -QRnet. Thus for these controllers we use a root-finding algorithm to locate a closed loop equilibrumx near x f . The dynamics nearx can be approximated byẋ ≈ A cl (x −x), where is the closed loop Jacobian. Therefore, after synthesizing a feedback controller we can easily verify local stability by seeing if A cl is Hurwitz. Furthermore as noted in [26], one benefit of using an NN controller with differentiable activation functions is that the closed loop dynamics are locally C 1 . This allows one to use tools from linear systems theory to characterize the local stability ofx. Figure 4 shows the real part of the most positive eigenvalue of A cl for each NN. We find that standard NNs must be trained to a high level of test accuracy before they are even locally stable, which necessitates a large data set and long training time. On the other hand, λ-QRnet. u-QRnet, and the new "Jacobian" and "matrix" QRnets all yield LAS even when trained on small data sets. Recall that Proposition 1 guarantees this for the new architectures.

C. MONTE CARLO STABILITY ANALYSIS
Here and in Section D we conduct Monte Carlo (MC) closed loop simulations. We randomly select N MC = 100 initial conditions x (i) placing them at the edge of the training domain where the NNs may be less accurate and the system harder to control. We stop each simulation when the system reaches a steady state or exceeds a large final time. We call the largest observed final state, the worst-case failure. If this is sufficiently small then the closed loop nonlinear system is likely to be semi-globally stable. Figure 5 shows the worst-case failures for each controller. We find that only the most accurate standard NNs stabilize the origin, whereas all controllers from Section III stabilize all the MC trajectories. These empirical results suggest that the proposed architectures not only guarantee LAS, but also make the control design process more reliable, consistently yielding a stabilizing control law even with small data sets and short training times.

D. MONTE CARLO OPTIMALITY ANALYSIS
In this paper we are interested in both stability and optimality. Optimality of a given controller u(·) can be quantified by the accumulated cost J u(·); x  Figure 6 shows the results of this analysis for the same set of MC simulations conducted in Section C. Among the stabilizing NN controllers there is a clear correlation between higher test accuracy and better performance. All the stabilizing NN controllers follow this trend and perform better VOLUME 00 7 This article has been accepted for publication in IEEE Open Journal of Control Systems. This is the author's version which has not been fully edited and   than LQR alone. It follows that the proposed architectures improve stability without limiting optimality.

VI. APPLICATION EXAMPLE: FIXED-WING UAV
In this section we illustrate how the proposed control archictures can be used with supervised learning to design an optimal feedback controller a fixed-wing 6DoF UAV. The controller is designed for stabilization from a wide range of flight conditions, as well as tracking for arbitrary altitude and course commands -a challenging nonlinear OCP.

A. FIXED-WING UAV DYNAMICS
The dynamic model we use is based on the one presented in [38], [39]. We review it here to orient the reader and point out several small differences. The position of the UAV is described in inertial northeast-down coordinates, p := p n , p e , p d T . The velocities in the body x, y, and z directions are denoted as V := u, v, w T . The attitude of the UAV, i.e. its rotation from inertial to body frames, is described using quaternions q := q 0 ,q T T , where q 0 is the scalar quaternion andq := q 1 , q 2 , q 3 T is the vector quaternion. The angular velocity of UAV in the body frame is written as ω := p, q, r T . The full state is then The UAV is controlled with a throttle δ t ∈ [0, 1], ailerons δ a ∈ [−δ + a , δ + a ], elevator δ e ∈ [−δ + e , δ + e ], and rudder δ r ∈ [−δ + r , δ + r ]. Thus u := δ t , δ a , δ e , δ r T ∈ U ⊂ R 4 .
Modeling the UAV as a rigid body we obtain the dynamic equations [38] x Here R q : R 3 → R 3 is the rotation (computed using the attitude q) from inertial to body frame, and R −1 q (·) is the inverse rotation from body to inertial frame. Next, m is the mass of the UAV, J ∈ R 3×3 is the UAV's inertia matrix, and Finally F = F(x, u) and M = M(x, u) are the external forces and moments acting on the vehicle expressed in the body frame. These arise as a result of gravity, aerodynamics, and control inputs. In the following presentation we ignore the effects of wind for simplicity.
The first force is gravity, which can be expressed in the body frame as F gravity = R q 0, 0, mg T , where g is the gravitational constant. Next we employ a linear propellor model based on [38]: Here ρ is the air density, R prop is the propellor blade length, and k motor and C prop parameterize thrust efficiency. Finally, the aerodynamic forces F aero = F x , F y , F z and moments M = M , M m , M n are in general complicated nonlinear relationships that must be modeled from experimental data. In this work we use the basic models from [38], with slight nonlinear modifications to the drag and pitching moment models to improve their post-stall realism. The longitudinal forces are modeled as where α = tan −1 (w/u) is the angle of attack and are the lift and drag forces, respectively. Here S is the wing area and C Lq , C L δe , C Dq , C D δe are modeling parameters. As in [38] we model where C L0 and C Lα are modeling parameters and σ b (α) is a smooth blending function which is σ b (α) ≈ 0 for |α| < α stall and σ b (α) ≈ 1 for |α| > α stall , with α stall being the stall angle of attack. See [38] for details. For the drag model we use a blend of a quadratic and post-stall flat plate model [40]: where C D0 is the parasitic drag, b is the wingspan, and e is another modeling parameter. We similarly modify the pitching moment model from [38] to be nonlinear in α. Let with and where C mq , C m δe , C m0 , C mα , and C m∞ are modeling parameters. The remaining lateral aerodynamics, F y , M , and M n , are functions of V , p, r, δ a , δ r , and the sideslip β = sin −1 (v/ V ). These models are the same as in [38] and are omitted for brevity. The values of the constants used in this problem are taken from [39], with the exception of C prop = 0.45, k motor = 32, α stall = 20 • , and C m∞ = 0.8. Note that α stall = 20 • is lower than in [38], [39], making the model more realistic and challenging to control.

B. OPTIMAL CONTROL PROBLEM FORMULATION
We aim to design a feedback controller to stabilize the UAV and track any desired altitude h f = −p d,f and course angle χ f = tan −1 (ṗ e /ṗ n ). Let x f , u f be the pair of trim states and controls computed for a desired airspeed V f . The UAV is in trim if f (x f , u f ) = 0, except forṗ n andṗ e . Note that the dynamics are invariant to p, so we can choose any arbitrary trim altitude. The dynamics (exceptingṗ n anḋ p e ) are also invariant to rotations of the inertial reference frame about the inertial z axis, which allows us to use the same trim attitude q f to express any desired yaw angle ψ f . When the vehicle is in trim (and in the absence of wind) the yaw angle ψ is equal to the course angle χ, and thus this formulation allows arbitrary course tracking. A suitable running cost for this OCP is where Q h , h ceil > 0, Q V , Q q , Q ω ∈ R 3×3 are positive definite, and R ∈ R 4×4 is positive definite. Notice that the altitude cost is locally quadratic but saturates for |p d − p d,f | ≥ h ceil , preventing extreme maneuvers when the commanded altitude changes. We set the desired airspeed at V f = 20 [m/s] and use the following cost function parameters: Initial conditions are uniformly sampled from the following domain to elicit a wide range of nonlinear dynamics: Here ψ 0 , θ 0 , φ 0 denote the inital yaw, pitch, and roll angles, which are converted to the initial quaternion q 0 . Recall that we can set p d f = 0 and ψ f = 0 without loss of generality, and thus the initial condition determines the initial altitude and course errors.
For this high-dimensional and highly nonlinear OCP we found that indirect methods were unreliable for generating data. For this reason we generate data with a direct method: Radau pseudospectral method [33]. To the best of our knowledge this is the first case of pseudospectral methods being used for supervised learning. To obtain good quality open loop OCP data we use a large number of Radau collocation points and set stringent tolerances for the nonlinear programming solver. Note that direct methods typically provide optimal state and control pairs, but obtaining costate entails extra computational effort. For this reason in this section we only show results for u-NN, u-QRnet, u Jac -QRnet, and u mat -QRnet, which directly approximate the optimal control and do not need costate data.

C. LEARNING RESULTS
For this problem we generate training data sets with 32, 64, 128, and 256 trajectories each. For each data set we train each type of NN controller with different weight initializations. As before we conduct ten trials for each data set size. We evaluate the RM 2 error (24) on an independent data set with 100 trajectories. As in Section V all NNs have L = 5 hidden layers with w = 32 neurons each and tanh(·) nonlinearities. Because these data sets are too large for fullbatch optimization we use Adam [41] with a learning rate of 10 −3 , batch sizes of 256 data points, and 1500 epochs.

VOLUME 00 9
This article has been accepted for publication in IEEE Open Journal of Control Systems. This is the author's version which has not been fully edited and content may change prior to final publication.   Similar to the results shown in Figure 4, again we find that well-trained u-NNs may fail to even locally stabilize the system. Furthermore, closed loop equilibria under u-NN control are often far from x f . In the physical system this corresponds to steady state altitude, course, and attitude errors, even when said equilibrium is stable (see Figure 1). Figure 7 shows the worst case norm for a set of N MC = 100 closed loop simulations. These simulations demonstrate how challenging the UAV is to control over this large spatial domain. First we notice that LQR is not globally stabilizing for this OCP. Next we observe that most standard u-NNs, even the well-trained ones, do not stabilize x f . u-QRnet, u Jac -QRnet, and u mat -QRnet also have some difficulty with semi-global stabilization, though they clearly do better than u-NN. Note that these controllers are able to stabilize trajectories LQR fails to stabilize -even though they are built on top of LQR.
Finally Figure 8 shows the average performance of each controller in terms of minimizing the cost functional J[u(·)]. We again see that most NN controllers perform better than LQR on average, indicating they they do learn the optimal policy reasonably well. We also see that the standard u-NNs have slightly highter test accuracy, suggesting that for this OCP the training loss (22) converges faster than the modified architecture (i.e. requires fewer gradient descent steps). Despite this, we can see that u-QRnet, u Jac -QRnet, and u mat -QRnet perform just as good or better in terms of closed loop stability and optimality. We expect that all methods will improve with larger data sets, more training epochs, and hyperparameter tuning.  (29) with u-NN, u mat -QRnet, and LQR controllers, compared to optimal trajectory.

D. EXAMPLE CLOSED LOOP SIMULATION
We conclude this section with an illustrative example of an NN-in-the-loop simulation. We conduct the simulation for the same initial condition as in Figure 1, this time with a u mat -QRnet controller trained on 256 trajectories. A view of the closed loop trajectory is presented in Figure 9, and detailed time series of system states and feedback controls are shown in Figure 10. Notice that the UAV begins off course and pitched down with large negative pitch rate. For this initial condition, LQR is 36.44% suboptimal while the u mat -QRnet is 0.95% suboptimal. This simulation highlights the potential benefit of using NN optimal feedback controllers to achieve good performance in nonlinear systems.

VII. CONCLUSION
In this paper we have shown that NN feedback controllers can frequently fail to stabilize a system, even when they are trained to a high degree of accuracy. This occurs frequently enough that it cannot be ignored. One strategy to make NN feedback controllers more viable is through the use of specialized NN architectures. To this end we propose four new model architectures which guarantee (at least) LAS while retaining the approximation capacity necessary to learn the full nonlinear optimal control and provide nonlinear stability on semi-global domains. A summary of the control architectures discussed in this paper is given in Table 1.
In Section V we evaluated the proposed architectures through a series of practical closed loop stability and optimality tests, demonstrating their advantages over standard NNs. Finally in Section VI we illustrated how the proposed architectures might be used with supervised learning to design optimal feedback controllers for challenging, practical systems.
can be implemented even when it is not possible to solve (6) for the optimal control, and when it is difficult to generate accurate costate data.
To complement the NN architectures presented in this paper, in future work we intend to develop mathematical tools to explain the behavior of NN feedback controllers. In particular, we would like to better understand what causes seemingly-accurate NN models to fail at stabilizing a system, as well as why and to what extent the novel NN architectures improve semi-global system stability. Such theoretical advances will be necessary if supervised learning is to become a reliable and commonly accepted control design method.

APPENDIX: PROOFS OF APPROXIMATION CAPACITY
To prove the approximation capacity theorems, we first specialize the classic Stone-Weierstrass theorem (stated below for reference) to approximation of locally C 1 functions. This result is given as Corollary 1. The proofs of Theorems 1 and 2 in Sections A and B, respectively, subsequently apply classical NN approximation theory [28] to the approximating function from Corollary 1 to obtain the desired result.
By C (X) and C X; R d we denote the spaces of continuous functions on X taking values in R and R d , respectively. These function spaces are algebras; a set of functions A an algebra if it is closed under (element-wise) addition, multiplication, and scalar multiplication. A subalgebra of A is a subset of A which is also an algebra.

Theorem 3 (Stone-Weierstrass [29]):
Suppose that A is a subalgebra of C (X) which separates points 4 and does not vanish 5 anywhere in X. Then for all f ∈ C (X) and all ε > 0 there exists g ∈ A satisfying max x∈X |f (x) − g(x)| < ε.
Proof: Consider the set of functions A ⊂ C 1 (X) which have [∂g/∂x] (0) = 0. We claim that A is an algebra which vanishes nowhere and separates points. It is easy to verify that A is closed under addition, multiplication, and scalar multiplication; hence A is an algebra. A also contains the constant functions, so it vanishes nowhere. Lastly, to see that