Learning Constrained Parametric Differentiable Predictive Control Policies With Guarantees

We present differentiable predictive control (DPC), a method for offline learning of constrained neural control policies for nonlinear dynamical systems with performance guarantees. We show that the sensitivities of the parametric optimal control problem can be used to obtain direct policy gradients. Specifically, we employ automatic differentiation (AD) to efficiently compute the sensitivities of the model predictive control (MPC) objective function and constraints penalties. To guarantee safety upon deployment, we derive probabilistic guarantees on closed-loop stability and constraint satisfaction based on indicator functions and Hoeffding’s inequality. We empirically demonstrate that the proposed method can learn neural control policies for various parametric optimal control tasks. In particular, we show that the proposed DPC method can stabilize systems with unstable dynamics, track time-varying references, and satisfy nonlinear state and input constraints. Our DPC method has practical time savings compared to alternative approaches for fast and memory-efficient controller design. Specifically, DPC does not depend on a supervisory controller as opposed to approximate MPC based on imitation learning. We demonstrate that, without losing performance, DPC is scalable with greatly reduced demands on memory and computation compared to implicit and explicit MPC while being more sample efficient than model-free reinforcement learning (RL) algorithms.


Learning Constrained Parametric Differentiable Predictive Control Policies With Guarantees
Ján Drgoňa , Member, IEEE, Aaron Tuor, and Draguna Vrabie , Member, IEEE Abstract-We present differentiable predictive control (DPC), a method for offline learning of constrained neural control policies for nonlinear dynamical systems with performance guarantees.We show that the sensitivities of the parametric optimal control problem can be used to obtain direct policy gradients.Specifically, we employ automatic differentiation (AD) to efficiently compute the sensitivities of the model predictive control (MPC) objective function and constraints penalties.To guarantee safety upon deployment, we derive probabilistic guarantees on closed-loop stability and constraint satisfaction based on indicator functions and Hoeffding's inequality.We empirically demonstrate that the proposed method can learn neural control policies for various parametric optimal control tasks.In particular, we show that the proposed DPC method can stabilize systems with unstable dynamics, track time-varying references, and satisfy nonlinear state and input constraints.Our DPC method has practical time savings compared to alternative approaches for fast and memoryefficient controller design.Specifically, DPC does not depend on a supervisory controller as opposed to approximate MPC based on imitation learning.We demonstrate that, without losing performance, DPC is scalable with greatly reduced demands on memory and computation compared to implicit and explicit MPC while being more sample efficient than model-free reinforcement learning (RL) algorithms.
Index Terms-Differentiable programming, learning for control, parametric optimal control, policy optimization (PO).

I. INTRODUCTION
D ATA-DRIVEN dynamics modeling and control policy learning show promise to "democratize" advanced control to systems with complex and partially characterized dynamics.However, pure data-driven methods typically suffer from poor-sampling efficiency, scale poorly with problem size, and exhibit slow convergence to optimal decisions [1].Of special concern are the lack of guarantees that black-box data-driven controllers will satisfy operational constraints and maintain safe operation.
Model predictive control (MPC) methods can offer optimal performance with systematic constraints handling.However, their solution is based on real-time optimization that might be computationally prohibitive for some applications.Explicit MPC (eMPC) [2] aims to overcome the limits of online optimization by precomputing the control law offline for a given set of admissible parameters.However, obtaining eMPC policies via multiparametric programming (mpP) [3] is computationally expensive, limiting its applicability to small-scale systems.Addressing limitations of eMPC, several authors have proposed using deep learning to approximate the control policies given example data sets computed by a supervisory MPC [4], [5].

A. Contributions
In this article, we expand our prior work on differentiable predictive control (DPC), a method for offline learning of explicit predictive control policies [6], [7], [8].DPC is based on differentiable programming [9] where the mathematical operations associated with evaluating the MPC problem's constraints and objective function are represented as a directed acyclic computational graph implemented in a language supporting automatic differentiation (AD).This allows us to obtain direct policy gradients of the MPC problem suitable for learning constrained control policies using stochastic gradient descent (SGD) algorithm.We report the following contributions.
1) An offline policy optimization (PO) algorithm based on gradients computed by differentiating the MPC problem.2) Probabilistic closed-loop stability and constraint satisfaction guarantees based on Hoeffding's inequality.3) Five numerical studies that compare DPC with MPC, approximate MPC (aMPC), and model-free reinforcement learning (RL) algorithms demonstrating that: a) DPC can learn to stabilize unstable systems; b) DPC can handle parametric nonlinear constraints; c) DPC is more sample efficient than RL algorithms; d) DPC has faster execution time than implicit MPC; and e) DPC scales better compared to eMPC.4) Open-source code implementation [10].

B. Related Work 1) Learning-Based Model Predictive Control:
In general, learning-based MPC (LBMPC) methods [11], [12] are based on learning the system dynamics model from data and can be considered generalizations of classical adaptive MPC.
To make LBMPC tractable, the performance and safety tasks are decoupled by using reachability analysis [13], [14].Variations include formulation of robust or stochastic MPC with state-dependent uncertainty for data-driven linear models [15], Gaussian processed (GP) [16], kernel regression model [17], Koopman operator model [18], recurrent neural networks [19], [20], fuzzy neural networks [21], [22] selforganizing radial basis function neural networks [23], or iterative model updates for linear systems with bounded uncertainties and robustness guarantees [24].For a comprehensive review of LBMPC approaches we refer the reader to a recent review [25] and references therein.In general, LBMPC methods are based on the online solution of the corresponding optimization.In contrast, in the proposed DPC methodology, the neural policy is learned offline and hence represents an explicit solution to the underlying parametric optimal control problem (pOCP).
2) Explicit Model Predictive Control: For a certain class of small scale MPC problems [26], [27], the solution can be precomputed offline using mpP [3], [28] to obtain a socalled eMPC policy [29], [30].The benefits of eMPC are typically faster online computations, exact bounds on worstcase execution time, and simple and verifiable policy code, which makes it a suitable approach for embedded applications.However, eMPC suffers from the curse of dimensionality, scaling geometrically with the number of constraints.This severely limits its practical applicability only to small scale systems with short prediction horizons [2], even after applying various complexity reduction methods [31], [32].The DPC method proposed here presents a scalable alternative for obtaining eMPC policies for linear systems by employing principles of data-driven constrained differentiable programming.
3) Approximate Model Predictive Control: Targeting the scalability issues of eMPC, authors in the control community proposed aMPC [33], [34], [35], [36], [37] whose solution is based on supervised learning of control policies imitating the original MPC.An interesting theoretical connection was made by Karg and Lucia [38] showing that every piecewise affine (PWA) control policy can be exactly represented by a deep ReLU network [39].This means that the optimality of the neural control policy will depend only on the quality of training data and formulation of the learning problem.However, the remaining disadvantage of aMPC is its inherent dependency on the solution of the original MPC problem.This article presents an alternative method for computing scalable explicit control policies for linear systems subject to nonlinear constraints.Our approach is based on differentiable programming and avoids the need for a supervisory MPC controller, as in the case of aMPC.aMPC methods alone do not provide performance guarantees, as in the case of implicit MPC.This limitation has previously been addressed either by involving optimality checks with backup controllers [40], projections of the control actions onto the feasible set [41], or by providing sampling-based probabilistic performance guarantees [42].In this work, we adapt probabilistic performance guarantees as introduced in [42] in the context of unsupervised learning of the constrained control policies via the proposed DPC method.
4) Safe Learning-Based Control: For imposing stability guarantees in deep learning-based control applications, one could employ data-driven methods based on learning Lyapunov function candidates for stable dynamics models and control policies as part of the neural architecture [8], [43], [44], constructing analytical Lyapunov functions [45], or attraction domain criteria [46].Others [47] have employed semidefinite programming for safety verification and robustness analysis of feed-forward neural network control policies.Agrawal et al. [48] represented constrained optimization problems as implicit layers in deep neural networks, leading to the introduction of neural network policy architectures designed to handle predefined constraints [49], [50].Peng et al. [51] proposed a novel proportional-integral Lagrangian method to handle hance constraints in RL setting.A conceptual idea of backpropagating through the learned system model parametrized via convex neural networks was investigated in [52].Zhang et al. [53] and Ding et al. [54] studied the convergence of the PO with robustness guarantees in the context of constrained model-based RL.Others [55], [56] have recently proposed modifications of RL algorithms that can handle constrained-input systems.The proposed work falls in the category of domain-aware differentiable architectures that can handle both state and input constraints emerging in the MPC problem formulations.

5) Differentiable Model Predictive Control:
In recent years, works [50], [57] have proposed differentiable MPC as safe imitation learning methods based on AD of the MPC or LQR optimization problems.In this context, the computed MPC sensitivities (i.e., backward gradients) are used in the gradient-based update rules to learn the unknown dynamics models and MPC objective functions of the controlled system from the sampled closed-loop data.Orthogonally, Zanon et al. [58] and Gros and Zanon [59] leveraged differentiable MPC as a policy approximation to provide safety guarantees in the context of RL algorithms.This is achieved by replacing deep neural network policy with an online MPC scheme whose parameters are tuned via RL updates based on the MPC sensitivities obtained via AD.Dong et al. [60] used neural-networks to reconstruct the unknown system dynamics and parameterize the critic network in the adaptive dynamic programming (ADP)-based nonlinear MPC scheme.Contrary to the methods mentioned above, the presented DPC method uses backward AD to compute the sensitivities of the MPC problem without the need for learning additional critic networks, or supervisory signals.

II. DIFFERENTIABLE PREDICTIVE CONTROL
We present a new perspective on solving pOCPs by leveraging the connection between parametric and differentiable programming.In particular, we introduce a DPC method based on representing the constrained optimal control problem as a differentiable program and using AD to obtain direct policy gradients for gradient-based optimization.The proposed model-based DPC methodology is illustrated in Fig. 1.As a first step, the system's state-space model is obtained from system identification or physics-based modeling.Next, we construct a differentiable closed-loop system model by combining the state space model and neural control policy in a computational graph.Finally, the neural control policy is optimized via backpropagation of the gradients of the MPC objective function and constraints penalties, formulated as soft constraints, through the unrolled closed-loop system dynamics.

A. Differentiable Predictive Control Problem Formulation
Compared to MPC, instead of directly optimizing for control actions u, the DPC problem is optimizing the learnable parameters, W, of a neural control policy π W : R n x +n ξ → R n u , i.e., an explicit mapping from system states x and parameters ξ to control actions u.Beware of an unfortunate collision of terminology between the machine learning, optimization, and control communities here.Unlike policy parameters W, problem parameters ξ is a set of loss function r, state constraint p h , input constraint p g , and system model parameters θ , that together with the initial conditions x 0 define the following pOCP: Here, N represents a prediction horizon, and k is a discrete time step.The loss function terms are the parametric MPC objective : R n x +n u +n r → R with terminal penalty term, p N : R n x → R, to promote stability of the underlying MPC problem.The parametric state and input constraints are given via (1d), and (1e), respectively.The formulation (1) implemented as a differentiable program allows us to obtain a data-driven solution of the corresponding parametric programming problem by differentiating the loss function (1a) backwards through the parametrized closed-loop dynamics given by the system model (1b) and neural control policy (1c).This formulation allows one to use the SGD and its variants to minimize the loss function (1a) over the initial conditions and problem parameters sampled from distributions P x 0 , and P ξ , respectively.In the following paragraphs, we elaborate on key components of the proposed DPC problem formulation (1).

B. Differentiable Closed-Loop System
The core idea of DPC is parametrization of the closed-loop system composed of differentiable system dynamics model and neural control policy as shown in Fig. 1.One major difference between implicit MPC and DPC ( 1) is that the MPC solution returns an optimal sequence of the control actions U = u 0 , . . ., u N−1 , while solving DPC yields optimized weights and biases W = {H 1 , . . ., H L , b 1 , . . ., b L } parametrizing the explicit neural control policy π W : R n x +n ξ → R n u given as where z i are hidden states, H i , and b i represent weights and biases of the ith layer, respectively.The activation layer σ : R n z → R n z is given as the element-wise application of a differentiable univariate function σ : R → R.This article assumes a discrete-time nonlinear dynamical system model (1b).In a nominal case with z 0 = x k we obtain a full state feedback formulation of the control problem (1), where the system model (1b) and control policy (1c) together define the closed-loop system dynamics as ( Here, the N-step ahead rollout of the system (3) is conceptually equivalent to a single shooting formulation of the MPC problem [26].Please note that in the extended case, we can consider the features of the policy to be augmented with vector ξ k = {r k , p hk , p g k , θ} consisting of reference signals, state and input constraints parameters, as well as system model parameters.Thus allowing for generalization across model scenarios, tasks with full reference preview, and dynamic constraints handling capabilities as in the case of classical implicit MPC.

C. Differentiable Loss Function
The proposed DPC method is an offline PO algorithm that requires an estimate of the expected values of the pOCP's objective function (1a) and (1d) and (1e).For this purpose, we construct the Lagrangian of the pOCP (1) using well known penalty method.This allows us to compute the DPC loss L DPC as weighted average of the Lagrangian by sampling m-number scenarios of the initial conditions x i 0 ∼ P x 0 and problem parameters The first term of the differentiable Lagrangian loss function L DPC represents the main performance metric and can be defined, for instance, as a reference tracking term Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
parametrized by r k .The constraints terms correspond to penalty functions [61], [62] imposed on state and control action constraints parametrized by p hk , and p g k , respectively.Specifically, the parametric constraints penalties are with ReLU : R → R standing for rectifier linear unit function, n h and n g define the total number of state and input constraints.
The relative importance of the individual terms is tuned by The terminal penalty term p N (x N ) in the DPC problem formulation ( 1) is a commonly used ingredient to guarantee recursive feasibility and closed-loop stability of MPC and can be designed based on a wide range of methods from [63], [64], [65], [66], [67], and [68].
Assumption 1: The system dynamics model (1b), parametric control objective function (x k , u k , r k ), state and input constraints penalties p x (h(x k , p hk )) and p u (g(u k , p g k )), and terminal penalty term p N (x N ) are differentiable at almost every point in their domain.We assume zero gradients at points where the functions are not differentiable.

D. Differentiable Predictive Control Policy Gradients
A main advantage of having a differentiable closed-loop dynamics model, control objective function, constraints, and terminal penalties is that it allows us to use AD (backpropagation through time [69]) to directly compute the policy gradient.In particular, by representing the problem (1) as a computational graph and leveraging the chain rule, we can directly compute the gradients of the loss function L DPC w.r.t. the policy parameters W. For a simplicity of exposure lets consider the case with m = 1 and N = 1 with corresponding gradients defined as follows: The advantage of having gradients (7) is that it allows us to use the SGD algorithms, such as AdamW [70], to solve the corresponding pOCP (1).In practice, we can compute the gradient of the DPC problem by using AD frameworks, such as Pytorch [71].

E. Offline DPC Policy Optimization Algorithm
First, we sample the initial conditions of the problem x 0 ∼ P x 0 .
Next, we sample the problem parameters window.Optionally, the problem parameters could include system dynamics model parameters θ representing system model uncertainties.The simplest choice of P x 0 and P ξ is to use the uniform distribution.Alternatively, the distribution of the training data can be obtained from measured trajectories of the controlled system or specified by a domain expert.Now combining the individual components, the DPC PO algorithm is summarized in Algorithm 1.The system dynamics model and control policy architecture are required to instantiate the computational graph of the DPC problem (1).The policy gradients ∇ W L DPC are obtained by differentiating the DPC loss function L DPC over the distribution of initial state conditions and problem parameters sampled from given distributions.The computed policy gradients now allow us to perform direct PO via a gradient-based optimizer O.The presented algorithm introduces a generic approach for a datadriven solution of model-based pOCP (1).

III. PROBABILISTIC STABILITY AND CONSTRAINT SATISFACTION
In this section, we present a probabilistic verification method for closed-loop stability and constraint satisfaction guarantees of the proposed DPC PO Algorithm 1.In particular, we modify the probabilistic verification method developed by [42] in the context of aMPC method.We first consider the following assumptions.
Assumption 2: There exists a local Lyapunov function V(x), an invariant terminal set X N , and feasible terminal control law K(x), such that ∀x ∈ X N the following holds: Remark 1: In practice, the condition of Assumption 2 can be satisfied by designing a stabilizing LQR controller given the system dynamics x k+1 = Ax k + Bu k linearized at the operating point x ∈ X N .In standard MPC, the terminal control law K(x) is typically combined with terminal cost function p N (x N ) to guarantee the closed-loop stability by showing the optimal cost function is a Lyapunov function [63], [64], [65], [66], [67], [68].

A. Performance Indicator Functions
Inspired by [42] and [72], our probabilistic validation method is based on the closed-loop system (3) rollouts defined as Here, i is an index of the simulated trajectory, and X represents a set of sampled initial conditions.For performance assessment, we define stability I s (X i ) and constraint satisfaction I c (X i ) indicator functions as The proposed indicator functions (10) indicate whether the learned control policy π W (x k ) generates state trajectories that satisfy given stability conditions (10a), without violating state and input constraints (10b), respectively.For this purpose, we define the risk as μ , where μ = 1 implies the stability and constraints are satisfied for 100% of sampled trajectories (9), conversely, μ = 0 implies that all sampled scenarios violate either constraints or stability.
Remark 2: The advantage of the proposed DPC problem formulation (1) in comparison with aMPC scheme [42] is that it allows us to define direct performance indicator functions (10) based on known constraints and closed-loop system dynamics (3).While in [42], the indicator functions are based on the approximation error of the control policy trained via imitation learning supervised by the original MPC.

B. Probabilistic Guarantees
We can now use the indicator functions (10) to empirically validate the set of m trajectories X i , i ∈ N m 1 , with sampled independent and identically distributed (IID) initial conditions x i 0 ∼ P x 0 .For this purpose, we define the empirical risk as Similar to [42], we use the Hoeffding's inequality [73] to estimate the true risk μ from the empirical risk μ.Hoeffding's inequality implies that the probability of the sampled trajectories X i satisfying the indicator functions (10) is larger than the empirical risk μ minus the user-defined risk tolerance with confidence 1 − δ, or more compactly Thus as given in [42], for a chosen confidence δ and m samples, we obtain the following formula to calculate the worst-case empirical risk μ wc : Now using the condition (12) we can compute the worst-case bound for the true risk as μ ≥ μ wc .Hence, higher-empirical risk value μ, more samples m, and lower-confidence δ, imply higher-worst-case risk value μ wc , which implies tighter bound on true risk μ.This allows us to define the following closedloop performance validation Algorithm 2. Finally, we can formalize the stability guarantees for DPC PO Algorithm 1 via Theorem 1 inspired by [42].else repeat from step 4 14: else validation passed, terminate procedure Theorem 1: Let Assumption 2 hold.Construct the DPC problem (1) with the given system dynamics model.Initialize and run Algorithm 2 with chosen hyperparameters δ, μ bound , m, m max , and p.If Algorithm 2 terminates with the validation procedure passed, then with the confidence of 1−δ the trained control policy satisfies closed-loop stability and constraint satisfaction for sampled initial conditions with admissible risk μ bound .
Proof: If Algorithm 2 terminates with validation procedure passed, then (12) must hold, then through Hoeffding's inequality it holds that P[I s (X i ) = 1 & I c (X i ) = 1] ≥ μ bound with confidence of 1−δ.Or in other words, reaching the terminal set X f while satisfying state and input constraints is guaranteed at minimum for μ bound portion of samples with 1−δ confidence.Thanks to Assumption 2 convergence to the terminal set X f now implies closed-loop stability (8).

IV. NUMERICAL CASE STUDIES
This section presents simulation results on five example systems demonstrating the capabilities of the DPC PO algorithm to 1) learn to stabilize unstable linear systems; 2) learn to control systems with multiple inputs and outputs; and 3) satisfy state and input constraints.Furthermore, we demonstrate superior sampling efficiency compared to model-free RL algorithms, faster online execution compared to implicit MPC, and smaller memory footprint and better scalability compared to eMPC.The presented experiments are implemented in Pytorch-based toolbox NeuroMANCER [10] and are available on Github. 1 All experiments in this section were performed on a laptop with 2.60-GHz Intel i7-8850H CPU and 16-GB RAM on a 64-bit operating system.All examples using DPC PO Algorithm 1 are optimized with AdamW [70] optimizer.

A. Example 1-Stabilizing Unstable Double Integrator
Here, we demonstrate the capabilities of DPC to learn a stabilizing neural feedback policy for an unstable system with known dynamics.We control the unstable double integrator As a benchmark we synthesize an eMPC policy solved with classical multiparametric programming solver using MPT3 toolbox [74] with prediction horizon of N = 10, and weights for whose the surface is shown on the left in Fig. 2.
1) Training: In the case of DPC, we learn a full state feedback neural policy u k = π W (x k ) in the closed-loop system (3) via Algorithm 1 for whose the surface is shown on the right in Fig. 2. A side by side comparison in Fig. 2 shows almost identical control surfaces of compared policies.For the DPC policy training we used the MPC loss function (14) subject to state x k ∈ X , input u k ∈ U , as well as terminal set constraints x N ∈ X f := {x | −0.1 ≤ x ≤ 0.1}.All constraints are implemented using penalties (6) with weights Q h = 10, Q g = 100, Q N = 1, while for the control objective ( 14) we consider prediction horizon N = 1, and weights Q x = 5, Q u = 0.5.We train the neural policy (2) π W (x) : R2 → R with 4 layers, 20 hidden states, with bias, and ReLU activation functions.We use a training set of 3333 normally sampled initial conditions x 0 fully covering the admissible set X .
2) Closed-Loop Performance: Fig. 3 shows resulting converging trajectories of the closed-loop system dynamics controlled by a stabilizing neural policy learned using DPC formulation with terminal constraints.Finally, on the left hand side of Fig. 4 we visualize the DPC loss function (4) (left) for a trained policy over a state space to demonstrate that it is indeed a Lyapunov function.While on the right of Fig. 4 we evaluate the contraction constraint (15) showing state space regions with contractive α < 1 (blue) and diverging α > 1 (red) dynamics

B. Example 2-Reference Tracking of Two Tank System
We consider a canonical nonlinear control problem, a system of two connected tanks controlled by a single pump and a two way valve.The system is a simplified model of a pumped-storage hydroelectricity which is a type of hydroelectric energy storage used by electric power systems for load  balancing.The nonlinear ordinary differential equation (ODE) model 2 has the following form: With system states h 1 , and h 2 representing levels of liquid in tank 1 and 2, respectively.Control actions are pump modulation p, and valve opening v.The ODE system is parametrized by inlet and outlet valve coefficients c 1 and c 2 .We solve the continuous-time system model ( 16) by using a 4th order Runge-Kutta ODE solver obtaining a discretized system that can be represented in a compact form as with system states x = [h 1 , h 2 ] and control actions u = [p, v].
In general, there are two ways to differentiate through the system model solved with the ODE solver (17).The first is to use the backpropagation through time algorithm [69] by unrolling the operations of the ODE solver and using AD to compute its gradients.The second, is to use adjoint method as in the case of neural ODEs [75].

1) Training:
The objective is to control the tank levels into desired time-varying reference values by modulating the pump and valve control actions, subject to state and input constraints.We use the DPC PO Algorithm 1 to learn the explicit neural control policy by solving the following parametric nonlinear optimal control problem: The objective function is to minimize the reference tracking error ||x i k − r i k || 2 2 over the prediction horizon N = 50 weighted by a scalar Q x = 5, including terminal penalty weighted by Q N = 10.The parametric neural control policy is given by π W (x i k , R i ).The neural control policy is optimized over problem parameters sampled from the distributions P x 0 , and P R , for state initial conditions and references, respectively.Specifically, we sample 2000 scenarios with uniformly distributed initial conditions and a random sequence of desired references with prediction horizon N = 50.The state (18d) and input (18e) constraints are transformed into penalties with weights Q h = 10, Q g = 10, respectively.
We train the neural policy π W (x k , R) : R n x +Nn r → R n u with 2 layers, 32 hidden states, with bias, and GeLU activation functions, where n x = 2, n r = 2, n u = 2, and N = 50 being number of states, references, control actions, and prediction horizon steps, respectively.
2) Closed-Loop Performance: In Fig. 5 we show the closed-loop reference tracking control performance using the neural control policy trained offline using the DPC PO Algorithm 1.We demonstrate that the neural DPC policy can reliably control nonlinear dynamical system with time-varying reference, while satisfying state and input constraints.

C. Example 3-Tracking Control of Quadcopter Model
In this example, we demonstrate scalability of DPC to larger systems using a linear quadcopter model 3 with twelve states 3 https://osqp.org/docs/examples/mpc.html Fig. 5. Closed-loop trajectories of the two tank system controlled by neural policy trained offline using DPC PO Algorithm 1.
x k ∈ R 12 , and four control inputs u k ∈ R 4 .Our objective is to track the reference with the 3-rd state representing the controlled output y k ∈ R, and stabilize rest of the states, while satisfying state and input constraints.
1) Training: To optimize the policy, we use Algorithm 1 with 30 000 samples of feasible normally distributed initial conditions, where we use 10 000 samples for train, validation, and test set, respectively.We consider the DPC problem with sampled quadratic control objective using prediction horizon N = 10, and weights Q r = 20, Q x = 5.The system is subject to state and input constraints

5}, with constraints penalty weights
respectively.To promote the stability of the closed-loop system during training, we consider the contraction constraint (15) with scaling factor α = 0.8 and penalty weight Q c = 1.We train the full state feedback neural policy (2) π W (x) : R 12 → R N×4 with 2 layers, 100 hidden states, and ReLU activation functions.In the closed-loop simulations, we use RHC to use only the first predicted control action.
2) Closed-Loop Performance: Fig. 6 plots the closed-loop control performance of the quadcopter model controlled with the trained full state feedback neural policy.We demonstrate that the trained DPC policy can simultaneously track the desired reference, and stabilize the selected states while satisfying state and input constraints.From the computational perspective, in Table I we compare the online computational time associated with the evaluation of the trained DPC policy compared against the implicit MPC in the quadratic form implemented in CVXPY [76] and solved with the OSQP solver [77].We report on average almost two orders of magnitude speedup in the maximum and mean evaluation time compared to the online optimization solver.Due to the large state dimensions, the presented Fig. 6.Closed-loop trajectories of the quadcopter system controlled by DPC policy trained via Algorithm 1 using control objective (19) and constraints penalties (6), including contraction constraint (15).

TABLE I COMPARISON OF ONLINE COMPUTATIONAL TIME OF THE PROPOSED DPC POLICY AGAINST IMPLICIT MPC SOLVED WITH OSQP
problem is not feasible with traditional parametric programming solvers.Thus by solving this problem, we demonstrate the scalability of the proposed DPC method beyond the limitations of eMPC while providing significant speedups compared to the online state-of-the-art convex optimization solver.

D. Example 4-Parametric Obstacle Avoidance Problem
In this example, we demonstrate handling parametric nonlinear constraints for obstacle avoidance problem by learning parametric neural policies using DPC PO Algorithm 1.We consider the double integrator system where x i,k denotes the ith system dynamics state, and p, b, c, d are scalar-valued parameters defining the volume, shape, and center of the eliptic obstacle defined by (21).
where the first term penalizes the terminal state condition with parametric reference r N , the second term penalizes the energy used in control actions, while the third and fourth terms represent control action and state smoothing penalties.For scaling the control objective terms we use following weight factor values Q r = 1.0,Q dx = 1.0,Q du = 10.0,Q u = 10.0, and we use Q h = 100.0for state constraint penalties that include the obstacle avoidance constraint (21) and box constraints on states and control actions.For datasets we sample 30 000 uniformly distributed initial state conditions x i 0 , constraints parameters p i , b i , c i , d i , and terminal state references r i N , and use one third of data for train, validation, and test set, respectively.
2) Optimal Control Performance: The obstacle constraint (21) casts the overall pOCP (1) to be nonlinear.Hence for the comparison, we implement the implicit MPC with nonlinear constraints in the CasADi framework [78] using the nonlinear programming solver IPOPT [79].In Fig. 7 we visualize, for different parametric scenarios, the trajectories computed online using IPOPT and those computed offline using our DPC PO Algorithm 1.We demonstrate that DPC can generate suboptimal trajectories that satisfy a distribution of parametric nonlinear constraints while being more computationally efficient in online time than the state-of-the-art IPOPT solver.In particular, Table II compares mean and maximum online computational time associated with the evaluation of the learned DPC policy with implicit nonlinear MPC solved using IPOPT.On average, the explicit neural DPC policy provides Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
almost 40-times speedup and 30-times speedup in the worstcase compared to IPOPT online solver without sacrificing the control performance on the presented obstacle avoidance problem.

E. Scalability and Performance Guarantees
In this section, we analyze online (evaluation) and offline (training) computational complexity of the proposed DPC PO method.
1) Online Computational Complexity: From the online computational and memory standpoint we compare the proposed method with implicit and explicit solutions.The implicit MPC problem is solved online via quadratic programming (QP) using MATLAB's Quadprog solver.In theory, the online complexity of the QP problem depends on the number of constraints, which scale polynomially with the prediction horizon steps N, and decision variables.In particular, the theoretical complexity of the QP problem in a dense form is O(N 3 n 3 u ), and in the sparse form is O(N 3 (n x + n u ) 3 ) [80].The online procedure of eMPC consists of evaluating the piecewise affine policy.The forward pass of the PWA map is typically performed using binary search that has logarithmic complexity O(log(n R )), where n R is the number of critical regions of the PWA map that is upper bounded exponentially by the number of constraints q as n R ≤ 2 q .In case of DPC, the online computational complexity depends solely on the number of hidden layers n L and hidden size n z of the neural control policy (2).In particular, the evaluation of a single layer neural network consists of matrix multiplication with worst-case asymptotic run-time O(n 3 z ) followed the elementwise activation function with run-time O(n z ).The overall theoretical complexity of the neural policy evaluation then becomes O(n L (n 3 z + n z )), thus being decoupled from the sizes of the optimization problem n x , n u , N. This allows us to design neural policies with a tunable upper bound on worstcase computational time for the problems of arbitrary size.We empirically compare the mean and maximum online evaluation of the iMPC, eMPC, and DPC policies to demonstrate this advantage.Furthermore, we compare the memory footprint of the trained neural policy of DPC against PWA policy of eMPC.The assessment of the memory requirements of the implicit MPC is not straightforward, as it depends on the memory footprint of the chosen optimization solver, dependencies, and storage of the problem parameters.In this case, we estimate the optimistic lower bound by assessing only the memory requirements of the Quadprog solver. 4n Table III we report the scalability analysis in terms of mean and maximum online evaluation time, and memory footprint as a function of increasing prediction horizon N, evaluated on the problem from Example 5.The results show that the learned DPC policy is on average 5 to 10 times faster than iMPC policy, comparable to eMPC on smaller problems, and orders of magnitude faster than eMPC on larger problems.Compared with eMPC, the DPC has a significantly  [38].Remark 3: To account for increasing problem complexity, we train a two layer DPC policy, with progressively increasing number of hidden layers n hidden = 10 N. Both, DPC and iMPC policies share the same objectives, however, due to the exponential complexity growth in the case of eMPC we had to implement the move blocking strategy [81] limiting the control horizon N c as follows: if N = 4 then N c = 2, if N = 6 then N c = 1, while the problem with N = 8 is practically intractable for eMPC even with N c = 1.
Remark 4: The online complexity of DPC depends entirely on the choice of the policy parametrizations, i.e., number of layers, and hidden neurons.Therefore, it can represent a tunable tradeoff between the performance and guaranteed complexity, a design trait desirable for embedded applications with limited computational resources.
2) Offline Computational Complexity: For both, DPC and aMPC we evaluate the offline computational complexity using a problem from Example 2. The resulting scalability analysis of the PO algorithms with the increasing number of training samples is given Table IV.In particular, we evaluate total offline computational time as a sum of dataset generation and training time.We demonstrate that the proposed DPC algorithm requires fewer computational resources than aMPC based on imitation learning.The first reason is that DPC dataset generation does not require the solution of the original MPC as in the case of aMPC.Interestingly, using the same number of epochs, same policy architecture, and the same learning algorithm hyperparameters, the DPC approach has shown to be faster in the training time than the aMPC supervised learning method.As observed in the presented example, the policy training via the DPC method seems to be approximately three times faster than aMPC.In this article, the eMPC problems are solved offline via multiparametric QP (mpQP) using MPT3 toolbox [74].In the multiparametric QP the worst case is given by the all possible combination of active constraints (O)(2 q ), where q is the number of constraints [2], [82].As a consequence applications of the eMPC approach are limited to very small problems as shown in Table III, where the mpQP problem for larger prediction horizons took several hours to compute.Thus, we omit the eMPC approach from the scalability comparison with DPC and aMPC on a larger problem used for the analysis in Table IV.
3) Probabilistic Performance Guarantees: For trained policies trained via DPC and aMPC, we calculate the empirical risk μ (12) using Algorithm 2 evaluated on a problem from Example 2. Given a risk tolerance = 0.02 and confidence 0.9835 we generate 7000 closed-loop trajectories by simulating the system (3) with trained policies on samples of normally i.i.d.initial conditions.Then we calculate the lower bound on empirical risks μ of stability and constraints violations as a function number of training samples and we report the results in Table V.The results for DPC and aMPC policies trained on the same amount of data are almost identical, where DPC seems to perform slightly better with more training samples, and aMPC does better with fewer training samples.

V. CONCLUSION
In this work, we present a novel direct PO method called DPC.We provide a new perspective on the use of differentiable programming for obtaining offline policy gradients that can lead to solutions of the parametric programming problems arising in eMPC.The presented DPC PO can learn stable neural control policies for linear and nonlinear systems subject to state and input constraints.
The neural control policy can be learned end-to-end without the need for an expert policy to imitate, thus presenting a scalable algorithm for obtaining parametric solutions to constrained optimal control problems.
Furthermore, we provide sampling-based stochastic performance guarantees of the trained DPC policies by employing Hoeffding's inequality.
We demonstrate the method in extensive case studies, evaluating the control performance, data efficiency, and scalability against implicit and eMPC, aMPC, and deep RL algorithms.We show that DPC is faster than implicit MPC, has a smaller memory footprint than eMPC, and requires fewer resources for training than aMPC.Based on the reported results and userfriendly open-source implementation [10], the proposed DPC has the potential to be deployed in the application domains beyond the computational reach of the traditional eMPC while providing performance guarantees and sound computational tradeoffs.

Fig. 3 .
Fig.3.Closed-loop trajectories of system (13) controlled by stabilizing neural feedback policy trained using Algorithm 1 with DPC problem formulation (1), including terminal constraint x N ∈ X f .
x ∈ R 2 and u ∈ R 2 are states and inputs that are subject to the box constraints x k ∈ X := {x | −10 ≤ x ≤ 10}, and u k ∈ U := {u | −1 ≤ u ≤ 1}, respectively.Additionally, we consider parametric nonlinear constraints representing an obstacle in the state space

1 )
Training: Again we use the DPC PO Algorithm 1 for training the parametric neural policy π W (x, ξ ) : R 7 → R N×2 with 4-layer, each having 100 hidden neurons, and ReLU activation functions, where ξ = [b, c, d, r N ].To solve this problem, we consider the following DPC objective with the prediction horizon N = 20:

Fig. 7 .
Fig. 7. Trajectories in different parametric scenarios of the obstacle avoidance problem (21) computed online using IPOPT and offline using DPC PO Algorithm 1.
DPC Offline PO 1: input training datasets of initial conditions and problem parameters sampled from distributions P x 0 and P ξ 2: input system dynamics model x k+1 = f θ (x k , u k ) 3: input neural policy architecture π W (x k , ξ k ) 4: input DPC loss L DPC (4) 5: input optimizer O 6: differentiate DPC loss L DPC (4) to obtain the policy gradient ∇ W L DPC (7) 7: learn policy weights W via optimizer O using gradient ∇ W L DPC 8: return optimized policy π W (x k , ξ k ) an N-step prediction horizon Algorithm 1

Algorithm 2
Closed-Loop Stability and Constraint Satisfaction Validation of DPC PO Algorithm 1: input stability and constraints indicator functions (10) 2: input confidence δ and admissible risk bound μ bound 3: input sampling parameters m, m max , p 4: sample m IID initial conditions x i 0 ∼ P x 0 , i ∈ N m

TABLE II COMPARISON
OF ONLINE COMPUTATIONAL TIME OF THE PROPOSED DPC POLICY AND IMPLICIT MPC SOLVED WITH IPOPT

TABLE III ONLINE
SCALABILITY WITH INCREASING PREDICTION HORIZON N. COMPARISON OF MEAN AND WORST-CASE ONLINE COMPUTATIONAL TIME PER SAMPLE, AND MEMORY FOOTPRINT OF THE PROPOSED DPC POLICY AGAINST IMPLICIT (IMPC) AND EXPLICIT (EMPC) SOLUTIONS, EVALUATED ON THE MODEL FROM EXAMPLE 5 smaller memory footprint across all scales.Both CPU time and memory footprint of the DPC policies scale linearly with the problem complexity, while eMPC solutions scale exponentially and are practically infeasible for larger-scale problems.These findings are supported by the fact that DNNs with ReLU layers are more memory efficient than lookup tables for parametrizing the PWA functions representing eMPC policies

TABLE IV OFFLINE
SCALABILITY WITH THE NUMBER OF TRAINING SAMPLES.COMPARISON OF THE COMBINED DATASET GENERATION AND TRAINING TIME OF THE AMPC AND THE PROPOSED DPC PO ALGORITHMS, EVALUATED ON THE PROBLEM FROM EXAMPLE 2 WITH PREDICTION HORIZON N = 10 TABLE V PROBABILISTIC PERFORMANCE GUARANTEES THE PROPOSED DPC AND AMPC EVALUATED USING EMPIRICAL RISKS μ AS A FUNCTION OF TRAINING SAMPLES