TOAST: Trajectory Optimization and Simultaneous Tracking using Shared Neural Network Dynamics

Neural networks have been increasingly employed in Model Predictive Controller (MPC) to control nonlinear dynamic systems. However, MPC still poses a problem that an achievable update rate is insufficient to cope with model uncertainty and external disturbances. In this paper, we present a novel control scheme that can design an optimal tracking controller using the neural network dynamics of the MPC, making it possible to be applied as a plug-and-play extension for any existing model-based feedforward controller. We also describe how our method handles a neural network containing history information, which does not follow a general form of dynamics. The proposed method is evaluated by its performance in classical control benchmarks with external disturbances. We also extend our control framework to be applied in an aggressive autonomous driving task with unknown friction. In all experiments, our method outperformed the compared methods by a large margin. Our controller also showed low control chattering levels, demonstrating that our feedback controller does not interfere with the optimal command of MPC.


I. INTRODUCTION
O NLINE trajectory optimization, also known as Model Predictive Control (MPC), has provided promising results in numerous robotic applications.The most popular current methods for controlling high-dimensional nonlinear systems involve gradient-based MPC, which was established in the context of differential dynamic programming [1]- [3].However, this approach poses the inherent limitation that the algorithm relies on a quadratic approximation for cost functions and requires smooth dynamics.Furthermore, it is extremely difficult to include state constraints in the optimization process of gradient-based MPC.
As a remedy, sampling-based approaches that do not require linear and quadratic approximations of dynamics and cost functions have been developed.The key advantage of sampling-based MPC is that it is capable of solving nonconvex optimization problems, allowing engineers to easily encode high-level robot behaviors using cost function clipping [4], [5].With the benefit that the algorithm can directly use the dynamics model without linearization, neural networks that have been proven effective for approximating the nonlinear transition models are increasingly employed in sampling-based MPC [6], [7].
Both of these MPC algorithms (called feedforward controllers) optimize an open-loop control sequence and execute this sequence until the next optimization update.However, an achievable update rate of the MPC is insufficient to cope properly with model uncertainty and external disturbances.This is because the optimization process must incorporate a number of trajectory rollouts along the future time horizon to reach the required performance level.In practice, a simply designed local feedback compensator is often used to correct for trajectory errors on the fastest time scales [8].The fundamental problem is that such compensators decouple the feedback and feedforward control domains, resulting in decreased efficiency due to the lack of consideration for the feedforward optimization process.As an alternative, the iterative Linear Quadratic Regulator (iLQR) [9], which is also known as a Sequential Linear Quadratic (SLQ) solver [10], was proposed to be used in an online setting for deriving the optimal feedforward trajectory and the feedback gains simultaneously by solving a single optimal control problem [2], [11].However, such schemes have only been considered in gradient-based settings and they are not scalable to samplingbased algorithms.In contrast to gradient-based MPC, the optimal feedback controller for sampling-based MPC has not been fully addressed by prior work.
One particular strategy to deal with this issue is to apply iLQR on top of the sampling-based MPC for designing an ancillary feedback policy [7], [12].However, the feedback gain term in iLQR is produced to push the control sequence to the locally optimal trajectory during the line search optimization procedure.The feedback gain becomes negligible when the sequence converges after recursive backward and forward passes.Nonetheless, it is possible to obtain a considerable feedback gain by intentionally reducing the iterations, but it is misleading to refer to this as an ideal optimal control gain.Finally, all the feedback strategies mentioned above have not yet considered a proper extension to neural network dynamics.
In this paper, we propose a novel approach that combines trajectory optimization and optimal tracking control via sharing neural network dynamics, which is applicable to both gradient-based and sampling-based MPC.Since our method is allowed to use the neural network transition models of the system, it can be applied as a plug-and-play extension to previously completed model-based control tasks.We also describe how our method handles a neural network containing history Fig. 1.Overview diagram of our TOAST control framework when applied to an aggressive driving task of an Unmanned Ground Vehicle (UGV).The components related to the feedforward controller and the feedback controller are depicted in blue and red, respectively.information, in which such a strategy is often employed for capturing time-varying or higher-order effects.We evaluated our algorithm using classical control benchmarks in which crosswinds are modeled as external disturbances.In addition, we demonstrated the feasibility of our idea by performing a challenging autonomous driving task under varying road friction conditions.The proposed method called Trajectory Optimization and Simultaneous Tracking (TOAST) is illustrated in Fig. 1.

II. METHODOLOGY
In this paper, we consider the problem of designing a generic control framework that includes MPC and a corresponding optimal tracking controller that both employ the same learned dynamics.While the idea is general and applicable to any MPC algorithm, such as iLQG [2] and Differential Dynamic Programming (DDP) [3], we use Smooth Model Predictive Path Integral (SMPPI) control [13] as the feedforward controller throughout this paper.

A. Smooth Model Predictive Path Integral Controller
This section provides a brief overview of the SMPPI approach.Consider a standard discrete time dynamic system in which we denote the state at time t as x t ∈ R n and the action as u t ∈ R m .A Model Predictive Path Integral (MPPI) controller, which is one of the most successful sampling-based approaches in recent years, draws parallel control sequence samples U = {u 0 , u 1 , . . .u T −1 } with a fixed time horizon T [5].The optimal control trajectory U * is then obtained using the information theoretic interpretation and importance sampling.The parallel samples are evaluated with the state cost function c(x) using a Graphics Processing Unit (GPU).However, chattering innately occurs in the optimized control sequence due to the nature of sampling-based algorithms.SMPPI is proposed to attenuate the chattering without extrinsic smoothing filters.The main idea is to decouple the control space and action space.Noisy sampling is performed on the control sequences and then the integral operation is applied on them to obtain action sequences while smoothing chattering.

B. Designing a Simultaneous Optimal Tracking Controller
Online trajectory optimization methods, as well as SMPPI, have shown effective performance in various dynamic systems.However, there remains a problem: the computational demand is too high for a real-world system to process in the required time frame.In an MPC-controlled system, the same feedforward command should be maintained for a certain amount of processing time.Control errors induced by external disturbances or model uncertainty might be difficult to respond to in such a system.
We propose an optimal feedback controller that minimizes the control errors that occur while the MPC calculates the next feedforward command.This ancillary feedback controller does not interfere with the optimal command of MPC, but rather takes the feedforward command as a reference and performs tracking control that reduces the error at a faster rate.Our method does not need to form a new error function to track the feedforward action, nor does it need to learn separate neural network dynamics.It calculates the optimal feedback gain adaptively by utilizing the same learned dynamics from MPC.Here, we emphasize that the feedback controller discussed in this paper is distinct from the low-level plant controller, which manages actuators based on the control input and is often referred to as a black-box.
Consider a discrete time nonlinear system dynamics model: An optimal action trajectory found by MPC is given as and corresponding state trajectory is given as X * = {x * 0 , x * 1 , . . .x * T −1 }.Linearize (1) by using first-order Taylor series expansion: As the definition of (1), ( 2) is equivalent to: Here, we define new variables z t = x t −x * t and ν t = u t −u * t .Then the system dynamics can be expressed as: Assume a quadratic cost function: where Q ≥ 0, R > 0.Then, it seems to be formulated as a standard LQR problem [14].Here, we use the infinite-time objective function for computational efficiency.The resulting errors from this simplification would be sufficiently compensated by the online optimization strategy of MPC.It is worth mentioning that the dynamics near the optimal trajectory can be interpreted as being locally linear because the discretized time frame of MPC would be short enough.As a result, a linear feedback controller such as LQR can be adopted for this problem.
One should note that the tracking controller will run faster than the MPC.The discretized time period in (4) is denoted by ∆t 1 , and the one in ( 5) is denoted by ∆t 2 .Converting the system dynamics (4) to have faster sampling period ∆t 2 results in: where we use notation (•) to denote variables that are discretized into a shorter sampling period.Notice that the formulation of the converted dynamics in ( 6) is that of a common linear interpolation.
In the discrete time linear quadratic control problem of ( 5) and ( 6), we can solve for νt to yield the optimal linear feedback policy using the discrete Riccati equation [15]: where K t is feedback gain: Algorithm 1: TOAST Given: Eq. ( 6) Eq. ( 8)

C. Trajectory Optimization and Simultaneous Tracking
In the previous section, we demonstrated that the system dynamics for trajectory optimization can also be utilized by a trajectory following feedback controller.We now describe how the two controllers in our proposed control framework, called Trajectory Optimization and Simultaneous Tracking (TOAST), operate harmoniously.In addition, we will address practical issues that may arise during implementation.
Expanding (7) we obtain: Note that ũ * t is the optimal feedforward command from MPC.Here, ũ * t can be replaced with u * 0 , as it will maintain the same command during the processing time ∆t 1 of MPC.Then, we have: Since xt can be obtained by measuring directly from the sensor or only through state estimation algorithms, such as an extended Kalman filter, it can be updated at a much faster refresh rate than the feedforward commands.Let t 0 be the instant at which the feedforward command u * 0 is received from MPC.And let t be the instant at which the feedback controller receives the state xt .Then, x * t in ( 10) is synchronously updated via linear interpolation: From the aforementioned derivations, we can derive a generic control scheme that includes an optimal feedforward controller and a locally linear optimal feedback policy that adapts to the former.If there are no external disturbances and sensor noises, (x t − x * t ) will remain nearly zero, and the feedforward command u * 0 will take precedence during ∆t 1 .Conversely, if the state of the system deviates from the expected trajectory [as given in (11)], the feedback command will start to operate to complement the feedforward command and keep the system on the desired trajectory.
Since the feedback controller utilizes the same dynamics F as the feedforward controller, our control scheme can be applied very effectively to existing control tasks where the system dynamics models are highly complex and nonlinear.In recent studies of MPC [1], [5] and model-based reinforcement learning [6], for instance, the dynamics models are approximated by neural network.Our feedback controller can use the learned dynamics (F θ ) directly to solve optimal tracking problems without additional resources.Note that the reference state at the next time step x * 1 in (11) can not be acquired instantly during feedforward trajectory optimization, since sampling-based MPC indirectly optimizes the action trajectory U * rather than the state trajectory X * .Meanwhile, we implement the linearization to obtain the Jacobians of neural network dynamics in the Pytorch autograd automatic differentiation package [16].For automatically computing the gradients of F θ with respect to x t and u t , one step forward propagation with the model ( 1) must be done first in order to obtain x t+1 .We can take advantage of this result to interpolate the current reference state to circumvent additional computation.The overall algorithm of our TOAST framework is shown in Algorithm 1.

D. Extension to an Aggressive Autonomous Driving Task
Our research goal is to control an UGV to drive aggressively under challenging conditions, such as sharp corners and unknown surfaces with high and low friction.The vehicle will slide on low-friction corners, causing tire lateral forces to enter the highly nonlinear zone.Despite longitudinal and lateral accelerations being robust features for estimating future states of sliding vehicles, such measurements from real-world sensors are highly noisy.One effective solution for this problem is to provide a history of the vehicle's velocity states and control inputs to the neural network [17], [18].Following our earlier study [13] where the results were also promising, we determine the history length as 4. Because the history length is relatively short, we employ a fully-connected neural network for computational efficiency.Note that other types of model architectures, such as Recurrent Neural Networks (RNN), can also be utilized if the Jacobians of the model can be obtained.However, ablation studies on model architecture selection are beyond the scope of this paper.
Then, the approximated model can be expressed in the following form: where H is denoted as the length of the history.Since (12) does not follow a general form of dynamics, direct linearization is not applicable.Substituting this dynamics model into the Taylor linearization (2) yields: where we can replace (x) * + δ (x) with (x), and replace (u) * + δ (u) as (u).The state and control deviations δ (x) and δ (u) before the current time step t were presented in the past and they should be zero at time t.As a result of causality, linearization of the dynamics with historical encoding can be addressed as a current-state-dependent system.In a normal vehicle state estimation setting, the state x in approximated dynamics F θ consists of the longitudinal velocity v x , the lateral velocity v y , and the yaw rate r.However, the original purpose of the tracking controller was not to follow the command and state at velocity level, but to follow them at position level.We, therefore, augment the state space with dynamic state variables x d = (v x , v y , r) and kinematic state variables x k such that: where x k consists of x-position p x , y-position p y , and yaw angle θ on the global coordinate frame.The next kinematic state can be updated by the current dynamic state: The augmented full state is then updated via the neural network F θ and the explicit kinematic function (15).Linearization can still be done using the automatic differentiation method.The overall control architecture, as it was applied to an autonomous driving task, is depicted in Fig. 1.

III. EXPERIMENTS
For performance comparison, we implement four feedback controllers based on the same feedforward controller, SMPPI.These are: 1) no feedback controller (abbreviated as "No Feedback"), 2) a feedback controller with a low hand-tuned gain matrix (abbreviated as "Low Gain"), 3) a feedback controller with a high hand-tuned gain matrix (abbreviated as "High Gain"), and 4) our TOAST controller.Note that the TOAST's minimum and maximum gains observed throughout the experiments are set to "Low Gain" and "High Gain", respectively.We perform a grid search to find the best control parameters showing the most successful performance for each task.The discretized time step for feedforward ∆t 1 is set to 0.1 s, while the feedback ∆t 2 is set to 0.01 s.In the experiments, we use a neural network to approximate the system dynamics.

A. Pendulum
We first evaluate the performances of the compared controllers using classical control benchmarks: Pendulum and Cartpole.To emulate external disturbances, one can consider adding perturbations in the transition dynamics [19] or adding a noise in action [20].We use a more intuitive method by modeling a continuously varying crosswind.The crosswind applied to the pole's surface is converted to imposed torque.This environment is visualized in Fig. 2a.
We designed the neural network dynamics with two fullyconnected hidden layers, each of which has 32 neurons.The model predicts the residual difference between the current state and the next state: We designed two distinct tasks: 1) swing-up the pole just as in other control studies, and 2) swing-down the pole to maintain a static state.The swing-down task is designed to assess the chattering level of the controller when the system is in a stable condition.The running cost function of the swing-up task is formulated as: Likewise, the cost function of the swing-down task is formulated to keep the pole downward against external disturbances: While only MPC was activated, a dataset was collected online with 1000 random bootstrap data and the model was trained with the dataset every 50 time-steps until the pole was balanced successfully in an upright position.After training, the feedback controller was activated for the experiments.For each task, the dynamics were trained with 10 different fixed random seeds, and the controllers were tested at 50 random locations for each of the pre-trained dynamics.In the first half of the 500 tests, random winds were applied, whereas sinusoidal winds were applied in the second half.Each controller was tested under exactly the same conditions except for the feedback controller.We analyzed the state costs for the swing-up task, and the derivative actions for the swing-down task to quantify the degree of chattering (shown in Fig. 3).The tracking errors of "No Feedback" and TOAST during a Pendulum swing-up test are depicted in Fig. 4.

B. Cartpole
The majority of Cartpole's experimental settings follow the Pendulum.We modify the common Cartpole environment to provide a continuous action space with a range of (−10, 10) N, in which the force is applied to the cart horizontally (shown in Fig. 2b).We use neural network dynamics with the same architecture as Pendulum: where the velocity of the cart ẋ can be derived using only the model inputs.Then, the kinematic variables can be obtained using Euler integration as described in Section II-D: The running cost function of the swing-up task is formulated as: The cost function of the swing-down task follows the same method as for Pendulum.
For TOAST, the control parameters were set as follows: Q = Diag(1, 1, 10, 10), R = Diag(0.001).The signs of the hand-tuned gains for "Low Gain" and "High Gain" were switched depending on whether the pole was placed in the upper half-plane or the lower half-plane.This is because the action in Cartpole, unlike in Pendulum, is applied to the cart horizontally.
The experimental results of four tasks are shown in Fig. 3.The results show that both "High Gain" and TOAST could keep the pole upward against the continuously varying external disturbances.MPC without a feedback controller failed to control the systems in most cases."Low Gain" performed better than "No Feedback", but still failed to achieve the desired performance level.On the other hand, "Low Gain" and "No Feedback" showed low action chattering levels in the swing-down tasks, as expected.TOAST also demonstrated a much lower chattering level than "High Gain", suggesting that our proposed feedback controller does not interfere with MPC when the tracking errors are assumed to be trivial.

C. Aggressive Autonomous Driving
We also evaluated the performances of different controllers in a high-fidelity vehicle simulator, the IPG CarMaker.It has been widely used to validate precisely nonlinear vehicle dynamics [21]- [23].We built a race track with a length of 1016 m, two moderate curves, and four sharp curves (see Fig. 2c).A Volvo XC90 is used as the control vehicle.Unlike the electric vehicles used in recent learning-based aggressive autonomous driving studies [5], [24], our vehicle used an automatic transmission for gear-shifting.Therefore, we employ desired speed (v des ), instead of throttle, as a highlevel controller's input and let a low-level plant controller manage the throttle and brake.Thus, the control input u consists of steering angle δ and desired speed v des .
1) Training Neural Network Vehicle Dynamics: Our fullyconnected neural network approximates the nonlinear vehicle dynamics.Note that only the dynamic state variables are considered as input to the neural network.We denote |x d | and |u| as the size of the controller's state and action.The history of states and actions, which has a size of (|x d | + |u|) • H, is forward propagated to the neural network.Then, the output of the network predicts the residual difference between the current state and the next state, which has the size of |x d |.The network was implemented as an MLP with four hidden layers.We applied a hyperbolic tangent (tanh) activation function, and the network was trained with the mean squared error loss function and Adam optimizer.
We collected a human-controlled driving dataset, with a data rate of 10 Hz, to train our network by expanding the methods used in our previous work [23].We found that the dataset should comprise three types of distinct maneuvers in order for the neural network to accurately represent the vehicle dynamics for various friction conditions: i) Zig-zag driving at low speeds (20 − 25 km/h) on the race track, in both clockwise and counter-clockwise directions.Each driving in a different direction was treated as a separate maneuver.ii) High speed driving on the race track in both directions, trying to maintain 40 km/h as much as possible.iii) Sliding maneuvers at the friction limits on flat ground, in combinations of acceleration and deceleration with various steering angles.
The above five maneuvers were done with seven different friction coefficients: [0.4,0.5, 0.6, 0.7, 0.8, 0.9, 1.0].These 35 maneuvers were logged for two minutes each, obtaining a total of 70 minutes of driving data.We divided the data into 70 % for training and 30 % for testing after randomizing the data to break temporal correlations.The trained model was saved when it showed the best result for test error.Then, it was evaluated with the validation dataset.We collected the validation data on the race track in both directions.The default friction coefficient was set to 0.8 and the friction coefficients of the six curves were set to [0.95, 0.85, 0.75, 0.65, 0.55, 0.45], respectively, which were previously not included in the training data.
The test and validation errors of our neural network are shown in Table I.Root Mean Square Error (RMSE) is denoted as E RM S and the max error is denoted as E max .The results show that our neural network can make accurate one-step predictions under a variety of driving conditions.2) Experimental Setup: We designed the state-dependent cost function c(x) in MPC to have the following form: Track(x) = (0.9) t 10000 M(p x , p y ), where I is an indicator function.M(p x , p y ) in track cost is the two-dimensional cost map value at the global position (p x , p y ).Thanks to the sampling-based derivation of SMPPI, we can provide an impulse-like penalty in the cost function.The speed cost is a simple quadratic cost to achieve the reference vehicle speed v ref .The slip cost imposes both soft and hard costs to discourage slip angle in the trajectory (σ = −arctan( vy vx )).The trajectory expected to have a slip angle greater than 0.2 rad (approximately 11.46 • ) will be penalized since it has the potential to make the vehicle unstable.
For SMPPI, the number of parallel samples was 10000 and the number of time steps was 35.For TOAST, the control parameters were set as follows: 3) Experimental Results: We evaluated the control performance of the four controllers on the race track.The goal of the controllers was to complete whole laps in a clockwise direction.
The race track was adjusted to be more challenging to drive than it was when the training dataset was collected.The friction coefficients µ were modified to have values of [0.55, 0.60, 0.65, 0.70, 0.75, 0.80] at the corners, respectively, and to have a default of 1.0.During ten laps around the course, we measured average lap times on the six corners at a reference speed of 40 km/h.When the vehicle left the track, it was placed at the starting point and started a new lap.All of the methods used the same pre-trained model and no further training was given throughout the experiments.The results are shown in Table II.
"No Feedback" and "Low Gain" barely got through Corner #2 and Corner #3, which are the most challenging sections  of the track.They completed the full course with only one and three out of ten laps, respectively.The learned dynamics predicted that the vehicle would slide greatly if it did not slow down due to the low friction surfaces.Therefore, SMPPI generated trajectories that applied brakes first, then steered the vehicle and slowly accelerated it to pass the corners.However, the vehicle failed to follow the planned trajectory and understeering occurred due to model uncertainty.In contrast, "High Gain" and "TOAST" completed entire laps in 100% of the trials.However, the results show that TOAST has a faster lap time than "High Gain" in most cases.This is because if the feedback controller keeps a fixed high gain all the time, it frequently interferes with the control sequence of the MPC and thus loses the optimality of the solution.Our controller computes the optimal feedback gain for each optimal control sequence based on the contextual information encoded in the neural network dynamics.As a result, a low gain can be maintained in situations where feedback compensation is not greatly required.We show the time-varying feedback gains of the compared controllers in Fig. 5.For TOAST, the overall average speed during ten laps was 33.4 km/h and the maximum slip angle was 7.4 • .We visualize the vehicle trajectories on Corner #2 and Corner #3 in Fig. 6.We also analyzed the degree of chattering on control values with "High Gain" and TOAST.It is a well-known fact that rapid changes in the action commands are a burden to the actuators.In the straight sections of the track with high friction coefficients, the vehicle is stable and the requirement for a feedback controller becomes negligible.Therefore, we analyzed the derivative actions during 10 laps excluding the corners.The results are shown in Fig. 7.For "High Gain", the average derivative actions of the steering angle (δ) and the desired speed (v des ) were 1.31, and 0.81, respectively.On the other hand, those of TOAST were 0.96 and 0.46.The results demonstrate that the adaptive manner of our proposed method can alleviate chattering.

IV. CONCLUSION
We presented a novel control scheme that combines online trajectory optimization and optimal tracking control using the same learned dynamics.We described how to convert a single dynamics model so that it could be used for controllers on two different time scales.We also explained how to handle dynamics models containing history information for our method.The performance of the proposed method was evaluated using two control benchmarks, demonstrating that our ancillary feedback controller is capable of regulating tracking errors without interfering with the optimal feedforward control sequence.We also extended our work to an autonomous aggressive driving task, suggesting its scalability.In all experiments, our method outperformed the compared methods by a large margin.Our controller also showed control chattering levels that were 24 − 43% lower than for the controller with high hand-tuned gains in the driving task.

Fig. 2 .
Fig. 2. We evaluate our algorithm using classical control benchmarks and aggressive autonomous driving.Crosswinds are modeled in (a) Pendulum, and (b) Cartpole as external disturbances.(c) A high-fidelity vehicle simulator is used for the autonomous driving experiments.The vehicle trajectory is also visualized using our control framework for driving 10 laps around the track in a counter-clockwise direction.

Fig. 3 .
Fig.3.Experimental results in four control tasks.The derivative actions are quantified using the L2 norm.

Fig. 4 .
Fig. 4. Visualization of tracking errors during a Pendulum swing-up task.The feedforward controller optimizes a new trajectory every 0.1 s.At that time, the tracking error is updated to zero (red squares).TOAST is effectively regulating the trajectory deviations caused by external disturbances.

Fig. 5 .
Fig.5.Visualization of the time-varying feedback gains with different controllers while they are driving the same section.(a) "Low Gain", (b) "High Gain", and (c) TOAST.For clarity, only the gains corresponding to the steering angle are visualized.We applied rotation on the hand-tuned gains of px and py from global to vehicle coordinate frame.TOAST adaptively changes the gains according to the time-varying dynamic characteristics of the system.

Fig. 6 .
Fig. 6.Vehicle trajectories of the compared controllers.The friction coefficient of Corner #2 (in gray) is 0.60 and Corner #3 (in brown) is 0.65.

Fig. 7 .
Fig. 7. Derivative action for each control variable.They are quantified by the L2 norm.

TABLE II AVERAGE
LAP TIMES ON THE SIX CORNERS OF DIFFERENT CONTROL METHODS.THE MINIMUM SPEED (IN [KM/H]) AND MAXIMUM SLIP ANGLE (IN [ • ]) AT EACH CORNER ARE ALSO ANALYZED WITH OUR METHOD.THE SUCCESS RATES (SR) ARE ALSO DISPLAYED.