Tube-NeRF: Efficient Imitation Learning of Visuomotor Policies From MPC via Tube-Guided Data Augmentation and NeRFs

Imitation learning (IL) can train computationally-efficient sensorimotor policies from a resource-intensive model predictive controller (MPC), but it often requires many samples, leading to long training times or limited robustness. To address these issues, we combine IL with a variant of robust MPC that accounts for process and sensing uncertainties, and we design a data augmentation (DA) strategy that enables efficient learning of vision-based policies. The proposed DA method, named Tube-NeRF, leverages Neural Radiance Fields (NeRFs) to generate novel synthetic images, and uses properties of the robust MPC (the tube) to select relevant views and to efficiently compute the corresponding actions. We tailor our approach to the task of localization and trajectory tracking on a multirotor, by learning a visuomotor policy that generates control actions using images from the onboard camera as only source of horizontal position. Numerical evaluations show 80-fold increase in demonstration efficiency and a 50% reduction in training time over current IL methods. Additionally, our policies successfully transfer to a real multirotor, achieving low tracking errors despite large disturbances, with an onboard inference time of only 1.5 ms.

Tube-NeRF: Efficient Imitation Learning of Visuomotor Policies From MPC via Tube-Guided Data Augmentation and NeRFs Andrea Tagliabue and Jonathan P. How Abstract-Imitation learning (IL) can train computationallyefficient sensorimotor policies from a resource-intensive model predictive controller (MPC), but it often requires many samples, leading to long training times or limited robustness.To address these issues, we combine IL with a variant of robust MPC that accounts for process and sensing uncertainties, and we design a data augmentation (DA) strategy that enables efficient learning of visionbased policies.The proposed DA method, named Tube-NeRF, leverages Neural Radiance Fields (NeRFs) to generate novel synthetic images, and uses properties of the robust MPC (the tube) to select relevant views and to efficiently compute the corresponding actions.We tailor our approach to the task of localization and trajectory tracking on a multirotor, by learning a visuomotor policy that generates control actions using images from the onboard camera as only source of horizontal position.Numerical evaluations show 80-fold increase in demonstration efficiency and a 50% reduction in training time over current IL methods.Additionally, our policies successfully transfer to a real multirotor, achieving low tracking errors despite large disturbances, with an onboard inference time of only 1.5 ms.Index Terms-Imitation learning, deep learning for visual perception, aerial systems: perception and autonomy, NeRF.

I. INTRODUCTION
I MITATION learning (IL) can generate computationally- efficient sensorimotor neural network (NN) policies [1], [2], [3] for onboard sensing, planning and control on mobile robots.Training data is typically obtained from a compute-heavy model predictive controller (MPC) [4], [5], [6], acting as expert demonstrator.The resulting NN outputs actions from raw sensory inputs, bypassing the computational cost of (i) state estimation, (ii) localization and (iii) the optimization problem in MPC.
However, a key limitation of existing IL methods to train sensorimotor policies (Behavior Cloning (BC)) [2], DAgger [3]) is the limited robustness or the overall number of demonstrations that must be collected from MPC.These issues are rooted in training/deployment data distribution mismatches (covariate Fig. 1.Tube-NeRF collects a real-world demonstration using output feedback tube MPC, a robust MPC that accounts for process and sensing uncertainties through its tube cross-section Z.Then, it generates a Neural Radiance Field (NeRF) of the environment from the collected images I t , and uses the tube's cross-section to guide the selection of synthetic views I + t from the NeRF for data augmentation, while corresponding actions are obtained via the ancillary controller, an integral component of the tube MPC framework.
Leveraging high-fidelity simulators on powerful computers in combination with DR avoids demanding data-collection on the real robot, but it introduces sim-to-real gaps that are especially challenging when learning visuomotor policies, i.e., that directly use raw pixels as input.For this reason, sensorimotor policies trained in simulation often leverage as input easy-to-transfer visual abstractions, such as feature tracks [5], depth maps [11], [12], intermediate layers of a convolutional NN (CNN) [13], or a learned latent space [14].However, all these abstractions discard information that may instead benefit task performance.
This work introduces Tube-Neural Radiance Field (NeRF) (Fig. 1), a novel DA framework for efficient, robust visuomotor policy learning from MPC that overcomes the aforementioned limitations in DA.Building on our prior DA strategy [8] that uses robust tube MPC (RTMPC) for efficiently and systematically generate additional training data for motor control policies (actions from full state), Tube-Neural Radiance Field (NeRF) enables learning of policies that directly use vision as input, relaxing the constraining assumption in [8] that full-state information is available at deployment.
Tube-NeRF, first, collects robust task demonstrations that account for the effects of process and sensing uncertainties via an output feedback variant of RTMPC [20].Then, it employs a photorealistic representation of the environment, based on a NeRF, to generate synthetic novel views for DA, using the tube of the controller to guide the selection of the extra novel views and sensorial inputs, and using the ancillary controller to efficiently compute the corresponding actions.Additionally, the tube is employed to generate queries from a database of real-world observations, reducing the synthetic-to-real gap when only NeRF images are used for DA.Lastly, we adapt our approach to a multirotor, training a visuomotor policy for robust trajectory tracking and localization using onboard camera images and additional measurements of altitude, orientation, and velocity.The generated policy relies solely on images to obtain information on the robot's horizontal position, which is a challenging task due to (1) its high speed (up to 3.5 m/s), (2) varying altitude, (3) aggressive roll/pitch changes, (4) the sparsity of visual features in our flight space, and (5) the presence of a safety net that moves due to the down-wash of the propellers and that produces semi-transparent visual features above the ground.
Contributions: This Letter is an evolution of our previous conference paper [21], where we demonstrated the capabilities of an output feedback RTMPC-guided DA strategy in simulation.Now, we present a set of algorithmic changes that enabled the first real-world, real-time deployment of the approach.Specifically, different from [21], our new algorithm: (1) Utilizes a NeRF for photo-realistic novel view synthesis for DA, replacing a 3D mesh, and presents a calibration method to integrate the NeRF into the DA framework.(2) Further reduces the sim-to-real gap that, although small, still exists in the synthetic images from the NeRF by modifying the DA strategy to additionally sample available real-world images inside the tube.(3) Accounts for visual changes in the environment by introducing in training randomizations in image space.(4) Includes noisy altitude measurements in the policy's input.(5) Accounts for camera calibration errors by introducing perturbations in the camera extrinsic during DA.(6) Employs a more capable but computationally efficient visuomotor NN architecture for onboard deployment.
Therefore, this work contributes procedures to: r Efficiently (demonstrations, training time) learn of a sen- sorimotor policy from MPC using a DA strategy grounded in the output feedback RTMPC framework theory, unlike previous DA methods that rely on handcrafted heuristics.
r Apply our method for tracking and localization on a mul- tirotor via images, altitude, attitude and velocity data.
r Achieve the first-ever real-time deployment of our ap- proach, demonstrating (in more than 30 flights) successful agile trajectory tracking with policies learned from a single demonstration that use onboard fisheye images to infer the horizontal position of a multirotor despite aggressive 3D motion, and subject to a variety of sensing and dynamics disturbances.Our policy has an average inference time of only 1.5 ms onboard a small GPU (Nvidia Jetson TX2) and is deployed at 200 Hz.I presents state-of-the-art approaches for sensorimotor policy learning from demonstrations (from MPC or humans), focusing on mobile robots.Tube-NeRF is the only method that (i) explicitly accounts for uncertainties, (ii) is efficient to train, and (iii) does not require visual abstractions (iv) nor specialized data collection setups.Related to our research, [19] employs a NeRF for DA from human demonstrations for manipulation, but uses heuristics to select relevant views, without explicitly accounting for uncertainties.
NeRFs.NeRFs [23] enable efficient and photorealistic novel view synthesis by directly optimizing the photometric accuracy of the reconstructed images, in contrast to traditional 3D photogrammetry methods (e.g., for 3D meshes).This provides accurate handling of transparency, reflective materials, and lighting conditions.Ref. [25] queries a NeRF online for estimation, planning, and control on a UAV, but this results in 1000× higher computation time1 than our policy.
Output feedback RTMPC.MPC [26] solves a constrained optimization problem that uses a model of the system dynamics to plan for actions that satisfy state and actuation constraints.RTMPC assumes that the system is subject to additive, bounded process uncertainty (disturbances, model errors) and employs an auxiliary (ancillary) controller that maintains the system within some known distance (cross-section of a tube) of the plan [20].Output feedback RTMPC [20], [27] in addition accounts for the effects of sensing uncertainty (noise, estimation errors) by increasing the cross-section of the tube.Our method uses an output feedback RTMPC for data collection but bypasses its computational cost at deployment by learning a NN policy.

III. PROBLEM FORMULATION
Our goal is to efficiently train a NN visuomotor policy π θ (student), with parameters θ, that tracks a desired trajectory on a mobile robot (multirotor).π θ takes as input images, which are needed to extract partial state information (horizontal position, in our evaluation), and other measurements.The trained policy, denoted π θ * , needs to be robust to uncertainties encountered in the deployment domain T .It is trained using demonstrations from a model-based controller (expert) collected in a source domain S that presents only a subset of the uncertainties in T .
Student policy.The student policy has the form: (1) It outputs deterministic, continuous actions u t to track a desired N + 1 steps (N > 0) trajectory X des t = {x des 0|t , . . ., x des N |t }. o t = (I t , o other,t ) are noisy sensor measurements comprised of an image I t from an onboard camera, and other measurements o other,t (altitude, attitude, velocity, in our evaluation).
Expert.Process model: The considered robot dynamics are: is the state, and u t ∈ U ⊂ R n u the control inputs.The robot is subject to state and input constraints X and U, assumed convex polytopes containing the origin [20].2) captures time-varying additive process uncertainties in T , such as (i) disturbances (wind/payloads for a UAV) (ii) and model changes/errors (linearization errors and poorly known parameters).w t is unknown, but the polytopic bounded set W T is assumed known [20].Sensing model: The expert has access to (i) the measurements o other,t , and (ii) a vision-based position estimator g cam that outputs noisy measurements o pos.xy ∈ R 2 of the horizontal position p xy,t ∈ R 2 of the robot: where v cam,t is the associated sensing uncertainty.The measurements available to the expert are denoted ōt ∈ R n o , and they map to the robot state via: where additive sensing uncertainty (e.g., noise, biases) in T .V T is a known bounded set obtained via system identification, and/or via prior knowledge on the accuracy of g cam .
State estimator.We assume the expert uses a state estimator: where xt ∈ R n x is the estimated state, and L ∈ R n x ×n o is the observer gain, set so that A − LC is Schur stable.The observability index of the system (A, C) is assumed to be 1, meaning that full state information can be retrieved from a single noisy measurement.In this case, the observer plays the critical role of filtering out the effects of noise and sensing uncertainties.Additionally, we assume that the state estimation dynamics and noise sensitivity of the learned policy will approximately match the ones of the observer.

IV. METHODOLOGY
Overview.Tube-NeRF collects trajectory tracking demonstrations in the source domain S using an output feedback RTMPC expert combined with a state estimator (5), and IL methods (DAgger or BC).The chosen output feedback RTMPC framework is based on [20], [27], with its objective function modified to track a trajectory (Section IV-A), and is designed according to the priors on process and sensing uncertainties at deployment (T ).Then, Tube-NeRF uses properties of the expert to design an efficient DA strategy, the key to overcoming efficiency and robustness challenges in IL (Section IV-B).The framework is then tailored to a multirotor leveraging a NeRF as part of the proposed DA strategy (Section V).

A. Output Feedback RTMPC for Trajectory Tracking
Output feedback RTMPC for trajectory tracking regulates the system in ( 2) and ( 5) along a given reference trajectory X des t , while satisfying state and actuation constraints X, U regardless of sensing uncertainties (v, (4)) and process noise (w, (2)).
Preliminary (set operations): Given the convex polytopes A ⊂ R n , B ⊂ R n and M ∈ R m×n , we define: Optimization problem, solved at every timestep t: where e i|t = xi|t − x des i|t represents the trajectory tracking error, Xt = {x 0|t , . . ., xN|t } and Ūt = {ū 0|t , . . ., ūN−1|t } are safe reference state and action trajectories, and N + 1 is the length of the planning horizon.The positive semi-definite matrices Q (size n x × n x ) and R (size n u × n u ) are tuning parameters, e N |t 2 P is a terminal cost (obtained by solving the infinitehorizon optimal control problem with A, B, Q, R) and xN|t ∈ XN is a terminal state constraint.
Ancillary controller.The control input u t is obtained via: where ū * t :=ū * 0|t and x * t :=x * 0|t , and K is computed by solving the LQR problem with A, B, Q, R.This controller maintains the system inside a set Z ⊕ x * t ("cross-section" of a tube centered around x * t ), regardless of the uncertainties.Tube and robust constraints.Process and sensing uncertainties are taken into account by tightening the constraints X, U, obtaining X and Ū in (6).The amount by which X, U are tightened depends on the cross-section of the tube Z, which is computed (see [20]) by considering the closed-loop system formed by the ancillary controller (7), the nominal dynamics in (2) and the observer (5).Sensing uncertainties in V T and process noise in W T introduce two sources of errors in such system: a) the state estimation error ξ est t :=x t − xt , and b) the control error ξ ctrl t :=x t − x * t .These errors can be combined in a vector ξ t = [ξ est t T , ξ ctrl t T ] T with dynamics (see [27]): By design, A ξ is a Schur-stable dynamic system, and it is subject to uncertainties from the convex polytope D.Then, it is possible to compute the minimal Robust Positive Invariant (RPI) set S, that is the smallest set satisfying: S represents the possible set of state estimation and control errors caused by uncertainties and is used to compute X and Ū.Specifically, the error between the true state x t and the reference state x * t is: As a consequence, the effects of noise and uncertainties can be taken into account by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.tightening the constraints of an amount: Z (cross-section of a tube) is the set of possible deviations of the true state x t from the safe reference x * t .Computing S. While accurately computing the minimal RPI set S for highdimensional systems can be challenging [27], for simplicity, we efficiently obtain S from W T and V T via Monte Carlo simulations of ( 8), uniformly sampling instances of the uncertainties, and by computing an outer axis-aligned bounding box of the trajectories of ξ.In addition, we treat linearization errors and future changes in the reference trajectory as an additional source of uncertainty, computing the tube based on an increased process uncertainty prior WT , and numerically validating that the resulting expert is robust.While this procedure is approximate, it was found computationally tractable and useful at estimating tubes with an adequate level of conservativeness.Fig. 2 shows an example of this controller for trajectory tracking on a multirotor, highlighting changes to the reference trajectory to respect state constraint under uncertainties.

B. Tube-Guided DA for Visuomotor Learning
IL objective.We denote the expert output feedback RTMPC in ( 6), (7) and the state observer in (5) as π θ * .The goal is to design an IL and DA strategy to efficiently learn the parameters θ * of the policy (1), collecting demonstrations from the expert π θ * .In IL, this objective consists in minimizing the MSE loss: where τ := {(o t , u t , X des t ) |t = 0,. . .,T } is a T + 1 step (observation, action, reference) trajectory sampled from the distribution p(τ |π θ , T ).Such a distribution represents all the possible trajectories induced by the student policy π θ in the deployment environment T .As observed in [7], [8], the presence of uncertainties in T makes IL challenging, as demonstrations are usually collected in a training domain (S) under a different set of uncertainties (W S ⊆ W T , V S ⊆ V T ) resulting in a different distribution of training data.
Tube and ancillary controller for DA.To overcome these limitations, we design a DA strategy that compensates for the effects of process and sensing uncertainties encountered in T .We do so by extending our previous approach [8], named Sampling Augmentation (SA), which provided a strategy to efficiently learn a control policy (i.e., π : R n x → R n u ) robust to process uncertainty (W T ).SA recognized that the tube in a RTMPC [28] represents a model of the states that the system may visit when subject to process uncertainties.SA used the tube in [28] to guide the selection of extra states for DA, while the ancillary controller in [28] provided a computationally efficient way to compute corresponding actions, maintaining the system inside the tube for every possible realization of the process uncertainty.
Tube-NeRF.Our new approach, named Tube-NeRF, employs the output feedback variant of RTMPC presented in Section IV-A.This has two benefits: (i) the controller appropriately introduces extra conservativeness during demonstration collection to account for sensing uncertainties (via tightened constraint X and Ū in ( 6)); and (ii) the tube Z in (10) additionally captures the effects of sensing uncertainty, guiding the generation of extra observations for DA.The new data collection and DA procedure is as follows.
1) Demonstration Collection: We collect demonstrations in S using the output feedback RTMPC expert (Section IV-A).Each T + 1-step demonstration τ consists of:

) 2) Extra States and Actions for Synthetic Data Generation:
For every timestep t in τ , we generate N synthetic,t > 0 (details on how N synthetic,t is computed are provided in Section IV-B4) extra (state, action) pairs (x + t,j , u + t,j ), with j = 1, . . ., N synthetic,t by sampling extra states from the tube x + t,j ∈ x * t ⊕ Z, and computing the corresponding control action u + t,j using ( 7): The resulting u + t,j is saturated to ensure that u + t,j ∈ U.

3) Synthetic Observations Generation:
To generate the necessary data o + t,j = (I + t,j , o + other,t,j ) input for the sensorimotor policy (1) from the selected states x + j,t , we employ observation models (4) available for the expert.In the context of learning a visuomotor policy, we generate synthetic camera images I + t,j using an inverse pose estimator ĝ−1 cam , mapping camera poses T IC to images I via I + t,j = ĝ−1 cam (T + IC t,j ), where T IC denotes a homogeneous transformation matrix from a world (inertial) frame I to a camera frame C. ĝ−1 cam is obtained by generating a NeRF of the environment (discussed in more details in Section V-A1) from the images I 0 , . . ., I T in the collected demonstration τ , and by estimating the intrinsic/extrinsic of the camera onboard the robot.The camera poses T + IC t,j are obtained from the sampled states x + t,j , which includes the robot's position and orientation.These are computed as T + IC t,j = T IB (x + t,j ) TBC t,j , where T IB is the transformation from the robot's body frame B to the reference frame I, and TBC t,j are perturbed transformation of the nominal camera extrinsic, where perturbations are introduced to accommodate uncertainties and errors in the extrinsics.Last, the full observations o + t,j are obtained by computing o + other,t,j , using (4) and a selection matrix S: o + t,j = (I + t,j , o + other,t,j ), o + other,t,j = SCx + t,j .(13) 4) Tube-Guided Selection of Extra Real Observations: Beyond guiding the generation of extra synthetic data, we employ the tube of the expert to guide the selection of real-world observations from demonstrations (τ ) for DA.This procedure is useful at accounting for small imperfections in the NeRF and in the camera-to-robot extrinsic/intrinsic calibrations, further reducing the sim-to-real gap, and providing an avenue to "ground" the synthetic images with real-world data.This involves creating a database of the observations o in τ , indexed by the robot's estimated state x, and then selecting N real,t observations at each timestep t inside the tube (x ∈ x * t ⊕ Z), adhering to the ratio N real,t/ N samples ≤ ¯ , where 0 < ¯ ≤ 1 is a user-defined parameter that balances the maximum ratio of real images to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.synthetic ones, and N samples is the desired number of samples (real and synthetic) per timestep.The corresponding action is obtained from the state associated with the image via the ancillary controller (12).The required quantity of synthetic samples to generate, as discussed in Section IV-B3, is calculated using the formula N synthetic,t = N samples − N real,t .

5) Robustification to Visual Changes:
To accommodate changes in brightness and environment, we apply several transformations to both real and synthetic images.These include solarization, adjustments in sharpness, brightness, and gamma, along with the application of Gaussian noise, Gaussian blur, and erasing patches of pixels using a rectangular mask.

V. APPLICATION TO VISION-BASED FLIGHT
In this section, we provide details on the design of the expert, the student policy, and the NeRF for agile flight.Task: We apply our framework to learn to track a figure-eight trajectory (lemniscate, with velocity up to 3.15 m/s lasting 30 s), denoted T1.Robot model.The expert uses the hover-linearized model of a multirotor [29], with state ), I is an inertial reference frame, while Ī a yaw-fixed frame [29].The control input u t (n u = 3) is desired roll, pitch, and thrust, and it is executed by a cascaded attitude controller.Measurements.The multirotor is equipped with a fisheye monocular camera, tilted 45 deg, generating images I t (size 128 × 96 pixels).In addition, we assume available onboard noisy altitude I p m z ∈ R, velocity I v m ∈ R 3 and roll I ϕ m , pitch I ϑ m and yaw measurements.This is a common setup in aerial robotics, where noisy altitude and velocity can be obtained, for example, via optical flow and a downward-facing lidar, while roll, pitch, and yaw can be computed from an IMU with a magnetometer, using a complementary filter [30].Student policy.The student policy (1), shown in Fig. 3, takes as input an image I t from the onboard camera, the reference trajectory X des t , and o other :=[ I p m z , I v mT ,Ī ϕ m ,Ī ϑ m ] T , and it outputs u t .A Squeezenet [31] is used to map I t into a lower-dimensional feature space; it was selected for its performance at a low computational cost.To promote learning of internal features relevant to estimating the robot's state, the output of the policy is augmented to predict the current state x (or x + for the augmented data), modifying the training loss accordingly.This output was not used at deployment time, but it was found to improve the performance.Output Feedback RTMPC and observer: The expert uses the defined robot model for predictions, discretized with T s = 0.1 s, and horizon N = 30, (3.0 s).X encodes safety and position limits, while U captures the maximum actuation capabilities.Process uncertainty in W T is assumed to be a bounded external force disturbance with magnitude between to 10 − 19% of the weight of the robot and random direction, close to the physical limits of the platform, and the tube of the expert is computed assuming WT equal to 20% of the weight of the robot.The state estimator ( 5) is designed by using as measurement model in (4): ōt = x t + v t where we assume v t ∼ N (0 T }, with 3σ cam = [0.6,0.6] T (units in m) and 3σ other = [0.4,0.2, 0.2, 0.2, 0.05, 0.05] T (units in m for altitude, m/s for velocity, and rad for tilt).These conservative but realistic parameters are based on prior knowledge of the worst-case performance of vision-based estimators in our relatively feature-poor flight space.The observer gain matrix L is computed by assuming fast state estimation dynamics, (poles of A − LC at 30.0 rad/s).
1) Procedure to Generate the NeRF of the Environment: (i) Dataset: A NeRF of the environment, the MIT-Highbay, is generated from about 100 images collected during a single realworld demonstration of the figure-eight trajectory (T1) intended for learning, utilizing full-resolution images (640 × 480 pixels) from the fisheye camera onboard the Qualcomm Snapdragon Flight Pro board of our UAV.(ii) Extrinsic/Intrinsic: The extrinsic and intrinsic parameters of the camera are estimated from the dataset using structure-from-motion (COLMAP [32], RADTAN camera model).(iii) Frame Alignment: The scale and homogeneous transformation aligning the reference frame used by the COLMAP with the reference frame used by the robot's state estimator are determined via the trajectory alignment tool EVO [33].This enables the integration of the NeRF as an image rendering tool in a simulation/DA framework.(iv) NeRF Training: Instant-NGP [23] is utilized to train the NeRF.The scaling of the Axis-Aligned Bounding Box used by the Instant-NGP is manually adjusted to ensure that the reconstruction is photorealistic in the largest possible volume.(v) Images rendering for DA: Novel images are rendered using the same camera intrinsics identified by COLMAP.The camera extrinsics, mapping from the robot's IMU to the optical surface, are determined via Kalibr [34], using an ad-hoc dataset.An example of an image from the NeRF is shown in Fig. 1.

A. Evaluation in Simulation
Here we numerically evaluate the efficiency (training time, number of demonstrations), robustness (average episode length before violating a state constraint, success rate) and performance (expert gap, the relative error between the stage cost of the expert and the one of the policy) of Tube-NeRF.Note that the number of demonstrations provides a metric useful not only to estimate real-world data collection efforts, but also the number of environment interactions required in simulation which, depending on the simulation environment considered, may be computationally costly (e.g., in fluid-dynamic simulations).We use PyBullet to simulate realistic full nonlinear multirotor dynamics [29], rendering images using the NeRF obtained in Section V-A1 -combined with realistic dynamics, the NeRF provides a convenient framework for training and numerical evaluations of policies.The considered task consists of following the figure-eight trajectory T1 (lemniscate, length: 300 steps) used in Section V-A1, starting from x 0 ∼ Uniform(−0.1, 0.1) ∈ R 8 , without violating state constraints.The policies are deployed in two target environments, one with sensing noise affecting the measurements o other (the noise is Gaussian distributed with parameters as defined in Section V) and one that additionally presents wind Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 4. Episode length (timestep before a state constraint violation, up to 300) vs. number of demonstrations collected from the expert, and vs the training time (the time required to collect such demonstrations in simulation, and to train the policy).This shows that policies trained with Tube-NeRF (TN) archive full episode length after a single demonstration, and require less than half of the training time than the best-performing baselines (DR-based methods).Note that the lines of Tube-NeRF-based approaches vs the number of demonstrations overlap.Shaded areas are 95% confidence intervals.Note that to focus our study on the effects of process uncertainties and sensing noise, we do not apply visual changes to the environment, nor the robustification to visual changes (Section IV-B5).Evaluations across 10 seeds, 10 times per seed.disturbances, sampled from W T (also with bounds as defined in Section V).Method and baselines.We apply Tube-NeRF to BC and DAgger, comparing their performance without any DA; Tube-NeRF−N samples , with N samples = {50, 100}, denotes the number of observation-action samples generated for every timestep by uniformly sampling states inside the tube.We additionally combine BC and DAgger with DR by applying, during demonstration collection, an external force disturbance sampled from W T .We set β, the hyperparameter of DAgger controlling the probability of using actions from the expert instead of the learner policy, to be β = 1 for the first set of demonstrations, and β = 0 otherwise.Evaluation details.For every method, we: (i) collect K new demonstrations (K = 1 for Tube-NeRF, K = 10 otherwise) via the output feedback RTMPC expert and the state estimator; (ii) update2 a student policy using all the demonstrations collected so far; (iii) evaluate the obtained policy in the considered target environments, for 10 times each, starting from different initial states; (iv) repeat from (i).Note that in our comparison the environment steps at its highest possible rate (simulation time is faster than wall-clock time), providing an advantage in terms of data collection time to those methods that require collecting a large number of demonstrations (our baselines).Results.Fig. 4 highlights that all Tube-NeRF methods, combined with either DAgger or BC, can achieve complete robustness (full episode length) under combined sensing and process uncertainties after a single demonstration.The baseline approaches require 20-30 demonstrations to achieve a full episode length in the environment without wind, and the best-performing baselines (methods with DR) require about 80 demonstrations to achieve their top episode length in the more challenging environment.Similarly, Tube-NeRF requires less than half training time than DR-based methods to achieve higher robustness in this more challenging environment, and reducing  4, was approximately 5 minutes (20000 epochs on an RTX 3090 GPU).Even accounting for this time, TN is significantly faster than collecting real-world demonstrations (the 80 demonstrations required by DR correspond to 40 minutes of real-world data, followed by the time to train the policy).In addition, if the NeRF is combined with a simulation of the dynamics of the robot (creating a photo-realistic simulator), our DA strategy still provides benefits in terms of performance and training time.. Table II additionally highlights the small gap of Tube-NeRF policies from the expert in terms of tracking performance (expert gap), and shows that increasing the number of samples (e.g., Tube-NeRF-100) benefits performance.

B. Flight Experiments
We now validate the data efficiency of Tube-NeRF highlighted in our numerical analysis by evaluating the obtained policies in real-world experiments.We do so by deploying them on an NVIDIA Jetson TX2 (at up to 200 Hz, TensorRT) on the MIT-ACL multirotor.The policies take as input the fisheye images generated at 30 or 60 Hz by the onboard camera.The altitude, velocity and roll/pitch inputs that constitute o other,t are, for simplicity, obtained from the onboard estimator (a filter fusing IMU with poses from a motion capture system), corrupted with additive noise (zero-mean Gaussian, with parameters as defined in Section V) in the scenarios denoted as "high noise".We remark that no information on the horizontal position of the UAV is provided to the policy, and horizontal localization must be performed from images.We consider two tasks, tracking the lemniscate trajectory (T1), and tracking a new circular trajectory (denoted T2, velocity up to 2.0 m/s, duration of 30 s).No real-world images have been collected for T2, therefore this task is useful to stress-test the novel-view synthesis abilities of the approach, using the NeRF and the nonlinear simulated robot dynamics as a simulation framework.Training.We train one policy for each task, using a single task demonstration collected with DAgger+Tube-NeRF-100 in our NeRF-based simulated environment.During DA, we try to achieve an equal amount between synthetic images (from the NeRF) and real ones (from the database), setting ¯ = 0.5.Fig. 7 reports the number of sampled real images from the database, highlighting that the tube is useful at guiding the selection of real images, but that synthetic images are a key part of the DA strategy, as T2 presents multiple parts without any real image available.Performance under uncertainties.Fig. 5 and Table III show the trajectory tracking performance of the learned policy under a   variety of real-world uncertainties.Those uncertainties include (i) model errors, such as poorly known drag and thrust to voltage mappings; (ii) wind disturbances, applied via a leaf-blower, (iii) sensing uncertainties (additive Gaussian noise to the partial state measurements), and (iv) visual uncertainties, produced by attaching a slung-load that repeatedly enters the field of view of the camera, as shown in Fig. 6.These results highlight that (a) policies trained after a single demonstration collected in our NeRF-based simulator using Tube-NeRF are robust to a variety of uncertainties while maintaining tracking errors comparable to the ones of the expert (Table III, Fig. 2), while reaching velocities up to 3.5 m/s, and even though the expert localizes using a motion capture system, while the policy uses images from the onboard camera to obtain its horizontal position.In addition, (b) our method enables learning of vision-based policies for which no real-world task demonstration has been collected, effectively acting as a simulation framework, as shown by the successful tracking of T2, which relied entirely on synthetic training data for large portions of the trajectory (Fig. 7), and was obtained using a single demonstration in the NeRF-based simulator.Due to the limited robustness achieved in simulation (Table II), we do not deploy the baselines on the real robot.Efficiency at deployment and latency.Table IV shows that onboard the policy requires on average only 1.5 ms to compute a new action from an image, being at least 5.6 faster than a highly-optimized 7. Number of real images sampled from the tube to perform data augmentation using Tube-NeRF-100, as a function of time in the considered trajectory.The considered trajectories are a Lemniscate trajectory (T1), which is the same as the one executed for real-world data collection, and a Circle trajectory (T2), which is different from the one executed for data collection.The results highlight that (1) the tube can be used to guide the selection of real-world images for data augmentation and (2) the synthetic images from the NeRF are a key component of our data augmentation strategy, as there are multiple segments of the circular trajectory (T2) where no real images are presents, but the sensorimotor policy successfully controls the robot in the real-world.

TABLE IV TIME (MS) REQUIRED TO GENERATED A NEW ACTION
(C/C++) Note that the reported computational cost of the expert is based on the cost of control only (no state estimation), therefore the actual computational cost reduction provided by the policy is even larger.Online, image capture (independent of our method, nominal light conditions) has a latency of about 15 ms, while image pre-processing and transfer to the TX2 takes less than 2 ms.Sensitivity to uncertainties in the visual input.We study the closed-loop performance of the policy under different types of uncertainties in the visual input by monitoring the position tracking error (in simulation, tracking T1, under wind) as a function of different types and magnitudes of noise applied to the images rendered by the NeRF and input to the policy.We consider 1) Gaussian noise, representing the presence of high-frequency disturbances/uncertainties such as new visual features in the environment, and 2) Gaussian blur, capturing visual changes that reduce high-frequency features (e.g., weather changes, and/or the effects of using a low-quality NeRF for training).The results in Fig. 8 highlight that 1) the visual input plays a key role in the output of the policy and, while our approach is not specifically designed for environments with rapidly changing visual appearance, it also shows 2) the overall robustness of the policy to visual uncertainties.Last, the comparison of the effects of the two types of noise highlights that 3) the policy has lower sensitivity to the presence of new high-frequency visual uncertainties, as the position errors grow slower when more Gaussian noise is applied (lower PSNR) than for Gaussian blur.

VII. CONCLUSIONS
We presented Tube-NeRF, an efficient IL of robust end-toend visuomotor policies that achieve real-world robustness in trajectory tracking from images on a multirotor.Tube-NeRF leveraged output feedback RTMPC to collect demonstrations that account for process and sensing uncertainties.In addition, properties of the controller guided an efficient DA procedure that used a combination of a database of real-world images, a NeRF of the environment, and randomization procedures in image space to obtain novel relevant views.Future work will generalize the approach to dynamic environments by further randomization in image/NeRF space, and use event-cameras to ensure performance in poorly-lit environments.

Fig. 2 .
Fig. 2. Output feedback RTMPC generates and tracks a safe reference to satisfy constraints.

Fig. 5 .
Fig. 5. Qualitative evaluation in experiments, highlighting the high velocity and the challenging 3D motion that the student policy can execute under uncertainties.(a) T1, student w/o wind.(b) T1, student w/ wind.(c) T2, student w/o wind.(d) T2, student w/ wind

Fig. 6 .
Fig.6.Robustness to visual uncertainties: we show images captured by the onboard camera while using the proposed sensorimotor policy for localization and tracking of T1, with a slung load (tape roll, 0.2 Kg) attached to the robot.The slung load (circled) repeatedly enters the field of view of the onboard camera, without however compromising the success of the maneuver.We hypothesize that randomly deleting patches of pixels during training (Section IV-B5) contributes to achieving robustness to this disturbance.

Fig. 8 .
Fig. 8. Position error under wind for T1 in simulation as a function of different types and magnitude of noise in the visual input.PSNR is the Peak Signal to Noise Ratio, and lower PSNR denotes a larger amount of noise corrupting the image.The maximum noise level shown corresponds to to 0% success rate.Contour lines in the Comparison plot show the density of the errors.

TABLE II ROBUSTNESS
, PERFORMANCE, AND DEMONSTRATION EFFICIENCY OF IMITATION LEARNING (IL) METHODS FOR VISUOMOTOR POLICY LEARNING the number of samples (e.g, Tube-NeRF-50) can further improve the training time.The time to generate the NeRF, not shown in Fig.

TABLE III POSITION
ROOT MEAN SQUARED (RMS) TRACKING ERRORS