Learning Fast and Precise Pixel-to-Torque Control

In the field, robots often need to operate in unknown and unstructured environments, where accurate sensing and state estimation (SE) becomes a major challenge. Cameras have been used to great success in mapping and planning in such environments, as well as complex but quasi-static tasks such as grasping, but are rarely integrated into the control loop for unstable systems. Learning pixel-to-torque control promises to allow robots to flexibly handle a wider variety of tasks. Although they do not present additional theoretical obstacles, learning pixel-to-torque control for unstable systems that that require precise and high bandwidth control still poses a significant practical challenge, and best practices have not yet been established. To help drive reproducible research on the practical aspects of learning pixel-to-torque control, we propose a platform that can flexibly represent the entire process, from lab to deployment, for learning pixel-to-torque control on a robot with fast, unstable dynamics: the vision-based Furuta pendulum. The platform can be reproduced with either off-the-shelf or custom-built hardware. We expect that this platform will allow researchers to quickly and systematically test different approaches, as well as reproduce and benchmark case studies from other labs. We also present a first case study on this system using DNNs which, to the best of our knowledge, is the first demonstration of learning pixel-to-torque control on an unstable system with update rates faster than 100 Hz. A video synopsis can be found online at https://youtu.be/S2llScfG-8E, and in the supplementary material.


I. INTRODUCTION
In the field, robots often need to operate in unknown and unstructured environments, where accurate sensing and state estimation (SE) becomes a major challenge. Cameras have been used to great success in mapping and planning in such environments [1], as well as complex but quasi-static tasks such as grasping [2], but are rarely integrated into the control loop for unstable systems. Learning pixel-to-torque control promises to allow robots to flexibly handle a wider variety of tasks. Although they do not present additional theoretical obstacles, learning pixel-to-torque control for unstable systems that that require precise and high bandwidth control still poses a significant practical challenge, and best practices have not yet been established. Part of the reason is that many of the most auspicious tools, such as deep neural networks (DNN), are opaque: the cause for success on one system is difficult to interpret and generalize.
The machine learning community has alleviated this problem by establishing standard data sets and standardized simulation environments that allow different approaches to be easily benchmarked against each other. This trend is not well established in the robotics community, as there are many more hurdles to reproduce a system in hardware than purely in simulation. To help drive reproducible research on the practical aspects of learning pixel-to-torque control, we propose a platform that can flexibly represent the entire process, from lab to deployment, for learning pixel-to-torque control on a robot with fast, unstable dynamics: the vision-based Furuta pendulum. The platform, shown in Figure 1 and detailed in "Reproducible Platform", can be reproduced with either off-the-shelf or custom-built hardware. We expect that this platform will allow researchers to quickly and systematically test different approaches, as well as reproduce and benchmark case studies from other labs.
We also present a first case study on this system using DNNs which, to the best of our knowledge, is the first demonstration of learning pixel-to-torque control on an unstable system with update rates faster than 100 Hz. A video synopsis can be found online at https://youtu.be/S2llScfG-8E, and in the supplementary material.

II. RELATED WORK
DNNs combined with RL have had tremendous success recently in a variety of robotics applications [2]- [5], though many open challenges remain [6]. One of the most critical challenges in all these approaches is sample efficiency-or rather, sample inefficiency. In most cases, learning is done in simulation only, which adds the challenge of transferring the learned policy from simulation to the actual hardware. To make this transfer successfully, a lot of effort is typically put into modeling and system identification [4], [7], such that the gap between simulation and reality is 'small' in some sense. For certain dynamics, such as turbulent flows or soft matter [8], [9], accurate models are unavailable or prohibitive to simulate. Overcoming the sample-efficiency challenge would not only allow learning to be leveraged on these systems, but also alleviate the reality gap in general: policies can be refined on the real hardware after initial training in a low-fidelity simulation. Indeed, Lee et al. [3] point out that, while they needed a high-fidelity model of the robot dynamics to train a DNN policy, they could successfully make the transfer from simulation to reality using only low-fidelity terrain models.
One of the key concepts they leverage is privileged information: ground-truth data is often available during training, and can be leveraged to substantially reduce training time. Chen et al. [10] coined this term, and use it to improve imitation learning by first training an autonomous driving policy with a birds-eye view of the environment, then using its evaluations as training examples for the final policy, which only has access to a regular car-mounted camera as input. Lee et al. [3] use the same concept to infer terrain properties from a history of proprioceptive data and thus avoid the need for external sensing entirely. In both these studies, learning is done in simulation, and privileged information can be directly accessed from the simulator. Privileged information is also available when learning directly in hardware, especially if it takes place in a controlled lab setting. For example, Srinivasan et al. [11] learn accurate SE for a racing car from only IMU and wheel encoder readings. Training targets are generated with a mixed Kalman filter that has access to two additional velocity sensors, which are very accurate but also expensive. Previously, Levine et al. [12] used this concept to learn to estimate the position of a target object from images, using supervised learning. During this phase, the object is placed in the robot's gripper, so the robot can directly estimate its position relative to the camera through joint-position measurements and forward kinematics.
Levine et al. [12] also report significantly better performance with full end-to-end learning, that is, training a single DNN for both the state estimator and controller. However, separating SE and control also has benefits, such as improved sample efficiency and more targeted development. For example, Srinivasan et al. [11] rely on existing methods for perception, mapping, and control, and focus on learning a convolutional DNN specifically to estimate velocities, which can be challenging for a Kalman filter during aggressive maneuvers. Hoeller et al. [13] implement a full, highly modular learning pipeline, which separately tackles staterepresentation 1 and motion planning. This pipeline is trained to high performance with remarkable sample efficiency, requiring on the order of 70'000 depth-images and 17 hours worth of trajectories, using a mixture of simulated and real-world data.
Despite recent successes in learning vision-based controllers for grasping [2], [14], [15], learning pixel-to-torque control remains elusive, especially for fast, unstable systems. An important difference is that for tasks such as grasping, a conventional low-level controller can be relied on to stabilize the dynamics. Instead, the challenge is to generate appropriate desired kinematics such as grasping positions [2], [15], or kinematic trajectories, often called primitives [14], [15]. In other words, learning is used for planning, rather than for control.
Since torque control is usually required when systems have fast and unstable dynamics, high control bandwidth is often a concern when learning pixel-to-torque control. This need for fast and precise feedback is a key characteristic of the proposed platform, which distinguishes it from more common platforms for research on vision-based learning.
Lambert et al. [16] use DNNs to learn a predictive low-level controller of a hovering quadcopter using onboard sensors as input, and manage to obtain stable hovering for several seconds at a time after training on only a few minutes of data. Despite running a relatively simple DNN architecture on a powerful, offboard GPU, evaluation time is the bottleneck: to obtain sufficiently long prediction horizons requires multiple evaluations of the DNN, which limits their control bandwidth to 50 Hz. This bottleneck is greatly exasperated when vision is used for feedback since the high dimensions and complexity of vision typically require larger and more sophisticated DNNs, which are consequently slower to evaluate. For example, Mattner et al. [17] use an auto-encoder architecture to learn pixelto-torque control for balancing an inverted pendulum with minimal domain knowledge. The entire DNN size is kept manageable in a number of ways, including down-sampling the input image to 40×30 pixels and limiting the output torque to only three values. Nonetheless, the control bandwidth is limited to roughly 10 Hz. Fast evaluation becomes even more challenging for autonomous robots that rely on onboard computation, since more powerful computers add strain to a limited payload and battery supply. To run vision-based control fully onboard a small quadcopter, Kaufmann et al. [18] only learn a DNN for part of the control pipeline, which runs at a lower frequency of 10 Hz. Similar to the grasping examples above, stability is maintained by a conventional controller running at a higher frequency, in this case a minimum-jerk planner. To learn the full pixel-to-torque control pipeline of fast systems, the trade-off between the learned SE's precision and its evaluation speed takes a central role in designing the learning pipeline, as we will explore in our case study.

III. LEARNING PIPELINE
We found the two central criteria for designing the learning pipeline to be sample efficiency, and simultaneously fulfilling the precision and control bandwidth requirement. For the Furuta pendulum used in this study, the minimum control frequency translates to a time budget for the entire visionbased control of roughly 8 ms (see "Reproducible Platform"). We found it essential to split up the learning pipeline into four steps (see Figure 2): in step A, online RL of a control policy using privileged information, in step B, policy analysis and sample-collection using privileged information, in step C, offline supervised learning of the SE, and finally in step D, online adaptation learning of the control policy to the SE.
We rely on privileged information in the form of rich and accurate state measurements, which are often available in the lab setting via external sensing such as motion capture. We also rely on a means of automating sample collection with a specified distribution. In the case of the Furuta pendulum, state measurements are readily available from the joint encoders, and samples can be easily gathered using a standard combination of energy-pumping and LQR controllers. An important benefit of automating sample collection is that it makes it possible to quickly and easily collect new data sets. As is always the case, development is an iterative process, and speeding up this process is critical yet seldomly discussed in literature [7].

A. Learning the Control Policy
To focus on a reliable and sample efficient training process, we train a Proximal Policy Optimization (PPO) [24] RL agent

Reproducible Platform
To foster research on learning pixel-to-torque control, an ideal platform should represent the entire learning pipeline while also allowing different challenges to be addressed independently. It should be simple to reproduce, allow fast and easy iterations, and clear benchmarking. In order to push the boundary of learning and vision-based control in highly dynamic settings, we propose to combine the above requirements in a representative setup with high demands for precision and control bandwidth of larger than 100 Hz.

The Vision-based Furuta pendulum
The Furuta pendulum is a well-studied system that combines important challenges, being nonlinear, unstable, and underactuated [19]. A minimal state space can be described as x = [θ, α,θ,α] T , where θ refers to the arm angle and α to the pendulum angle. The pendulum has a single control input u: the voltage applied to a DCmotor actuating the arm joint θ. Access to its built-in encoders serves as a simple proxy for a controlled lab environment with external sensing, without the need for additional expensive equipment such as a motion capture setup. Well-established controllers, such as energy-based swingup [20] and LQR controllers [21], provide a reliable and rigorous baseline to compare against. Because the required control frequency directly depends on the pendulum's natural frequency, it can be easily increased or decreased by simple modifications to the pendulum length. Any desktop-sized Furuta pendulum can thus be easily modified to have comparable dynamics to the one used in this study, including platforms commonly used in classrooms [22].

Hardware Details
We use the off-the-shelf Quanser Qube Servo 2 Furuta pendulum; as mentioned above, other Furuta pendulums with similar dynamics can be used instead. For our setup, a minimum sample frequency of 80 Hz is required to stabilize the pole with an LQR controller. From experiments, we found a sampling frequency of 120 Hz to be a good compromise and use this frequency for the case study presented. In our experience, the learning pipeline is not very sensitive to changes in system dynamics, but more sensitive to changes in the camera and lighting setup. We use a FLIR BlackFly 3 high-speed camera with a resolution of 0.3 MP and a sample rate of 522 Hz. We found a high frame rate to be important to avoid additional latency in the control cycle. To reduce the effect of ambient light, we mounted the camera and pendulum in a white box. We also use a dimmable light source with a maximal illuminance of 4600 lx/0.5 m, which can both provide a steady illumination, and also simulate transfer to other lighting conditions in a controlled manner. We ran experiments on a standard desktop computer with an Intel Xeon W-2123 CPU and an Nvidia Quadro P620 GPU, but also used a more powerful desktop to accelerate offline training (reported training time corresponds to the described computer). The hardware interface is based on code from [23] and contains a flexible setup based on OpenAI Gym and PyTorch, facilitating the reproducibility and reusability of all components in the proposed learning pipeline. Our entire setup, including off-the-shelf pendulum, camera, and desktop PC, was purchased for a total cost of less than 10'000 C.

Baseline Controllers & Automated Data Collection
For the swing-up baseline, we use the energy-pumping controller described in [20]. For balancing, we use a standard LQR controller. To automate data generation, these baseline controllers are modified with reference trajectories for the pendulum arm and a sinusoidal input signal on the input to ensure sufficient coverage of the state space. This is critical because, although the dynamics are invariant to the arm angle θ, the vision-based SE is not. For details on the baseline controllers, choice of gains, and modifications, see the Appendix B.

Instructions and Code Repositories
Detailed instructions on how to set up the hardware platform are provided at https://git.rwth-aachen.de/quanservision/vision-based-furuta-pendulum. All code needed to reproduce the pixel-to-torque case study presented here is provided at https://git.rwthaachen.de/quanser-vision/furuta-pixel-to-torque-control.

Hours of interaction time
Train state-to-torque control policy with reinforcement learning 12 Analyze policy for convergence criteria, and collect labeled data using privileged information as input. In the case of the Furuta platform, the agent learns to swing up and balance the pole in approximately 12 h of interaction time, which is equivalent to 8 h worth of samples gathered for learning and 4 h for resetting. The entire process is automated and could be run in a single session without any intervention.
To enable the agent to learn this task reliably, it was important to tune the reward function and to adjust hyperparameters based on knowledge about the system. We use a continuous reward function, which accelerates training by providing a reliable, steady increase in the accumulated reward. For the Furuta pendulum, we use a quadratic reward penalizing the angle positions of the pendulum with We train the agent with a small learning rate and a small clipping factor (compare Table I), which also helps to reliably increase the reward over training episodes. Agents with a large learning rate learned the swing-up task more quickly, but were not able to learn to balance the pendulum reliably: they were susceptible to 'fatal forgetting', or sudden large drops in reward. We surmise this is because balancing requires very precise control inputs, and therefore also a smaller learning rate.

B. Policy Analysis and Data Collection
Based on the control policy trained on privileged information, we empirically identify minimum precision requirements by injecting noise on the state until the task can no longer be fulfilled. This threshold is then used as the convergence criteria for training the SE in step III-C. For the Furuta pendulum, we add zero-mean Gaussian noise on the angles, and propagate it via finite differences to the angular velocities. At a sampling frequency of 120 Hz, the agent can tolerate noise with a standard deviation < 1°. We also noticed that this level of precision is only necessary to balance the pendulum near the equilibrium point; the policy is able to swing up the pendulum even with higher noise. Based on this observation, we separately collect data for the swing-up portion of the task, and the balancing portion (see "Reproducible Platform"). The convergence criteria is then only tested on images relevant for balancing, which we heuristically determined as |α| < 10°. As we will see in Section III-C, converging to high precision over the entire state space not only requires more training time, but a larger DNN.

C. Learning precise State Estimation
Precise predictions require an unknown minimum network capacity, which makes it difficult to reduce execution time on limited computational resources. We balance this trade-off with a deliberate choice of the DNN architecture, a biased data set, and data augmentation methods.
To increase precision, we simplify the learning task and train a DNN using standard convolutional layers to estimate only the pose from a single image. Velocities are then computed from a buffer of previously estimated positions and velocities via finite differences and a first-order low pass filter (see Figure 3). This structure reduces the SE's prediction error to roughly a fifth compared to a recurrent neural network architecture similar in size, which we speculate is due to the freed capacity being available for higher accuracy on a simpler task. Alternatively, velocities could be estimated by using a history of images as input, but again this would significantly increase the network size, which we need to reduce as much as possible.
We also downsample the input image from 540×720 to 220×220 pixels, which allows the DNN depth to be increased; we found this was more important for precision than a higher image resolution. To compensate for the downsampling, we add a very small stride of 1-pixel per step. With a depth of 12 layers, the SE reaches a precision that that is able to distinguish individual pixels.
Despite these measures, the limited network size makes it difficult for the DNN to converge to a low error everywhere. Precise state estimates are often not needed throughout the entire state space, and we can evaluate where the SE should be more precise based on the policy analysis conducted in step III-B. For the Furuta pendulum, we bias the training data set to be more densely sampled around the upper equilibrium point. An SE trained on a very biased data set can meet our convergence criteria after just four episodes of training. Due to its reliably low prediction error for small angles (compare Figure 4), the RL agent could also adapt much faster.
To avoid overfitting to the training data set, and to increase the SE's robustness, we also apply data augmentation methods [25] during training. The input images are randomly zoomed, rotated, shifted, and modulated in brightness (compare Table II). While augmenting training data did not increase the accuracy on the validation data set, it substantially increased the prediction performance while testing on the hardware setup.

D. Policy Transfer
Although the state-to-torque policy does not perform well 'out of the box' with the DNN-based SE, we found that it can be quickly and easily transferred with additional training, without adjusting any learning hyperparameters. With an additional training time of approximately 30 min, the policy adapts to the new input and achieves performance comparable to the policy relying on privileged information. A typical run is shown in Figure 5, where the pendulum reaches the upright position in a single swing.

IV. DISCUSSION
Our case study demonstrates learning fast and precise pixelto-torque control of an unstable system, with deep neural networks trained exclusively on real-world data. Although this task does not pose any theoretical obstacles, the practical challenges are substantial, and to the best of our knowledge, this is the first demonstration on a system requiring a control bandwidth of 100 Hz or higher. We hope this case study will serve as a starting point for further reproducible and comparable studies.
In addition to the usual challenges of using DNNs in control, such as sample efficiency and robustness, we found that achieving both high bandwidth and precision becomes a critical challenge for fast and unstable systems. This is especially the case when using function approximators such as neural networks. While large, sophisticated DNN architectures have tremendous representation power, these networks are not only slower to train but also slower to execute at runtime. The high bandwidth requirements severely limit the network size, putting speed and precision of the DNN in direct conflict. In fact, we found that making this trade-off was the dominating factor in designing the learning pipeline.
In order to achieve this while also keeping sample requirements reasonable, we resorted to separating the problems of control and state estimation, and enforcing a state representation based on first principles. Deep learning literature often advocates against a clear separation [5], [12] in favor of allowing the learning process to converge to a latent space representation, which may be more parsimonious and task-invariant. However, in our initial attempts using spatial autoencoders [5], we found that a DNN small enough to meet our execution time requirements was by far not expressive enough to learn a meaningful representation. On a data set of more than 400'000 samples, the autoencoder could not detect features precise enough for the RL agent to noticeably increase the reward after an interaction time of over 24 hours. Learning control and SE separately allowed us to use sample-efficient algorithms, and more easily leverage domain knowledge and privileged information to systematically reduce the DNN size. Furthermore, for the Furuta system, we can directly compare our learning algorithms against conventional controllers as a clear baseline, which is particularly helpful when debugging unexpected learning outcomes.
One of the advantages we did not initially anticipate is that, by learning the control policy first, we could quantify the robustness of this policy to SE noise. This provides clear convergence criteria for supervised learning of the SE and is particularly helpful as a quantitative measure for balancing the trade-off in precision and bandwidth. To cope with limited representation power, we bias the SE training data to regions that require high precision, and compromise in the regions that do not. With the Furuta pendulum, these regions were simply chosen based on system knowledge, but this is not always straightforward to do for more complex systems. An alternative we find promising is to not only test the robustness of a standard control policy, but deliberately train robust control policies with a curriculum [3], for example with progressively noisier environments. Such a policy would further relax the requirements on the SE, which will be critical for more complex tasks or if even higher bandwidth is required. To this end, an important direction of research is to quantify the robustness of a control policy [26].
In the presented case study, we put minimal effort into making the SE robust to changes in the environment [27], and this is certainly an important topic for further studies. Nonetheless, we found the applied data augmentation methods were crucial to increase the SE's performance on the hardware system. To our surprise, the final policy was quite robust to lighting changes: the LED lamp could be dimmed by 30 % before the policy failed. Instead, we believe that the robustness of the learning pipeline itself is an important aspect that is rarely discussed in the literature. Indeed, we have so far reproduced these results multiple times ourselves, with mixed results. The entire pipeline was first developed with TensorFlow, then re-implemented in PyTorch essentially un-Fu lly Co nn ec te d 3 Fig. 3: The architecture of the SE. Instead of predicting the angles directly, we train the DNN to predict [cos θ, sin θ, cos α, sin α] T . We thereby ensure that samples that are very close to each other in the input space are also close in the output space by avoiding jumps for the angles around ±180°. During training, we apply neuron dropouts with a probability of 0.1 after every max-pooling layer and every fully connected layer to prevent overfitting to the training data set. . Both data sets contain 336'000 images and were trained on the same DNN architecture. While the unbiased data set met the convergence criteria after 48 episodes, the unbiased converged after just four episodes. We also found that the RL agent could adapt much more quickly and more reliably to the SE trained on the biased data set, further reducing the interaction time on the hardware system. Fig. 5: Swing-up and balance trajectory of the RL agent with the state estimator as input. Despite much larger errors during swing-up, the policy is able to reliably swing up with only one or two swings, which is faster than a typical run using a standard energy-pumping controller.
changed; while learning the SE worked out of the box, the RL agent typically converged to much more aggressive policies, and required additional parameter tuning and testing. After the pandemic started, the entire hardware setup was moved to Steffen's apartment; here, we were pleasantly surprised that the same pipeline, without any algorithmic changes or hyperparameter tuning, reproduced our results. However, once the setup was moved back to the lab, it took an unexpected three days of debugging, recollecting training samples, and training from scratch in order to once again reproduce these results. To better understand how to create reliable learning pipelines, reproducible studies-and reproducing them-are sorely needed. We believe the vision-based Furuta platform we have presented is ideal for such studies. Not only does it capture important challenges for fast and unstable systems, it is simple enough that development iterations can be made quickly, and conventional controllers provide not only a clear baseline to compare against, but also a tool to debug and validate different parts of the learning pipeline. For example, we often determined whether to put more effort into step A or step C by comparing the DNN-based SE coupled with an LQR controller against the DNN-based policy with encoder readings. We believe the required effort to recreate the visionbased Furuta platform is reasonable, and we look forward to studies that reproduce, and improve on, the results we have presented.

B. Baseline controllers and data generation
The parameter values are specific to the Quanser Qube Servo 2 pendulum and may need to be adjusted for other Furuta pendulum systems. The control signal is saturated to stay within motor constraints by simple thresholding.
1) Swing-Up Control: As a baseline, we use an energybased control law from [20].
where µ is a tunable gain, E 0 = mgl is the potential energy of the pendulum at the upright equilibrium, with mass m and length l, and E is the current total energy. The sign operator applies a bang-bang control signal, which results in better performance.  For data collection, the baseline controller is modified to sweep large areas of the state space. We add a PID controller k PID,θ on the arm angle θ to follow a reference trajectory θ ref (t) = A data,1 sin(f data,1 t): u data,1 = u swing-up + k PID,θ (θ − θ ref (t)) The PID parameters k p , k i and k d and trajectory parameters A data,1 and f data,1 are listed in Table III.
2) Balancing Control: As a baseline, we use a Linear Quadratic Regulator (LQR) controller To design the feedback gain K we solve the Ricatti equation of the linear dynamic model of the pendulum at the upright equilibrium x 0 = [0, 0, 0, 0] T and u 0 = 0. The system matrices of the linear model are provided by Quanser [28]  and R = 1 to be suitable for reliable and stable data generation trajectories.
To generate data around the upper equilibrium point we perturb the controller with two input signals: where u oscillation (t) = A data,1 sin(f data,1 t) is a fast oscillation and serves to gather more samples in the region |α| < 15°, and x ref (t) = [θ ref (t), 0, 0, 0] T is a slow oscillation with θ ref (t) = A data,2 sin(f data,2 t), and serves to cover large areas of θ. All control and signal parameters are listed in Table III.