I. Introduction
A core problem of learning control is to determine optimal feedback controllers for (partially) unknown nonlinear systems from experimental data. Reinforcement learning (RL) [1], [2] is a promising framework for this, yet often requires performing many experiments on the physical system to even find suitable controllers, which limits the applicability of such techniques. Therefore, a lot of research effort has been invested into data efficiency of RL aiming at learning controllers from as few experiments as possible. Recently, Bayesian optimization (BO) has been proposed for RL as a promising approach in this direction. BO employs a probabilistic description of the latent objective function (typically a Gaussian process (GP)), which allows for selecting next control experiments in a principled manner, e.g., to maximize information gain [3] or perform safe exploration [4].