Transferring Online Reinforcement Learning for Electric Motor Control From Simulation to Real-World Experiments

Reinforcement learning (RL) based methods are an upcoming approach for the control of power systems such as electric drives. These data-driven techniques do not need an explicit plant model like most common state-of-the-art approaches. Instead, the control policy is continuously improved solely based on measurement feedback pursuing optimal control performance through learning. While the general feasibility of RL-based drive control algorithms has already been proven in simulation, this work focuses on transferring the methodology to real-world experiments. In the case of electric motor control, a strict real-time requirement, safety constraints, system delays and the limitations of embedded hardware frameworks are hurdles to overcome. Hence, several modifications to the general RL training setup are introduced in order to enable RL in real-world electric drive control problems. In particular, a rapid control prototyping toolchain is introduced allowing fast and flexible testing of arbitrary RL algorithms. This simulation-to-experiment pipeline is considered an important intermediate step towards introducing RL in embedded control for power electronic systems. To highlight the potential of RL-based drive control, extensive experimental investigations addressing the current control of a permanent magnet synchronous motor utilizing a deep deterministic policy gradient algorithm have been conducted. Despite the early state of research in this domain, promising control performance could be achieved.


I. INTRODUCTION
Optimal electric motor control is of prime interest for various applications (e.g., automation and automotive engineering) that depend on high-performance drive systems. State-of-the-art motor control methods like linear quadratic regulators (LQR) [1], [2], model predictive control (MPC) [3], [4] or closed-form tuning of proportional-integral control (PI-control) [5], [6] require an accurate drive model for their design. While the latter is relatively more robust than MPC or LQR approaches due to its integral feedback, highperformance PI control still requires an exact model representation [7]. On the contrary, severe deviations between the real drive system and the drive model can occur due to plenty of reasons such as production tolerances or operationdependent system behavior changes (e.g., temperature, magnetic saturation or wear-and-tear influences). In some cases a model might be even completely unknown, e.g., when a new motor is connected to a power electronic converter (selfcommissioning).
In contrast, model-free reinforcement learning (RL) techniques do not require a mathematical motor model at all. RL motor controllers are completely data-driven and learn an optimal control policy directly from the drive's response. Also, secondary and parasitic effects like iron saturation, iron losses, the skin effect, or influences by nonlinear inverter behavior can be learned and directly compensated by RL control without requiring domain expert knowledge in this field. Moreover, many RL algorithms allow background planning [8], i.e., the control inference (evaluating a control policy function) is decoupled from the learning process (a policy update step). Compared to MPC as a planning-at-decision-time approach, this relaxes real-time requirements and allows more implementation flexibility since learning the control policy can be executed asynchronously to the control inference.

A. RELATED WORK
Recent publications on this topic have shown that RL approaches already reach standard control performance in simulation [9], [10]. In particular, [9] provides a basic proof of concept of the methodology in the motor control context while [10] contributes to the development of an open-source drive system simulation toolbox using the OpenAI Gym standards [11] to test and train RL agents 1 . Such a simulationbased training pipeline can be used to derive RL-based control in an offline fashion, i.e., based on (simplified) motor models. However, deploying an offline-learned RL agent on a realworld drive application leads to the same drawback of limited model accuracy as discussed with the state-of-the-art control approaches. Previous contributions have not investigated the online training of RL-based control using real-world motor drive feedback on a fully experimental basis.
The transfer of RL algorithms from simulation to reality causes several new challenges that have to be faced, as summarized in [12]. In the case of electric motor control, mostly real-time requirements, safety constraints, measurement noise and system delays are of interest. Although an offline, simulation-based pre-training can be utilized in order to speed up the online training on the real physical system [13], the initial control performance after the transfer is non-optimal if the simulation model is not accurately matching the real-world system behavior. As will be discussed in Sec. II, this model mismatch is a prominent problem in drive applications.
Popular RL examples as AlphaGo [14] or other gamerelated approaches (e.g., [15]) do not face any real-time requirement. In drive control, however, the typical turnaround time ranges from 10200 μs. Due to this real-time constraint, training carried out directly on the real-time hardware becomes infeasible. Hence, the control policy inference and the learning have to be decoupled and implemented on different time scales -a batched RL training [16] is necessary.
Safety constraints are another crucial point in motor control. For example, electric currents exceeding the limits of the drive might destroy it due to rapid overheating. RL algorithms do not consider constraints inherently. For instance, [17] and [18] face this issue by adding a safety layer correcting actions that violate constraints. Moreover, [19] forces the agent to learn the constraints during training by shaping the FIGURE 1. Simplified schematic of the overall control and drive system structure; note that all gray shaded parts are control-related while from the RL agent's perspective, both the coordinate transforms and PWM are part of the environment, i.e., they are pre/post-processing steps outside the RL agent's core software.
reward function, which penalizes policies exceeding the safety bounds.
Furthermore, electric motor control systems contain multiple inherent forms of delays, e.g., calculation time of the controller hardware or the modulation scheme of the power electronic converter [20]. These can be modeled as a one-step delay in the application of the agent's actions, as described in Sec. IV-C. Such delays slow down the learning process of RL agents significantly. To tackle a τ d -step delay before actions take effect, [21] appends the last τ d applied actions to the observation of the RL agent. Alternatively, [22] uses recurrent neural network agents and a special reward allocation to properly assign reward to past actions.
In summary, the overwhelming majority of investigations in the field of RL are based on simulations without any interaction to real-world physical systems [12]. Addressing and solving issues when transferring RL-based control approaches to real-world applications, specifically for the field of electric drive systems, is therefore an important object of research in order to be able to transfer data-driven control techniques into industrial processes in the long run.

B. CONTRIBUTION
In this work, the transfer from simplified offline simulationbased training to online training and inference on real motor drive systems is presented. A Python-based rapid control prototyping toolchain 2 is developed that allows online training on a remote platform (edge computing) using measurements obtained from an embedded controller (cf. Fig. 1). Therefore, the training process is executed asynchronously in the background. This toolchain allows to rapidly test and validate various RL algorithms in the context of electric drive control without the necessity to implement the training process within the embedded software. Hence, only the control inference (policy evaluation) is required to be executed in real time on the embedded controller while the learning step (policy improvement) is decoupled. For demonstration purpose, a batched version of a deep deterministic policy gradient (DDPG) algorithm [23] is extended to learn the current control policy for a permanent magnet synchronous motor (PMSM) that is fed by a B6-bridge power electronic converter. Further innovations of this contribution handle the safety constraints and system delays in the case of RL motor control including extensions to the baseline DDPG algorithm [23]. In addition, the impact of offline-based pre-training using a drive model is compared to randomly initialized DDPG agents. The functionality of the presented architecture from Fig. 1 is successfully tested on a laboratory electric drive test bench. The data-driven DDPG-based controller is compared against state-of-the-art linear field-oriented control and model predictive control approaches. The presented rapid control prototyping toolchain is an important intermediate step in order to accelerate datadriven control research in power systems. Although an actorcritic-based RL approach is depicted in Fig. 1 the rapid control prototyping toolchain can be directly applied to value-based RL techniques such as (double) deep Q-networks [24], [25], too. In particular, it allows to seamlessly plug-in and test many potentially interesting RL algorithms on a Python basis avoiding cumbersome embedded software implementations of each and every algorithm. This is especially valuable because many contemporary RL algorithms are publicly available as open-source Python code (e.g. Stable-Baselines3 [26], TF-Agents [27] or Keras-RL [28]) and, thus, can be tested comparatively easily on a real, physical system.

C. PAPER STRUCTURE
First, in Sec. II a basic PMSM-based drive model is presented explaining the basic functioning of the system to be controlled for non-motor-experts. Moreover, the same model is utilized within an MPC and a linear feedback approach for comparison reasons in the later part of the paper. In Sec. III the fundamental background of RL in terms of the DDPG algorithm is summarized followed by extensions and modifications for batched online learning in Sec. IV. The utilized laboratory test platform is presented in Sec. V followed by the experimental results in Sec. VI. Finally, a conclusion on the major findings and an outlook to future work in the field are given in Sec. VII.

II. DRIVE SYSTEM MODEL
The drive system (cf. Fig. 1), which is used in this work, consists of a B6-bridge voltage source power electronic converter and a PMSM. In the following, these two main components of the electric drive are briefly modeled.

A. PERMANENT MAGNET SYNCHRONOUS MOTOR
The mathematical model of the three-phase PMSM can be simplified using the rotor-fixed dq-coordinates [29]. Therefore, two transformations are necessary. First, the three phases x a , x b and x c are transformed to stator-fixed x α , x β and x 0 components with ⎡ ⎢ ⎣ Second, a rotational transformation to rotor-fixed variables x d and x q is performed with the electrical rotor angle ε (1) and (2) apply to current, magnetic flux linkage and voltage [30]. The resulting flux linkage ordinary differential equation (ODE) in dq-coordinates is Here, ψ dq is the stator flux linkage, u dq is the stator voltage, R s is the stator resistance, i dq is the stator current and ω is the electrical angular fundamental frequency. 3 The resulting electromagnetic torque T produced by the motor is with the pole pair number p. The ODE for the mechanical angular velocity ω me is defined by with the load torque T l and the moment of inertia J. The relationship of mechanical (ε me , ω me ) and electrical (ε, ω) quantities is given by ω = pω me and ε = pε me .
To fully describe the motor behavior a relationship between the magnetic flux linkage ψ dq and the stator current i dq is required. For many drives a simple linear equation with utilizing absolute inductance values (L d , L q ) and a constant permanent magnet flux linkage ψ PM can be often used. Consequently, the electrical ODE simplifies to In this case, a classical equivalent circuit diagram can be drawn as depicted in Fig. 2. However, for highly utilized PMSM drives, especially in the automotive or aerospace domain, significant (cross-)saturation within the magnetic circuit is occurring [5]. As a consequence, the linear magnetic model (8) is not valid and has to be replaced by a nonlinear relation Instead of absolute inductances as in (9), operating pointdependent differential inductances have to be taken into account: The resulting nonlinear current ODE is then Besides the magnetic (cross-)saturation, other parasitic effects like temperature influences [31], rotor angle-dependent flux harmonics [32], tolerances in mass production [33], iron losses [34], and the skin-and proximity effect [35] are directly influencing the motor behavior making it a general nonlinear and uncertain control plant. Since those effects are hard to model in an analytical closed-form, we will not go into more details here, but refer to the further technical literature [36].

B. POWER ELECTRONIC CONVERTER
The circuit diagram of the three-phase two-level converter is given in Fig. 3. A pulse width modulation (PWM) scheme is used for the generation of the switching commands of the transistors [20]. Hence, the control actions are the reference stator voltages a = u * dq = u * d u * q T , which are continuous variables. The action space is limited by the DC-link voltage u DC and by six box constraints in stator-fixed coordinates (socalled voltage hexagon) [20], which translates to a rotor angledependent limitation in dq-coordinates: Due to the discrete-time control implementation and the inverter modulation scheme in combination with regular sampling of the relevant states, there is a delay between the applied control action and the applied voltage at the motor terminals [5]. This digital control delay is discussed more deeply and compensated from an RL point of view in the upcoming Sec. IV. Besides that, the inverter itself is also a nonlinear actuator since the reference voltage u * dq from the control is not perfectly transferred to the applied voltage u dq at the motor terminals due to resistive voltage drops, the antishort-circuit interlocking times as well as non-ideal switching transitions [37], [38]. Hence, the drive system's two main components show nonlinear behavior, which is caused by a variety of parasitic influences of different physical root causes and, therefore, are hard to fully model during the control design phase. This motivates the usage of data-driven control techniques from the field of RL since they do not require a system model but can directly learn from the interaction with the physical system.

III. DEEP DETERMINISTIC POLICY GRADIENT
First, it may be noted that both the state and action space are continuous within the given drive control problem. For this kind of problem space, the DDPG algorithm is a well known solution candidate from the RL domain and will be utilized for demonstrating the online learning rapid control prototyping toolchain under real-world experimental conditions. Therefore, only a short introduction to the RL in general and the DDPG in particular is given, while more details can be found in [8], [23]. The standard pseudocode of the DDPG algorithm is presented in Alg. 1. Additionally, modifications to the standard DDPG approach that boost the learning process for real-time motor control are presented in Sec. IV.
The classic RL setup consists of an agent and an environment. The environment can be modeled as a Markov decision process (MDP) (cf. [8]). At each discrete time step k, the agent receives an observation o k = f (s k ) from the environment, i.e., measurements, reference signals and further quantities in a feature engineering sense derived from the environment state s k . The agent calculates an action a k (i.e, the actuating variable) with its internal policy function a k = π(o k ), and receives a reward r k+1 for applying it to the environment. The reward indicates how well the agent is performing, and it is thus comparable to the MPC framework's cost function, albeit the latter is minimized rather than maximized. From a control engineering perspective, the reward signal r k is an important degree of freedom which requires proper design to ensure that the RL agent is learning a suitable control policy. The agent's goal is to find an optimal policy a k = π * (o k ) which maps each observation to the action that maximizes the return g k , which is the expected discounted cumulative reward over time: is the expectation of the reward R i (capital symbols denote random variables in Sec. III) given the random observations O i and actions A i with realizations o and a. This relation can be rewritten in the form of the Bellman equation [8]: wherein q is the action value function that performs a mapping from the momentary observation o k and action a k to the expected return. This equation is of fundamental importance for many RL methods, as it condenses the problem of finding an optimal sequence of actions to the problem of finding the single optimal action at each time step. The DDPG algorithm is an efficient RL algorithm for continuous observation and action spaces that separates action selection and policy evaluation [23]. Action selection is performed by the actor π ξ : o k → a k , which is a policy function approximator with parameters ξ. Policy evaluation is carried out with use of the critic q ζ : (o k , a k ) → g k , which is an action value function approximator with parameters ζ. Both function approximators are usually artificial neural networks (ANN) whose specific topologies (general type of ANN, number of neurons per layer and number of layers, learning rate, etc.) represent an important subset of the agent's hyperparameters. The critic learns to predict the return for observationaction pairs (o k , a k ) [23]. It can be trained by minimizing the cost function J c for given transition experiences where in B is a batch of collected experiences and d k is a done signal, which is equal to one (d k = 1) if the observation o k belongs to a terminal state and zero (d k = 0) elsewise. The structure of the critic's cost function (17) indicates that the critic q ζ is trained by supervised learning, under the assumption that the estimated returnĝ k is an accurate approximation.
Training the critic is based on the concept of bootstrapping, meaning that the critic q ζ is improved towardsĝ k by using an estimation of the upcoming action value, as proposed by (16). As this method can lead to unwanted parameter oscillations, it has been proven to be good practice to dampen the parameter updates within the critic that is used to calculateĝ k . This introduces the concept of the target critic q ζtarget , whose parameters ζ target track the actual critic parameters ζ with low-pass filter characteristic with damping parameter ρ ∈]0, 1] [15] [23]. When satisfactory action value estimation can be assumed, this knowledge can be used to optimize the actor π ξ . The task of the actor is to simply choose an action such that q ζ (o k , π ξ (o k )) is maximized. This leads to a very straightforward actor cost function J a : Also for the actor it is often helpful to make use of target parameters ξ target . Since the actor optimization (19) is based on the assumption that the critic outputs action value estimations that fit the policy, harsh changes to the actor could lead the critic to get out of touch with the updated policy. Inaccurate action value estimation would then disrupt the training process. Thus, it is usually advised to also dampen the actor parameters when training the critic, as denoted in (17).
It should also be noted that the DDPG agent represents a parametric control algorithm defined by the actor network weights ξ. To enable online-suitable control, only the actor inference π ξ : o k → a k needs to be executed under real-time requirements while the learning of the actor and critic can be executed asynchronously.

REPLAY MEMORY
During training, the actor applies actions to the environments and receives transition experiences (o k , a k , r k , o k+1 , d k+1 ). These are stored in a replay memory D, which buffers the last experiences the agent has gathered. From this memory, multiple experiences are sampled in a training batch B to train the actor and critic networks. Those samples are independently drawn from the memory and are not necessarily temporally consecutive. Random sampling of multiple experiences from a longer history decorrelates the experiences in one training batch from each other, which results in gradient updates with more general improvement.

EXPLORATION NOISE
To avoid getting stuck in a local optimum, exploratory behavior is required. This is implemented by superimposing the actor's action choice a k with random noise. Here, the discretetime zero-mean Ornstein-Uhlenbeck-Process (OU) [39] is used as exploration noise with the sampling time t , the stiffness θ , the diffusion σ and the standard Gaussian noise process N (0, 1), respectively. The benefit of this process is that it is time-correlated. Having non-correlated random noise, which is the case for, e.g., standard Gaussian noise, will lead to fewer exploration throughout the observation space because consecutive actions are more likely to cancel each other out.
Randomly initialize weights of q ζ (o, a) and π ξ (o) Initialize target weights accordingly:  (17) update ξ by minimizing J a on B, (19) update weights of target networks:

IV. REINFORCEMENT LEARNING MODIFICATIONS
In this section, the modifications to the standard DDPG training are presented which boosted the real-world training process. Although the modifications are discussed based on the DDPG framework, they are equally useful and applicable for other RL algorithms in the motor control context.

A. BATCHED AND REMOTE REINFORCEMENT LEARNING
RL agents are usually trained after each iteration [23]. However, this is not possible with the utilized real-time setup, because waiting for a policy update every step would violate the real-time constraints of the controller. In the subsequent experimental tests, the controller is operating at 10 kHz (cf.  need to be executed in less than 100 μs. This includes the measurement of observations, computation of actions and rewards, safety routines and the application of actions to the converter. Thus, the training and the control inference of the DDPG agent shall be split up (cf. Fig. 1), since only the actor network is required to be implemented within the embedded controller. As shown in [9] the actor needs only a rather shallow ANN for function approximation. Hence, evaluating the actor ANN under hard real-time requirements is not a problem. The training of the DDPG agent, i.e., improving the actor and critic, is shifted to an asynchronous background task which is not required to be executed in real-time. In order to enable a rapid control prototyping pipeline that will evaluate different RL algorithms in the future, this background task is implemented on a flexible remote workstation with a non-real-time operating system. However, without loss of feasibility, the background task could easily be implemented on the embedded hardware, for example on system-on-a-chip (SoC) solutions with auxiliary machine learning hardware accelerators.

1) CONTROLLER
All the functions that belong to the environment in the RL setting are located on the controller. This includes the reward calculation, the generation of current references as well as observation of limits with a safety mechanism. Moreover, the controller contains DDPG agent functionality. These are the actor network to control the motor and the exploration noise to ensure exploratory behavior.
Up to τ b consecutive experiences are recorded in a batch as one episode, while the RL controller is active. Next, the experiences are sent to the remote DDPG agent. The RL controller receives updated weights and uses them for a new recording.
Another important functionality of the controller is the safety mechanism to interrupt the RL control by the actor network, if motor constraints are violated (e.g., overcurrent or overheating). This terminates an episode of the RL agent. In this case, fewer than τ b samples are sent to the remote agent. When the motor is set into a safe state and the weights are updated, the RL controller takes over the control again.

2) REMOTE DDPG AGENT
The remote DDPG agent contains most of the components of the DDPG agent described in Sec. III. A copy of the actor network, the critic network, the target networks and the replay memory are located in the remote DDPG agent. It receives the batches of experiences. Then, each experience is stored in the memory buffer, and a training step is performed like it is done in the regular DDPG algorithm until all experiences are processed. The remote DDPG agent sends the updated actor weights back to the embedded controller.

B. ACTOR AND CRITIC PRE-TRAINING
Training on real motors can be costly and also dangerous in case of limit violations. To speed up the training, pre-trained weights can be used. One option is to initialize weights with imitation learning [40]. The RL agent is trained to imitate a  target controller with supervised learning. Another possibility is to make use of a simulative environment [10]. Due to the fact that a simulative motor model is only an approximation, a control error arises when a control algorithm is ported unaltered from simulation to reality. However, this can result in expedient initial weights for the beginning of the training, which might speed up the training significantly. Furthermore, safety critical constraints can be pre-learned in the simulation. This leads to fewer violations of safety limits during real-world training.

C. DIGITAL CONTROL DELAY COMPENSATION
Usually, RL agents interact with their environment in an alternating execution order. The environment responds with an observation and does not change its state while the agent calculates the next action. This is valid for typical RL problems like computer games or control problems with system dynamics that are much slower than the speed of action calculation.
Here, however, an action is always applied one time step delayed due to the digital control delay in the system. When the agent applies the action a 0 that was calculated based on the last observation o 0 , the real environment state has changed already and displays the next observation o 1 . Consequently, the effects of action a k can be observed in o k+2 and not instantly in o k+1 (cf. Fig. 4). Therefore, the reward r k and terminal flag d k+1 are independent of the action a k . Hence, the interaction between agent and environment can be sketched with a concurrent scheme, see Fig. 4. Ignoring this fact can lead to longer and more unstable training. For example, the agent could see a high instant reward r k for an action a k that would actually lead to a low reward r k+1 .
The concurrent execution can be mapped to the alternating execution if the environment is modeled with a one-step delay as shown in Fig. 5. To overcome this additional hurdle, two changes have been implemented in the DDPG agent.

1) EXPERIENCE MODIFICATION
The experience e k definition from (17) is extended to e k = (o k , a k , r k+1 , o k+2 , d k+2 ). Having added r k+1 and o k+2 , an experience contains the reward and the next observation after the action a k has actually taken effect on the electric motor.

2) ACTION FEEDBACK
Furthermore, the action a k−1 which has been played in the last cycle will be active in the next timestep. This action is appended to the observation o k = (o k , a k−1 ) as proposed by [21]. With this information, the agent is able to estimate how the system will behave in the next time step. For example, after applying actions that lead to steep changes in the electrical current the agent might better use smaller actions to reduce overshooting the reference in the next time step. The actions from time step k have not had any effect on the system due to the digital control delay. A simple feed forward network as actor or critic approximation model cannot remember the previously played action. Therefore, it is fed back into the networks inputs as part of the observation. This allows the agent to comprehend causal relationships again [9].

D. REWARD FUNCTION AND SAFETY CONSTRAINTS
It must be ensured that the RL controller learns to comply with the safety constraints of the motors [12]. In electric motor control, especially the current constraint is important to avoid overcurrent that could destroy the motor or the feeding inverter including the power supply (e.g., traction battery). To ensure that a trained RL agent complies with these constraints, a reward shaping approach is used [10]. In case of a limit violation, an additional penalty term r lim is added to the reward of Here, {w 1 , w 2 } ∈ R < 0 are weighting parameters to balance the regular and the penalty component of the reward function.
The regular part of the reward function (22) is representing the motor current control problem following given reference trajectories i * j (e.g., from superimposed control loops). The root-function (22) delivers improved early and long-term training performance compared to the standard mean-squaredtype rewards which are most common in tracking control problems. In particular, the steady-state control error can be reduced significantly compared to a mean-squared control error reward [10].

V. EXPERIMENTAL TEST SETUP
In this section, the experimental setup is presented. First of all, the workflow from simulation-based investigations via real-time controlled software-in-the-loop (SIL) models to the final test bench training session is described. Afterwards, the specific hardware architecture including the motor, controller and workstation is presented. Finally, important implementation details for the tests are described.

A. WORKFLOW FROM SIMULATION TO THE TEST BENCH
The development of RL motor controllers can be split into three steps as shown in Fig. 6. First, the gym-electric-motor toolbox [10] can be used with the standardized interface from OpenAI Gym [11]. Therewith, many different generalpurpose RL agents from several Python libraries can be adapted and tested easily for this use case. Also, different investigations (e.g., on training parameters and network architectures) can be executed in a simple and quick manner. Afterwards, selected RL algorithms and parameter specifications are tested with the presented remote training setup on a real-time controlled SIL model utilizing an embedded rapid control prototyping hardware system. The batched learning under the real-time control and the proper transfer from a pure simulation framework to an embedded hardware framework is tested with this setup. Furthermore, the RL agent's weights are pre-trained in the SIL simulation. Finally, the chosen algorithm is trained and tested on the test bench. The training on the workstation as well as the controller can stay the same when exchanging the SIL model with the real motor. Solely the actions of the actor are applied to the laboratory inverter and the observations are received from measurement sensors.

B. HARDWARE SETUP
The nominal parameters of the test bench equipment and the motor used for experimental investigations are given in Table 1, while an illustrating picture of the test bench setup with the utilized (interior) PMSM motor in the background is  shown in Fig. 7. To highlight that the motor under test shows significant (cross-)saturation effects within the normal operation range, the offline-identified flux and differential inductance maps are shown in Fig. 8 and Fig. 9. More information regarding the motor under test including open-source characterization measurement data highlighting its highly nonlinear behavior can be obtained from [41]. However, it should be noted that the offline-obtained data from Fig. 8 and Fig. 9 have not been made available to the RL training.
The controller and the SIL simulation models for the pretraining of the agents are built in Simulink and run as automated and exported C-code on a dSPACE MicroLabBox with a real-time kernel in the pre-training sessions. During the test bench investigations a similar dSPACE DS1006MC system is used, since it is already fully integrated into the laboratory test bench. Both, the MicroLabBox and the DS1006MC,   are commercial off-the-shelf embedded hardware products for rapid control prototyping. Hence, they speed up the development process since the embedded C-code must not be written manually. As shown in Fig. 1, only the DDPG actor networks need to be implemented on the embedded hardware, while the actual RL training pipeline is executed remotely. Besides the DDPG actor, the embedded control framework incorporates the usual general measurement processing and safety protocols for motor control applications. A Pythonbased script controls and automates both the measurement recordings (gathering RL experience samples) and the actor weight updating using the ControlDesk interface, which is part of the dSPACE rapid control prototyping environment.
It should be noted that the specific implementation is adapted to dSPACE hardware and software, however, the general concept of edge computing-based RL with an asynchronous training pipeline decoupled from the embedded actor is generalizable to any hardware setup. The main advantage is that the learning process can be executed on standard computer hardware with a full operating system (e.g., Linux) such that testing and tuning of different RL algorithms is easily possible without the need to transform the actual learning algorithms into the embedded world.

C. SOFTWARE SETUP
The remote DDPG agent on the workstation is a modified version of the DDPG implementation in keras-rl [28]. The DDPG agent offers several parameters which are summarized in Table 3. The actor and critic are structured as multi-layer perceptron ANNs with their specific designs summarized in Table 4. Hence, the RL agent is learning a direct, state-free (no internal integral action or similar) mapping between observations and actions to maximize its return. The configuration from Table 4 is only an example and is not to be understood as the best possible actor and critic configuration. Further ANN types (e.g., recurrent or convolutional ANNs) as well as the size of the networks (number of neurons and layers) can be investigated in terms of a superimposed hyperparameter optimization, as outlined in Sec. VII.
The applied action a = u * d u * q T is the reference voltage generated for the inverter. The observations shown to the agent consist of the following quantities: While the currents and the motor speed are standard measurements, the last applied voltage vector (action a k−1 ) is also utilized as an observation. This enables the RL agent to learn and compensate for the digital control delay (cf. Sec. IV-C).
Finally, the reference current values are also part of the observation space to allow proper current control. All quantities in the observation space are normalized to a range of [−1, 1] with their limits. In the training, step-like reference changes i * dq ∈ [−250A, 250A] are updated every 100ms from a uniform distribution (cf. [9], [10]). The reward function (22) is chosen. As exploratory action noise an OU process (20) is used with variable parameters θ and σ in the different experimental sessions and without a mean μ = 0.

D. PRE-TRAINING OF THE MOTOR CONTROLLER
Three different RL agents are pre-trained with a SIL motor model, in order to evaluate whether the simulation-based pre-trained RL controller is learning faster compared to randomly initialized actor and critic networks during experimental investigations. During the training, the noise is reduced to σ = 0.1 after 100000 and to σ = 0.0 after 200000 steps. The pre-trained agents of training sessions A and B are trained on the networks as shown in Table 2. The pre-trained agent C has one additional hidden actor layer. For agent B the motor model inductance values were reduced to a third compared to case A to investigate parameter sensitivity. After the transfer of the pre-trained agent to the real motor, a better performance in the nonlinear regions is expected, because the effective inductances are reduced at higher current due to magnetic saturation. As shown in Fig. 10, all pre-trained models show a similar mean absolute error (MAE) averaged over 5000 consecutive samples after 250000 training steps.

VI. EXPERIMENTAL INVESTIGATION
The goal of these experimental investigations is to give a proof of concept for the presented simulation-to-experiment toolchain and to show that RL motor controllers are feasible, not only in simulation, but also in real experiments. Furthermore, the influence of different training parameters is investigated. The effects of pre-trained networks as well as different OU-noise parameters and two different actor network topologies are examined in the experiments. All subsequent experimental training and cross-validation results are obtained at a fixed motor speed of n = 1000 1/min (maintained by a directly coupled speed-controlled load machine).

A. DDPG TRAINING ON A REAL MOTOR
A number of different training scenarios have been evaluated which are listed in Table 2. In case a, the controller is trained only on the real motor for 250000 steps. In the other cases, pre-trained networks from Sec. V-D are used for the initialization of the real motor training. The different noise parameters are also given in the table and were kept constant during the training. In case e, an actor network with one additional hidden layer is used.
The learning process of the RL controller is analyzed with the MAE during training. In Fig. 11, a large difference between the pre-trained cases and case a can be seen in the beginning. This is expected and shows that a simulative pretraining can be helpful in the beginning to speed up the training process on the real system. However, at the end of the training, the MAE s are all in the same range. This shows that complete training on a real motor is possible. The lowest MAE is achieved in case d. It seems to be nearly constant during the last 100000 training steps. The used exploration noise has faster dynamics compared to the cases a, b and c which seems to lead to better exploration. Also, the variations in the MAE during the training are smaller. It can be inferred from this that the exploration noise has a large impact on the training. Also case e with similar noise to case d performed well during the training with the larger actor network. Prior to these experiments, several tests with the larger network failed, which was not the case with the smaller network. Hence, the smaller network seems to be more robust and sufficient for this RL motor controller. Moreover, it should be noted that the DDPG algorithm requires active exploration to learn something, i.e., if the OU-noise is deactivated while the DDPG algorithm still actively performs policy updates there is a high likelihood that the agent diverges. Hence, when the learning is converging into steady state, both the exploration noise and the policy updates should be discontinued. In this case, the actor becomes a static policy (mapping observations to actions by the feed-forward ANN from Table 4) until active learning and exploration are re-activated again (e.g., if the rewards are getting worse due to control plant changes).
Based on the various experimental training runs, a DDPG setup was selected for the subsequent transient and steadystate tests that delivered the comparatively best performance. The parameter setup used in the following is summarized in Table 3 and Table 4. Since the conceptual presentation of a rapid control prototyping pipeline for data-driven control approaches is the main focus of this contribution, it should be mentioned that no systematic hyperparameter optimization or feature engineering was performed for the DDPG controller, i.e., it can be assumed that the performance of this model-free controller can be further increased within future investigations.

B. COMPARISON TO STATE-OF-THE-ART CONTROLLERS
Finally, the DDPG-based controller is set against a linear field-oriented approach using PI controllers [42] and a continuous-control set (CCS) MPC [4]. For these two modelbased controllers, the discrete-time representation of (13) is utilized where x k is the discrete-time sample of a given quantity, t is the time difference between two control cycles (assuming regular sampling) and I is the identity matrix [43]. To allow a fair comparison, the model (24) has been accurately parameterized using offline motor characterization (cf. Fig. 8 and Fig. 9) to cover magnetic (cross-)saturation effects and the inverter nonlinearity [31]. The results of the system identification have been made available by look-up tables to the two model-based controllers. Using (24) for decoupling the dq-axes by a feed-forward compensation of the induced voltage [5], two independent PI controllers for the linear field-oriented approach have been designed using the symmetrical optimum [42] with K p and K i being the proportional and integral gains as well as κ being the bandwidth design parameter, respectively. It should be noted that the PI controller is adaptively designed reacting to parameter changes as denoted in (25). The best PI controller performance was achieved for κ = 3 by experimental pre-testing at the test bench. Suitable anti reset-windup measures have been included to prevent an integrator runaway if the set voltage is clipped by the inverter limits (14). For the MPC, the discussed control delay is compensated by an additional prediction step of the system states using (24) at the beginning of each control cycle. In order to retrieve a  Fig. 12, Fig. 14 and Fig. 16) convex optimization problem, the MPC problem definition is where c is a quadratic cost function with respect to the current reference tracking, f is the system model (24) and h is a linear inequality action space constraint due to the inverter limitation as in (14). To solve the linearly constrained quadratic program (LCQP) (26), the embedded QP solver from the Matlab MPC toolbox [44], [45] was utilized with N = 1 prediction steps.

C. TRANSIENT TESTS
The trained DDPG agent and the model-based controllers are tested based on a sequence of random current reference jumps within the entire left i d -i q half-plane. To compare the following results, different performance metrics are used. The basic metric is with different power values m applied to the test recordings. Those are m = {0.5, 1, 2} for mean square root error (MRE), mean absolute error (MAE) and mean squared error (MSE), respectively. Note the difference between root mean square error (RMSE) and mean square root error (MRE). The motivation for the MRE metric arises from its beneficial scaling effect in the reward function.
The transient benchmark results are summarized in Table 5 and the test time series profiles are shown in Fig. 12 to Fig. 17.
First of all, it should be noted that the DDPG controller can be operated stably and safely over the entire operating range. The transient performance is satisfactory and in comparison the DDPG controller is on par with the MPC but clearly better than the adaptive PI controller.
It may also be noted that all current measurements show a rather high level of noise independently of the specific control approach. This is due to the integrated current measurement sensors within the utilized voltage source inverter (cf. Table 1) whose maximum measurement range substantially exceeds the exemplary test motor's operation range. Nevertheless, unfavorable measurement noise is a typical problem in many industrial applications and a given motor control scheme must be able to handle it. All three control methods compared here are able to do so, whereby the RL controller performs       minimally better during the stationary phases than the stateof-the-art approaches.
Additionally, the controller execution time on the embedded hardware was measured during the transient tests. The results are summarized inTable 6. As expected, the PI controller requires the least computational burden whereby the majority of the execution time is not caused by the actual controller, but by auxiliary functions (measurement acquisition and processing, safety monitoring, etc.). Nevertheless, both the MPC and DDPG only require marginally more computation time. It should also be considered that when using parallel computing devices (FPGA, neural cores, etc.), the calculation time of the DDPG actor can be reduced compared to the implementation shown here on a sequentially working CPU.

D. STEADY-STATE TESTS
To evaluate the steady-state performance of the different control approaches an evenly distributed grid of reference operating points has been designed as shown in Fig. 18. Each controller has been operated in each reference point for 2s given a constant speed of n = 1000 1/min enforced by a load machine. For quantitative evaluation of the steady-state controller behavior, the following benchmark quantities are introduced: Here, the SSE is the normalized steady-state error between the reference and the momentary current and the TDD is the averaged total demand distortion with I h being the h-th current harmonic and I n being the nominal RMS motor current. The TDD is chosen as a benchmark quantity instead of the total harmonic distortion (THD), since the fundamental current component may be different depending on the given steady-state control error which would distort the THD.
The results of the steady-state control accuracy are shown in Fig. 19. As expected, the PI controller is performing best due to its integral control action being able to compensate for any model deviation. The MPC and DDPG approach show similar steady-state accuracy, whereby the observed stationary deviations are still tolerable. It should be taken into account that the MPC has a very precise, adaptive time-discrete motor model at its disposal, which was hand-picked and preparametrized on the test bench during an offline characterization. In contrast, the RL-based controller is able to achieve comparable steady-state control accuracy within a short period of training time without any model knowledge, solely on the basis of the data-driven learning process. Moreover, it should be mentioned that the stationary accuracy of both methods can be further increased: For the MPC the coupling with a disturbance observer [46] is possible, while for the DDPG controller a comprehensive hyperparameter optimization and feature engineering as well as the consideration of recurrent neural networks in the actor [47] (in analogy to standard integral controllers) are possible.
In addition, the results regarding the current harmonics by means of the TDD in steady-state operation are depicted in Fig. 20. Here, the DDPG controller is outperforming the PI and MPC approach, which was already indicated by the evaluation of the transient test profiles.

E. RESULT DISCUSSION
In summary, the presented exemplary RL agent's control performance is on par with the established model-based control procedures. While the MPC approach used here is the state-ofthe-art solution after more than 20 years of active, extensive research in the field of power electronics, the data-driven RL controller is at a very early research stage which to the best of the authors' knowledge has now been demonstrated for the first time in a real-world power electronics application. The fact that a RL-based, model-free controller performs at eye level compared to MPC without the use of expert knowledge is therefore an important intermediate step in the research of data-driven control methods from the machine learning domain.
The exemplary investigated DDPG agent is only one possible control approach from the field of reinforcement learning, which itself has a lot of potential for improvement, because for this work no hyperparameter optimization or feature engineering regarding the observation space or the reward signal has been performed yet. It can therefore be assumed that further improvements can be realized in the future by systematic optimization. On the other hand, it should be emphasized once again that the central scientific contribution of this work is the rapid control prototyping simulation-toexperiment toolchain and the associated experimental proofof-concept as a methodical piece in the puzzle to introduce data-driven approaches in real-world power system control. It is not claimed that the presented RL performance is outperforming the MPC approach as the state-of-the-art control solution at the current point of time.

VII. CONCLUSION AND OUTLOOK
In this work, the transfer of RL electric drive control from offline simulation to online real-world learning was successfully presented. An RL-based PMSM current controller was experimentally trained solely using a measurement data stream without any expert or model knowledge. Several modifications to the classical DDPG training algorithm are presented that enabled the successful online training on a laboratory drive test bench. The presented rapid control prototyping pipeline allows fast and flexible testing of RL algorithms since the learning is shifted to an asynchronous background task. Here, the used remote workstation is only optional in order to speedup the learning process, but the presented training scheme can be implemented on typical SoC embedded hardware, too. The embedded part of the RL agent (actor) could be implemented on a typical laboratory hardware system without any real-time problems and due to the continuous development of control electronics with specialized parallel processing units for the matrix algebra of machine learning (FPGA, neural cores, etc.), an implementation in typical industrial applications will also be possible for low-cost applications in the future. Furthermore, it is conceivable to outsource the training process to an edge or cloud computing framework, which allows completely new possibilities in the control and monitoring of power electronic systems.
Moreover, there is much space for future research in this field. The training process needs to be optimized for sample efficiency and, therefore, to accelerate the learning process. An adaptive parameter setting of learning rates and noise parameters could be part of an upgraded training process. Hyperparameter optimization and extended feature engineering for the RL agent is also very likely to improve the overall learning and control performance. This includes particularly the possibility to use recurrent ANNs (with corresponding internal states / memory cells) as the actor, enabling the RL agent to learn some kind of integral feedback to minimize the steady-state control error. Also, investigations regarding the safety constraints are important for real-world control engineering applications. Additionally, transferring the RLbased controller framework to other tasks such as torque or speed control for PMSM and other motor types including finite-control-set approaches is highly interesting. And finally, using the rapid control prototyping toolchain for investigating the performance of entirely different RL algorithms on an experimental basis is of prime interest.