Finite-Set Direct Torque Control via Edge-Computing-Assisted Safe Reinforcement Learning for a Permanent-Magnet Synchronous Motor

Advances in the field of reinforcement learning (RL)-based drive control allow formulation of holistic optimization goals for the data-driven training phase. The resulting controllers feature efficient drive operation without the necessity of an a priori known plant model but, so far, conduction of the corresponding training phase in real-world drive systems has been applied only sparsely due to safety concerns. This contribution targets the challenging problem of self-learning torque control for a permanent-magnet synchronous motor assuming a finite control set, i.e., the direct selection of switching actions instead of a modulator-based setup. In order to allow a secure and effective online training with real-world drive systems, the RL controller is monitored by a safeguarding algorithm that prevents application of unsafe switching actions, e.g., such that result in overcurrent. The accruing amount of measurement data is handled with the use of an edge-computing pipeline to outsource the RL training from the embedded control hardware. The inference of the utilized artificial neural network in hard real time is realized with the use of a reconfigurable field-programmable gate array architecture. The resulting RL-based algorithm is able to learn a torque control policy in just 10 min, which has been validated during comprehensive real-world experiments.


I. INTRODUCTION
D UE to their extensive range of applications and high power density, permanent-magnet synchronous motor (PMSM) drives have earned popularity, not only for industrial utilization but also in electric mobility [1].The wide distribution of these The authors are with the Department of Power Electronics and Electrical Drives, Paderborn University, 33098 Paderborn, Germany (e-mail: schenke@lea.uni-paderborn.de;haucke-korber@lea.uni-paderborn.de;wall scheid@lea.uni-paderborn.de).
This article has supplementary material provided by the authors and color versions of one or more figures available at https://doi.org/10.1109/TPEL.2023.3303651.
Digital Object Identifier 10.1109/TPEL.2023.3303651drive systems motivated a sophisticated but also complicated control theory that has been an important research topic ever since.Ranging from linear, field-oriented control schemes [2] over direct torque control (DTC) methods [3] to model predictive control (MPC) approaches [4], model-based design approaches have been established in drive control for more than 40 years.
In the recent past, however, data-driven controller design procedures like reinforcement learning (RL) are increasingly prominent as contemporary hardware allows handling of large scale data [5].
In contrast to the established model-based optimal control techniques, which require significant design effort by human experts, RL-based solutions automatically learn how to control an arbitrary drive system without any human intervention.Especially in the face of labor shortage, such data-driven automation will contribute to ensure high productivity in research and development processes.

A. State of the Art
Recent investigations on RL-based motor control schemes have broadened the range of available controller design procedures and allow a different perspective on the field of optimal drive control.
1) The optimal drive control policy is learned during experiments.Consequently, an a priori drive model (i.e., specific system knowledge) is not needed.2) Iron losses, magnetic (cross-)saturation and other parasitic effects are indirectly considered within the control scheme as they are affecting the measurements, and therefore, the data being used for RL-based control.3) Multiple objectives can be considered within the same optimal controller on an infinite time horizon.For the scenario of PMSM current control, the mentioned points have been successfully validated in publications [6], [7], [8].However, the considered current control task is only an intermediate step when control of mechanical quantities (torque, speed, or position) is targeted.Consequently, an operation strategy for the motor currents (i.e., controlling the motor current such that the targeted mechanical behavior results) can be incorporated into the design of an RL torque controller as well.This approach has been theoretically discussed in [9] under the label deep Q DTC (DQ-DTC).A practical validation of the described concept as well as further methodological improvements are to be presented within this article.

B. Contribution
Despite the aforementioned benefits of RL-based control, there are still several issues to be investigated to establish safe and fast applicability in the real world.Therefore, this article proposes the following contributions in order to improve the usability of the DQ-DTC.
1) A safeguarding layer that ensures adherence to all operation limits (i.e., current and voltage constraints) at runtime in order to avoid corresponding plant system downtimes.This is particularly important as RL algorithms require (random) exploratory actions during the training process, which could lead to overloading the drive systems.2) An edge-computing, online-learning pipeline that enables training of the DQ-DTC using test bench measurements.3) A resource-efficient and online-reconfigurable fieldprogrammable gate array (FPGA) implementation of the necessary artificial neural network (ANN), which enables real-time capable policy inference on the control hardware.4) A fully automated, fast RL framework delivering an expedient, data-driven torque control policy in just a single digit number of minutes without the need of any a priori plant model knowledge [10].1 Most importantly, comprehensive test bench experiments are performed in order to prove the feasibility of the DQ-DTC in practice.Further, all scripts and programs created within the scope of this investigation are openly published [11]. 2 A schematic of the proposed control scheme is depicted in Fig. 1 with detailed explanations for each individual component in the following sections.

II. DRIVE SYSTEM
The drive system under investigation features the utilization of a three-phase two-level voltage source inverter and a PMSM.The combination of these components is a standard setup that can be found in many industrial and automotive applications [1].Both are to be presented shortly in the following.

A. Permanent-Magnet Synchronous Motor (PMSM)
The PMSM is a standard component of modern drive systems.Especially highly utilized PMSMs feature a high torque density and are, therefore, prevalent in applications where lightweight and space-saving motors are required [12].Characterization of the PMSM can be simplified by utilizing well-known coordinate transformations Herein, the physical three-phase quantities x abc are reinterpreted as 2-D, orthogonal coordinates that can either be viewed from the stator-fixed αβ reference frame, or from the rotor-fixed dq reference frame, which is defined by rotating the stator-fixed quantities x αβ by the electrical rotor angle ε el .Rotor-fixed coordinates allow a compact definition of the PMSM's dynamic behavior3 d dt ψ dq (t) = u dq (t) − R s i dq (t) − pω me (t)Jψ dq (t) with magnetic flux linkages ψ dq , stator currents i dq , stator voltages u dq , angular velocity ω me = 1 p d dt ε el , and generated drive torque T .Further parameters of this ordinary differential equation are the stator resistance R s and the number of pole pairs p.The dependence between magnetic flux linkage and stator current follows a nonlinear but static relation The presented motor model incorporates several common variants as special cases, e.g., PMSMs with both, interior (IPMSM) and surface-mounted magnets (SPMSM) as well as highly utilized PMSMs (which feature dominant magnetic saturation behavior) and synchronous reluctance motors (SynRM), which all differ concerning the nature of (3).In the following, no system-specific information (e.g., electric parameterization or lookup table) is utilized by the RL-based control algorithm.Merely an estimation of the motor current limitation, which is an upper boundary for the stator current i s , must be available for safety purposes.The presented motor model is used to illustrate the controlled system behavior for the reader's convenience, but is not known to the learning control algorithm in any way.

B. Two-Level Voltage Source Inverter
The two-level voltage source inverter is another standard component of drive systems.It is supplied by means of a DC-link voltage u DC (t).In the context of finite-control-set (FCS) applications such as the proposed DQ-DTC, the inverter is operated with respect to its eight distinguishable switching states as listed in Table I.The specific switching commands for the three half bridges (s a , s b , s c ) determine the voltage that is applied to the motor, enabling control capabilities to change the electric and mechanic operating point of the drive system.
Alternatively, the inverter could also be operated with an intermediate modulator (e.g., space vector modulation), which would enable a dynamically averaged synthesization of the commanded voltages.This allows to directly command the applied voltage u dq , rendering the system a continuous-control-set (CCS) application.Control schemes of this nature are equally well represented in drive control but are not within the scope of this contribution.Hence, the term DQ-DTC refers exclusively to its FCS implementation within the scope of this article.

III. DEEP Q DIRECT TORQUE CONTROL (DQ-DTC)
This contribution presents an augmented version of the DQ-DTC, which was originally proposed in [9].Before the fundamental concept of this control scheme is reviewed, a short definition of the optimal torque control task is delivered.

A. Optimal Torque Control
The task of optimal torque control of PMSMs can be described by the following discrete-time dynamic optimization problem: min Thereby, the digital sampling index is denoted by k.This formulation demands to minimize the stator current i s (as a proxy for maximizing the drive's efficiency [1]) while sustaining the commanded reference torque T * by means of the applied switching state a.Meanwhile, the current limit i lim must be respected at all times, resulting in the assumption that T * is reachable under this condition.
The DQ-DTC approach seeks to solve this optimal control problem in a data-driven fashion, allowing the setup of an optimal controller without the need for a specific motor model, i.e., knowledge about the parameters or lookup tables concerning (2) and (3) are not utilized to design the controller.This distinguishes the data-driven control approach from the conventional procedure of the model-driven control design that is usually employed when configuring, e.g., field-oriented proportionalintegral controllers or MPC schemes.Only information that is generally valid is considered, e.g., the structure of (2), the staticness of (3) or the existence of operational constraints, as described in the following.

B. Operating Conditions
As already defined as part of the control task (4), the most important safety constraint for usage of the PMSM is defined by the current limit i lim .Overshooting this limit can lead to a thermal overload of the drive and the feeding inverter and must, therefore, be avoided.Operating directly below i lim is not instantaneously harmful but should not be sustained for long intervals because of the limited overheating capacity of the motor.This motivates the definition of the nominal current i n < i lim , which is the maximum stator current that can be endured permanently.In most drive applications, the utilized region of operation is margined by the nominal current.In special cases, overloading may also be tolerated but corresponding applications require careful temperature monitoring, which is not in the scope of this contribution.Hence, it is targeted to operate the PMSM such that i s ≤ i n .
Whereas the current boundaries are operational constraints, the applied voltage and the voltage limitations are subject to the installed inverter and voltage source.As already stated in Table I, the DC-link voltage u DC is of central importance for the control behavior of the motor.Therefore, operating points can only be sustained if the necessary mean voltage is available, leading to the relation As of (2), this is specifically critical at high speed, because the induced voltage needs to be compensated by the applied voltage Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and only the remaining surplus can serve as a control reserve.If no such compensation is possible, the motor can become uncontrollable, which is an indirect safety concern and should be avoided.
In addition to the aforementioned, safety-related operating conditions, efficiency-related conditions can be formulated that are universally applicable for all subclasses of PMSMs.Principally, motor operation with i d > 0 is not practical and should, therefore, be avoided in order to maximize motor efficiency [12]. 4FCS schemes, however, feature a larger current ripple than CCS methods and can, therefore, benefit from permitting some leeway i d+ > 0 in order to reduce the average current drain.
While operating the drive at commanded torque, it is pursued to maximize the efficiency of the motor by reducing the stator current i s , leading to operation with the well-established maximum-torque-per-current (MTPC) characteristic.Torque demands that exceed the described voltage limitations, which usually happens at high speed, necessitate maximum-torqueper-voltage (MTPV) operation.While conventional control schemes exploit plant-specific parameter knowledge to determine operating points with MTPC or MTPV characteristic [13], the DQ-DTC is able to feature these behaviors after a data-driven training phase.

C. Control Design
Similar to conventional optimal control schemes such as MPC, the behavior of the controller is mainly defined by a reward function that is to be maximized. 5Contrary to conventional optimal control, the DQ-DTC scheme does not employ optimization in real time to solve the optimization problem (4).Instead, the control strategy is learned asynchronously during direct interaction with the plant system to find the state-action value Herein, o denotes the observation vector, which contains the information about the system state and is additionally augmented with task-specific features that simplify the learning problem (so-called feature engineering), E{•} is the expected value, 6and a is the switching action that is to be applied.The reward function r reflects the quality of the momentary plant state and must, therefore, incorporate the objectives and boundary conditions of the control problem.The discount factor γ ensures convergence of the series and must be defined in the interval [0,1[.A large discount factor translates to consideration of long-term effects when choosing an action, whereas a smaller discount factor causes the control agent to act rather short sighted.
The controller training is finished as soon as q is found, because the optimal action can then be determined easily via meaning that the action that scores best in terms of action value is considered optimal, which is determined by comparing the resulting action values for all possible actions.The most well-known algorithm employed to learn q with usage of an ANN in systems with continuous state and finite action space is famously known as deep Q-network (DQN) [14], [15], giving origin to the title DQ-DTC.The fundamental cornerstones of the original DQ-DTC are revisited in the following.
To approximate the state-action value by means of an ANN qθ with network weights θ, a cost function must be formulated to allow training/optimization of θ.First, please note that according to (6), the state-action value approximation7 must satisfy the Bellman equation [16] This property is exploited in many pertinent RL algorithms because it allows the formulation of a cost function J q for optimization: the left-hand side and the right-hand side should be equivalent, which means their (quadratic) distance is to be minimized, yielding Herein, B denotes a minibatch of experiences E as follows: which contains the relevant information about the state transition that is learned from.The Boolean flag d marks the termination of the control task, which nullifies the future value when, e.g., the control task is halted or violation of safety constraints trigger an emergency shutdown.As realized by means of the max(•) operator in the estimation target, this cost function focuses the momentary action value for the assumption of subsequent optimal control, i.e., the controller learns on the basis of the best achievable expected trajectory instead of the actually observed one.This implementation detail enables off-policy training, meaning that samples can be considered in any order and from any control policy in order to optimize for θ.The parameter update is then simply performed via stochastic gradient descent (or variations thereof, most famously [17]) with learning rate β.

TABLE II REWARD DISTRIBUTION FOR THE DQ-DTC IN CONSIDERATION OF SAFEGUARD ACTIVATION
The utilization of qθ to estimate its own estimation target is labeled a bootstrapping method.In order to reduce the variance of parameter updates, estimator qθ and target r + γ qθ are usually not updated at the same rate.Instead, a set of less-frequently-or slower-updated target parameters θ target is used to determine the estimation target [15].
As of ( 6), it can be seen that the action value is determined by means of the immediate and future rewards.Hence, the definition of a proper reward function is of major importance for the performance potential of the resulting DQ-DTC agent.The considered reward function is defined in Table II and depicted in Fig. 2, for which an in-depth motivation and derivation has been delivered in [9].Please note that the original DQ-DTC approach only considered the cases A, B, C, D, and E. The newly distinguished cases E S and D S correspond to further employed safety measures, which are discussed in Section IV.
The original definition of reward from [9] handles all aforementioned operation specifications with exception of the voltage boundary, whose consideration is hardly possible without availability of any model.Therefore, a data-driven prediction model is presented in the following section.It enables the control agent to adhere to the voltage boundary more reliably without any additional requirements concerning plant-specific expert knowledge.
Finally, to enable DQN estimation of the future development of the reward as of ( 6), it must be equipped to perform this prediction solely on the basis of o k and a k .This means that o k must incorporate all necessary information about the plant state that render the control agent capable of such a prediction while simultaneously avoiding redundant features.Moreover, it is a well-established best practice to normalize the input of ANNs to the range [-1,1], which is easily performed when the motor limitations are known This observation design is closely related to the original proposal in [9].The newly featured utilization of the observations u dq,k−2 , u dq,k−3 , and u DC stems from the real-world implementation of the DQ-DTC algorithm and is discussed more in-depth in Section V-C.Please note that the measured torque T is considered within the reward function Table II but not within the observation vector o, which means that a torque sensor is needed only during training of the DQ-DTC, but not thereafter.This is important because usual drive applications are not monitored using a torque sensor, as it introduces further cost, space, and mass demands as Fig. 3. Overview of available safeguarding mechanisms and classification of the DQ-DTC (derived from [18]).
well as risk of failure.Availability of such a device is, therefore, not assumed for the utilization of the trained DQ-DTC.

IV. SAFEGUARDING
RL algorithms can only learn through proper exploration of the state and action space, meaning that different actions may be applied in a trial-and-error fashion in all regions of the state space.Within the DQN, this is traditionally ensured through utilization of the -greedy policy8 a = a * with probability 1 − ∈ R A with probability (13) wherein A denotes the available action space and ∈ [0, 1] is a configurable parameter of the DQN training.This is traditionally conducted without regard to the limitations of the state space, which goes without any consequences in simulated environments.However, in this article, the deployment of an online-capable RL control agent on a real-world drive system is targeted, which incorporates safety hazards as well as limitations to experiment time.
To prevent operation in safety-critical states, this implementation employs a safeguarding routine that makes use of online system identification.This setup allows the safeguard to detect and evade the selection of unsafe actions, which keeps the point of operation within the predefined limitations.Primarily, this measure ensures safe operation of the DQ-DTC during the online training phase.Secondarily, it also accelerates the agent's training as fewer (if any) plant shutdowns with consecutive restarts occur.A classification of this safeguarding procedure is visualized in Fig. 3.The depicted categories have been motivated in [18].

A. Data-Driven Online System Identification
The employed safeguard is based on the data-driven recursive least squares (RLS) implementation for synchronous machines as presented and validated in [19].The assumed system dynamics have the form which motivates the online regression problem with regressors with measurements and with parameters Herein, the entries of the parameter vectors χd and χq correspond to the entries of the matrices Â, B, and ê, respectively.
Note that this system model is not linked to the physical parameters of the motor but is implemented in a purely data-driven fashion.Determination of the physical plant system parameters is, therefore, not targeted.Additional feature engineering, i.e., extending the regressor and parameter sets, can be optionally conducted if the aforementioned data-driven model structure is not leading to satisfactory results (e.g., using [20]).
The regression problem can then be tackled using the standard RLS algorithm Here, P is a scaled covariance matrix of the parameter estimation Cov( χ, χ), I is the identity matrix, and λ ∈]0, 1[ denotes the forgetting factor, which is a tuning parameter that defines the priority given to the regressors measured in the more distant past.Please note that the update of χ needs to be performed for both, the d-axis and the q-axis.An initial value for the parameter covariance P 0 as well as an initial guess for the parameter vectors χd,0 , χq,0 needs to be specified to run the algorithm.These initial guesses allow incorporation of expert knowledge.
FCS schemes, including the presented approach, incorporate sufficient system excitation that usually allow identification with satisfying accuracy at all times.With these considerations, the parameters Â, B, and ê of the assumed plant model ( 14) can be considered as known, which allows to predict the outcome of the different applicable switching states on the plant.In MPC, such an identified prediction model could be utilized to determine the optimal possible switching action in consideration of a limited prediction horizon.Especially FCS-MPC suffers from exponentially increasing computational complexity with rising prediction horizon, and hence, it is typically implemented with only narrow foresight.
Note that also model-free predictive controllers (e.g., [19], [21], or [22]) suffer from the characteristic of growing computational effort.Herein, a predictive current model is identified online using the proposed RLS or a comparably adequate method.Commonly, the online identification is limited to the electric subsystem's behavior, because industrial drive applications do not incorporate a torque sensor by default, which would be required for identifying the mechanic subsystem.Therefore, also the capability of model-free predictive approaches is usually limited to current control (which is an intermediate step when designing a torque controller), whereas the proposed DQ-DTC is a direct torque controller, and hence, does not require manual design of an underlying current controller.
Contrary to predictive controllers, the DQN qθ considers also the long-term control consequences by means of the Bellman equation, whereas the time horizon and the computational effort are independent of each other.The proposed solution in this article combines a computationally cheap one-step prediction to avoid unsafe actions with the benefits of a long-term-oriented control policy enabled by RL.Therefore, the prediction is not utilized in search of the optimal action, but only to exclude switching actions that would result in clearly unsafe plant operation.The detection of such actions is presented in the following.

B. Prevention of Unsafe Actions
As already mentioned in Section III-B, the critical operation limitations for the motor are specified by means of the current boundary, which is the most important safety condition, and the voltage boundary, which ensures controllability, and hence, ensures safety indirectly.Utilizing the parameters identified by the RLS, the system ( 14) can be used to check adherence to both of these boundaries for the upcoming state.
Concerning the current boundary, this can be done easily by verifying that the predicted current îdq,k+1 adheres to the nominal current of the plant To also ensure compliance with the voltage boundary ( 5), it must be examined whether the predicted operating point îdq,k+1 can be sustained with the available DC-link voltage u DC , resulting in the condition Here, the expression on the left-hand side corresponds to the fundamental (average) voltage amplitude that is necessary to maintain the operating point at îdq,k+1 , i.e., steady-state control operation, while assuming that the DC-link voltage will not change drastically from one sampling instant to the next [23].
Naturally, inspection of these conditions must succeed for all switching states a ∈ A that can lead to different follow-up states îa dq,k+1 .Each switching state that violates any of the constraints may not be selected in the momentary time step, neither in case of an exploiting nor in case of an exploring action.Under these circumstances, the safeguard overwrites the agent's policy entirely to ensure safe plant operation.An algorithmic representation of the outlined safeguarding technique is given in Algorithm 1. Herein, the prediction model Â, B, ê is updated in an online fashion, without the requirement (but surely with the possibility) to insert an expert-based initial model before the training.
Please note that, in order to allow the DQ-DTC agent to learn from mistakes, the naive action a must be defined for the experience tuple E (10) even though only the safeguarded action a S is applied.If a S would be included in E instead, E would resemble a state transition wherein the potentially unsafe a is masked, consequently preventing the agent from learning to avoid such behavior.
An appropriate safeguard prevents the plant system from being operated in the regions D and E (cf., Table II and Fig. 2), which avoids emergency system shutdowns, especially in the Algorithm 1: Safeguarded DQ-DTC.
for given weights θ of qθ (o, a) k ← 0 loop obtain o k update Âk , Bk and êk using the RLS (18) A S,k ← {} for all a ∈ A do predict îa dq,k+1 according to ( 14) if (20) and ( 21) are satisfied then add a to the set of safe actions training phase when the control agent has not yet learned a feasible policy.On the other hand, the agent is unable to learn from failures if such failures do not occur.By means of the safeguard, however, prediction results that forecast unsafe switching are available and can be considered within the reward function, adding a few more cases to the reward distribution as listed in Table II and visualized in Fig. 2. The newly added case distinctions B S , D S , and E S are rewarded in such a way that triggering the safeguard is always better than actually violating the limitations.Otherwise, the agent could be encouraged to actively bypass the safeguard whenever operated in states where the prediction model ( 14) has problems to accurately forecast such violations. 9n terms of published safeguarding routines, the proposed approach is closely related to so-called postposed shielding [24], which is characterized by perpetual determination of safe alternative action(s) a S and also ensures safeguarding during the application phase (i.e., even after the training phase is finished).Instead of attempting to learn the necessary shielding behavior from constraint violations, the concept of predictive safety certification [25] is combined with recursive feasibility [26] within the assumed system dynamics (14).For the given motor setup, this is done effectively and free of specific system knowledge by exploiting the discussed voltage limitation as controllability condition ( 5), (21).
Finally, the DQ-DTC agent is able to learn optimized operation behavior by means of measured real-world data while the consequences of safety-critical actions are predicted using the data-driven model.Actual terminations of plant operation are, therefore, obsolete and the training can be continued without any time loss or harm to the system.A schematic overview of the employed control structure is presented in Fig. 1.

V. IMPLEMENTATION
The real-time capable implementation of RL-based drive torque control faces some challenges that are discussed in this section.Primarily, this concerns the setup of an asynchronous edge-computing pipeline for the training, the hardware implementation of the DQ-DTC agent under consideration of realtime constraints, and the handling of parasitic effects that arise from the mechanical subsystem.
The former is necessary because a complete implementation of the RL learning algorithms as well as its inference on contemporarily available embedded computing hardware is not yet technically possible in hard real time due to the high necessary update frequency of the FCS drive control.Therefore, the learning algorithms are outsourced to more powerful computing hardware.This is implemented in the present case by means of edge computing, but in principle should also be feasible by means of cloud computing.This distributed approach turns the drive control into an Internet of Things application.Please note that all subsequently depicted time-series plots have been gathered in real-world test bench experiments, and that software-in-the-loop and hardware-in-the-loop simulations are not presented for clarity (cf., the last step of the RL controller deployment pipeline as proposed in [7]).

A. EdgeRL Pipeline
A major part of this contribution features the setup of an asynchronous edge-computing pipeline that is to be utilized during the training phase of the DQ-DTC agent.A schematic overview of the employed structure is presented in Fig. 4. As depicted, a dSPACE MicroLabBox system is in use as rapid control prototyping hardware (RCPH).From there, measured experiences E k can be read in real time with the use of a corresponding Python interface, allowing time-efficient data capturing at the test bench.The collected data are sent to a workstation computer using a TCP/IP-based communication channel to guarantee causal integrity of the accrued data.
The workstation computer runs the RL training algorithm on the basis of the buffered experiences.This means that only the backward pass / gradient descent optimization of the DQN training needs to be considered here.Therefore, the training of the DQ-DTC is outsourced from the RCPH and only the forward pass / inference of the utilized ANN must be performed with real-time capability on the dSPACE system.
The newly computed DQN weights θ are continuously sent back to the test bench computer to allow their appliance within Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the operated control loop.To maximize throughput on the workstation end of the pipeline, the unpacking of received sample data into the training memory, the backward pass that utilizes the training memory and the dispatch of the network weights are all separated into different processes.This avoids idle time that would otherwise occur, e.g., because the training would be halted until a message is completely sent or read.Finally, also the TCP/IP communication is separated between two channels, allowing both PCs to send and receive data simultaneously, which enables the asynchrony of the tasks.This is exceptionally important in the proposed setup, because the sampling time T s is much smaller than the time that is necessary for one learning step T l , which means that new samples accrue much faster than network weight updates.The complete communication and training code is available at [11].

B. DQN Inference
The employed MicroLabBox system by dSPACE permits inference of the DQN (i.e., function evaluation of qθ (o k , a k )) either by means of its CPU or by utilization of the built-in FPGA.As indicated in [9], sensible ANNs for the given task grow to a size where implementation of the ANN inference on the CPU is not feasible anymore to yield a result within the given sampling time of T s .Here, it is more reasonable to make use of the FPGA's parallelization potential to allow adherence to the real-time constraint.In order to operate the pipeline depicted in Fig. 4, it is furthermore required that the DQN parameters θ can be updated at runtime.A schematic of a corresponding neuron implementation is depicted in Fig. 5.As can be seen, the same hardware is utilized for all consecutive layers, allowing for a resource-efficient FPGA build.With the given structure, a DQ-DTC can be evaluated with a time demand of

TABLE III PARAMETERIZATION OF THE DQN
wherein l is the number of layers, n h is the number of neurons per hidden layer, τ n is a runtime delay each neuron needs to finish its computation, and T FPGA is the base cycle time of the utilized FPGA.This configuration is sufficiently fast to conform the sampling time constraint as long as architectures do not get too comprehensive.Due to hardware constraints, reconfiguration of all network parameters θ cannot be performed within one sampling period.Since feasible learning rates, and therefore, also the parameter change per update Δθ is rather small, and because the DQN gradient descent is rather time intensive, there is no further concern to update the DQN at a larger time rate than the controller clock period T s .Hence, the parameters are updated for one neuron at a time and at a slower update rate of T nu , resulting in a total network update time of An overview of the utilized network architecture and its computational real-time demand is provided in Table III.

C. Parasitic Effects During Test Bench Training
The real-world behavior of the described motor setup does, naturally, not fit the ideal assumptions that have been featured in [9].The rotational vibration capability of the drive train and the limited bandwidth of common torque sensors complicate  the learning task even further and must be dealt with to enable model-independent learning.Moreover, the DC-link voltage u DC cannot be assumed constant.
The latter issue can be dealt with rather easily.A time-variant u DC can be interpreted as a state of the plant system rather than a (constant) parameter, and its consideration within the observation vector o as of ( 12) enables the control agent to react to corresponding changes.Since low-bandwidth voltage measurement can be realized rather inexpensive, availability of u DC is the usual case in drive systems, and hence, adding it to o complicates neither the hardware nor the software setup.
Contrary to the original (purely simulative) scenario in [9], the mechanical behavior of the application adds further dynamics to the plant.A free body diagram and equivalent scheme of the drive train are depicted in Fig. 6, strongly indicating oscillating behavior that impacts the torque measurement signal, which is used during controller training.Unlike the DC-link voltage, the internal states of the drive train, i.e., angular velocities and torsional angles of each individual component named in Fig. 6(b), are not available and cannot be determined with reasonable effort, rendering them hidden states.Therefore, a different approach is suggested.To enable the DQ-DTC to correctly estimate qθ , earlier samples of the commanded voltage u dq,k−1 , u dq,k−2 , u dq,k−3 are integrated into the observation vector o k as of ( 12) as so-called lagging features [27].Due to the correspondence between applied voltage and switching action, o k is in this way enriched to allow the DQ-DTC to find the mapping between past actions, hidden mechanical states, and corresponding action value qθ .Hence, the direct dependence upon the drive train's state is replaced with available signals and the issues of the mechanical subsystem are handled with very manageable effort.The specific necessity of at least three lacking features has been empirically determined and a larger number did not prove beneficial.In fact, the DQ-DTC agent was unable to learn feasible control behavior without this modification.
As of [9], the consideration of at least the latest voltage command u dq,k−1 is also necessary to allow compensability of the digital control delay of one sampling instant.The commanded switching action a k−1 does not show its effect in measurement o k but rather in o k+1 , because the computation of a k−1 on the basis of the measurement o k−1 is not instantaneously available.It is, therefore, applied starting at time k and has, hence, a direct effect on the action value

VI. EXPERIMENTAL RESULTS
In order to verify the feasibility of the proposed DQ-DTC approach, an experimental investigation is performed on a PMSM test bench system, which is depicted in Fig. 7.The utilized components are listed in Table IV, nameplate data of the PMSM under test and parameterization of the DQ-DTC are specified in Table V.The validation of the DQ-DTC is conducted in three steps.First, the safeguard is tested concerning its functionality.Second, the training phase for the RL agent is analyzed in terms of physical and numerical stability.Finally, the performance of the DQ-DTC is presented in a series of exemplary test scenarios.while exploration actions are deactivated.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Safeguard Functionality
To ensure safety of components and personnel, and secondarily, to avoid training downtimes, it is of high priority to validate the functionality of the safeguarding procedure that has been proposed in Section IV.This investigation is performed for two scenarios: at low speed, where the induced voltage plays no significant role, and secondly, at higher speed where the voltage boundary is of importance.The tests are performed by presenting arbitrary, random switching actions to the safeguard algorithm, which then should be able to filter out the unsafe actions to keep the motor current trajectory i dq within the current and voltage limitations.
Long-term current trajectories for both scenarios are depicted in Fig. 8. Please note that the test bench includes a load machine as well as open-loop controlled DC-link choppers, whose dynamics are not included in the identified model (14), and therefore, cannot be considered by the safeguard.Still, the current boundary has visibly been avoided with no striking violation.Further, the identified voltage boundary is displayed with the identified voltage limit being adhered to rather consistently. 10Speed and angular velocity correspond via ω me = n me 2π 60 min s .A statistical evaluation of the safeguard's prediction uncertainty is delivered in Table VI, which indicates that the minor violations, which are mainly visible in Fig. 8(b), can be attributed to the prediction uncertainty that remained during usage of the RLS from [19].This result encourages the proposed safeguard architecture that allows certain prediction error in correspondence to the leeway between i n and i lim .
Overall, the functionality of the safeguard can consequently be confirmed.In fact, no violation of the current limit occurred, and hence, no emergency shutdown was necessary for the entirety of investigations in this article.

B. Training Phase
After asserting the safeguard functionality, the training phase is investigated in detail.To analyze the numerical convergence characteristic of the training phase, ten separate DQ-DTC agents have been trained independently for 10 min each, i.e., with reinitialization of a new set of random network weights for each individual agent.During the training, the torque references and the operated motor speed that is enforced by the load machine are resampled uniformly at random sampling instants as specified in Table VII.The learning rate β and the -greedy parameter are linearly decreased from the beginning to completion of the training phase (scheduling).
Several snapshots from one exemplary training phase are depicted in Fig. 9.As can be seen, the reference torque T * and the speed n me are varying over the training time.While the early performance looks quite insufficient due to the mostly untrained control agent, the performance at the end of the training phase is a lot more satisfying.Please note that the explorative policy  Over the course of each training, the measured reward is recorded and a statistical evaluation of the learning behavior over all ten agents is depicted in Fig. 10.As visible, the mean reward of the agent set is converging reliably, and also the corresponding standard deviation σ r is decreasing.Strikingly, no negative outliers have been observed during the procedure, and despite the randomness of DQN initialization and training routine, the final performance in the training has been measured to be quite similar.The peak control performance is consistently reached within less than 10 min of training, i.e., the proposed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.RL framework learns as quick as making a pot of coffee (cf., uploaded supplementary video material [10]).

C. Performance Evaluation
Finally, to validate the DQ-DTC in practice, the performance of an exemplary trained agent is assessed in several test bench experiments, which include the following.
1) A torque reference step at constant speed.
2) A speed ramp from negative to positive velocity at constant torque reference.3) A torque reference ramp from negative to positive torque at constant speed.4) A small-signal investigation with several torque reference steps at constant speed.For these tests, the best-performing agent from the previous training investigation was selected concerning its cumulative reward during training.Moving-average quantities are determined with a window size of 50 ms and are denoted by an overline (T , i).

1) Torque Reference
Step: In the first experiment, a torque reference step is investigated to evaluate the control loop's reaction to transients.The obtained measurement is depicted in Fig. 11.Beside the measured torque T , also the calculated drive torque estimation T is presented, which does not feature the low-pass behavior and the oscillations of the drive train and is, hence, a more feasible basis for investigating the control behavior precisely.The computation of T is based on comprehensively identified motor characteristics, which is not compatible with the premise of crafting a control scheme without a priori knowledge and is, therefore, only used for the given validation but not within any formulation that is used within the DQ-DTC's setup or training.
Unfortunately, the given drive train features a dominant oscillatory behavior with limited bandwidth (as can be inferred from the time lag between T and T ).Since the torque measurement T is utilized within the reward formulation, the parasitic drive train behavior is obviously limiting the reachable torque tracking precision of the control loop.Moreover, the current and torque ripple that is inherent to FCS approaches leads to consistent excitation of the oscillation such that a true steady state is never reached.
Despite these complications, which are all independent of the DQ-DTC, fast torque tracking of roughly 5 ms can be observed in Fig. 11, and even the time series of applied actions a features the familiar overlapping staircase form.
Interestingly, a rather large i d was observed during the experiment.It can be speculated that this results in a decreased torque ripple, and hence, in less striking drive train oscillations, which would lead to a higher reward r.
2) Speed Ramp: A speed ramp experiment with constant torque reference is depicted in Fig. 12.Over the course of the acceleration, the torque ripple can be seen to be quite significant, which is usual for FCS control schemes (and partly also attributed to the drive train oscillation).The moving average of the torque measurement T , however, features clearly that the torque reference is met for the whole considered speed range.
3) Torque Reference Ramp: Fig. 13 presents an experiment wherein the DQ-DTC is tracking a ramping reference torque.This scenario reveals no significant shortcomings.Only a small offset error can be measured when speed hits its upper steady state.
4) Small-Signal Behavior: The small-signal behavior of the DQ-DTC is featured in Fig. 14.The momentary torque measurement is omitted for clarity in this case and only the moving average T is shown.In this plot, the control agent can be observed to react with no visible delay to the changing torque reference T * .Again, some of the commanded operating points exhibit a visible torque offset, which can presumably be attributed to their proximity to the current boundary.
Concerning all of the presented experiments, it is observed that i d = 0.Although such behavior seems counterintuitive when dealing with an SPMSM, several possible explanations can be identified.
1) The information that the given motor is an SPMSM has not been used for setting up the RL controller.Hence, it is not initially clear that i d = 0 should be targeted for  maximizing efficiency, but it must be learned during the training phase.
2) The training phase might be chosen too short or might be biased in terms of flux weakening operation (provoking i d < 0). 3) Numerical difference in reward be negligible when moving i d closer to zero, which is a valid concern when dealing with the limited numerical precision of the FPGA implementation.4) The selection of i d = 0 is only optimal in terms of steadystate efficiency, which is given a lower priority than time-optimal torque tracking by means of Table II.For maximum reward, it could, therefore, be more promising to maintain i d < 0, because this would allow faster transients, and correspondingly, a tracking behavior of higher bandwidth.For the given training scenario with consecutive torque reference changes of arbitrary magnitude (cf., Fig. 9), the latter mechanism could be plausible to be the dominant reason for the observed behavior.If so, the operation with i d < 0 would be exactly in line with the targeted control performance, but this needs further confirmation in future investigations.

A. Conclusion
The general goal of implementing and validating the DQ-DTC agent in a real-world test has been achieved.Although the initial implementation effort is significant in terms of setting up the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
EdgeRL online learning pipeline and preparing the reconfigurable DQN on an FPGA, the finally resulting controller design / training expense of 10 min (and prospectively, even less) is quite low.The training process was completely automated and did not require any human intervention.
Although the reached control performance is not yet on par with the effectiveness of established optimal controllers such as MPC, the relatively short period for which RL-based control approaches have been investigated in the domain of three-phase drives promises a lot of scope for improvement and further research potential.

B. Outlook
The presented findings and insights yield a broad base for future research.Some ideas and possible directions for further investigations in RL-based power system control are listed in the following.

1) DQ-DTC Research Potential:
The training behavior of the DQ-DTC featured quite fast and reliable learning.The training speed could be further improved with different types of scheduling plans for, e.g., or the learning rate β.In terms of the achievable performance, the reduction of the drive train oscillations can be expected to yield better torque precision.Besides physical manipulation of the corresponding drive setup (i.e., replacing the couplings or the torque sensor), pre-or postprocessing approaches to compensate for these oscillations would be of major interest to avoid hardware cost and manual effort.A more accurate safeguard design could be achieved by exchanging or extending the RLS-based architecture with methods that allow consideration of data-driven system identification in terms of locally different system dynamics, e.g., [28].
Concerning the drive losses, the featured reward function only takes into account the motor losses but not the inverter losses.Effective ways to limit the switching frequency would also be important for utilization of the six-step mode, as it has been discussed in [29].Also, the presented implementation of the safeguard monitors only the momentary current, whereas a safeguarding of the average current (e.g., over the course of one electric rotation) is equally safe but allows to utilize the PMSM's capability much more effectively.
In terms of hardware effort, the proposed approach is quite costly, mainly due to the utilization of FPGA resources.Prospectively, this will not remain an issue because the calculation performance of embedded hardware increases, while its cost decreases with the progress of industrial development.An industrial application of the DQ-DTC is, therefore, not easily affordable in the short term, but further development and gain in knowledge in data-driven motor control is targeted building upon this contribution.
2) Further Scope: Apart from the DQ-DTC, the presented insights can also be applied to different drive systems.Most evidently, it would be of interest to transfer the DQ-DTC, which is an FCS setup, to the CCS, where modulators are used to set the average voltage.Although RL approaches for CCS environments are readily available, the safeguarding procedure would need a major overhaul for such scenarios.The proposed reward function could presumably be retained.
Further, the given functionality should also be conceivable for externally excited synchronous machines, which feature further degrees of freedom concerning their operating strategy, and nonsynchronous motors such as induction drives.For the latter, the full state vector describing the environment's state is not inherently measurable due to missing rotor flux linkage sensors, necessitating data-driven observation methods that could be implemented in an explicit or implicit fashion.Outside the domain of electric drives, also power grid applications as well as power electronic devices could be handled in an RL-based FCS scheme as presented in this article.

Manuscript received 7
February 2023; revised 22 June 2023; accepted 5 August 2023.Date of publication 9 August 2023; date of current version 22 September 2023.This work was supported in part by the German Research Foundation (DFG) under Grant 459524199, and in part by the EU Horizon 2020 project VEDLIoT under Grant 957197.Recommended for publication by Associate Editor R. Kennel.(Corresponding author: Maximilian Schenke.)

Fig. 2 .
Fig. 2. Graphical depiction of the proposed reward function gradient according to TableII(oriented at[9]); please note that regions B S , D S , and E S correspond to safeguard activation and are not actually entered.

Fig. 5 .
Fig. 5. Schematic of the realization of an online-configurable neuron on the FPGA, oriented at [5], components are: multiply accumulator (MAC), activation function (f (•)), output register (REG), weight storage (RAM), layer control unit (LCU), and neuron control unit (NCU), n and l are respective identifiers of the active neuron and layer, i indicates the entries of the neuron's weight vector θ n,l , and y is an (intermediate) result.

Fig. 6 .
Fig. 6.Depiction of the mechanical drive train.(a) Free body diagram of the drive train.(b) Equivalent scheme of the drive train.

Fig. 8 .
Fig. 8.Long-term current trajectory recordings to validate the safeguard's functionality.(a) Operation at low speed |n me | < 50 min −1 ; only the current boundary is active.(b) Operation at high speed n me = 700 min −1 ; current and voltage boundaries are active.

Fig. 9 .
Fig. 9.Control performance in an early (left), intermediate (center), and late (right) phase of an exemplary training.

Fig. 10 .
Fig. 10.Mean learning behavior μ r over ten separate DQ-DTC trainings with marked variation range of one standard deviation σ r , moving average filter applied.

Fig. 11 .
Fig. 11.Measurement of a torque reference step at n me = 500 min −1 , T denotes the calculated electromagnetic torque for validation purposes.

Fig. 14 .
Fig.14.Measurement of the small-signal behavior concerning the torque reference at n me = 500 min −1 with additional moving-average signals.

TABLE I CORRESPONDENCE
BETWEEN THE CONTROL ACTION, THE SWITCHING COMMANDS AND APPLIED VOLTAGES

TABLE VI STATISTICAL
EVALUATION OF THE CURRENT PREDICTION ERROR e d,q WITH MEAN μ AND STANDARD DEVIATION σ