Meta-Reinforcement-Learning-Based Current Control of Permanent Magnet Synchronous Motor Drives for a Wide Range of Power Classes

Data-driven reinforcement-learning-based controller schemes have much potential to aid the design of model-free control algorithms that can be trained without the necessity of plant-specific parameter knowledge. Unfortunately, the corresponding training phase is a time-consuming and possibly money-consuming process, which needs to be repeated whenever application to a new plant system is requested. To reduce the total training time for a large set of heterogeneous plant systems, this article proposes a meta-reinforcement-learning-based approach that is to be utilized for the control of hundreds of different permanent magnet synchronous motor drives ranging from a few watts to hundreds of kilowatts. The so-called context variables carry the meta-information about the set of considered drive systems. Their estimation using a corresponding artificial neural network as context approximator is a core aspect of this article. The context information allows the reinforcement-learning-based control algorithm to automatically adapt itself to individual motor drives without requiring individual plant training. Since the found context variables can also be interpreted as an implicit system identification result, they allow us to determine irregular plant behavior (e.g., faulty drives) as an added bonus of the proposed meta-reinforcement learning scheme. Empirical results during this proof of concept successfully validate the potential of the proposed approach to drastically reduce the total training time and encourage further research.

direct torque controllers [2] to model-predictive control patterns [3], to just name a few cornerstones.
While these already established approaches are highly efficient in scenarios where accurately parameterized drive models are available, they tend to lack performance whenever precise system knowledge is not available. Moreover, those model-based controller designs require a lot of human expert knowledge as well as manual tuning effort [4], both of which are not available in abundance (especially in the industry). In the recent past, electric drives [5], [6], [7] and further power electronics (in particular DC-DC converters) [8], [9], [10], [11] control setups on the basis of reinforcement learning (RL) have been proposed to deal with such situations by providing a data-driven fully automatable control scheme with appealing potentials.
1) Knowledge about the plant model is not required as optimal control actions are learned through the direct interaction of the RL algorithm and the plant system [12]. 2) Challenging higher order and parasitic effects, e.g., iron losses, magnetic (cross-)saturation, or the inverter nonlinearity, do not need to be characterized in beforehand as their effects are directly learned within the control policy from real-world measurements [6]. 3) Different goals can be incorporated into the learning problem, allowing multiobjective control with only one learning phase [7]. These advantages of data-driven control are opposed by a time-consuming learning phase during which the plant system is unavailable for its intended application, and the computationally demanding learning algorithms must be run on costly computer hardware. Naturally, it is, therefore, of great interest to reduce the time spent with training by finding a more universal approach to RL-based drive control that, instead of being optimized to run on only one specific drive setup, is able to adapt itself to a large set of different plant systems as a part of its objective, which is the core problem addressed in the following.

A. Contribution
This article, therefore, proposes a meta-RL (MRL) current control algorithm for permanent magnet synchronous motors (PMSMs), which is equipped to adapt to a given motor system [13], [14]. After the learning phase, this capability enables the MRL algorithm to be operated on previously unseen drives of different parameterization without necessitating new learning effort and, hence, introducing several striking advantages.
1) The learning phase is conducted only once and expanded to a wide variety of drive systems. This reduces the total training time for a large set of different drives drastically. 2) Control performance is very robust concerning drive variations.
3) The underlying context variable identification scheme, although only implicit (physical motor parameters are not required and also not identified), gives information about pathological system behavior, which could indicate irregular drive conditions such as faults. In order to advance toward these benefits, this article presents a proof of concept on the preparation, conduction, and validation of the MRL for PMSM control. 1

II. DRIVE SYSTEM MODEL
In the following, a brief overview of the PMSM drive system model is provided for the sake of completeness. It is used as part of the RL-oriented training and testing open-source software gym-electric-motor (GEM) 2 [15]. However, it should be noted that the considered RL-based control algorithms do not have access to the drive model.
The PMSM is a three-phase electric motor with a large variety of application scenarios, ranging from industrial automation to electric traction [16]. In the domain of drive control, three-phase drives are concisely modeled via field-oriented coordinates. Utilizing this two-phase representation, the dynamic motor behavior can be described by a system of first-order ordinary differential equations (ODEs): Herein, the direct and quadrature current components are denoted by i d and i q , respectively. The corresponding voltages are labeled u d and u q , and the angular velocity is denoted by ω me . The continuous time t will be omitted in the rest of this article wherever possible to shorten notation. A definition of the physical parameters within these equations is given in Table I, wherein the first five parameters describe the motor's physical properties, while the last three define the operation space. Note that the dynamic system model defined by (1) is a simplification of the real-world application that is known as the PMSM's fundamental wave model. The rotor of a PMSM contains the namesake permanent magnets, which can be located either on its surface (SPMSM) or within its interior (IPMSM). In terms of electrical properties, L d = L q is always satisfied for SPMSMs.The task under investigation is to control the dq current components, by actuating a two-level power inverter (B6-bridge) using pulsewidth modulation. The control algorithm The resulting voltage vector u * dq is then subjected to the limitations of the physical three-phase system: with u * dq being the commanded and u dq the applied voltages. The operator T abc,dq corresponds to the subsequent application of the Clarke transformation [17] and the Park transformation [18], whereas T dq,abc denotes the associated inverse.

III. RL BASICS
Classical RL settings consist of an RL control algorithm that interacts repeatedly with the control plant (referred to as environment) at each discrete time step k. First, it is assumed that all states are measurable and are, hence, directly visible in the system output. Then, the actor of a trained RL controller determines the action signal a k based on the momentary environment state s k . The applied action leads to the next state s k+1 , which conforms the idea of modeling the environment as a Markov decision process (MDP), defined via the tuple (S, A, P, R, γ). Here, S ⊆ R n and A ⊆ R m denote the possible state and action spaces, respectively. Moreover is the transition probability function for transfer from s k to the state s k+1 via action a k , which represents the mapping P : S × S × A − → [0, 1]. Furthermore, the action is rewarded with r k+1 , defined by a reward function R : S × A − → R. From a control engineering point of view, the reward represents intended control objectives. Hence, sensible design can improve the learning speed considerably. The discount factor γ ∈ [0, 1[ defines how far-sighted the algorithm chooses its control actions.
In an MDP, the transition from state s k to state s k+1 only depends on the momentary state Pr(s k+1 |s k , a k ) = Pr(s k+1 |s 0 , . . ., s k , a k ). (4) When utilizing RL within a control scenario, the general task is to determine the optimal policy that relates each state to an action that can maximize the return g k Herein, E{G k } denotes the expected value of the return with capital letters denoting random variables, while lowercase letters denote respective realizations. The optimal return can be defined by means of the Bellman optimality equation where q(s k , a k ) denotes the action value function q : S × A − → R that links the action a k in state s k to the expected return. To find the best applicable action, a policy function a k = π(s k ) with π : S − → A needs to be learned that satisfies max a k q(s k , a k ) = q(s k , π * (s k )).
For environments with continuous sets of states and/or actions, approximate solutions such as artificial neural networks (ANNs) for the policy function and value estimation are required. Prominent utilization of ANNs in RL can be found in actor-critic methods, which make use of two separated ANNs to estimate the action values and the policy separately. This is achieved by employing an actor network π φ and a critic networkq θ with network parameters φ and θ. Contemporary algorithms from this class are able to learn in an off-policy fashion, which means that they are also capable of learning from actions that do not fit the momentary policy. This allows us to utilize a replay buffer B where past samples are stored, enabling the consideration of larger batches of experiences to learn in a data-efficient way. The definition of the Bellman equation (6) suggests the corresponding cost function that is used for optimizing the critic network parameters θ Assuming that the critic delivers a reliable estimate of the action value q, the gradient points in the direction of policy improvement and can, hence, be utilized to optimize the actor parameters φ. In order to keep this introduction to the basics of RL concise, further details are omitted at this point. More in-depth summaries of the RL fundamentals can be found, e.g., in [12] and [19].
In electric drive control, off-policy RL-based algorithms have already been implemented successfully to control currents and torque, respectively, but only for individual drive applications [5], [6], [7]. The transfer to a wide set of different drives was not investigated so far and is, therefore, the central challenge addressed in this contribution.

IV. CONTEXT-BASED MRL
In spite of the beneficial traits of RL control that allow independence from plant-specific model knowledge, a main drawback of such methods is the lack of generalizability concerning a learned policy. If presented to another task from the same problem class (e.g., a plant system that is described by the same ODE but is characterized by different parameters), the learned optimal policy π * would not fit the new task and must, therefore, be retrained. In terms of PMSM control, this means that for each motor with different physical properties, the algorithm's training needs to be conducted again, which may still be tolerable for special applications, but is rather aggravating for large-scale production.
The goal of meta-learning is to systematically speed up the training of machine learning (ML) algorithms when presented to new but similar tasks [13]. Concerning RL, this means that the algorithm would be able to adapt to different environments. One way to model this scenario is based on picturing each different motor as a partially observable MDP (POMDP). In a POMDP, the state s k is not completely measurable and is, hence, not entirely available to the RL algorithm. It is described by the tuple (S, A, P, R, O, Ω, γ), wherein Ω is the set of observations available to the algorithm and observation function O, which is a probability distribution over possible observations given an action with a resulting state.
Different algorithms in MRL have emerged to handle such scenarios. However, most of them necessitate on-policy training and may, therefore, lack sample efficiency [20], [21]. A class of algorithms that have shown promising and sample efficient results on simulated and real tasks are context-based off-policy algorithms [14]. Here, a further ANN is introduced, which generates additional information about the momentary environment. These so-called context variables z are computed by means of the context networkĉ ξ with parameters ξ wherein B j c is a commissioning buffer that contains several state transitions of the form (s k , a k , s k+1 ) that have been recorded on the jth environment.
Furthermore, feature engineering is usually employed to present an enriched observation vector o = f (s) to the control algorithm in order to facilitate the training process. Here, f (·) is the feature function that extracts important information from the state s, which is highly problem dependent.
In the corresponding framework, the action value networkq θ is augmented to accept the context z as an additional input. The Bellman equality equation (6), therefore, results to The cost function (8) is altered accordingly and can then be used to also optimize the context networkĉ ξ : Important contributions to this class of MRL algorithms are probabilistic embeddings for actor-critic RL (PEARL) [22] and meta-Q-learning (MQL) [23], which differ concerning their design of context variables. PEARL uses random nonconsecutive transitions from the recent additions to the replay buffer to generate a probabilistic context. MQL uses a recurrent ANN to generate a context from the last buffered state transitions. This contribution has taken inspiration from both of these approaches; a schematic of the targeted MRL structure is depicted in Fig. 1.
This contribution utilizes twin delayed deep deterministic policy gradient (TD3) [21] as its base RL algorithm, which is a state-of-the-art method for RL problems with continuous state and action space. Less-recent RL algorithms, such as deep deterministic policy gradient [24] or advantage actor critic [25], are also suited for the current control task [5], [6], but are not discussed within the scope of this article to focus the extension to meta learning.
In addition to the usual actor and critic networks that are standard for the TD3 structure, a further context ANN is introduced. This context network processes the plant specific observation transitions, which will be investigated in the following section. The RL controller operates on a given plant with random initialization and acts according to the momentary policy π φ , superimposed by a Gaussian exploration noise β e . The resulting transitions and rewards are then saved to the replay buffer B. This rollout procedure is described in Algorithm 1.
After finishing the rollout, the MRL algorithm's networks are updated via standard gradient descent. For this, the update routine of TD3 is applied under the consideration of the added Algorithm 1: Rollout.
Require: Policy π φ , motor m j , buffer B, number of steps N steps , number of steps per episode N episode , optional: context z j 1: Determine Execute a k on m j 10: Observe r k+1 and o k+1 11: Store k ← k + 1 13: end while 14: n ← n + k 15: end while 16: return Buffer B context network. The context network is updated with the same loss as the critic to ensure that the context variables allow a sensible distinction of plant systems with regard to their action values. The update routine is described in Algorithm 2 and makes use of the well-established target networks [26] in order to stabilize the training process.

V. CONSIDERED MOTOR DRIVE SET
This section presents the preliminary considerations and software tools to prepare a representative and robust dataset of heterogeneous PMSM drives for the MRL training. This is necessary because both the selection of the motors and the design of the context depend on a balanced set of drive systems, whereas a badly prepared training set could decrease the MRL algorithm's ability to generalize.

A. Overview
To train and validate an MRL algorithm for the control of PMSMs, different PMSM parameterizations need to be available. Since no comprehensive database was publicly available to the best knowledge of the authors, parameter sets have been collected first. For this, only sources with complete parameter sets were considered. Necessary parameters were not only electrical characteristics of the PMSM but also operating limitations, as described in Table I. Different sources were used, such as publicly available catalogs for industrial motors as well as scientific papers. Through this, a total of 566 different motor parameter sets were collected from power classes ranging from a few watts to hundreds of kilowatts. As a supplementary part of this contribution, the corresponding motor database has been made Algorithm 2: Update.

B. Normalization and Preprocessing
For the selection of the drives for training and test sets, the first intuition is to choose a balanced distribution of physical parameters. However, this might not be the best distribution over actual dynamic behavior of the motors. Instead, a different representation of a motor can be derived, which describes its dynamic behavior in a more target-oriented way. Transforming the model presented in (1) to a representation with normalized variables ω me , u d , u q , i d , i q ∈ [−1, 1], the ODEs can be reformulated to yield a different representation of the parameter space. With s ω = ω me,max , s i = 1.5 I n , s u = U DC 2 as scaling factors of ω, i, and u, respectively, the first equation from (1) evaluates to Note that for SPMSMs, p 1 = p 4 , p 2 = −p 5 , and p 3 = p 7 hold. To have a balanced distribution of the dynamic motor behavior within training and test, the ODE coefficients from Table II were used for the preparation of the corresponding datasets. To reduce the number of SPMSMs, a downsampling strategy was applied: the first batch of 35 SPMSMs was drawn with the routine described in Algorithm 3, which targets an extensive coverage of the parameter space by maximizing the sum of distances between the points. This selection routine tends to select parameter sets on the edges of the parameter space. Therefore, an addition of 40 SPMSMs were randomly added in order to also get an even coverage of the center, adding up to a utilized set of 75 parameters.
To include edge cases of dynamic motor behavior into the training set, the convex hull over these 75 SPMSM's nonequal parameters is determined. The SPMSM parameter sets that are identified to be vertices of this hull are added to the training set. Further parameter sets were selected randomly using a uniform distribution until the training set contained 50 different parameter vectors. The remaining 25 parameter sets are attributed to the test set.
Owing to the low number of publicly accessible IPMSM parameter sets, additional feasible sets were to be generated synthetically from the acquired ones. These data were created using the synthetic minority oversampling technique (SMOTE), which is a method originally designed to deal with classification tasks [28]. There, unbalanced datasets are problematic because a classifier can achieve a high score on training data even without the ability to correctly classify the minority class if that class is strongly underrepresented. SMOTE does sample synthetically on the basis of available data by generating new samples in between close data points of a given class, which is specified in Algorithm 4. Owing to the algebraic dependence between several ODE coefficients, it is not necessary to generate new values for the complete set of coefficients. Instead, it can be exploited that Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

Algorithm 3: Downsampling for Space Coverage.
Require: Data points X original from majority class, goal number of data points N goal 1: Initialize empty data buffer X new 2: Sample random data point x from X original 3: Remove x from X original and put it into X new 4: Σ old ← 0 5: while |X new | < N goal do 6: Σ new ← 0 7: for each x 1 ∈ X original do 8: for each x 2 ∈ X new do 9: 10: end for 11: if Σ new > Σ old then 12: x next ← x 1 13: Σ old ← Σ new 14: end if 15: end for 16: Remove x next from X original and put it into X new 17: end while Algorithm 4: SMOTE.
Require: Set of parameter vectors X original from underrepresented class, targeted number of data points N goal , number of nearest neighbors K 1: Find K nearest neighbors for each element p ∈ X original 2: Initialize data buffer X goal ← X original 3: while |X goal | < N goal do 4: Sample random p 1 ∈ X original 5: Sample random p 2 from p 1 's K nearest neighbors 6: Generate p new = αp 1 + (1 − α)p 2 with α ∼ U(0, 1) 7: X goal ← {X goal , p new } 8: end while Therefore, SMOTE needs to be applied only to the coefficients p 1 , p 2 , p 3 , p 6 , and p 7 . For each synthetically derived set of coefficients, it was verified that it lies within the convex hull of the 14 original parameter sets. 3 For the IPMSMs, the original 14 parameter vectors are directly utilized as training data, while the rest was selected randomly using a uniform distribution. The breakdown of training and testing sets is concisely listed in Table III. Fig. 2 features the distribution of p 1 against p 4 for training and test sets in an exemplary 3 To derive the corresponding physical parameters needed for simulation, R s = 1 Ω, p = 4, and I n = 5 A were assumed as fixed values. fashion. Further visualizations as well as all 150 normalized PMSM parameter sets are available at [27] to supplement this article.

C. Context Design
The context design of this work differs from the ones suggested by PEARL and MQL: because the ODE coefficients p 1...7 can be assumed to be constant for each PMSM, learning static context variables is sufficient to characterize individual environments. To yield a static context, the input of the context network c ξ must be kept static as well for each given environment. Therefore, the transitions used are not sampled from the regular replay buffer B but from a separated prefilled commissioning buffer B j c for each motor m j . This buffer contains observed transitions e j = (o c,k , a c,k , o To ensure a balanced specification of each motor's dynamic behavior within these buffers B j c , coverage of the entire observation-action space is necessary. Since only a finite set of sampled transitions are computationally feasible, it is targeted to cover the observation-action space as well as possible with a limited number of observation transitions. For this, the densityestimation-based state-space coverage acceleration (DESSCA) is used [29]. DESSCA utilizes kernel density estimation to evaluate the coverage of the observation space and suggests a new sample to minimize the difference to a reference coverage density. Here, the reference coverage is a uniform distribution across the possible values of the observations and actions. In Algorithm 5 outlines how the B c buffer is filled for each motor parameter set j. The exemplary distribution featured in Fig. 3 showcases that DESSCA is able to determine a balanced coverage of the observation space. The MRL setup as presented in Fig. 1 shows how these commissioning buffers B j c are used to generate the context z j for a given motor m j . This enables the RL algorithm to adapt its control behavior according to the motor.

VI. EMPIRICAL INVESTIGATION
In the following, the training and testing of the MRL algorithm is described in detail, and the obtained results are discussed.

A. Training Routine
During training, the MRL algorithm performs a rollout on a specific motor and stores its observations in the replay buffer B. The observation o of the PMSM is defined as follows: The feature ( i 2 d + i 2 q ) aids the RL-based controllers to recognize annealing to the current limit as of (17). The references i * d and i * q are the targeted currents for the next sample step. They are generated using a stochastic Wiener process wherein ∼ N (0, σ) denotes sampling from a normal distribution with zero mean and σ variance with σ d,k and σ q,k changing each episode. The generated references are also subjected to the current constraint. In addition, the condition i * d < 0 A is respected. By doing so, the current reference values are randomly sampled from the entire feasible dq current half-plane. The reward r k+1 is depending on the reference of the last observation and the present observation's current The reward function is defined on the interval between (1 − γ) and −(1 − γ), ensuring the action values to always lie between −1 and 1 for numerical reasons [7]. During training, a motor parameter set from the training set is drawn. This parameter set is used for the simulation of the motor within the GEM software toolbox [15]. As function-approximation-based RL comes with the risk of training divergence, learning checkpoints are created to allow access to each RL controller's parameterization at its historical performance peak in hindsight [30], [31].
The hyperparameters of the TD3 algorithm, of the MRL setup, and of the training routine in this contribution are depicted in Table IV. A sampling time of T s = 100 µs was chosen. The number of context variables was set to 8, leaving enough degrees of freedom to allow, e.g., identification of the physical parameters or alternatively of the ODE coefficients without bloating the context vector unnecessarily.

B. Test Routine
After completing the training phase, the RL algorithms' performance has to be evaluated. For this, a testing routine has been developed to validate the control performance on a representative set of situations. The test routine is executed for each RL algorithm variant on each available motor. The given motor is initialized with respect to the currents and the speed. The RL controller then needs to handle its control task by following the given reference. To ensure a balanced coverage of the state space, DESSCA is once again employed to generate the corresponding initial states ( i d,0 , i q,0 , ω me,0 , i * d,0 , i * q,0 ), wherein the currents are again subjected to the current limit and i * d < 0 holds. The pseudocode of the testing routine is presented in Algorithm 6. Here, a special reference generator was used where a drift back to the initial i * 0 is added to the Wiener process. The stiffness ζ tunes the strength of this drift. This Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  IV  HYPERPARAMETERS AND THEIR VALUES IN THIS ARTICLE   TABLE V  TEST ROUTINE HYPERPARAMETERS AND THEIR VALUES approach is meant to provoke oscillation around a reference point. Table V lists the hyperparameters chosen for the test routine. Episodes have been configured to be rather short, such that the RL algorithm's transient control behavior is weighted roughly equivalent to its steady-state behavior.

C. Evaluation of the Control Performance
The training checkpoints allow us to evaluate the peak control quality of the MRL approach in hindsight. In this article, the decision on the peak performance is based on the training rewards: the rewards were averaged using a moving average filter with a window length of w = 10 5 and a shift of s = 10 5 . The checkpointed control configuration at the reward maximum then got selected for evaluation. As a comparison to the MRL Algorithm 6: Test Routine.
Require: Policy π, motor m j , initial states sampled with DESSCA B DESSCA , number of steps per episode N episode , optional: context z 1: r sum ← 0 2: for each b ∈ B DESSCA do 3: k ← 0 4: Initialize m j with i d,0 , i q,0 , ω me,0 , i * q,0 , i * d,0 from b 5: Receive initial observation o 0 6: while k < N episode and (17) holds do 7: Receive action from policy: a k = π(o k , z j ) 8: Execute a k on m j 9: Observe r k+1 and o k+1 10: r sum ← r sum + r k+1 11: k ← k + 1 12: end while 13: end for 14: return r sum · (N episode · |B DESSCA |) −1 approach, a TD3 algorithm without context variables has been trained on all training motors. This experiment is labeled RL AM (AM: "all motors"). Moreover, individual TD3 controllers were trained for each motor separately labeled as RL SM (SM: "single motor"). The corresponding hyperparameters were also configured, as stated in Table IV, and have been trained using the rollout and update loop. Each RL SM controllers' training was executed only once. For MRL and RL AM , ten trainings were conducted. From these, the RL controller configuration with the highest peak in its learning curve was chosen.
Furthermore, Figs. 4 and 5 show exemplary test episodes that were observed on motors of different power classes. The conducted episodes feature several steps of the reference currents, which relate to corresponding changes of the drive torque. Hence, dynamic behavior is highly requested as a characteristic of satisfying control performance, and it can be seen that the RL SM and the MRL controllers react similarly fast, whereas the RL AM reacts slowly and, concerning i q , even away from the reference. Fig. 6 features the behavior of a drive during a speed ramp from negative to positive velocities, which verifies the capability of the RL SM and MRL controller to deal with changing speeds. Both are able to follow the current reference as long as the available voltage allows it, which is (for the presented motor) not the case for speeds close to the maximum speed ±ω me,max . Fig. 7 shows another except from a test episode that demonstrates the adaptive capabilities of the MRL approach. This motor has a rather strong input sensitivity (comparably high p 1 , p 4 ) [cf. (13) and (14)], and a low-performing controller may easily violate the current limitation. It has to be noted that motors with the given characteristics were rather underrepresented in the training dataset for the MRL and RL AM agents. Hence, the RL AM agent's lack in performance and unstable behavior was to be expected. Yet, the MRL agent is able to stabilize the plant system and, although a steady-state error remains, the currents visibly move toward their references.
These experiments highlight the potential of the MRL approach as it is able to stabilize the plant system even without being comprehensively optimized. Fig. 7 also confirms the findings listed in Table VI with the MRL's performance being much better than the RL AM but slightly worse than the RL SM . However, the reference tracking behavior of the MRL controller looks highly promising for a first proof of concept, and an improvement can be expected if optimal hyperparameters for the MRL can be found. Nonetheless, Fig. 7 also highlights the limitations of the MRL's setup. The training routine considers  all available motors with identical probability. Therefore, the MRL's control quality is more adapted to motor classes that are overrepresented in the training set. Further extensions are needed to avoid corresponding overfitting, which could be done by, e.g., including more sophisticated exploration schemes for selecting a motor environment during training, or via an improved minibatch sampling routine. In addition, expanding the training dataset with more motors that have uncommon characteristics would level out the distribution of featured motor dynamics, which should improve the final performance of the MRL agent. 4 Fig. 8 shows the average training rewards of the last 10 6 steps and their standard deviation for the MRL and the RL AM algorithms. The training rewards seem to be consistently better for the MRL approach compared to the RL AM , which was expected due to the information advantage implemented through the context. Also, the variance around the mean is smaller, indicating  Table VI shows the statistical results of the tests where, again, the MRL controller shows a higher reward than the RL AM approach, while both RL SM algorithms are performing better. This is also an expected result, because the 150 RL SM individually trained RL controllers together contain also 136 times more parameters than the MRL algorithm and have, therefore, much more learning capacity. Also note that the MRL controller has only slightly more network weights than the RL AM approach, which are located within the context network. The complete data of all test results from the given 150 motors as well as all utilized code are available at [27] to supplement this contribution. An insight of the measured training time requirements is listed in Table VII, whereas the training was conducted on a high-performance computing cluster [32]. As can be seen, the MRL's training time exceeds the RL SM 's expected training time by a factor of about 12.25. Therefore, the initially larger training effort for the MRL amortizes already after utilizing it within 13 different drive motors. Since there are nearly uncountable drive variants used in industrial applications, which would require an individual training with standard RL techniques (i.e., the training time scales linearly with the number of drive configurations considered), this training effort offset of the MRL algorithm pays off very quickly.

D. Evaluation of the Training
Also note that this training time analysis is based on a first proof of concept. Further optimization of the MRL, e.g., in terms of its hyperparameters or changes of the training routine, has the potential to decrease the necessary MRL training time, which  TABLE IX  IMPACT OF MOTOR FAULTS ON THE CREATED CONTEXT VARIABLES FOR AN EXEMPLARY SPMSM AS DEFINED IN TABLE VIII would make the time-saving potential even more worthwhile. In particular, the update rate of the context network and the size of the context buffer are what dramatically increased the MRL's training time. Both have not yet been investigated in terms of more time-efficient choices and are, therefore, an obvious degree of freedom that may enable a (drastic) reduction of training time.

E. Evaluation and Utilization of the Context Variables
Given the improvements of the MRL algorithm over the RL AM , the information encoded within the context is of major interest. To fully characterize the dynamic behavior of the motor system, it is sufficient to learn either the physical parameters or the normalized ODE coefficients. For this, the evaluated MRL controller's context network was used to generate context variables z 1 -z 8 for each of the 150 motors. These contexts were then analyzed concerning their correlation to the physical parameters and the normalized ODE coefficients. Fig. 9 shows the correlation matrix of the context, which indicates that the context variables correlate much stronger with the ODE coefficients than with the physical parameters.
Finally, the output of the context network has been investigated concerning different error cases that change the physical parameters of a PMSM. Since the parameter change will affect the resulting context vector, it is reasonable to utilize this chain of effects for detection of irregular drive conditions including faults. The analyzed error cases and the corresponding context vector have been investigated for one exemplary motor (specified in Table VIII) and are listed in Table IX. As can be seen, the context variables deviate from their original values for each of the error cases. Especially, for the first three experiments, this deviation is severe. Only for the last experiment, no obvious linear deviation trend from the original context is observable. However, it cannot be excluded that even this assumed fault is accessible from the context if nonlinear classifiers are considered for the detection.

VII. HARDWARE-IN-THE-LOOP (HIL) INVESTIGATION
Real-time capability is often a major concern when ML applications are to be utilized within or in conjunction with time-critical systems. In order to validate the feasibility of the proposed MRL architecture, an exemplary testcase is conducted on rapid control prototyping hardware (RCPH). This HIL experiment allows the monitoring of the utilized turnaround time T TA , which is the time that is needed to compute the next control action. Hence, T TA < T s must be fulfilled to guarantee controllability at all times. The corresponding HIL test setup is presented in Fig. 10 with a dSPACE MicroLabBox [33] posing as RCPH.
As visualized in Fig. 11, the found exemplary control performance during the HIL experiment is comparable with the offline simulations. It can also be seen that T TA is strictly smaller than T s . The corresponding HIL experiment only made use of the RCPH's CPU; the utilization of the built-in field-programmable gate array was not necessary. This verifies the viability of the MRL control scheme, as the resulting control agent turns out to be relatively lightweight concerning its computational burden.

VIII. CONCLUSION
This article presented a drive control framework utilizing a context-based off-policy MRL approach. It was applied within the current control scenario on PMSMs from a wide range of different power classes and has shown to perform significantly better than a general RL algorithm without context variables. The overall procedure is promising for the time-efficient design of RL-based controllers, which is a major drawback of approaches with individual training. Furthermore, an HIL experiment also validated sufficiently quick inference of the presented MRL, despite the comprehensive training routine. The investigation of the learned context showed that the identified context variables correlate to the dynamic behavior of the motors as described by ODE coefficients. Therefore, the secondary goal of implicit motor behavior characterization was achieved, whose potential application to motor fault detection was also outlined with promising perspectives.
For upcoming investigations, closing the gap between MRL and individually trained RL should be prioritized. For that, a comprehensive optimization of the utilized hyperparameter configuration should be considered. In addition, the training routine should be extended to include methods of nonuniform motor and minibatch selection. Moreover, the level of sophistication in the preparation and selection of training data could be elaborated even further, especially in terms of synthetic data generation and in the representation of uncommon motor characteristics. Finally, this contribution assumed PMSMs that are described by static parameters, which is a simplification of the real-world behavior with parasitic effects such as (cross-)saturation or temperature and aging effects. These assumptions rendered a static context sufficient to fully characterize the environment. However, usual real-world applications may show dominant time-varying behavior, and therefore, it would be of interest to recreate the context at runtime to adapt to changes of the drive in an online fashion. Naturally, the utilization of the given setup should also be considered for different power electronic applications.