Sequence Prediction for SiC MOSFET Active Gate Driving With a Recurrent Neural Network

This article develops a recurrent neural network (RNN) with an encoder–decoder structure to predict the driving sequence of SiC MOSFET active gate drivers (AGDs). With a set of switching targets as the input, the predictor generates an optimal active gate driving sequence to improve the switching transient. The development is based on a hybrid platform across MATLAB, PyTorch, and LTspice. A high-fidelity switching model is implemented in MATLAB to obtain reliable training data. The sequence predictor is trained with PyTorch. The predicted sequence is verified on an example Buck circuit in LTspice. In contrast to the state-of-the-art approach, the proposed method avoids exhaustive search in a large solution space; the sequence length is dynamically predicted per the driving strength at each step. The AGD sequences generated by the predictor effectively and precisely improve the switching transients, making the proposed sequence predictor an integral and valuable component for active gate driving.


I. INTRODUCTION
Despite the benefits of SiC MOSFETs in building high efficiency, high power density, and high-performance power conversion systems, the high switching speed also causes more significant overshoot, oscillation, and elevated electromagnetic interference (EMI). Active gate driver (AGD) is a remedy for adopting SiC devices while addressing problems they introduce.
Although extensive efforts have been made on AGD circuit implementations [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], a methodology for designing the active driving sequence still needs to be developed. To address the challenge, a search approach is proposed in [9] and is improved in [13] where a dynamic resistance sequence with 60 time steps is searched by the particle swarm optimization (PSO) algorithm. Another model-based solution was proposed in [14] to find the optimal resistance sequence for the AGD reported in [8]. The optimal gate resistance value adopted for oscillation reduction is found by trial and error.
As shown in Fig. 1, the state-of-the-art method initializes a group of candidate AGD sequences with predefined fixed lengths [9], [13]. Then, the AGD sequences are applied to the physical circuit, and switching results of interest are captured. The actual circuit is regarded as an AGD to switching results (SR) model, which a high-fidelity simulation platform can also implement. The PSO algorithm takes the SR model's feedback to update the candidate AGD sequences and iteratively searches for the optimal solution. The fixed sequence length assumption of the state-of-the-art method is of concern.
In this article, an AGD sequence prediction method is developed to address the challenge, as shown in Fig. 1. The total length and individual driving strengths are predicted, and their values are dynamically matched. Instead of searching from candidate AGD sequences, which can be poorly defined, the predictor takes switching targets as input and generates the driving sequences. The predictor uses the recurrent neural network (RNN) to process the sequential data effectively. The proposed method can fine-tune the switching transient parameters to reduce the switching loss or improve the EMI profile.
The rest of the article is organized as follows. The principle of RNN-based sequence prediction is given in Section II, training data generated from a high-fidelity switching model  is discussed in Section III, the neural network structure and the training process are elaborated in Section IV, and the AGD performance verification is given in Section V. Finally, Section VI concludes the article.

II. RECURRENT NEURAL NETWORK-BASED SEQUENCE PREDICTION
The hybrid data-driven workflow for AGD sequence prediction is demonstrated in Fig. 2. The deep-learning neural network developed for the task is at the workflow's core. A predictor that models the long-term dependency within a time series is critical for the prediction problem. For example, to generate a driving sequence leading to a particular switching loss, the predictor must oversee the loss generated in the previous steps and then, predict the present step. Since tradeoffs always exist among multiple switching targets in one switching transient, understanding long-term dependencies by the predictor is essential. The encoder-decoder RNN (ED-RNN) presented in this article is developed to solve the time-series modeling problem; the structure of the ED-RNN will be elaborated in Section IV.
The training data for the ED-RNN consists of AGD sequences paired with their corresponding switching results, including E sw , di/dt, dv/dt, overshoot, and oscillations. In this work, a high-fidelity switching model is developed in MATLAB where large numbers of switching transients can be quickly simulated and quantified. The switching results are treated as the inputs to the ED-RNN, and the AGD sequence prediction is the expected output.
The implementation and training of the ED-RNN are based on PyTorch, a high-performance integrated library for AI applications [27]. Once the neural network is trained, it makes the AGD sequence prediction, referred to as the inferring process. The ED-RNN predictor takes user-defined switching targets as the input and predicts a driving sequence, which is verified in the LTspice platform.
The hybrid workflow enables a cross-verification of the proposed method. The training data generation and AGD performance verification are conducted on different platforms, hence, the AGD predictor is a generalized approach and not platform dependent.

III. HIGH FIDELITY SWITCHING MODEL FOR NN TRAINING A. DATA GENERATION MODEL
The training data can be obtained from a circuit as simple as a double pulse tester (DPT). A DPT circuit is modeled in MATLAB script for training data generation, shown in Fig. 3(a). Die models of a SiC MOSFET and a diode are used to eliminate the influences of packaging parasitic inductance. The parasitic driving loop inductance is neglected, while the commutation loop inductance (L LP ) is retained. The dc bus voltage (V DC ) and switching current (I 0 ) are fixed for the  model but can be modified to obtain switching data on different operating conditions. An active gate current (I g ) driver is considered in this work and is assumed to generate discrete currents from i = 0.1 A to i = 2.0 A with 20 levels. The time step for I g is selected based on the switching speed of the target SiC MOSFET. On the one hand, a shorter time step is preferred to fine-tune the switching transient of the device, but on the other, it should be long enough for the active I g to alter I ds and V ds at each step. Given that the turn-ON time of the device is typically 30-80 ns [15], 3 ns time step is a proper choice to have around ten steps in the driving sequence even for the fastest switching transient. The circuit parameters are listed in Table 1. Fig. 4 highlights the high-fidelity data generation workflow. Three methods are applied to obtain reliable data from the switching model: the device I-V curve, the capacitance C-V look-up table (LUT), and the ordinary differential equation (ODE) models. Quasi-random AGD sequences are generated, serving as the switching model excitations.

1) DEVICE I-V CURVE
The characteristic I ch − V ds curve of the CPM2-1200-0025B is critical to describe the output behavior of the SiC MOSFET. The curve is usually available on the device datasheet, but the V ds range is limited and insufficient to model the device entirely. This work uses the SPICE model developed by the device manufacturer to extract the extended I − V curve. The I − V curve of CPM2-1200-0025B obtained from the manufacturer SPICE model at 25 • C junction temperature is shown in Fig. 3(b).
In the ohmic region, the channel current (1), as proposed in [16], is adopted to approximate the I − V curve.
where I ch is the channel current of the MOSFET, V ds , V gs , V T H are the drain-source, gate-source, and threshold voltages, respectively. The β ohm , g sat , and k are device-related parameters and are obtainable by fitting the device I − V curves using (1). Note that (1) differs from the current equation for Si MOSFETs. The channel current equation for the Saturation region is where λ is channel length modulation coefficient, β sat is a device parameter. These device parameters are obtained by fitting the I − V curve in the saturation region. The MOSFET I − V characteristic is given as a group of I ch − V ds curves under different V gs gate voltages. For a particular I ch − V ds curve, the device parameters (β ohm , g sat , k, β sat , λ) can be obtained by curve fitting using (1) and (2). Therefore, for multiple I − V curves under different V gs , the device parameters are formulated as functions to V gs . A boundary voltage V ct is also formulated such that the MOSFET is in the ohmic region if V ds < V ct ; it is in the saturation region when V ds > V ct . The curve of V ct is shown in Fig. 3(b).

2) CAPACITANCE C-V LUT
The conventional method for modeling the nonlinear capacitances of MOSFETs, i.e., C iss , C rss , C oss as shown in Fig. 3(c), is done by curve-fitting the capacitances to theoretical equations [17], [18]. However, the curve-fitting results cannot precisely match the experimental values of the capacitances, especially in the low V ds range where the capacitances vary remarkably. The mismatch in the nonlinear capacitances reduces the fidelity of the data generation. As discussed in the paper, the DPT circuit dynamics is described by ODEs and solved numerically. Under this framework, LUT-based implementation of the capacitance C − V curves can achieve better accuracy. By adopting a small time step to solve V ds and updating the capacitance values accordingly, the switching model incrementally incorporates the nonlinear capacitances into the transient simulation model without using any explicit equation.

3) ODE MODEL
The DPT circuit is modeled by its differential equations to obtain accurate switching transient data and solved numerically by MATLAB. Unlike behavioral modeling methods such as in [17], [19], [20] where certain assumptions are made to derive the analytical equations, the numerical solution based on the circuit ODE model requires the fewest assumptions while retaining the nonlinearities, which provides better accuracy.
The turn-ON process is divided into four modes and named in the format "x -y," where "x" is the state of the MOS-FET, and "y" is the state of the diode. Starting from the OFF-ON mode where the MOSFET is OFF and the diode is free-wheeling. The gate current I g is charging the C iss and brings V gs up from V EE = −5 V to V T H . In this mode, no dynamics in the commutation loop are involved. The gate voltage equation is given below where C iss = C gs + C gd At the moment the MOSFET turns ON, V ds = V DC , and hence, the MOSFET is in the saturation region, and the turn-ON process is in the Sat-On mode. The Sat-On mode ends when I d reaches the load current I 0 as in Fig. 3(a), and the diode turns OFF (Sat-Off). The gate driver loop equation is shown in (4) are initial values of V ds and V od in each simulation time step. The commutation loop voltage and current equations are shown in (5) and (6), the parameters are shown in Fig. 3(a) and Here, V f is the forward voltage drop of the SiC diode CPW41200S020B. Its value to the forward current is curve-fitted by the datasheet information as V f = 0.03194I f + 0.9205.
Generally, the actual switching loss E sw caused by the MOSFET channel current I ch is immeasurable; the measured switching loss E me is calculated by the drain current I d , but is an underestimation of the actual loss [21]. One feature of the proposed modeling approach is that the actual switching loss E sw and the measured switching loss E me can be differentiated as described in (7) and (8). Both E sw and E me can be obtained, but only E sw is used as the training data in this work.
The Sat-Off mode starts when the diode turns OFF and ends when the MOSFET transitions from the saturation region to the ohmic region (Ohm-Off). The commutation loop voltage equation for this mode changes from (5) to (9), where C T is the parasitic capacitance of the top switch and C T = C oss + C D .
In the Ohm-Off mode, the circuit begins oscillating. As V gs increases till V CC , the turn-ON process finishes, and the MOS-FET reaches its steady operating point in the ohmic region. Commutation loop current (6) and loss (7) are changed to (10) and (11), while (4) and (9) remain the same.
The switching transient is modeled according to the gate driver loop equation, the commutation loop voltage and current equations, and the switching loss equation. The flowchart for the switching model is summarized in Fig. 5. The simulation time step 0.1 ns and the AGD time step 3 ns are example values used for this work and can be changed.

C. QUASI-RANDOM AGD SEQUENCE
On the data generation platform, I g sequences are randomly generated and concurrently applied to the turn-on transient. In this work, the I g sequences are generated by randsample() function that accepts a probability distribution input. For the turn-on transient, the quasi-random I g sequences during the current rising interval are generated with probability distribution P(di/dt ), and that for the voltage falling interval is P(dv/dt ). The proposed switching model makes the turn-ON transient a white box, and hence, it is convenient to assign unique I g sequences for the current rising and voltage falling intervals individually.
It should be noted that the quasi-random AGD sequences are generated in parallel with the switching transient simulation rather than being predefined. When the switching transient finishes, the AGD sequence generation also terminates. Therefore, the sequence length is dynamically matched with the duration of the transient.

D. TRAINING SET RESULTS
The switching model is implemented with MATLAB. m script, where the optimized ODE solver, ode45(), generates the numerical solutions. An efficient search algorithm is developed for the nonlinear capacitance C − V LUT to run the switching model. The circuit ODEs presented before are transformed to state space form for the ode45() solver. The simulation runs and updates the coefficients of the ODEs (such as the nonlinear capacitance values) at every 0.1 ns. The small time step guarantees the high accuracy of the switching model. The proposed switching model generates approximately 15 000 training data, which takes about 4-5 h.  Three switching results are presented here as an example: E sw − the switching loss, di/dt− the average current slope on the rising edge, and dv/dt− the average voltage slope on the falling edge. The normalized switching results distribution with di/dt, dv/dt and E sw as x, y, and z axes is visualized in Fig. 6. It is clear that on the x − y plane, the switching results appear over the entire plane. In other words, most of the possible di/dt, dv/dt, and E sw combinations have been included in the training data, making the training more reliable. Here, the key enabler is the quasi-random I g generation method with variable di/dt and dv/dt. It should also be mentioned that AGD sequences generate the switching results denoted by the same color with the same probability distribution. The base values used for the normalization are E − the switching loss under constant I g = 0.1 A; K i , K v − the average current and voltage slopes when constant I g = 2.0 A is applied.
In summary, the SiC MOSFET model is developed based on the device I − V characteristic, and the capacitance C − V characteristic is implemented by LUT from measured data instead of fitting equations. The switching model is based on ODEs, which are more accurate than behavioral models based on analytical equations. The variables di/dt and dv/dt are state variables in the switching model and can be easily solved. The AGD sequences can be generated individually for di/dt and dv/dt stages using the white-box switching model. The probability distributions assigned to the sequences can be manually optimized, which is a unique and helpful feature for high-quality training data generation.

IV. GRU-BASED ENCODER-DECODER RECURRENT NEURAL NETWORK
The gated recurrent unit (GRU) network is one of the two main variants of classical RNN that can process time series with long time dependencies [22]. For this reason, GRU is adopted here to construct the active gate driving sequence predictor. The sequence predictor also takes advantage of the encoder-decoder network structure successfully applied to machine translation [23] and image autocaptioning [24] problems. Fig. 7 demonstrates the overall structure of the GRU-based encoder-Decoder recurrent neural network (GRU-EDRNN).

A. ENCODER AND DECODER STRUCTURE
The encoder takes the switching results as input and generates a context vector C for the decoder. The encoder consists of two components: 1) batch normalization; and 2) linear layer.
The batch normalization (BN) algorithm [25], [26] is to whiten the training data within a minibatch input. It helps the neural network learn the optimal distribution of the training data, making the training faster and smoother. This work uses a BN layer as the front end to normalize input data properly.
The linear layer is defined as follows: where x is the vector of switching results, C is the context vector, and W and b are the learnable weight matrix and bias vector. The A is a linear activation function in this work. The decoder takes the context vector and decodes it as the AGD sequence. The decoder consists of five parts as follows. 1) embedding layer; 2) GRU; 3) dropout; 4) BN and linear layer; 5) softmax layer. As mentioned above, at each time step, the active I g can be chosen from 0.1 to 2.0 A with 20 levels. Two unique tokens, < start > and < end >, are manually added to the first and last positions to mark the starting and completing of an AGD sequence. There are 22 selections for each time step of an AGD sequence. In this work, an embedding layer [28] is adopted to encode the AGD sequence.
The context vector C is then used as the input h 0 to the first GRU cell in the decoder. GRU is capable of "forgetting" irrelevant information and "memorizing" long-time dependencies in a sequence due to the reset and update gate mechanism. Three GRU layers are cascaded as shown in Fig. 7; in this way, the long-time dependency modeling capability is enhanced.
Dropout is an effective method to overcome the overfitting problem in a neural network [29], hence, it helps the neural network generalize better.
The AGD sequence prediction is a classification problem, in essence, where for each time step, there are 22 classes. The neural network treats its output as a probability distribution, which indicates the probability of I g equaling any of the 22 classes as follows: With the probability distribution, the class with the highest probability is chosen as the prediction result. The Softmax layer converts the linear layer output λ t to the probability distribution ∼ P I .

B. NEURAL NETWORK TRAINING
Consider a specific data set containing AGD sequence l of length m l : {I g,k |k = 1, . . ., m l } and the corresponding switching results vector x. Assuming I g,0 = < start > and I g,m l +1 =< end >, the goal of training is for the neural network to learn the prediction: I g,0,...,k−1 → I g,k , where I g,0,...,k−1 means sequentially inputting I g,0 , I g,1 , . . ., I g,k−1 to the GRU-EDRNN.
In practice, at step k where C(x), I g,0,...,k−1 are the inputs to the neural network, the GRU-EDRNN makes a prediction for I g,k denoted as ∼ I g,k . At each step, the neural network output is a probability distribution ∼ P I as in (13), the ∼ I g,k is obtained by where np.argmax() is a Python function to return the index of the maximum element.
In the same way, I g,k can be regarded as a probability distribution P I = [0.0, . . ., 0.0, 1.0, 0.0, . . ., 0.0] where the 1.0 appears at the kth position. Therefore, to make the correct prediction ∼ I g,k = I g,k , ∼ P I should be as close to P I as possible. Mathematically, the distance between two probability distributions is measured by cross-entropy and is formulated as follows: Therefore, the following optimization problem over the entire training set explains the training process. The W , U , and b are the learnable parameters of the neural network, N is the total number of training sets, and m l is the length of AGD sequence l.
To summarize, for a sequence l : {< start >, I g,1 , I g,2 , . . ., I g,m l } along with its switching results x, the expected prediction is q : {I g,1 , I g,2 , . . ., I g,m l , < end >}, the actual prediction is

C. NEURAL NETWORK INFERRING
Inferring refers to the operation when the neural network has been trained and is then used to make the AGD sequence prediction. The inferring process is demonstrated in Fig. 8. The input to the neural network is the switching targets of interest. In practice, the <start> token is one additional input  to the neural network to start the prediction. Once the first step prediction is made, it will be used as the input to the decoder and generates the second step prediction. The process is repeated until the neural network predicts the <end> token, indicating that a complete AGD sequence has been found. The neural network infers the sequence length and is dynamically changed for different switching targets.
It should be mentioned that the duration of the training and inferring processes relies on the hardware used. This work utilizes Paperspace (https://www.paperspace.com/), a cloud infrastructure for machine learning applications. Due to the neural network's complexity, the training data's size, and the GPU infrastructure in use, a single training process typically takes around 30 min (with multiple runs required to obtain the final model). Conversely, inferring usually takes only a couple of seconds.

V. AGD PREDICTOR PERFORMANCE VERIFICATION
The verification of the AGD predictor is based on a Buck converter with an active current source gate driver. The GRU-EDRNN predicts AGD sequences to improve the switching transients. Fig. 9 shows the verification circuit in LTspice with the ideal current source as the AGD. The dc bus is 800 V, and the turn-ON current is 40 A. Total commutation loop inductance is 20 nH. The circuit parameters for the verification is the same as the switching model where the training data are generated. An example total base plate to ground coupling capacitance of 200 pF is adopted [30], which is the source of common mode noise. Two line impedance stabilization networks (LISN) are added to DC+ and DC-. By computing V CM = 1 2 (V 1 + V 2 ) as labeled on the LISNs, the common mode noise voltage is evaluated. The CGD with a constant gate resistor is used as the benchmark. The circuit configurations are the same as in Fig. 9, except that voltage source CGD with a constant gate resistor R g replaces the current AGD in the dashed box.

A. SWITCHING WAVEFORM MODIFICATION BY AGD
This section discusses how AGD modifies the switching transient, focusing on the tradeoff between switching loss E sw and common mode noise V CM . The switching targets for the optimization contain E sw − the switching loss during turn-ON transient; di/dt− the average current rising slope; dv/dt− the maximum voltage falling slope. The AGD will adjust the maximum voltage slope to prevent high common mode noise. Three cases are verified as listed in Table 2.

1) CASE 1
The current source AGD reduces E sw and V CM simultaneously, but di/dt increases as a tradeoff. Table 2 lists the normalized switching results (E sw , di/dt, dv/dt ) that the CGD-1 achieves with R g of 25 . In Case 1, the switching targets for E sw and dv/dt are decreased compared to the CGD-1 case, while that for di/dt increases.

2) CASE 2
For applications emphasizing V CM reduction while having higher tolerance on E sw , the AGD reduces the maximum dv/dt. In this case, the di/dt remains unchanged, and the V CM is remarkably reduced at the cost of enlarged switching loss.

3) CASE 3
For applications with E sw minimization of the dominant target, the AGD optimizes the switching loss without deteriorating the common mode noise. This target is achieved by minimizing the E sw , raising the di/dt, and keeping dv/dt unchanged.
The Target shown in Table 2 are the intended switching results to achieve and are normalized values to the Base Value. They serve as the inputs to the GRU-EDRNN (Fig. 8). The AGD sequence is applied to the Buck circuit ( Fig. 9) once it is predicted. The Achieved denotes the E sw , di/dt, dv/dt results by active gate driving. It is observed that the Achieved are accurate approximations to the Target. The prediction process is a translation from switching targets to AGD sequence, and the results thus confirm the validity of GRU-EDRNN.
The switching waveforms obtained from LTspice are compared in Fig. 10(a) In the I g comparison waveforms, the CGD-1 gate current starts from 0 ns. For the AGD results, the active gate current sequence all start from 31 ns, indicated by the Start-of-Sequence (SOS) mark. The End-of-Sequence (EOS) marks for the three AGD sequences are also labeled in the figure. Here, the AGD sequences are only generated for the primary switching transient corresponding to the Sat-On and Sat-off modes of the switching model. The turn-on process between 0 ns to SOS corresponds to the OFF-ON mode, where no switching dynamics are involved. A constant I g = 0.8 A is adopted for this interval, and it takes 31 ns to finish OFF-ON mode. After the EOS, the circuit is in Ohm-Off mode, where the primary switching transient has finished. A constant I g = 0.8 A is also adopted for this interval. It can be inferred from the EOS that the predicted sequence length is changing for different cases, and the lengths of the three AGD sequences match the duration of the individual switching transients. The results demonstrate that the GRU-EDRNN can predict correct AGD sequences according to the switching targets.
In Cases 1 and 2, AGD reduces the maximum dv/dt to improve the common mode noise. The declines on the V CM are evident. As quantified in Table 2, when the maximum dv/dt in Case 1 and 2 are decreased by 14.2% and 20.5%, the V CM drops from −138 V to −123 V and −116 V, respectively. In Case 3, the focus is not dv/dt reduction, and the V CM is unchanged compared to the CGD-1 case. The average di/dt for Case 1 and 3 are increased by 11.0% and 37.4%, respectively, while for Case 2, it is kept the same as the CGD-1 case. The result is verified by the I ds waveforms in Fig. 10(a). In Case 3, the V CM is not improved, but the E sw is minimized by significantly accelerating the di/dt interval. In Case 2, the V CM is minimized at the cost of higher E sw . Case 1 is when the optimum solution is found to improve both E sw and V CM .
It is worth noting that di/dt and dv/dt are controlled individually by the AGD, but for CGD-1, it is impracticable. Due to a high time resolution of 3 ns, the sequence predicted for the current rising and voltage falling intervals can fine-tune the switching transient. The AGD is thus an efficacious tool for switching transient improvement with the sequence predictor as a critical enabler.

B. SWITCHING TRANSIENT IMPROVEMENT: AGD VERSUS CGD
This study optimizes V CM and turn-ON current overshoot I OV . The AGD and CGD are applied to fulfill the targets, and the E sw are compared. It shows that the AGD with the predicted driving sequence generates a lower loss in both cases to achieve the same improvement. The switching targets input to the GRU-EDRNN are the same as Section V-A since the maximum dv/dt affects V CM and the average di/dt changes the I OV . Table 3 summarizes the verification results.
For V CM reduction, CGD-1(Slow) with higher R g of 29 is utilized to slow down the maximum dv/dt. It is discernable from the V CM waveforms in Fig. 10(b) and Table 3 that CGD-1(Slow) and AGD-1 achieve similar V CM values, but the AGD-1 achieves lower E sw due to faster current transient speed, as shown in E sw and I ds waveforms. In the I OV reduction case, CGD-2 with R g of 12 accomplishes low switching loss but introduces serious current overshoot, which can be seen from I ds waveforms in Fig. 10(c). CGD-2(Slow) with higher R g of 24 , as a remedy, slows down the di/dt. The AGD-4 achieves the same overshoot reduction, but the switching loss is kept low due to acceleration on voltage transient speed, as inferred from E sw and V ds waveforms in Fig. 10(c). AGD outperforms CGD due to the augmented freedom on controlling di/dt and dv/dt. The GRU-EDRNN accurately generates AGD sequences for switching transient improvement based on the switching targets.

C. OPTIMALITY OF THE PREDICTION
The global optimality of the predicted sequence cannot be guaranteed since the problem of minimizing the loss and training a deep neural network is nonconvex [31], [32]. Therefore, the local optimality is investigated in this section. As an analogy to do partial derivatives for proving local optimality, a sequence perturbation method is adopted for the investigation.
The Case 3 discussed in Section V-A is taken as an example to explain the AGD sequence perturbation method. The predicted sequence is changed manually with small perturbations, and the modified sequence is applied to the verification circuit. The new switching results are then extracted and the average error is compared, which is defined as follows: Table 4 tabulates the Case 3 AGD sequence and three perturbations.

1) PERTURBATION 1
The position for I g (3) = 1.6 A is swapped with its adjacent elements, which changes the order of the gate driving sequence. The average error increases remarkably when the modified gate driving sequences are applied.

2) PERTURBATION 2
The strength of I g (3) is perturbed. The average error increases when the driving current is deviated from the prediction. As the deviation gets larger, the average error becomes higher.

3) PERTURBATION 3
The value of I g (2) to I g (4) are modified so that the new sequence injects the same total charge Q c = 4 i=2 I g (i) × 3 ns. In this variation, the average error is considerably reduced compared to Variation 1 and 2. Nevertheless, the original predicted sequence still outperforms the two gate driving sequences in Perturbation 3.
The above results demonstrate that the GRU-EDRNN prediction achieves the lowest average error, while all the other variations underperform this neural network prediction. The discussion above is not exhaustive to prove the local optimality of the entire AGD sequence. Similar perturbations can be applied to other I g predictions. Nevertheless, the predicted AGD sequence by the GRU-EDRNN best achieves the switching target among all the perturbed sequences; and hence, the local optimality of the prediction is partially validated.

VI. CONCLUSION
The GRU-EDRNN proposed in this article provides one superior solution to address the AGD sequence prediction challenges. The superiorities are: 1) No search process is involved, and the sequence is predicted for given switching targets; and 2) the sequence length and individual driving strength are both generated by the GRU-EDRNN and are dynamically matched, which is a unique feature not seen in other approaches. The AGD sequence generation is the critical step of active gate driver development. With the proposed solution, the device switching transient parameters are optimized for switching loss (E sw ) reduction or EMI profile (related to di/dt and dv/dt) improvement.