Transient Simulations of High-Speed Channels Using CNN-LSTM With an Adaptive Successive Halving Algorithm for Automated Hyperparameter Optimizations

Transient simulations of high-speed channels can be very time intensive. Recurrent neural network (RNN) based methods can be used to speed up the process by training a RNN model on a relatively short bit sequence, and then using a multi-steps rolling forecast method to predict subsequent bits. However, the performance of the RNN model is highly affected by its hyperparameters. We propose an algorithm named adaptive successive halving automated hyperparameter optimization (ASH-HPO) which combines successive halving, Bayesian optimization (BO), and progressive sampling to tune the hyperparameters of the RNN models. Modifications are proposed to the successive halving and progressive sampling algorithms for better efficiency on time series data. The ASH-HPO algorithm trains on smaller dataset subsets initially, then expands the training dataset progressively and adaptively adds or removes models along the process. In this paper, we use the ASH-HPO algorithm to optimize the hyperparameters of convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and CNN-LSTM networks. We demonstrate the effectiveness of the ASH-HPO algorithm using a PCIe Gen 2 channel, a PCIe Gen 5 channel, and a PAM4 differential channel. We also investigate the effects of several settings and tunable variables of the ASH-HPO algorithm on its convergence speed. As a benchmark, we compared the ASH-HPO algorithm to three state-of-the-art HPO methods: BO, successive halving, and hyperband. The results show that the ASH-HPO algorithm converges faster than the other HPO methods on transient simulation problems.


I. INTRODUCTION
The fast and accurate transient simulations of electrical interconnects consisting of high-speed channels play an important role in the design and verification of electrical devices and circuits [1]- [3]. With the continuous increase in data throughput and operating frequency, interconnect signal integrity becomes an important consideration in order to meet the stringent timing and voltage requirements in modern systems. Thus, the modeling and simulations for signal integrity has become a topic which has garnered increasing attention over the last decade.
The associate editor coordinating the review of this manuscript and approving it for publication was Wiren Becker .
In the analysis of high-speed systems, an eye diagram contains information that allows the engineers to evaluate key performance metrics for signal integrity such as the amount of intersymbol interference, noise margin, timing jitter, and timing sensitivity. Traditionally, an eye diagram is generated by overlaying the voltage waveform obtained from a transient simulation using a circuit simulator [4]. However, this is a computationally expensive process especially when the transient simulation is performed over long bit sequences on highly non-linear and complex channels. Several methods such as the peak distortion analysis (PDA) [5], statistical eye (StatEye) analysis [6], and edge response analysis [7] are developed to perform fast eye diagram analysis for linear time-invariant (LTI) systems without generating the long voltage waveforms. However, these methods are only applicable to LTI systems while the behavior of high-speed channels can be non-linear due to the presence of non-linear components such as the I/O drivers and receivers.
Recently, more modern transient simulation and eye diagram modeling techniques have been pursued, owing to the advancements in surrogate modeling and machine learning techniques. The main advantage of these modeling techniques is that the model, once developed, can be reused for future computational savings, because the prediction speed of the model is much faster than the training speed. For example, [8] uses a short transient simulation to train a polynomial chaos surrogate model and the surrogate model is then used to estimate the jitter and eye diagram of the output signal, while [9] uses Bayesian optimization to perform a worst-case analysis of the eye diagram. In addition, neural network based techniques have also been developed such as in the application of multilayer perceptron (MLP) neural network for the modeling of the eye diagram [10]- [13], jitter [14], channel equalization [15], and physical parameters such as resistance, conductance, inductance, and capacitance [16]. However, MLP lacks the ability to capture sequential information. For these applications, the recurrent neural network (RNN) is often used instead. RNN is a type of ANN that specializes in modeling time series data but it has also been used in the field of circuit design, especially in the modeling of non-linear devices such as CMOS receivers [17], power amplifiers [18]- [21], RF mixers [22], rectangular waveguides, and microstrip filters [23]. In the area of transient modeling, [24], [25] apply the long shortterm memory (LSTM) network, a type of RNN, to perform fast transient simulations of high-speed channels. The LSTM network is trained on a short initial bit sequence, and it is then used to predict up to hundreds of thousands of bits. It is shown that the LSTM network outperforms a RNN in terms of the convergence rate and accuracy when both of them have the same memory length in various voltage waveform modeling tasks.
Although recent results have shown the potential of machine learning methods versus their traditional simulation counterparts, the amount of time spent for designing the networks and tuning its hyperparameters are often overlooked. A hyperparameter is a configuration variable whose value cannot be determined from training and must be set before training. The performance of a neural network does not only depend on the quality of the training data, but on its hyperparameter choices as well, and a bad set of hyperparameters can affect the accuracy of the neural network. The tuning process of these hyperparameters can be time consuming and often relies on expert engineering knowledge and know-hows. Thus, automated hyperparameter optimization (HPO) techniques have been developed, which aim to find the optimal hyperparameter choices based on the architecture, regularization, and optimization of the neural networks. More information on the field of HPO can be found in [26].
In this research, various types of HPO methods including the Bayesian optimization (BO), successive halving, and hyperband are studied. BO is one of the most popular HPO strategy and has been successfully applied to the field of signal integrity modeling [27]. The BO algorithm requires a surrogate model of the objective function, and each evaluation of the objective function usually involves training a neural network and computing its validation loss. Although BO has a good reputation of requiring only a small amount of objective function evaluations, each evaluation of the objective function can still be computationally expensive for large dataset or complex ANN models, since each evaluation involves training a model to completion. This also means that BO will waste some resources on bad configurations as it relies entirely on the surrogate models to avoid creating bad models in the first place. Bandit-based methods such as successive halving [28] and hyperband [29] avoid this problem by evaluating the configurations using partially trained models. Successive halving is a simple yet powerful HPO method where the initial budget is split evenly across all configurations, and successively, the budget for the half that performs worse is removed while the budget for the other half is doubled until only one configuration is left. Thus, resource spent on bad configurations can be reduced since these configurations will be removed in the early iterations. Other than that, some researchers have also proposed methods to evaluate the configurations based on smaller dataset subsets. For example, [30] proposes a progressive sampling-based BO algorithm which uses only a small subset of data for the objective function evaluation during the initial rounds to drop unpromising configurations, and allocates more resources on the promising configurations with larger datasets. However, the generation of the dataset subsets for the progressive sampling-based BO algorithm involves a random sampling strategy, which will destroy the sequence of a time series data. This makes it unsuitable in transient simulation modeling where it is important to preserve the original sequence of the values.
In this paper, we propose a HPO framework named adaptive successive halving HPO (ASH-HPO) that combines BO and the successive halving algorithm, while adapting the progressive sampling method and modifying it to preserve the original sequence of the time series data. The ASH-HPO algorithm is developed to perform the HPO task efficiently for time-series forecasting sequential models. Instead of sampling randomly, the progressive sampling is modified to sample along the data sequence starting from the first time step, and remove part of the training data from the initial time steps when the size of the training data gets too large. We apply the proposed ASH-HPO algorithm to perform HPO for the CNN, LSTM, and CNN-LSTM networks. The CNN-LSTM architecture involves using convolutional neural network (CNN) layers for feature extraction and LSTM layers for sequence prediction [31] and it has been reported that CNN-LSTM outperforms LSTM in various time series prediction tasks such as household energy consumption prediction [32], visual recognition and description [33], rainfall prediction [34], and air pollution forecasting [35]. We demonstrate the robustness of the ASH-HPO algorithm using various problems for transient simulation modeling and compare its performance to the BO, successive halving and hyperband algorithms. Moreover, we also present the use of a multi-step rolling forecast method to reduce the prediction time of our models.

II. HYPERPARAMETER OPTIMIZATION (HPO) METHODS
The HPO process is a process of selecting a set of optimal hyperparameters, H within the search space, S for a machine learning algorithm. Grid search is the most basic and traditional HPO method. However, it is only applicable when the number of hyperparameters in S is low as it suffers from the curse of dimensionality, where the number of trials required to fully explore S grows exponentially with the number of hyperparameters. For a search space with seven hyperparameters, each with three possible values, the total required number of trials is then 3 7 = 2187 which is a prohibitively large number. Random search can explore S much more effectively than grid search and usually outperforms the latter because grid search can sometimes allocate too many trials on unimportant hyperparameters [26]. Both methods are completely ignorant of their pass trials, which means that their past decisions do not help them make better decisions in the future. This causes the overspending of resources on evaluating bad hyperparameters. Thus, they are often replaced with more advanced HPO methods such as BO, successive halving, and hyperband.

A. BAYESIAN OPTIMIZATION (BO)
BO is very suitable for optimizing expensive objective functions, which makes it a very good candidate for HPO of deep neural networks. The steps in performing Bayesian optimization on a black box function are: (step 1) initialize the process by sampling the hyperparameter space to generate a small number of observations; (step 2) create a surrogate model by fitting it to the observations generated previously; (step 3) use the maximum value of the acquisition function to select the next point to sample in S, the acquisition function balances exploitation and exploration where the acquisition value of a point is high when the predicted value from the surrogate model is high and the uncertainty about the predicted value is high; (step 4) evaluate the objective function at the newly sampled point and add it to the observations. Steps 2 to 4 will be repeated until the goal is met. Tutorials on Bayesian optimization can be found in [36]. In our work, BO is used to generate models in the first iteration of successive halving, and to generate additional models later in randomly selected iterations. Various acquisition functions are investigated including expected improvement (EI), entropy search, upper confidence bound (UCB), and probability improvement (PI) [37].

B. SUCCESSIVE HALVING
Successive halving is conceptually very simple. For a fixed budget, B and number of initial configurations, n, allocate the budget evenly across every configuration so that every configuration gets B/n resources. The budget refers to the epochs in this work, but can also refer to various types of resources such as the size of training data, the training time, and the number of features. At the end of each iteration, remove half of the configurations that performed worst, and double the resource of every surviving configurations and train them again. The process repeats until only one configuration is left. However, it is unclear whether to use larger or smaller n. When n is small, each configuration is given more resources, and that can be problematic when the resources are wasted on the bad configurations. When n is large, each configuration is given less resources, and this can also cause problems when the trainings are terminated before the potentially good configurations with slow convergence rates start to reveal themselves. The successive halving algorithm can be tuned to remove more models in every iteration by setting a higher value of the elimination factor, λ which determines the proportion of configurations eliminated in each iteration. Generally, approximately (λ − 1)/λ of the models will be removed in each iteration and the budget of every surviving configuration will be increased by a factor of λ.
Despite these weaknesses, successive halving is very suitable in the training of time series data. The time series dataset can be broken down into several dataset subsets and the original sequencing of dataset is preserved by using the first subset in the first successive halving iteration, the second subset in the second iteration, and so on. The data are arranged in such a way that the last surviving configuration learns the full sequence, and other models only need to learn part of the sequence in each iteration of the successive halving process to reduce the training time per epoch. In our work, we apply the successive halving method and propose several modifications so that it is more efficient on time series problems.

C. HYPERBAND
Successive halving is based on an assumption that the best configurations are more likely to be in the top half than the bottom half of the configurations after a small number of training iterations. However, there are some exceptions to that. For example, when the hyperparameter to be optimized is the learning rate, a model with a smaller learning rate may lag behind the rest of the pack when the number of training iterations is small, and still outperform the rest of the configurations when the number of training iterations is large.
In order to combat this problem and the problem of the B/n trade off, many researchers have switched their attention to the hyperband algorithm [29]. Hyperband is a hedging strategy that divides the whole HPO problem into several successive halving problems named brackets, each exploring a unique value of n for a fixed B. The user needs to define two inputs to the hyperband algorithm, R and λ. R is the maximum resource that can be allocated to a single configuration in any iteration, and λ is the elimination factor. In this work, we compare the ASH-HPO algorithm to the hyperband algorithm using three applications.

III. ADAPTIVE SUCCESSIVE HALVING AUTOMATED HYPERPARAMETER OPTIMIZATION (ASH-HPO)
In this section, our proposed ASH-HPO algorithm will be explained in detail. Subsection IV.A explains the method for converting the time series data to a supervised learning problem. Subsection IV.B explains the method for training the neural models. Subsection IV.C explains the role of BO in the ASH-HPO algorithm. Subsection IV.D explains the proposed modifications on the successive halving algorithm and the integration of it in the ASH-HPO algorithm. Subsection IV.E explains the method for retraining the bad performing models. Subsection IV.F explains the method for expanding the search space, S. Subsection IV.G introduces penalties to eliminate the slow training models and bad performing models. Finally, subsection IV.H summarizes the overall workflow of the ASH-HPO algorithm.

A. PROGRESSIVE SAMPLING FOR TIME SERIES DATA
This subsection describes how the data are processed into dataset subsets for the ASH-HPO algorithm to make it more efficient for time series predictions. The accuracy of the surrogate model is important for any type of surrogate model based HPO framework because a good surrogate model can have a better chance of suggesting a set of good hyperparameters. The more trials we can afford, the more accurate our surrogate model is because it will have more past experience to learn from. However, each trial is usually computationally expensive when it involves training a model from scratch on a large dataset. Thus, a progressive sampling strategy is used so that we start the HPO process with a small training set, and its size is increased slowly as the process continues. However, training the models on a small subset of data can downgrade the quality of the surrogate model. Thus, new models will be generated using the BO algorithm, not only at the beginning when the data size is minimal, but also when the data size becomes larger. This way, there is a chance to correct the mistakes made by the surrogate model at the beginning of the algorithm.
In this work, a sliding window transformation method with a window length or lookback, W LB is used to turn a time series dataset into a supervised learning problem. Through this process, the original time series data, D 0 will be normalized within the range of [0, 1] and transformed into another dataset named D with length t N using the sliding window transformation method. Also, we name the elements in D as d, where d t represents the data of the tth time step such that D = d 0 , d 1 , d 2 , . . . , d t , . . . , d t N . A bracket represents an iteration in the progressive sampling process and the training and validation samples are different for each bracket. We would like to introduce a few terms here. D t (i) and D v (i) are the training and validation data in the ith bracket, where D t (i), D v (i) ⊂ D. L t0 and L v0 are a unit size of training and validation samples, L t (i) is the size of the training samples in the ith bracket, and L max is the maximum allowable size of the training samples, where L max must be divisible by L t0 . In this work, the size of D v (i) is fixed to one unit or (1 × L v0 ) regardless of the bracket number. The size of D t (i) is L t0 in the first bracket, and its size is increased by one unit or (1 × L t0 ) each time the bracket number increases until its size reaches L max . After that, the training samples from the earlier time steps will be removed so that L t does not exceed L max . Fig. 1 illustrates the sliding window process. As can be seen from the figure, (1 × L t0 ) of earlier time steps are removed from D t in bracket 4, (2 × L t0 ) of earlier time steps are removed from D t in bracket 5 and so on. This figure also visualizes the arrangements of D t (i) and D v (i) with the assumption that there is no new model created after the first bracket. However, if there is a new model created in any other bracket, the training data will always start at the beginning of all time steps even if the L t (i) will exceed L max . This is to ensure that every new model can learn from the exact same dataset as the old models without any missing information. This method for arranging D t and D v is very similar to that of the blocked cross validation. Since all the D v must be placed after the D t to prevent data leakage from the future, data arrangement methods that splits D t and D v randomly, such as the regular k-fold cross validation method is not suitable for time series data. Fig. 2 shows a flowchart for generating D t (i) and D v (i) and the function is named GenerateTrainingAndValidationData(). is a random positive integer between two and six which decides in which bracket to generate new models by utilizing the BO algorithm, t a and t b are the beginning and the end of the training data sequence where L t (i) = (t b − t a ), t c and t d are the beginning and the end of the validation data sequence There is a special case when the counter is equal to , where new models will be generated through BO and t a will be set to one automatically so that the newly generated model can learn from the beginning of the dataset. We compute t a , t b , t c , and t d as follows:

B. TRAINING THE MODELS
In this subsection, the method used to train the models with a minimal amount of budget will be explained. In order to keep the training time of the models roughly consistent across all  brackets, the number of epochs, E p is set to be larger when the size of D t (i) is small, and smaller when the size of D t (i) is large. Generally, E p is computed as follows: where E p0 is a positive integer, len(·) is the length of a list, and M i is the collection of models in the ith bracket. In this work, the total number of models generated in the first bracket, len (M 1 ) is K 0 , and the value of E p0 is one. The early stopping technique is implemented to prevent overfitting during training. We try to train every model for at least 10 epochs, before applying the early stopping technique. This means that we do not use early stopping when E p ≤ 10, and only use early stopping when E p > 10. The patience, P is defined as the number of epochs that we allow the model to be trained for, when the validation loss does not improve, and P is calculated as follows: where min(·) is the smallest number in a list, and ceil(·) means rounding towards positive infinity. With this, we ensure that we do not terminate the training prematurely when E p is small, while preventing overfitting and saving the budget when E p is large. The maximum value of P is limited to six to reduce the overall training time. Other than that, the training time per epoch per length of training dataset, t p of the kth model is recorded as: where t m is the total training time of that model, M . Fig. 3 shows the flowchart of the training process named TrainModel(). We use the mean squared error (MSE) as the loss function and the Adam optimizer [38] to update the network weights.

C. BAYESIAN OPTIMIZATION IN THE BRACKETS
In this work, BO is used for two purposes; first to generate models in the first bracket of the proposed ASH-HPO algorithm, and second to add additional models in random brackets so that the surrogate model of the later brackets can contribute by generating some models as well. The surrogate model of the later brackets will be more reliable than the first brackets since it is updated on a larger training dataset. Thus, adding models generated by the surrogate model of the later brackets can help to improve the robustness of the ASH-HPO algorithm by correcting the mistakes made in the first bracket. In this work, we use a Gaussian process regression (GPR) model as the default surrogate model. In the first bracket, we will create and train 10 models where their hyperparameters values are generated by random sampling   (2), . . . , t p (10) ∈ T p . We will then fit the GPR model, G to the observations made in the first ten models. In other words, we will use M 1 to M 10 to initialize the GPR model and this model will be used by the BO. The BO process will start by generating an additional model, with its hyperparameters selected by G.
Once the training of that model is complete, H , M , E v and T p are appended with the respective new members, and G is updated with the new observations before it is used to select the hyperparameter values of the next model. This process is repeated until we generate a total of K 0 − 10 models through the BO process, which means the total number of models is now K 0 since there are 10 initial models. Fig. 4 shows the flowchart for the function to build the initial models M 1 to M 10 named BuildInitialModels().
A variable counter is assigned zero initially, and increased by one at the end of each bracket. As described previously, BO will be performed when the counter = , before resetting the counter back to zero. For each subsequent time BO is called after the initial models were created, it will generate two less models, starting from 10 and decreasing to one. Let K m be the number of models generated during the mth time the BO is performed. K m can be defined as: In addition, we double the E p calculated using (5) during the BO process after the first one (m > 1). This is because the models generated during the earlier stage will be trained with more E p than the newly generated models, thus doubling the E p prevents a premature termination of training. This also makes the comparison between the old and new models fairer. An open source Python package pyGPGO is used to implement Bayesian Optimization [39].

D. MODIFIED SUCESSIVE HALVING
In this work, we propose a modification to the successive halving algorithm where in each iteration or bracket there are two stages; an odd stage which is a determination stage, and an even stage which is an elimination stage. In the odd stage, we train every surviving model, and record its validation loss. In the even stage, we remove the under performing models by a factor of λ and continue training the surviving models. There is a special case where stage one in bracket one is actually used for BO to generate models. Also, when the counter = , the even stage in that bracket is also used to perform BO to generate new models after eliminating (λ − 1)/λ of the bad performing models so that only 1/λ of the models survive. Fig. 5 compares the number of surviving models of the modified and original successive halving algorithms across the brackets. It can be seen that the modified successive halving ensures that each model will be trained on the two different training subsets from the current bracket and the previous bracket before being removed, except in the first elimination stage where models are only trained on the dataset of stage 1. This improves the stability of the algorithm at the earlier stages when the training dataset is very small. Fig. 6 shows an example where L max is (3 × L t0 ), K 1 = 16, λ = 2, K 2 is 4 during the second BO, K 3 = 2 during the third BO, and is initially 4. As can be seen, 16 models are generated in stage one in bracket one through the combination of random search and BO. At the end of the even stage in each bracket, or to be exact after eliminating half of the models, the counter is increased by one. Then, in stage eight of bracket four where the counter = , four additional models are generated. After that, the counter is reset to zero again and is given a new random value of two. The counter is equal to once again in stage 12 of bracket 6 and only two new models are generated.
However, it can be a challenge to decide on how to update the surrogate model, G. Since we will remove the worst performing models in every bracket, the validation losses of the models that are removed in the earlier brackets will not be updated anymore. Thus, the E v of models that are terminated in the earlier brackets no longer accurately represent their VOLUME 9, 2021 actual validation losses. This is because if the models survive, they will be trained with larger training datasets and with more epochs compared to if they do not. Therefore, it is reasonable to assume that their actual validation losses are much lower than their recorded E v . In order to solve this problem, we employ a similar method as found in [30] where we use a term named error rate, E r which is defined as the ratio between the mean of the validation losses of the surviving models in the jth stage, µ j and the previous stage µ j−1 and can be computed as Every time a model, M k is deleted from M, we will delete its validation loss E v (k) as well from E v . In order to keep track of the validation losses of the deleted models, we will create . .} keeps all the validation losses of deleted and surviving models. At the end of each stage, we will update the validation losses of the removed models as follow: where E v (k) is the validation loss of the kth model. As shown in (9), we set E r to be one if the number of surviving models is less than or equal to four because the E r can be unstable when the number of surviving models is too small.

E. IMPROVING THE BAD MODELS
In this subsection, the process of identifying and retraining the bad models, and removing those that fail to improve are explained. We will only search for the bad models when a special condition, E r > 1 occurs. A value of E r < 1 means that the models are improved on average as compared to the previous stage. Specifically, a value of E r > 1 means that the average validation loss of the current stage is worse than the previous stage. We define a bad model as a model whose validation loss is greater than µ j . Once a bad model is found, it is retrained with the function TrainModel(). If the model performance improves after the first attempt, it will be updated. Otherwise, we will make two more attempts to improve the model. If the model performance does not improve after the third attempt, the model is removed. This process is completed after we iterate through all the bad models. If at least one model is removed, we will increase L max by L t0 . This is because we detect at least one model that cannot be improved by training, which can indicate a deficiency of training data. Thus, we will try to use larger training datasets in the future brackets. The flowchart for improving the bad models is presented in Fig. 7 and is named ImproveBadModels().

F. EXPANDING THE SEARCH SPACE
Traditionally, the search space of a HPO process is fixed throughout the process. In this work, we propose a method to expand the search space to make the selection of initial search space less critical to the final performance of the ASH-HPO algorithm. For an N dimensional search space, S = {S 1 , S 2 , . . . , S i , . . . , S N } where S i represents the search range of the ith hyperparameter, let S l (i) and S u (i) be the lower and upper bounds of S i respectively, where {S l (i), S u (i)} = S i . If the ith hyperparameter, H k (i) of the best performing model (we assume that the kth model is the best performing model) is close to S l (i) or S u (i), we will expand S i in the direction of S l (i) or S u (i) by 10%. H k (i) is said to be close to S l (i) if and close to S u (i) if In these cases, we will expand our search space as follows: if H k (i) is close to S l (i) and if H k (i) is close to S u (i). For certain hyperparameters, conditions are also imposed to prevent nonphysical parameters. For example, the learning rate and number of hidden neurons must be greater than zero.

G. PENALTIES
In order to guide our surrogate model, we impose penalties on models with unwanted characteristics. First, we punish models with long training times. Let t p (k) be the training time per epoch per length of training dataset of the kth model which is calculated as in (7). Usually t p (k) is in the order of milliseconds. For this penalty, the validation loss of the kth model is updated as follows: where E v (k) is the validation loss of that model before the penalty, and E v (k) is the validation loss of that model after the penalty. In order to enhance the impact of this penalty, E v is also used to decide the surviving models in every elimination stage so that the slow training models will have higher chances of being eliminated. Second, we punish the bad models and their neighbors. As discussed earlier, we will remove (λ − 1)/λ of the worst performing models at the end of every even stage. Then, we will also select the worst 10 percent of the members of E v , and append the original indices of those members to a special list named I bad . The kth model is identified as a neighbor to another model, the mth model, when the hamming distance between them, D ham (k, m) is less than a certain threshold τ (τ = 3 in this work). The hamming distance between two models in S, with N hyperparameters is computed as where H k (i) and H m (i) is the ith hyperparameter of the kth and mth model. If the kth model is either a bad model, or it is a neighbor to at least one bad model, we will multiply its validation loss after the first penalty, E v (k) by a penalty factor, F p (F p = 1.2 in this work). Otherwise we keep its validation loss the same. We compute the validation loss after the second penalty, E v (k) ∈ E v as follows: Basically, both penalties modify the optimization problem from optimizing E v which is the validation loss, to optimizing E v which is a function of the validation loss, training time, and how close the hyperparameters are to the bad configurations. The first penalty is used to guide the G away from slower training configurations, and the second penalty is used to guide the G away from bad performing configurations and other configurations that are very close to them. Table 1 shows how the values of E v (k) and t p (k) can affect the value of E v (k), which is directly related to the objective function to be optimized. It can be seen that the model with lower E v (k) and lower t p (k) has higher E v (k). Fig. 8 shows the flowchart for implementing BO, named function BayesianOptimization().

H. OVERALL WORKFLOW
The remaining functions are summarized here. Fig. 9 shows a flowchart for training all surviving models named function TrainAllModels(). This function is used to call function TrainModels() repeatedly to train every models in M. Fig. 10 shows a flowchart for initialization of the overall workflow named function Initialize(). This function is used to define every input to the ASH-HPO algorithm. Finally, the flowchart for the whole process is shown in Fig. 11. The termination criteria of the ASH-HPO algorithm is based on two conditions: when the goal is met (terminated and converged) or when the budget is used up (terminated but not converged). Each time the validation loss is updated, the ASH-HPO algorithm will compare it to the goal and terminate itself if the validation loss is lower than the goal. The budget is limited to prevent the algorithm from running indefinitely.

IV. RNN, LSTM, CNN, AND CNN-LSTM
A RNN is a type of a neural network that has an internal memory. Unlike a feedforward network, the RNN has a feedback loop connected to its past decisions, which means that its outputs from the previous time steps will be used as inputs for the current time step. This is what allows it to have memory, and this memory is what makes it a great candidate for dealing with time series data. RNNs can be very challenging to train due to vanishing or exploding gradient problem [24], [40]. This is because the gradients in deeper layers are calculated as products of differentials. Trained over a long time, this can lead to an exploding or vanishing gradient, depending on whether the individual gradients are more than or less than one. Thus, RNNs may be unsuitable for long sequences and long-term dependencies.
The LSTM neural network is a type of RNN which was developed to solve the training problems in traditional RNNs. An LSTM cell has 3 gates: a forget gate, an input gate, and an output gate. The forget gate, f t decides which information needs attention and which can be ignored. The input gate, i t decides what information is relevant to update in the current cell state, c t . The output gate, o t determines the value of the next hidden state, h t . An LSTM network can be controlled to  remember or forget the dependency on individual inputs by changing the values of i t , f t , and o t . Further background on RNN and LSTM can be found in [40]- [42].
The CNN-LSTM is a hybrid network created by combining a CNN and an LSTM network [31]. The CNN-LSTM network consists of one or more one-dimensional convolution layer, followed by a max pooling layer, and the output is then flattened to feed into one or more LSTM layers. Finally, the outputs of these layers are fed into one or more VOLUME 9, 2021 fully-connected/dense layers. The last dense layer is the output layer, and its nodes have linear activation functions.    which is the building block of the LSTM layers. Compared to the CNN-LSTM network, a CNN model can be constructed in a similar method by connecting its flatten layer directly to the first dense layer since it has no LSTM layers.

V. NUMERICAL EXAMPLES
In this section, we present three examples to illustrate the strengths of the proposed ASH-HPO algorithm. The first example uses a PCIe Gen 2 channel, the second example uses a PCIe Gen 5 channel, and the third example uses a PAM4 differential channel. The effects of the different settings of the algorithm such as the penalties and parameters of the algorithm are also investigated. As a benchmark, the ASH-HPO algorithm is also compared to the BO, successive halving, and hyperband methods. All experiments in this paper are performed using a computer with an Intel Core TM i7-10700k Processor @3.8GHz, 32 GB RAM, and NVIDIA GeForce RTX 2080 SUPER.

A. PCIe GEN 2 CHANNEL
The PCIe Gen 2 topology is shown in Fig. 13. V TX is the output voltage of the transmitter (TX), V RX is the input voltage to the receiver (RX), and V TX 0 is the output voltage of TX when it is terminated with a 100 resistor. The TX outputs a PRBS bit sequence with a rate of 5 Gbps with an 8b/10b encoding. The voltage waveforms are sampled with a sampling rate of 10 samples per bit during a transient simulation. The dataset for this example is generated from the voltage waveforms of a million bits, where we use the earlier bits for training and validation, and the remaining bits for testing. We define our problem as using V TX 0 and V RX from a time step of (t − 1) to (t − W LB ) to predict Y RX (t) ≈ V RX (t) where the look back window, W LB = 200. Using a larger value of W LB will give a better performance, but it requires more time for data preprocessing and training, and we find 200 to be a suitable number as it gives a good performance while not being computationally expensive.
During the prediction process, we will use the rolling forecast method, which involves feeding the current prediction back into the window to make a prediction at the next time step. Fig. 14 shows the formulation of this problem and demonstrates the rolling forecast method that predicts a single time step into the future. This method can also be applied for multi-step forecasting problems by defining a look ahead window, W LA as the number of time steps into the future during the rolling forecasting process. For example, Fig. 15 demonstrates the rolling forecast prediction process for W LA = 4. The single time step rolling forecast method is basically the multi-step rolling forecast method with W LA = 1. The S is an 12-dimensional search space which includes the learning rate, number of convolution layers, number of LSTM layers, number of LSTM nodes, dropout rate, filter size, kernel height, pool size, choices of activation functions where the choices are the rectified linear unit (ReLU), hyperbolic tangent (tanh), exponential linear unit (ELU), and scaled exponential linear unit (SELU), number of dense layers, number of dense nodes, and batch size. V TX 0 and V RX are normalized before being fed to the neural networks. The goal is: validation loss ≤ 1e-4. All the experiments are repeated 10 times to obtain better estimations of the results.
We investigate the effects of different algorithm parameters and settings on the performance of the ASH-HPO algorithm. First, we investigate the effects of penalties by using four cases: no penalty, punish slow models only, punish bad models only, and both penalties. We set the number of initial models, K 0 = 81, elimination factor, λ = 3, L t0 = L v0 = 2k,   acquisition function = EI, and initial L max = (3 × L t0 ). Table 2 shows the results and the convergence rate plots are visualized in Fig. 16, where it can be seen that the case ''both penalties'' converges the fastest, and the case ''no penalty'' converges the slowest. It can be seen from the table that the first penalty can reduce the training time despite using more training data as the training time per time step is shorter, and the second penalty can reduce the amount of training data required to reach the goal. Combining both penalties gave the best results. Thus, we will apply both penalties for the ASH-HPO for the rest of this paper.
Then, we investigate the effects of the W LA value (W LA = 1, 20, 50, 100) on the ASH-HPO algorithm. We maintain all   other algorithm parameters from the case ''both penalties'' and only change the value of W LA . The results are tabulated in Table 3. As the value of W LA increases, the ASH-HPO algorithm requires more time to converge due to the increase in complexity of the models. Other than that, increasing the value of W LA also increases the total time steps of training data required to reach the goal, which can slow down the convergence rate even more. However, using larger W LA values can greatly improve the prediction time per time step   because each prediction can be used to predict more time steps ahead, thus requiring fewer predictions to predict the whole sequence. For the PCIe Gen 2 example, we use W LA = 20 because it gives the best trade-off between the training and prediction processes.
Next, we investigate the effects of the initial L max value, using 3 cases: L max = (1 × L t0 ), L max = (2 × L t0 ), and L max = (3 × L t0 ). In these cases, we maintain all other algorithm parameters. L max is a variable that decides the maximum length of the training subset in each bracket, and its value is increased when the function ImproveBadModels() detects any bad model that cannot be improved. The results are visualized in Fig. 17, which shows that the case L max = (3 × L t0 ) is slightly faster than the rest of the pack in terms of the convergence rate, and it requires the least number of training data, using only 38k time steps as compared to 79k time steps for the case L max = (1 × L t0 ), and 62k time steps for the case L max = (2 × L t0 ). The results also show that the convergence rate of the ASH-HPO algorithm is only    marginally affected by the choice of initial L max as all three cases have very similar training times. A case with smaller initial L max trains the models with a smaller data subset, but requires more brackets to reach the goal, which increases the size of data required for the convergence.
Then, we compare the performance of the ASH-HPO algorithm when it is used with different types of neural networks. In this work, we investigated the CNN, LSTM, and CNN-LSTM models. Fig. 18 compares the convergence rates of the cases using the CNN, LSTM, and CNN-LSTM networks. We use W LA = 1 for all the cases since higher values require very high computational time and memory for the LSTM case. We remove the hyperparameters of the LSTM layers from S for the CNN models and we remove the hyperparameters of the convolutional layers from S for the LSTM models. We maintain the other algorithm parameters. It can be seen that the ASH-HPO algorithm with CNN-LSTM converges much faster than the ASH-HPO algorithm with LSTM. This is because the training time per epoch for the CNN-LSTM models is much shorter than that of the LSTM models. The convergence rate for CNN-LSTM is also slightly faster than that of the CNN. The CNN-LSTM models require only 8.01e-4 second on average to perform prediction for each time step. On the other hand, the LSTM models require 1.53e-2 second to predict one time step which is approximately 19 times slower than the CNN-LSTM models. The prediction time of CNN is about 6.34e-4 second per time step, which is slightly faster than the CNN-LSTM model. However, the testing loss of the CNN-LSTM model is 1.03e-4, which is lower than the CNN, which is 1.16e-4. Next, we investigate the effects of different acquisition functions which are EI, entropy search, UCB, and PI on the performance of the ASH-HPO algorithm. The results are tabulated in Table 4. It can be seen that the ASH-HPO algorithm using EI outperforms the other acquisition functions, as it converges the fastest, requires the smallest size of training data to reach the goal, and has the lowest testing loss.
Besides that, we also compare the proposed ASH-HPO algorithm to other existing methods. Firstly, we compare the ASH-HPO algorithm to the BO algorithm without the progressive sampling strategy. For the BO method, we use 10 random configurations to initialize the algorithm, and we terminate the BO algorithm when the total number of models exceeds 100. Each model is trained up to E p = 100 with P = 6. We test the BO algorithm using 10 cases, each case with a different training dataset length, L t = 5k, 10k, 15k, . . . , 50k and a fixed validation dataset length, L v of 2k. The progressive sampling method is not applied to any of them. The convergence plots are shown in Fig. 19. We only show L t ≥ 30k in the figure because all the cases of L t < 30k fail to reach the goal. The graph shows that the ASH-HPO algorithm converges faster than all of the BO cases.
Next, we compare the ASH-HPO algorithm with the successive halving algorithm. We set the successive halving algorithm to start with 81 initial models, with λ = 3, E p0 = 1, and P calculated as in (6). The successive halving algorithm is tested using 10 cases, each with a different L t = 5k, 10k, 15k, . . . , 50k value and a L v value of 2k. The convergence plots are shown in Fig. 20. We only show L t ≥ 35k in the figure because all the cases of L t < 35k fail to reach the goal. The graph shows that the ASH-HPO algorithm converges faster than all of the successive halving cases.
Finally, we compare the ASH-HPO algorithm with the hyperband algorithm. For the hyperband algorithm, we set the inputs of the hyperband algorithm to be K 0 = 81, λ = 3, and P is calculated using (6). The hyperband algorithm is tested using 10 cases, each with a different L t = 5k, 10k, 15k, . . . , 50k value and a L v value of 2k. The convergence plots of cases that reach the goal are shown in Fig. 21. The graph shows that the ASH-HPO algorithm has a faster convergence speed than all of the hyperband cases.
In addition to faster convergence speed shown in these examples, a main advantage of the ASH-HPO algorithm over the other three algorithms is that it does not require the user to input the value of L t , since it uses a progressive sampling strategy. This is a significant advantage as different L t values can have a large impact on the convergence speed and testing loss as can be seen from the examples, and the optimal L t value cannot be known without performing the actual training. In other words, the ASH-HPO algorithm is able to perform a fast and accurate automated training of the neural models, while limiting trial and error.
Since the ASH-HPO algorithm uses 38k training samples for training and validation, we will use the remaining samples for testing. We use the CNN-LSTM model generated by the FIGURE 26. True and predicted normalized V RX waveform of the PCIe Gen 5 channel for time step, t = 1 to t = 5k of the testing dataset, testing loss = 6.97e-5.

FIGURE 27.
True and predicted normalized V RX waveform of the PCIe Gen 5 channel for time step, t = 9800k + 1 to t = 9805k of the testing dataset, testing loss = 7.13e-5.  ASH-HPO algorithm with both penalties, initial L max = (3 × L t0 ), acquisition function = EI, W LB = 200, and W LA = 20. Fig. 22 and Fig. 23 show the comparisons between the actual and predicted normalized V RX of the PCIe Gen 2 channel for time step, t = 1 to t = 5k and t = 9900k + 1 to t = 9905k respectively. No significant accuracy degradation is observed from the first testing data subset to the last. The prediction speed is about 42 microseconds per time step. Fig. 24 compares the eye diagrams constructed using the voltage waveforms generated from a transient circuit simulator and from the CNN-LSTM network, and both of them are almost identical. The normalized eye heights for the simulated and predicted eye diagrams are both 0.343, while the eye widths for the simulated and predicted eye diagrams are 14.69 × 10 −11 second and 14.60 × 10 −11 second respectively. The hyperparameters search space and the best configurations for the CNN, LSTM, and CNN-LSTM networks for the PCIe Gen2 example are tabulated in Table 5.

B. PCIe GEN 5 CHANNEL
The PCIe Gen 5 topology used in this example is shown in Fig. 25. The TX outputs a PRBS bit sequence with a rate of 32 Gbps. The TX uses a feed-forward equalizer (FFE) whereas the RX uses a decision feedback equalizer (DFE) and a continuous time linear equalizer (CTLE). There are two transmission channels, both with 36 dB losses, and a retimer connects the two. V TX is the output voltage of the TX, V RX is the output voltage of the RX, and V TX 0 is the output voltage of TX when it is terminated with a 100 FIGURE 31. True and predicted normalized V RX waveform of the PAM4 differential channel for time step, t = 1 to t = 5k of the testing dataset, testing loss = 1.65e-5.

FIGURE 32.
True and predicted normalized V RX waveform of the PAM4 differential channel for time step, t = 400k + 1 to t = 405k of the testing dataset, testing loss = 1.67e-5.
resistor. The voltage waveforms are sampled with a sampling rate of 20 samples per bit during a transient simulation. The modeling problem is defined as using V TX 0 and V RX from time steps (t − 1) to (t − W LB ) to predict the values of V RX from time steps t to (t + W LA − 1), where the look back window, W LB = 800, and the look ahead window, W LA = 10. The CNN-LSTM model is generated by the ASH-HPO algorithm with the settings: apply ''both penalties'', number of initial models = 81, λ = 3, initial L max = (3 × L t0 ), acquisition function = EI, and a termination goal of validation loss ≤ 7.5e-5. A training dataset with 116k time steps is used with a training time of 1561 seconds and a validation loss of 5.10e-5. Fig. 26 and Fig. 27 show the comparisons between the actual and predicted normalized V RX of the PCIe Gen 5 channel for time step, t = 1 to t = 5k and t = 9800k + 1 to t = 9805k of the testing dataset respectively. The prediction speed is about 110 microseconds per time step. No significant accuracy degradation is observed from the first to the last testing set. Fig. 28 compares the eye diagrams generated using the simulated and predicted voltage waveforms. The normalized eye heights for the simulated and predicted eye diagrams are 0.176 and 0.160 respectively, while the eye widths for the simulated and predicted eye diagrams are 2.28 × 10 −11 second and 2.21 × 10 −11 second respectively. Fig. 29 shows a comparison between the ASH-HPO algorithm and other HPO algorithms which were trained using the same 116k time steps, where it can be seen that the ASH-HPO algorithm converges the fastest.

C. PAM4 DIFFERENTIAL CHANNEL
The PAM4 channel topology used in this example is shown in Fig. 30. The transmission channel is a differential microstrip line with a line width = 10 mil, line length = 1000 mil, separation between two lines = 5 mil,  substrate thickness = 10 mil, relative dielectric constant = 3.7, relative permeability = 1, conductor conductivity = 5.8e7 S/m, conductor thickness = 1.4 mil, and dielectric loss tangent = 0.002. V TX is the output voltage of the transmitter (TX), V RX is the input voltage to the receiver (RX), and V TX 0 is the output voltage of TX when it is terminated with a 100 resistor. The TX outputs a PRBS bit sequence with a rate of 14 GBd with a rise/fall time of 30 picoseconds. The sampling rate of the voltage waveforms is 10 samples per symbol. The modeling problem is defined in the same way as in the previous two examples. The CNN-LSTM model is generated from the ASH-HPO algorithm with the settings: W LB = 1600, W LA = 5, apply ''both penalties'', number of initial models = 81, λ = 3, initial L max = (3 × L t0 ), acquisition function = EI, and a termination goal of validation loss ≤ 2e-5. A total of 72k training data is used with a training time of 1500 seconds and a validation loss of 1.64e-5. Fig. 31 and Fig. 32 show the comparisons between the actual and predicted normalized V RX of the PAM4 differential channel for time step, t = 1 to t = 5k and t = 400k + 1 to t = 405k of the testing dataset respectively. The prediction speed is about 310 microseconds per time step. No significant accuracy degradation is observed from the first to the last testing set. Fig. 33 shows the eye diagrams generated using the simulated and predicted voltage waveforms. Table 6 compares the eye diagram metrics for both eye diagrams. The ASH-HPO algorithm is also compared to other HPO algorithms in this example using the same 72k training data. Fig. 34 shows that the ASH-HPO algorithm converges faster than the successive halving algorithm, while the BO and hyperband algorithms both fail to reach the goal.

VI. CONCLUSION AND FUTURE WORK
In this paper, we presented the ASH-HPO algorithm for automated hyperparameter optimization of CNN-LSTM models for transient simulations and demonstrate it using three high-speed channels. It is shown that our proposed method converges faster than three state-of-the-art HPO methods: BO, successive halving, and hyperband algorithms. We investigated the different aspects of the algorithm and show that our algorithm can be further improved by applying two penalties, which punishes slow training models and bad models and their neighbors, and show that the method can converge with a similar speed even with different choices of initial lengths of training data. Although the ASH-HPO algorithm has its own parameters such as the W LA , initial L max0 , choice of acquisition function, L v0 , and L t0 , these paramaters have a much less significant impact on the convergence rate of the algorithm, compared to the effect of the hyperparameters of the neural models.
For future research, the ASH-HPO algorithm can be used in combination with other advanced and recent models such as the LSTM with attention mechanism and the transformer model. Other than that, the high-speed channel modeling problem can be broken down into several smaller problems, where each problem focuses on the transient modeling of a certain part of the whole channel. Then, whenever that part of the channel is changed during the design process, another neural model can be swapped in or out. In this case, the ASH-HPO algorithm can be scaled up to select the combination of models that can successfully model the whole system. In this work, a rolling forecast method is employed for the training and prediction of the transient waveform. If the sequence to be predicted is indefinitely long, accumulation of errors can become an issue to the accuracy of the predictions. In such a case, the network can be retrained after a certain number of bits, and the presented ASH-HPO algorithm would be very well suited to simplify and automate the whole process. As the neural networks are deterministic in nature, only deterministic components are considered in this work. If necessary, a common practice is to add random components in a postprocessor as Gaussian distributions, and the modeling and subsequent hyperparameter optimization of these systems can be investigated as well. Finally, although the examples presented here focused on the transient simulations of high-speed channels, the developed algorithm is not limited to this field. Other applications in time series forecasting problems such as stock price prediction can be pursued as well.