Neuroevolution-based efficient field effect transistor compact device models

Artificial neural networks (ANN) and multilayer perceptrons (MLP) have proved to be efficient in terms of designing highly accurate semiconductor device compact models (CM). Their ability to update their weight and biases through the backpropagation method makes them highly useful in learning the task. To improve the learning, MLP usually requires large networks and thus a large number of model parameters, which significantly increases the simulation time in circuit simulation. Hence, optimizing the network architecture and topology is always a tedious yet important task. In this work, we tune the network topology using neuro-evolution (NE) to develop semiconductor device CMs. With input and output layers defined, we have allowed a genetic algorithm (GA), a gradient-free algorithm, to tune the network architecture in combination with Adam, a gradient-based backpropagation algorithm, for the network weight and bias optimization. In addition, we implemented the MLP model using a similar number of parameters as the baseline for comparison. It is observed that in most of the cases, the NE models exhibit a lower root mean square error (RMSE) and require fewer training epochs compared to the MLP baseline models. For instance, for patience number 10 with different number of model parameters, the RMSE for test dataset using NE and MLP in unit of log(ampere) are 0.1461, 0.0985, 0.1274, 0.0971, 0.0705, and 0.2254, 0.1423, 0.1429, 0.1425, 0.1391, respectively, for the 28nm technology node at foundry.


I. INTRODUCTION
Metal Oxide Semiconductor (MOS) is the most widely used semiconductor device in integrated circuits (IC) due to its superior electrical properties. Owing to the rapid technological scaling, designing a compact model (CM) becomes a very crucial component in the design cycle for describing new device behaviors. CM provides the tool of incorporating the newly fabricated devices into the circuit simulation first before the physical hardware can be implemented. In addition, well-designed CMs, especially machine learning-based CMs, allow users to evaluate the device behaviors in different process conditions and the corresponding effects in circuit applications.
Mostly, device compact modeling is achieved by following complex physics and equations, which highly depend on each material layer and its interfaces, traps, and doping profile [1][2][3]. While conventional physical CMs with parameter fitting prevails in the past thirty years, the recent trend is that people are looking after using machine learning to expand or explore the capability of CMs. Many works have recognized machine learning and artificial neural networks (ANN) as a powerful tool for its accurate approximation, modeling, and optimization problem [4][5][6][7][8][9][10][11][12][13]. In most cases, standard multilayer perceptron (MLP) is used in constructing semiconductor device CMs. In this standard scenario using MLPs, the backpropagation (BP) [14] method is used to update the weights and reduce the overall error of the network. Stochastic gradient descent (SGD) or other gradient-based algorithms [15][16][17] have proved to be most effective in the optimization of largescale deep neural network-based supervised learning models to update the network weights and biases.
Most of the previous works in semiconductor device compact models are based on dense connections using MLP [4,6,7]. Neuroevolution-based architecture for semiconductor CMs has not been seen to our best knowledge. One important earlier work related to neuroevolution is the physics-inspired network architecture weights and hence is generally limited to the relatively small network. On the other hand, hybrid methods, i.e., complementarity between stochastic gradient descent (SGD) and EA, where SGD optimizes the objective function and gradient-free EA evolves the neural structure [37][38][39][40], make it possible to utilize much larger networks. In this work, we have also developed a hybrid method for our purpose of compact device modeling. In our hybrid algorithm, we have used GA for selection, crossover, and mutation to evolve the network topologies and SGD-based algorithm Adam [15] to update the weights with the error BP algorithm. Some of the results in this work can be found in a student thesis [41].

A. DATA PREPARATION AND MEASUREMENT
Various dimensions of MOSFET devices are used to train the machine-learning-based compact model. The 28nm NMOS with different gate lengths (L): 30nm, 40nm, 50nm, 60nm, 300 nm, 500 nm, and gate width (W): 400 nm, 1000nm, 1500nm, 3000nm, were fabricated and obtained from the foundry. Totally 11 devices are used. Id-Vgs curves with Vgs step of 25mV curves at various Vds at 0.2V spacing are measured. Vgs sweeping is from 0V to 1.5V, and Vds sweeping is from 0V to 1.6V. For the current-voltage (IV) measurement, Agilent 4156C semiconductor device parameter analyzer was used. The dataset was arranged according to the device gate length (L), gate width (W), gate voltage (Vg), drain voltage (Vd), and drain current (Id). Our machine learning-based model will predict drain current (Id) based on the four input parameters: W, L, Vg, Vd. To improve the subthreshold region modeling, we use logarithm transformation [42]: 10 ' log ( ) I I = (1) where I is the real current values, I' is transformed current values. The absolute values are taken for some data points at Vds=0 where residual negative current values are observed. After multiplying the scalar function, the output and inputs are all scaled to the range (0, 1) for model training. The whole dataset will be split into train dataset and test dataset, with a 9:1(train: test) ratio. The test set may not be required since fitting the semiconductor device IV is a pure interpolation problem [7,43]. The measurement upper and lower bounds in voltage can be determined based on applications, and all predicted data points will fall within the bounds. Removing the test set offers the advantage of fitting all device IV data, ensuring better fitting and benefiting the subsequent SPICE simulations. Retaining the test set offers the advantages of detecting overfitting. In this work, we still retain a test set and calculate the test set scores.
We have implemented our models using Python 3.7.10 [44], Numpy 1.20.2 [45], Scipy 1.6.2 [46], Tensor-flow 2.3.0 with Keras API [47], Scikit-Learn 0.24.2 [48], Matlab, Pandas 1.2.5 [49]. In this work, we have considered the multi-layer perceptron (MLP) model as the baseline model with which we can compare our model. The reason for choosing MLP as a baseline model is due to its simple architecture and wide usage in semiconductor device CMs.

1) MULTI-LAYER PERCEPTRON (MLP)
MLP model is the most commonly used, fully connected feed-forward network, shown in Fig. 1.B MLP can have any number of hidden layers between the input and output layers. Each neuron within the hidden layers can have multiple inputs (xi) which are multiplied with the corresponding weights (wi). In addition to this, bias is added to every input. At last, the output signal is formed after operated with the activation function, as shown in equation 2. The activation function contributes to the nonlinear characteristic of an artificial neural network. Here sigmoid activation is used for MLP models. In MLP, manually optimizing the numbers of hidden layers and the number of neurons in each layer is a tedious yet important task.   Figure 1. The flowchart describes compact modeling of MOSFET devices using MLP and Neuroevolved (NE)-ANN, respectively.

2) NE USING GA
In our proposed model, we have used GA to evolve the neural network architecture by itself. Initially, we will start with a partially connected feedforward network, and the neurons in a block of each layer are randomly connected to the neurons of the blocks in the previous layer, as shown in Fig. 1.B As the epochs advance, our model will evolve within the defined boundaries in terms of the number of hidden layers and the number of blocks in each layer. Here, we introduce a block instead of a neuron as the basic unit to construct the topology-optimized neural networks. Instead of neurons, every hidden-layers will have multiple blocks, and every block can have one to multiple neurons. Therefore, each hidden layer is composed of several blocks, and there will be at least one neuron in each block. To evolve the network using GA, we will tune (i) networkdepth, i.e., the numbers of hidden-layers (Nlayers), (ii) network-width, i.e., the number of blocks (Nblocks)in each layer and the numbers of neurons (Nneurons) in each block, and (iii) the activation function of each block in a hiddenlayer. The partially connected model will then undergo evolution using GA and evolve to the best-fitted compact device model. Circuit simulators, in general, require a small compact model size. Based on our test, <20 neurons per hidden layer and <5 hidden layers lead to more reasonable CPU time in HSPICE. The total parameters in a CM should be kept <1000. The use of mixed tanh and sigmoid as activation functions in different blocks leads to improved performance, reference to single activations. Therefore, we allow mixed activations in neuro-evolved networks. The mixture of activation functions is also used in [7]. The details of the evolution scheme are discussed in the next section.

III. EVOLUTION OF NEURAL NETWORK ARCHITECTURE
GA, in general, follows the concept of propagating the genes of the fittest individual for successive generations to find the optimum solution. Initially, GA starts with a population of individuals, and each individual represents a potential solution to a specific problem. These individuals, over multiple generations, are evaluated using a fitness function and are assigned a fitness value or score. An individual with a better or poor-performing solution will be assigned a higher or lower fitness value or score. The fittest individuals, selected based on these scores for reproduction, will pass their genes to the next generation. This process of passing genetic features or genes from parents to their offspring occasionally results in better scoring solutions. The flowchart of the genetic algorithm is shown in Fig. 1.C

A. ENCODING STRATEGY
Genome representation plays an important role in the whole evolution process since evolutionary operations are performed by manipulating the genomes. A proper encoding scheme makes crossover and mutation easier and more efficient. In a typical genetic algorithm, a potential solution for a problem is usually encoded into a fixedlength genome. For example, binary encoding encodes the solutions into a string of 1 or 0 [50]. In our work, an optimal number of layers and neurons of a neural network is unknown, and thus we have to use a variable-length genome encoding method to represent our neural network. Encoding strategy involves the representation of the genome of each individual. In our work, the genome represents the network architecture and can have a variable length in each generation, i.e., metameric variable-length genome [51]. Each gene in our genome represents the block. Each gene is assigned a set of identification (ID) numbers (m, n), where m represents the number of the hidden layer, and n represents the block number, as shown in Fig. 2. In Fig. 2, there are two blocks in the first hidden layer, represented by ID (1,1) and (1,2). Gene ID not only helps to define network structure but makes model building with Tensorflow Keras functional API much easier.
Network architecture or topology information will be represented by its genome. The genome represented by a list of genes incorporates the information of model parameters such as neuron number, activation function, and input connection, i.e., connection id. Neuron number and activation function describe the number of neurons and the activation function for that particular block, respectively. Connection id defines the gene ID of the previous layer with which that particular block or gene has to establish the connections. For example, as shown in Fig. 2, gene ID (1,1) has to establish the connection with the input layer neuron, defined as (-1,1) and (-1,2). Therefore, the connection id directs each block to establish the connection to the blocks of the previous layer. It is worth mentioning that there can be more than one block in a single hidden layer and more than one single neuron in one block. Relative to MLP, the advantage of having different blocks in a hidden layer is that we can assign different activation functions within the same hidden layer. This increases the search space in GA. In addition, while one block can contain only one neuron, using blocks as the basic unit in GA does not diminish the search space and does not decrease the flexibility in network topology evolution.

B. POPULATION INITIALIZATION
During the initialization process, we first define the random network topologies, as discussed earlier. A population of these N individuals is initialized randomly within the constraints of hyper-parameter. The algorithm for initializing the population is described in Table I.
When GA started, i.e., in first-generation, all individuals in the population have random hidden layers and random blocks within each hidden-layers. For subsequent generations, the architecture will evolve within the defined hyper-parameters that define the network architectures, as shown in Table II. Here, we limit the minimum number of hidden layers (Nlayers) to two layers, since in our previous experience of model fitting, it requires at least two hidden layers to get satisfactory fitting accuracy. The lower limit of block in each hidden layer (Nblocks) and the number of neurons (Nneurons) in each block are both set to one. Each block in each layer is given a corresponding gene id, as mentioned above.
Once the network architectures and ids are defined, the connection is then assigned randomly between layers to form a randomly connected network. The rule of connection is described below: 1. First hidden layer: All neurons of each block in the first hidden layer should connect at least three inputs. 2. Other hidden layers: Blocks only connect to the blocks in the previous layer. Blocks in the same layer are not allowed to connect with themselves. 3. Output layer: Connect at least half of the number of blocks in the last hidden layer.

C. FITNESS EVALUATION
Each individual in generations is evaluated and given a fitness score as a measure of parent solution for reproduction. Because model fitting is a regression problem, mean squared error is the best way to evaluate fitness scores. Each model decoded from the genome is trained on the training set using backpropagation, and the fitness score is estimated on the validation set reserved from the training set. We have used the concept of early stopping to efficiently train our model. Keras early stopping callback is implemented to monitor the training process. Once the loss on the validation set no longer decreases within a number of epochs, defined as patience number, the training will terminate. Here we set the patience number as 10 in GA, and the maximum epoch is set to 1000 epochs, the training process usually concludes before reaching the maximum epoch, and the mean squared error of validation set in the last epoch is assigned to be the fitness score of individuals. This approach not only prevents overfitting but also saves significant training and evolution time.

D. SELECTION METHOD
The selection process helps in identifying the fittest eligible parents from the population to reproduce offspring. There is plenty of selection methods used in genetic algorithms, such as roulette wheel selection, rank selection, and tournament selection [52,53]. For this work, binary tournament selection is considered because of its ability to provide a better selection strategy in terms of convergence rate and time complexity [53]. Binary tournament selection is made by randomly choosing two individuals from the population, and these two individuals compete with each other in terms of their fitness scores. The one with a better fitness score, namely, a smaller mean squared error, is selected as one of the parents. The other parent will be selected by repeating the selection process again. These two parents will be added to the mating pool, which contains different pairs of parents for the reproduction of the next generation. This procedure is illustrated in Fig. 3. The size of the mating pool is set to be the population size. Figure 3. Two parents are selected based on fitness scores using the binary tournament selection method. Spatial crossover and mutation on parent's genomes to produce fit and diverse offspring

E. OFFSPRING REPRODUCTION (CROSSOVER AND MUTATION)
Individual genomes in our population can have variable lengths and are therefore expressed by a metameric variable-length genome [51]. Splitting such genes from random points can lead to loss of information during the crossover, and the offspring may not be similar to their parents. To maintain the phenotype space of the network, we used the spatial recombination [51,54,55] operator for crossover. Using this method, we consider the structure or the spatial distribution of genomes, and then the crossover point can only be at the boundary of layers. This preserves the localized relationships between the child and parent genomes. In simple words, the spatial recombination operator will split the genome randomly between hidden layers instead of blocks or neurons, therefore maintaining all the blocks in a hidden layer intact, as shown in Fig. 3. Two-parent genomes will create two offspring genomes by doing a crossover. Due to variable-length genome representation in our problem, the length of offspring genomes may be different from the length of parent genomes. The crossover layer does not need to match in the two-parent genomes, and this can lead to the mismatch of information, such as inconsistent input IDs and repeated genes, as shown in Fig. 3.
In this work, a mutation operation is performed to correct the defects in the genomes after crossover, i.e., inconsistent gene IDs and repeated genes. The mutation operation can fix the inconsistent connections ids by randomly reassigning them from the gene ID of the previous layer after crossover. Similarly, for repeated genes, the mutation process replaces the redundant genes from the genome and corrects the connection id by reassigning them, as shown in Fig. 3.

F. ELIMINATION
During the elimination process, the individuals in the population, including the parents and the children, are all evaluated and ranked according to the fitness score. Those who have better performance are allowed to survive and have chances to evolve, while the others will be eliminated. This approach helps preserve the best candidate from being discarded in this process. Here we make the population size the same by eliminating half of the individuals after crossover. After the evolution process, the neuro-evolution algorithm will return the best individual who has the best fitness score in the last generation. The fittest individual will then compare to the baseline model in the next section.

IV. RESULTS AND DISCUSSION
In Table III, we have conducted five NE runs using patience=10. We can see that at 5 NE runs, all neuroevolution cases exceed the performance of MLP models. The patience=10 is used for these five NE runs. The patience=10 is a reasonable value and also the default value in Scikit-Learn library [48]. Other patience values can be used to run the NE. Instead of using different patience values to re-run the NE, we use the trained model at patience=10 and continue the training at higher patience. This practice can lead to non-optimal topologies at the specific patience values, but it can give some insight into the effectiveness of NE and the models optimized at patience=10. It can be seen from Table III, the NE model with continued training at patience=50, 100 still shows lower RMSE values in reference to MLP baselines. It is worth mentioning that for all programs in this work, we uniformly set the random seeds to 1 at the beginning of the program execution. Therefore, the NE run does not optimize the random seeds for individuals. The training of an individual can potentially start from any seeds, and there is no guarantee that selected individuals will receive the same random seed in the next evaluation. We use this practice since we believe the optimized network topologies should be superior independent of seeds. Since the random seed is not optimized in NE, the baseline model of MLP is evaluated using random seed=1 for a fair comparison. In addition, to show the strength of NE, we actually rotate the MLP random seed from 1 to 30 and record the lowest RMSE values, labeled as RMSEmin in Table III. It can be seen that the RMSEmin is still greater than the RMSE of NE optimized individuals in this somehow unfair comparison. This indicates the effectiveness of NE using GA here.
In this work, the connections from the block in i th layer are only to the (i-1) th layer, preventing the possibility of connecting to two layers preceding the i th layer. This scheme reduces the searching space in the NE. Nonetheless, based on our numerical test, allowing the connections to more than the blocks in the (i-1) th layer can lead to difficulty in locating the optimized topologies for neural networks. The potential reason for the observed difficulty can lie in the too complex topologies in the neural networks, which leads to a vast searching space and poorer convergence in training. Regarding partially connected networks, instead of dense connections, Xing et al. also demonstrate their advantages [7]. Essentially, if we want to shrink the model size, we need to use a partial connection to maximize the critical usage of the neurons and connections. A fully connected structure with the same neuron and hidden layer number encompasses the subsets of partially connected networks but requires more training time and is larger in size.  On the other hand, allowing the connection only to the previous layer during evolution gives a smaller search space but faster locating the optimized topologies and faster convergence in network training. Fig. 4 shows the fitness statistics of GA during evolution, including the best individuals and the averages in one generation. Fig. 5 shows the scatter plots of the test dataset with patience=100 and with a similar number of parameters for NE and MLP, respectively. It can be observed that the RMSE values of neuro-evolution cases are reduced significantly in reference to MLP baseline models.  It is worth to mention that the effectiveness of NE can be diminished as the network size grows. This means that when the parameter number in a network is allowed to grow, the MLP can also be effective, and the relative enhancement of NE over MLP can become insignificant. This is as expected since, for a fixed dataset size, the fitting becomes easier and more manageable when the network size increases in terms of neuron numbers and the number of hidden layers. In this work, we count the parameter number in our networks and compare the performance of the neuro-evolution networks to the baseline MLP models with similar parameter numbers. It should be noted that although the effectiveness of neuro-evolution networks can be more pronounced at small network sizes and becomes less pronounced at larger network sizes, the method of topology optimization in network architecture is still of very high value for compact device modeling. This is because we need a small network for CMs in order to manage the circuit simulation time. While there are hundreds to thousands of transistors in one circuit, using a large network for one device as its CM is not practically feasible. Therefore, with a limited number of neurons and parameters in one network for CMs, we need to ensure that the network topology can be optimized to promote the fitting to device IVs and the accuracy in circuit simulation. The importance of neuro-evolution becomes indispensable in this regard.
The genome representation and the crossover scheme are other vital aspects of neuro-evolution. A conventional way to do the GA is to encode the individual in the optimization problem in a real-value or binary array, similar to the biological genomes. In this basic method, every parameter is represented by a bit in the genome. In our case, the parameters are the neuron number, activation functions in each block, and the connections between the blocks in each hidden layer. This basic encoding scheme is simple but can be ineffective in some cases. To improve the effectiveness of GA, we use a more sophisticated representation of genomes, as described in Fig. 2 in the previous sections. With this genome representation in hand, we have to proceed to decide the crossover scheme. When the genome has some data structure, we need to consider how to effectively conduct recombination to boost the individual score. Arbitrary crossover is not practical because the genome structures are destroyed, and the crossover scheme should lead to new individuals that can provide lower MSE values. In our case, we want to ensure that the evolution does not lead to too complex and too strange connections and maintain some basic features of parent individuals. This ensures the desired features from parents are retained. For this purpose, we formulate a one-point crossover scheme where the crossover point is only between the hidden layers. The crossover point cannot be within a block or a layer. In literature, the most pronounced case where careful consideration needs to be carried out in crossover is the socalled spatial recombination [54,55]. In their case, similar to the consideration here, the sensor array represented by a genome cannot be recombined arbitrarily in order to maintain the spatial distribution and the similarity between the parents and the offspring.
In standard neuroevolution, the GA is used to optimize the topology and tune the network weighting and biases. This is similar to reinforcement learning, where the reward at each path keeps increasing during the procedure. In this work, we do not use pure GA or a reinforcement-learning scheme. Instead, we use a GA and ADAM hybridization method. We think using a conventional gradient-based training scheme can be more effective in tuning the network weights and the biases in topology-optimized networks. On the other hand, using GA to tune the network architecture, including connections between neurons or blocks, can be effective. Essentially, the gradient-based optimization method cannot adjust the network architecture since the slope is difficult to evaluate for various network topologies. Fig. 6 shows the device I-V with fitting, and Fig. 7 draws the optimized network topologies for the case of 677 and 101 parameters, respectively. In this effort, the neuron number in each block can vary from 1 to 4-10, and thus essentially, every single neuron can have its own unique connections.
As for the question of why neuro-evolution is effective and needed in semiconductor device CMs, we think this lies in the nature of the fitting problem and the requirement of the small network size for semiconductor device CMs. In general, if the mapping between the input and output variables is highly nonlinear or complex, we need to increase the size of the network to fit the relation. It is also likely that the input and output relation is so complex that increasing the network size becomes ineffective.  This means a more complex network connection/structure needs to be used, e.g., convolution, recurrence, pooling, dropout, concatenation, etc. The pure dense connection between adjacent layers becomes less effective to fit a very complex mapping between input and output variables. Therefore, it is not surprising that we can have better fitting using optimized network topologies since an MLP, i.e., a simple dense connection, does not encompass the entire search space in network topologies. The view on the enhancement in this work can be explained as that we can explore many more topologies when using the same number of parameters. Finally, there are undoubtedly other possible, extended ways to enhance the fitting beyond the scheme used in this work. Specifically, in this effort, we do not allow connections beyond adjacent layers and do not allow recurrent/convolutional connections. While a more complex evolution scheme can further boost the efficiency of NE, fortunately, our tuning already shows pronounced effectiveness by the comparison between NE-based architecture and MLP at a similar parameter number.

IV. CONCLUSION
This work presents an efficient way to tune the network topologies to develop an accurate and topology-optimized device CM for the MOS transistor. Using a genetic algorithm (GA) and neuro-evolution (NE) method, we have allowed the network topology to learn and adapt within the defined parameters of Nlayers, Nblocks, Nneurons. Moreover, while in most cases, GA is solely used in NE to implement reinforcement-learning-like optimization, here, we have used GA and ADAM together in a supervised learning setting to develop CMs for MOSFET devices. During evolution, by limiting the connections of each block in a layer to its previous layer only, we have minimized the search space of GA, which resulted in a faster locating of the optimized topologies and faster convergence in network training. Furthermore, compared to the baseline MLP model with a similar parameter number, our NE model, at patience=10 for the test dataset, has reduced RMSE of 0.1461, 0.0985, 0.1274, 0.0971, 0.0705 for 5 GA runs. We believe topology optimization is an important task for future ML-based CMs to increase the fitting accuracy in a reduced network and minimize the circuit simulation time.