Neural Approach for Modeling and Optimizing Si-MOSFET Manufacturing

An optimal design of semiconductor device and its process uniformity are critical factors affecting desired figure-of-merits as well as reducing fabrication cost of fixing possible malfunctioning in semiconductor manufacturing. Two main tasks in optimal device design for semiconductor manufacturing, i.e., parameter optimization and modeling, have been typically used either to characterize the devices by understanding how each parameter affects the device performances or to calibrate the parameters for SPICE circuit simulation. However, there still remains limitations in describing the relationship between all manufacturing parameters and figure-of-merits using several simple equations human experts can utilize. Even with the best model currently available, the optimal design of semiconductor device heavily relies on experiences of human experts and deals with time-consuming ad-hoc trials and non-holistic approaches. In this paper, we propose a new approach for data-based accurate electrical modeling of transistor, which is the most fundamental unit device of semiconductor, and fast optimization of its manufacturing parameters. Instead of the previous analytic approaches, finding finite equations derived from semiconductor physics, we utilize machine learning technique and neural networks to find appropriate modeling functions from data pairs of parameters and figure-of-merits. And for given desired figure-of-merits, we find optimal manufacturing parameters in holistic manner by using the learned functions of neural networks and fast gradient-based optimization method. Experimental results show that our neural-network-based-model directly estimate figure-of-merits with competitive accuracy and that our holistic optimization technique accurately and rapidly adapts the manufacturing parameters to meet desired figure-of-merits.


I. INTRODUCTION
Transistor process is the most basic and core technology in semiconductor industry. Thus, developing a transistor that is high-performing and low-power-consuming is of a great importance. Typically finding optimal manufacturing parameters for high performing and low power consuming transistor requires a repeated wafers fabrication with hundreds of unit process steps to measure its electrical performances The associate editor coordinating the review of this manuscript and approving it for publication was Jonghoon Kim . and a feedback loop to unit process of manufacturing. This procedure dramatically increases research and development cost because it can take up to several weeks. The duration of the procedure has been extended with the shrinkage of technology node.
To reduce the time consuming trials, multiple equation-based models have been suggested in analytic or numerical forms either to physically understand device characteristics [1] or to model I-V and C-V through tens of fitting parameters for SPICE circuit simulation [2], [3]. However, these modeling equations do not explain the relationship VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ between manufacturing parameters and figure-of-merits at a time. There is no simple equation to delineate the relationship and ultimately to meet given desired figure-of-merits by understanding their correlations. In addition, even with those equations, manufacturing parameter optimization heavily relies on experiences of human experts and has been dealt with ad-hoc trials and non-holistic approaches. For instance, human experts design a transistor to optimize DC performance firstly by maximizing on/off current ratio from I-V characteristics and AC performance secondly by adjusting capacitance from C-V characteristics because of the antagonistic effects between I-V and C-V characteristics on DC/AC performances of transistor when its structure changes.
In this paper, we aim to overcome time-consuming and non-holistic optimization of transistor manufacturing parameters. For this end, we propose a new approach for both data-based precise transistor modeling and fast optimization of transistor manufacturing parameters. Instead of the previous analytic approaches in finding finite equations derived from semiconductor physics, we utilize artificial intelligence (AI), namely neural networks [4], [5], and machine learning technique, to automatically find appropriate modeling functions from a number of data pairs of input manufacturing parameters and output figure-of-merits. Subsequently, we find optimal manufacturing parameters for given desired figure-of-merits in automatic and holistic manner via gradient-based optimization technique and the learned modeling functions.
AI has been previously applied for monitoring the excursion, i.e., out of boundary figure-of-merits in unit process of transistor manufacturing [6], but there is no AI-based study on analyzing the relationship between overall manufacturing procedure and figure-of-merits of transistor. Our holistic approach simultaneously optimizes the entire manufacturing procedure in the whole parameter space while conventional sequential optimization for each unit process results in a sub-optimal transistor due to its limited parametersearching ability. The contributions of this work are as follows: 1) Physics-free modeling and high compatibility: In semiconductor field, different models with regards to material, scale, structure, and usage of transistor are conventionally required. However, our AI model replaces multiple models with one artificial neural network model, which is independent of those physical characteristics and able to be applied to various transistor only with their input/output data pairs. In the aspect of variability, our method demonstrates the correlation between manufacturing parameters and device figureof-merits from a number of data even for nanoscale devices suffering from physical ambiguity. 2) Fast modeling and optimization: It takes several hours for excellent human expert to compactly model a given transistor. In contrast, our AI method takes only several tens of seconds to complete the same task. Similarly, our AI method takes only a couple of minutes to find optimal manufacturing parameters for given device figure-of-merits. 3) Fast prediction without accuracy penalty: To speed up TCAD simulation in transistor design, less physical parameters and more empirical parameters are necessary. This results in low accuracy of simulation result. However, our AI method uses a compact neural network for fast prediction and guarantees high accuracy as the number of data increases. 4) Holistic and flexible optimization: Our method finds manufacturing parameters to optimize AC performance of transistor by simultaneously considering I-V and C-V parameters. In addition, our method can also selectively optimize the performance suitable for application of transistor, which is useful for flexible design of transistor for System-On-Chip application. In the remained of this paper, we will describe the details of our AI-based method in sec.II, experimentally prove the superiority of our method in sec.III, and conclude this work in sec.IV.

A. TRANSISTOR STRUCTURE AND PARAMETERS
Since the purpose of this work is not to find a new low-scaled devices but to optimize transistor and to analyze its variation, we designed 32-nm node high-K metal gate transistor, a 2-D planar MOSFET which is easily dealt with in data acquisition compared to 3-D FinFET in the current 7 nm technology node, and simulated electrical characteristics using Sentaurus TCAD simulator [7]. The transistor structure was designed using structure editor with parameters based on Intel 32-nm node [8]. Values and ranges of structure parameters are shown in fig.1 and summarized in table 1. All doping profiles were assumed as the Gaussian distribution. The junction gradient lengths for source/drain (L sdj ) and halo implants (L haloj ) indicate the lengths where the source/drain (N sd ) and halo (N halo ) implants are reduced by 10 times, respectively.

B. NEURAL-NETWORK-BASED MODELING FOR POWER AND DELAY PREDICTION
In order to avoid very complex mathematical equations of semiconductor physics and to make a general model for multiple devices, we utilize one of the recently popularized hyperparametric models, namely artificial neural networks (ANN) [4], [5], and machine learning techniques to model functions that output figure-of-merits of transistor with the given input vector. The unit structure of ANN is perceptron [9], which models after the biological function of neuronal synapse as a linear combination of inputs followed by a non-linear activation function. ANN with several layers of multiple perceptrons is called fully connected multi-layer perceptron (MLP) [10], [11]. The advantage of MLP is in making a flexible model for new device by simply changing the number of input and output perceptrons, regarding to the dimension of manufacturing parameters and output figure-of-merits. In this work, we use two of 3-layered MLP (input-hidden-output layers) as shown in fig.2 because the structure is simple enough to control non-linearity function by changing the number of perceptrons in the hidden layer.
By using these two MLPs, we model two functions of transistor, F power (1) and F delay (2), which respectively output log-scaled power log(O power ) and delay log(O delay ) of a device from given input log-scaled vector log(I p ) of manufacturing parameters p, where power and delay, or power delay products (PDP) are figure-of-merits for digital logic applications [12].
Here, we used log values in input and output of F y for y = power or delay to avoid network training failure caused by large scale differences between input and output values.  We follow the data flow and operations presented in fig.3 for MLP training. We use mean squared error (MSE), L y (3), between model output log(O y i ) and desired output log(D y i ) of i-th training sample (I p i ,D y i ) for i = 1..N as the only loss function. Each MLP for y = power or delay, is trained to minimize this loss function L y (3) by backpropagation procedure [13], which updates the hyperparameters of MLP from initial random values. Minimizing L y in log scale is identical to minimizing log-scaled ratio, |log( O y i D y i )|, which considers both scale-sensitive ratio and log-scaled error symmetric regarding to predicted and target values.
One could suggest using a single MLP, which outputs both power and delay from given input manufacturing parameters. Such an approach is also possible and, in actuality, more efficient in the size of neural networks because two different outputs share the same hidden layer. However, in training procedure, a multi-objective loss function L m of two objectives (3) corresponding to power and delay should be defined VOLUME 8, 2020 and usually has a form of weighted summation (4) of L power and L delay with ambiguous weights w power and w delay .
Since training a single MLP by minimizing L m may result in overly fitted F y with a larger weight w y or less optimized F y with a small weight w y , we need to repeat MLP training process with varying weights to get satisfactory values of L power and L delay to determine appropriate values for the weights. In contrast, when using two MLPs for two outputs as in our approach ( fig.2), it is unnecessary to find appropriate weights for multiple loss functions and each MLP is independently trained with its own loss function L y (3).

C. AUTOMATIC OPTIMIZATION OF MANUFACTURING PARAMETERS WITHIN CONTROLLABLE RANGE
Once two models of power and delay are ready to use, we can not only forwardly predict power or delay from given manufacturing parameters but also backwardly optimize manufacturing parameters for a better performance of transistor. PDP is a commonly used measurement of transistor performance and should be as small as possible. Since our neural networks are designed to output log-scaled power and delay ( fig.2), log-scaled PDP can be represented as the summation of two outputs of our networks (5).
As shown in fig.4, we find the optimal set of manufacturing parameters, which minimizes this log-scaled PDP value by utilizing our trained neural networks and the gradient descent method [14], [15]. Gradient descent method operates as described here: initial randomized manufacturing parameters, I 0 p , and learning rate γ are set as a preliminary step. Then, as the first step, the gradient of log-scaled PDP with respect to manufacturing parameter G(x) (6) is calculated with the initial manufacturing parameters (x = log(I 0 p )). As the second step, the initial manufacturing parameters are updated to move to the negative direction of the gradient with a certain learning rate γ as (7) at time stamp t = 0. Those two steps are repeated with the updated manufacturing parameters as the time stamp t increases until PDP value meets a termination condition (8), i.e., PDP less than a certain threshold value T PDP or vanishing gradient or no variation in manufacturing parameters. Here, the manufacturing parameters obtained from this gradient descent method are not guaranteed to be in the controllable range of real semiconductor manufacturing because the method may find an optimal solution along the entire hypersurface of our trained neural networks beyond the training data space, without any constraint of the acceptable range of solution in reality. Thus setting a controllable range of manufacturing parameters in semiconductor industry is critical because of both the limitation of manufacturing hardware and the preferred parameter setting. To provide a controllable optimal solution, as shown in fig.4, we use a limiter function [16] g(z p ) (9) and optimize latent variables z p instead of manufacturing parameters I p themselves.
g(z p ) is a differentiable function that is defined by maximum and minimum values (I max p , I min p ) of the available range of manufacturing parameters and a sigmoid function sig(z p ) [17] (10).
Because the output range is from 0 to 1 (sig(z p ) ∈ [0, 1]), g(z p ) becomes log(I max p ) when z p goes to +∞ and log(I min p ) when z p goes to −∞. Therefore, to find optimal manufacturing parameters in the controllable range [I min p , I max p ], we optimize z p without range constraint and calculate I p = g(z p ).
For numerical efficiency, instead of calculating complex analytic gradient (6) of neural network function, we use numerical gradient [18]Ĝ (11) in our process optimization, where two feed-forward operations of (5) with the manufacturing parameters x + τ 2 and x − τ 2 and one division by τ ≈ 0 are needed. Thus, updating equation of latent variables (12) is also used instead of the equation for manufacturing parameters (7).

III. EXPERIMENTS A. EXPERIMENTAL SETUP
For experimental data set, we gathered 1,000 samples of random manufacturing parameters in the reference range between min and max values specified in table 1 and simulated 1,000 samples of I d -V g and C gg -V g characteristics of the devices as shown in fig.5 by using Sentaurus TCAD simulator. Before the electrical simulation, we obtained the geometrical parameters such as gate length and equivalent oxide thickness announced in [8] and [19]. Then, we varied the doping profiles from source/drain to channel to calibrate subthreshold characteristics between simulated and Intel measured data [8]. Simulated I-V characteristics correspond to 32-nm node HP (High Performance) and LOP (Low Operating Power) targets [19] about the range of the threshold voltage and on/off current ratio which are related to electrostatic characteristics like subthreshold swing and drain-induced barrier lowering. To obtain power (14) and delay (15), effective current I eff is extracted from I H and I L (13) using [21]. We extracted I H at gate voltage, V g = V dd and drain voltage, Among these 1,000 samples of {manufacturing parameters, delay and power} pairs, 850 samples were used to train the models of delay and power, the other 150 samples were used to test the delay and power prediction performance of the trained models.
For TCAD simulation, drift-diffusion transport model was calculated self-consistently with Poisson and the carrier continuity equations. Density-gradient model was applied to consider the quantum confinements of carriers. Multivalley modified local-density approximation was used with six-band and two-band k.p band structures for holes and electrons, respectively. Canali model [20] for carrier velocity saturation at high electric field, Masetti model [22] to consider doping-dependent mobility, and Lombardi model [23] to consider mobility degradation at Si-SiO 2 interface were used. Doping-dependent Shockley-Read-Hall [24] and Auger [25] generation-recombination models were used along with Hurkx band-to-band tunneling model [26] to consider the gate-induced drain leakage. Operation voltage is fixed to 1.0 V, and off-state current is fixed at 1 nA for the averaged device. As a pre-processing of data set, we took logarithm of each element of the 850 train samples and then normalized to have (mean, standard deviation)=(0, 1) across samples. This pre-processing of eliminating bias and deviation differences between data helps multi-layered perceptron (MLP) to learn a function in the restricted range of domain and is known to enhance training performance [27]. We performed the same pre-processing for the 150 test samples by using the mean and standard deviation of the train samples.
All experiments were operated in MATLAB console on a laptop computer with Intel i7-8550U (Quadcore, 1.8GHz) CPU and 16 GByte RAM.

B. PERFORMANCE EVALUATION OF POWER AND DELAY MODELING
We built two neural network models, one for predicting delay and the other for power of transistor, ( fig.2) as 3-layered MLPs of 19 nodes for input layer, 7 nodes for hidden layer, and 1 node for output layer, where the number of nodes in the hidden layer were experimentally determined as the smallest number to learn regression between manufacturing parameters and delay or power without overfitting. In details, the two models were trained, one to minimize power loss of (3) when y =power, the other to minimize delay loss of (3) when y =delay. During the training phase, randomly selected 700 samples out of 850 samples (termed training set) were used to update neural network parameters using Levenberg-Marquardt (LM) optimizer. The other 150 samples (termed validation set) were used find the stopping point for training iteration based on the generalization performance of the trained neural networks. In our experiments, we stopped the iteration of LM optimizer when the loss L y for validation set continuously remained or increased during the previous 6 epochs. We repeated this training process for varying number of nodes in the hidden node and used the smallest number of nodes that produce small training error for training set and near-zero difference between errors of training and validation sets.
Since the performance of a trained network differs based on its initial random parameter values, we repeated this training procedure with the same training and validation sets but with different initial randomized values of network parameters. And then, we took the best network which resulted the least error for training and validation sets. To find an appropriate number of repetition with a stable learning performance, we measured average MSEs and their variations of delay (F delay ) and power (F power ) models from 10 trials of training with each repeat number. As shown in fig.6, average loss and its variation decrease as the repeat number increases but no critical improvement was resulted after 40 repeats. Therefore, considering moderate training time, we concluded to use the best models from 40 trained networks.
For the trained models, we firstly analyzed the learning ability and generalization performance of our neural models by monitoring the Loss (3) reduction during training process. Figure 7 shows loss (MSE error) reductions of the normalized data sets (sec.III-A) during training process. For both training (blue lines) and validation (green lines) sets, the loss gradually decreased as the training epoch continued until optimizer stopped ( fig.7a and fig.7b). Even for test data set (red lines), which was not shown in training procedure, the same loss to the validation set was achieved. Figure 8 shows output regression lines of our two trained neural models. The regression plots of delay model ( fig.8a) and power model ( fig.8b) show almost ideal linear lines with R values very close to 1.0 for all kinds of data sets. Based on these results, we concluded that our neural models learned the deterministic equations of delay and power for given manufacturing parameters and that our neural models have a generalization performance to predict delay or power for unseen manufacturing parameters in the same perturbation range of training data.
Secondly, to evaluate the modeling accuracy of our neural approach, we compared the predicted delays and powers of our neural models to the simulated delays and powers (ground truth) for the 150 test samples in sec.III-A. As shown in fig.9, the differences between predicted and simulated values are very small (left panel of fig.9), less than 5 % of the simulated values (right panel of fig.9), and this prediction accuracy is good enough to apply our neural models to predict delay and power in a real manufacturing process.
Our neural modeling took 80 seconds to train two neural networks 40 times. Additionally, compared to TCAD simulation that takes 3 min, our method takes less than 1 ms to perform a forward prediction of delay and power for a certain manufacturing parameter.

C. PERFORMANCE EVALUATION OF AUTOMATIC MANUFACTURING PARAMETER OPTIMIZATION
In the optimization process represented in fig.4, we fixed our trained neural models of delay and power in sec.III-B and used them as F delay and F power respectively. And then, we generated random initial values of manufacturing parameters and updated them by gradient descent method as described in sec.II-C with numerical gradient (11), update equation (12), and objective function (5). Random initial values of manufacturing parameters were generated under the uniform distribution in a reasonable range, i.e., from I min p to I max p of (9) where I min p = 0.95×reference value in table 1 (-5%) and I max p = 1.05×reference value in table 1 (+5%). The small deviation τ = 2 × 10 −8 was used for calculating accurate numerical gradient (11) and the learning rate γ = 10 3 was used for update equation (12) for fast convergence. T PDP = 0, T G = 10 −5 , and T dx = 10 −5 were used for termination condition (8).
Since gradient descent method does not always guarantee global optimum from any initial value, we repeated this optimization process with different initial values and took the best result as the optimal manufacturing parameters. To find the necessary number of repetition of optimization process to obtain stable optimal result, we firstly measured the best PDP value for each repeat number and secondly calculated mean and standard deviation from 10 trials. Figure 10a and the second row of table 2 show the measured values where predicted optimum and simulated optimum represent PDP values corresponding to the optimized manufacturing parameters from our neural model and from Sentaurus TCAD simulator, respectively. We find that the predicted PDP is saturated at around 0.3066 × 10 −15 when repeat number is ≥ 10 and its deviation diminished to 0.0002 × 10 −18 . Even without any repeat (when repeat number is 1), the resulted PDP value (0.3071×10 −15 with standard deviation of 0.9429×10 −18 ) is in close proximity to the saturated optimal value. Compared to the ground truth (simulated optimum), the predicted optimum has slightly lower value. This difference is expected and caused by the model prediction error shown in fig.9 but the VOLUME 8, 2020 difference within ±5% bounds of simulated optimum (upper +5% and lower −5% bounds of fig.10a) is not significant in real manufacturing process.
When repeat number is 20, delay ( fig.10b) and power ( fig.10c) converged to the best predicted optimum, 0.1685 × 10 −11 (third row of table 2) and 0.1819 × 10 −3 (fourth row of table 2) respectively, and are stable with very small standard deviations, 0.0005 × 10 −13 (second row of table 2) and 0.0006 × 10 −5 (third row of table 2) respectively. Compared to the simulated optima of delay and power, the predicted optima have very small difference within ± 5% bounds ( fig.10b and fig.10c) of the simulated optima. Based on these results, we decided to repeat optimization process 40 times (20 + marginal 20) to achieve the lowest PDP performance with stable delay and power.
Our optimized PDP and its corresponding delay and power are compared to the values of 150 test samples (III-A) in fig.11. As shown in fig.11a, our optimized PDP (green line) is lower than the minimum PDP of the test samples. However, the minimum PDP was not resulted with the minima of both delay and power. Instead, our optimized manufacturing parameters have relatively high delay ( fig.11b) and relatively low power ( fig.11c). This result is partially consistent with the previous optimizing technique of human expert where human expert firstly minimizes power (DC characteristics) by maintaining good short channel characteristics and then tries to minimize delay (AC characteristics) under the low power condition. However, our neural optimization result differs in that it does not insist on the lowest power and delay; rather our approach resulted in a slightly higher power, yet achieved the lowest PDP with a certain delay. Therefore, as an holistic optimization, our neural approach is promising in finding optimal manufacturing parameters for the lowest PDP while the commonly used sequential optimization in real manufacturing, which firstly sticks to the minimal power and secondly searches minimal delay, does not guarantee the lowest PDP. Benefits of our neural approach will be further proved by quantitative comparing with human expert and the previously used reference design in sec.III-E. Table 3 presents the average and standard deviation values of each optimized manufacturing parameter when repeat num is 40. Each parameter converged to a certain optimal value (Avg) with a very small deviation (Std) of 10 −4 or less scale of its corresponding Avg value (Std / Avg). Figure 12 shows the optimized transistor structure of the manufacturing parameters shown in table 3. Since our optimization method has no constraint of source/drain symmetry, the resulted optimal transistor has an asymmetric structure unlike commonly used symmetric structure [8]. Based on the table 2, this asymmetric structure achieved 17% lower PDP (0.3120 × 10 −15 , 12% higher delay (0.1699 × 10 −11 ), and 26% lower power (0.1836 × 10 −3 ) than reference values, 0.3749 × 10 −15 , 0.1516 × 10 −11 , and 0.2473 × 10 −3 respectively, which were obtained using TCAD simulation with the reference structure [8] in table 1.
Although asymmetric structure is not common and beyond the present manufacturing ability, this optimization result shows that our neural approach can suggest theoretical direction for MOSFET structure as well as optimal parameter values for better performance beyond the conventional source/drain symmetric structure. Our optimization method can also find a set of optimal source/drain symmetric manufacturing parameters by applying our optimization technique of sec.II-C assuming the same parameter for source/drain. The symmetric structure optimization will be dealt with in sec.III-E.

D. CORRELATION AND SENSITIVITY ANALYSIS
We additionally analyzed the effects of each manufacturing parameter on PDP value in optimization process by using our trained neural models of sec.III-B and the optimized manufacturing parameters of sec.III-C. Each graph of fig.13,  fig.14, and fig.15 shows transitions of figure-of-merits, i.e., PDP, delay, and power, respectively for each manufacturing parameter varying from Min to Max of table 1, while the other parameters are fixed at the optimal values of table 3. We calculated the differences between the minimum and maximum values of figure-of-merits as the sensitivity (dy) of parameter, i.e., how significantly a parameter affects the figure-of-merit. These graphs can be generated in a second from our neural models without using repeated TCAD simulations.
From these graphs, we firstly find that PDP and each manufacturing parameter have an almost linear correlation as shown in fig.13. This linear correlation resulted in the VOLUME 8, 2020  optima (red circles in fig.13) on one bound of 5% limiter (vertical dashed lines or dash dot lines in fig.13) which has the lowest PDP. Once we found how manufacturing parameters are correlated with PDP, we can easily decide if each parameters should be changed furthermore and to which direction each parameter should be moved. Moreover, the monotonic linear correlation supports the usage of a wider limiter in our optimization process up to full range, and thus each parameter can be moved toward the boundary of the limiter for even further PDP optimization.
Secondly, we find that source/drain parameter pairs have the opposite directional correlations with PDP, but with very similar sensitivities. Highly-doped source region by reducing parasitic resistance and lightly-doped drain region by decreasing gate-induced drain leakage and improving short channel effects (SCEs) are preferred whenever possible [28]- [30]. For example, L sdj(d) and L sdj(s) are negatively and positively correlated with PDP respectively and these two correlations have similar slopes in fig.13. This supports to use asymmetric structure for a better PDP because TABLE 5. Parameters listed in descending order of delay and power sensitivities (dy): parameters (0,0) are insensitive to both delay and power, Parameters (±,0) or (0, ±) are sensitive to only delay or power respectively. Parameters (±,±) or (±,∓) are sensitive to both delay and power in a same direction or opposite directions. Each group of parameters are sorted in a descending order in regards to both delay and power sensitivities. the positive and negative PDP increments of the same source/drain parameter pair compensate each other.
Thirdly, PDP is insensitive to some manufacturing parameters. For example, PDP is almost constant (|dy| < 3.066 ×10 −18 , less than 1% of the optimal predicted PDP 0.3066 ×10 −15 shown in table.2) for varying L cond(s/d) , L haloj(s) , N ch , and T g in fig.13. This means that not all parameters but only some significant parameters with high slope (high sensitivity) are selectively necessary for achieving an optimal PDP [31], [32]. Considering only significant parameters in the transistor optimization process is beneficial because finding an optimum from a lower dimensional search space is faster and easier. Table 4 shows the sorted list of parameters in descending order in regards to their PDP sensitivities (dy) where L g , N sd(s) , T ox , and L sdj(s) are significant factors for PDP optimization based on their PDP sensitivities. This analysis is consistent with the trend of conventional semiconductor industry [33], [34].
We also performed a similar analysis on the effect of manufacturing parameters in delay and power control by monitoring their sensitivities (dy) as shown in fig.14 and fig.15. We categorized manufacturing parameters into five groups regarding their sensitivities (dy) to delay and power, i.e., both insensitive (0,0), delay-only sensitive (±,0), power-only sensitive (0,±), both sensitive (±,±), both sensitive but opposite (±, ∓). Here, we assumed that parameters of sensitivity (dy) less than 1% of optimal predicted delay (0.1685 ×10 −11 in table.2) or power (0.1819 ×10 −3 in table.2) are insensitive. Table 5 shows the parameter groups in descending order FIGURE 13. PDP transition for varying manufacturing parameters around the optimal value of 5% limiter and asymmetric source/drain structure.

FIGURE 14.
Delay transition for varying manufacturing parameter around the optimal value of 5% limiter and asymmetric source/drain structure. VOLUME 8, 2020 of sensitivities (dy) to delay and power. As we can expect from table 4, L cond(s/d) , L haloj(d) , and T g of very low PDP sensitivity are in the both insensitive (0,0) group and have no effect in optimization process. N ch and L haloj(s) of low PDP sensitivity are categorized as the group of both sensitive but opposite (±,∓) and, therefore, these parameters are useful in controlling delay and power of transistor without changing PDP. By using the parameters in delay-only (±,0) sensitive group like L sdj(d) or power-only (0,±) sensitive group like T ox , we can effectively control delay or power of transistor independently.
Based on this correlation and sensitivity analysis, we find a possibility to further reduce PDP. To this end, we enlarged the range of limiter from 5% of reference manufacturing parameter values to full ranges between Min and Max in table 1 and performed the same automatic parameter optimization process mentioned in sec.III-C. Table 6 shows optimized values of manufacturing parameters with the full range limiter. The resulted parameters positively correlated with PDP as shown in table 4, i.e., L g , N halo(d) , and N sd(s) , were decreased and parameters negatively correlated with PDP shown in table 4, i.e., N ch , N halo(s) , N sd(d) , T hk , and T ox , were increased to the direction of reducing PDP. The other parameters were not changed because they were already in full range with 5% limiter. This full range optimization resulted in about 29% lower PDP (0.2661×10 −15 ) than the TCAD simulated value using the reference structure (0.3749×10 − 15) as shown in table 7.

E. COMPARISON BETWEEN HUMAN EXPERT AND OUR NEURAL APPROACH
For a fair comparison of optimization performances between our neural optimization method and human expert, we set the same search ranges of manufacturing parameters, i.e., 5% limiter (sec.III-C) and full range limiter (from Min to Max of TABLE 7. TCAD simulated values of PDP, delay, and power for the optimized manufacturing parameters: 'Neural' represents the optimization with our neural models and asymmetric source/drain structure, 'Neural sym' with our neural models and symmetric source/drain structure, 'Human expert' with human optimization and symmetric source/drain structure, and 'Reference structure' shows the outputs of TCAD simulation with the reference parameter values [8] in  table 1.   table 1), and used the conventional source/drain symmetric structure [8] for both methods.
In case of our neural approach, we performed additional automatic parameter optimization with our trained models, previously mentioned in sec.III-B as described in sec.III-C but with source/drain symmetric constraint, i.e., using a single parameter for each pair of source/drain parameters which has a symbol with (s/d) in table 1.
In case of human expert, an experienced human expert optimized the PDP of the conventional MOSFET as described here: to minimize the PDP, gate capacitance C gg should be minimized according to the PDP, product of delay and power. Gate length was first tuned to reduce the C gg . Then, equivalent oxide thickness (EOT), directly related to T ox and T hk , was controlled. In this process, it was revealed that the EOT does not affect the SCEs of the device in this work, so a large EOT was set to reduce the C gg . Both T g and T sd were decreased to minimum in order to reduce the fringing capacitance between gate stack and S/D regions. L con was also decreased to minimum in order to reduce the junction capacitance. Doping-related parameters, N ch , N halo , and N sd , were decreased to reduce the overlap and intrinsic channel capacitances. Finally, L sp was set to the middle value due to the trade-off between overlap and fringing capacitances.  Table 7 and fig.16 present and compare figure-of-merits of transistors optimized by our neural approaches and human expert. In case of 5% limiter, both human expert and our neural method (Neural sym) achieved about 7% less PDP and delay and similar power compared to those resulted from the reference structure. In case of full range limiter, both methods achieved about 20% less PDP and power and slightly less delay than those resulted from the reference structure.
When comparing results from human expert and our method (Neural sym), both showed similar delays, and powers; while PDPs were also similar, our method resulted in a slightly higher PDP compared to that resulted by human expert.. Our method has 1 to 2% error in modeling ( fig.9 in sec.III-B) and optimization ( fig.10 in sec.III-C). And, due to  the linear correlation of low complexity ( fig.13, fig.14, and  fig.15), human expert easily found the near best manufacturing parameters. This resulted in a lower PDP in Neural sym but the difference was not significant, coming to less than 1.5%.
We presented optimized manufacturing parameters by our neural approach in table 8 and by human expert in table 9. L cond(s/d) , L sdj(s/d) , L sp(s/d) , and T g of two methods have different values. These differences are resulted from the different optimization objectives of two methods, where our neural method selects parameter values to obtain lower PDP as described in sec.II-C while human expert minimizes C gg only for lower PDP according to the PDP equation as described in sec.III-E. Nonetheless, these parameters with different values are not significant in minimizing PDP because PDP is almost insensitive to these parameters based on the sensitivity VOLUME 8, 2020   analysis in symmetric source/drain structure (table 10). The other parameters have the same or similar values in both methods. These results demonstrate that our neural approach achieved a comparable optimization result to the equation-based minimization of human expert without knowing any equation through the entire process.
In the case of full range limiter, as the search range increased from 5% of reference parameter values to the full range in table 1, the parameters positively correlated with PDP (table 10), i.e., N ch , N halo(s/d) , and N sd(s/d) , were decreased and the parameters negatively correlated with PDP (table 10), i.e., T hk and T ox , were increased in order to get minimal PDP in both methods (table 8 and table 9).

2) COMPARISON OF SYMMETRIC/ASYMMETRIC STRUCTURES
We find that our neural method with asymmetric structure (Neural) optimized PDP much less than with symmetric structure (Neural sym) as shown in table 7 and fig.16. As mentioned in sec.III-C, two parameters in a source/drain pair are oppositely correlated to PDP and this resulted in a higher PDP in symmetric structure. In addition, there is higher chances to find better parameters to obtain a lower PDP in asymmetric structure since asymmetric structure has 19 degree-of-freedom (DOF), which has seven more variables than symmetric structure with 12 DOF. Table 11 shows the elapsed time to get optimal manuscript parameters with our neural method and human expert.

3) COMPARISON IN OPTIMIZATION SPEED
Our neural method consumed 80 seconds for training two neural models described in sec.III-B and another 80 seconds for automatically optimizing manufacturing parameters described in sec.III-C. Once the models are trained, then our method only required 80 seconds for additional optimization. In contrast, human expert consumed 3 to 4 hours to obtain optimal manufacturing parameters because of a lot of trial & error tuning with TCAD simulations. Although these speeds are not the most optimized for both methods, we can roughly conclude that our neural method is about 100 times faster (2 to 3 minutes) than that of human expert (180 to 240 minutes).
Our neural approach requires a thousand of training samples generated by TCAD simulations. However, this data generation is off-line process which can be done by random sampling without delicate grid point control and in parallel utilizing multi-cores within several sampling time. Because of this, data acquisition time is not usually significantly considered in machine-learning-related works. Similarly, there are several papers about neural approach for the optimization of devices and circuits which do not consider the data acquisition time either [16], [35], [36]. Even when considering data acquisition time, our neural method requires only several TCAD simulation time for generating whole training samples through parallel computing while human expert needs lots of sequential simulation time because of a bundle of sequential trial and error tuning processes. In our experiment, TCAD simulation time for a data sample required 1 minute and we used 3 parallel simulations only because of license limitation. Although this resulted in 333 minutes (about 5.5 hours) to obtain 1000 training samples, if we use more number of parallel simulations, data acquisition time can be reduced down to 1 min (when 1000 parallel simulations are used), a single data sample time.

IV. CONCLUSION
In this paper, we proposed a new framework to automatically optimize manufacturing parameters of transistor by using neural networks. As the first step, our method trained two neural networks for delay and power models of a transistor utilizing a training data set from TCAD simulation. Then, an optimal set of manufacturing parameters that minimizes PDP was found automatically by using the neural network models and gradient descent method in desired ranges of parameters.
We experimentally proved the superiority and efficiency of our neural approach compared to the previous Intel 32 nm node [8] and human expert's MOSFET optimization in symmetric source/drain structure. Figure-of-merits of our method were significantly better than the TCAD simulated values with the reference structure [8], i.e., about 7% less PDP with 5% limiter and about 20% less PDP with full range limiter. Additionally, Figure-of-merits of our method were comparable to human expert's optimization resulted via time-consuming trial & error with TCAD simulations, while our method only took a couple of minutes to do the same optimization.   Moreover, it was possible to analyze the effect of each manufacturing parameter on figure-of-merits when our neural models were utilized. Without a huge amount of TCAD simulations, our neural models obtained sensitivities of parameters to PDP, delay, and power by predicting these figure-of-merits for varying parameters. Then, we categorized and gave certain weights to manufacturing parameters regarding their sensitivities to PDP, delay, and power. For example, L g , T ox , T hk are highly correlated to PDP but L cond(s/d) and T g are uncorrelated to any of PDP, delay, and power.
Our neural approach is not limited to a specific device and matured technology but can be expanded to general devices and new technology as long as a number of input/output  data are given because machine learning itself does not use any device physics considering device structure and scale.
Even though we did not perform real device experiments but focused on simulated results as done in the machinelearning-based circuit designing method [16] which includes only simulated results without implementing real circuit device, based on our experience, the simulation results are rather correct and well matched to the real devices.
There would be concerning about gathering training data from real factory. In real factory, for variation in unit manufacturing process, they cannot examine all wafer but some dies as representative samples. Our work is focusing on optimizing variation in unit manufacturing process with the  varying range from min to max in table 1 and, in the same manner, we used 1000 data samples for machine learning which can also be obtained in real factory without heavy expense. For output data, it would be burdensome if we measure full I ds -V gs and C gg -V gs data through gate voltage sweep, but several I ds and C gg values at some specific gate voltages  can be obtained in real factory with no heavy expense. For input data, process parameters can be obtained from the devices at the scribe line. Randomized structural parameters in Table 1 are the outcomes of the process parameters such as dose, temperature, and gas flow, but the accuracy of machine learning is not affected by the type of inputs. If there are some missing data, we can adopt generative adversarial network (GAN) as a realistic data generator to complete them.
Our experiments demonstrated that PDP of transistor can be much lower to about 29% less than that of the reference structure, if asymmetric source/drain structure is adopted. In addition, PDP can be further improved with even wider limiter beyond Min and Max shown in table 1 because linear transitions of figure-of-merits for full range limiter exhibit monotonic functions (see figures in appendix.B and C). Since our neural approach is not limited to a specific figure-of-merit or scale of device, we can apply our method to a smaller device for better figure-of-merits and to control each figureof-merit, such as lowering down the relatively high delay in asymmetric structure in table 7 by defining an appropriate objective function alternative to eq.3. There is a need to further analyze asymmetric structure and wider limiter for real application because they are not compatible with the previous 32 nm node with symmetric structure, and subsequently, there is an additional fabrication cost for asymmetric structure. This is remained as our future work.

APPENDIX. TRANSITIONS OF FIGURE-OF-MERITS
A. SYMMETRIC STRUCTURE AND 5% LIMITER See Fig. 17 to 19 here.