Hybrid Model of Mathematical and Neural Network Formulations for Rolling Force and Temperature Prediction in Hot Rolling Processes

Steelmaking requires precise calculation at several steps of the manufacturing processes. We focus on the hot rolling process using Steckel mills, almost the end step in steel coil manufacturing. The rolling process is a type of plastic working in which a slab passes between two rolls and is stretched to reach the target thickness. It is necessary to predetermine the exact rolling force to obtain a coil with an accurate thickness after the rolling process. First, we introduced a machine learning model for calculating the rolling force, which can be used in-line in real plants. However, a direct calculation of the rolling force can cause stability problems, because the model output directly affects the process. In order to avoid such a problem, we determined a special temperature of the coil by inverse calculation of the classical mechanical model of hot rolling and set it as the model output value. As learning models, deep neural networks (DNN) and gradient boosting-based decision tree models were used. We preprocessed the collected process history data and added artificial features to the model input by creating physical variables used in the classical models. Moreover, to supplement the black-box nature of DNN, feature importance was analyzed from the decision tree model, and utilization and interpretation of each feature in the process are presented. Thus, our methods take advantage of both the classical mathematical model and the deep neural network model.


I. INTRODUCTION
A. BACKGROUND Steel coils are used in many industries such as shipbuilding, bridge, and building construction. Recently, a demand for high-quality products with accurate and uniform specifications has increased. Briefly, steel coils are manufactured in steel companies according to the following processes: 1) Liquate iron ore using the furnace 2) Remove impurities in the molten iron in a converter 3) Reshape the refined iron using a casting machine 4) Roll the slab, and finally obtain the steel coil The associate editor coordinating the review of this manuscript and approving it for publication was Emanuele Crisostomi .
This article is focused on Step 4, especially the hot rolling process using Steckel mills and presents an application of machine learning techniques. Once the required target thickness of the coil is specified, it is necessary to predetermine a rolling schedule with several passes to reach the target thickness. More precisely, Ginzburg [1] determined that setting the rolling force at each pass schedule is an important factor in achieving high precision of the thickness of the final product. Classically, the rolling force was calculated using the thermo-mechanical theory [2], [3] with the deformation resistance (from well-known physical models such as [4]- [7]). In practice, however, a large deviation between the calculated force and the force actually exists because the above models cannot consider all the factors that affect the rolling process. For example, the temperature drops of the head and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ tail regions (the ends of the coil) while winding on a drum (See Fig. 2) must be considered, to avoid off-gauge problems. Consequently, we developed two directions of the application of machine learning models to calculate important factors in the hot rolling process. One direction is to calculate the rolling force directly as a model output. The other is to identify representative temperatures that could be used as an input value in the physical model. We remark that direct prediction of the rolling force can cause problems in terms of stability, because the model output directly affects the process results when an abnormal variable outside the learning range is given as input. For example, even if the predicted temperature value is in an abnormal range, it is possible to stably control the rolling force by setting a limit range of process variables that have the temperature dependency.

B. RELATED WORKS
For each step in the steelmaking process, machine learning based approaches have been applied in recent years. For Step 1, a thermal change in the blast furnace hearth is modeled as a binary classification problem and solved using support vector machines [8], [9]. Jo et al. [10] worked on Step 2, requiring a prediction of the molten iron temperature at the end of the LD converter process. They first made gradient boosting based decision tree models (see [11]- [13]), and then applied an ensemble learning technique [14] by combining decision tree models. For Step 3, a convolutional neural network architecture was used to detect stickers during continuous casting in [15].
In recent years, several works have introduced prediction models based on machine learning and neural networks for the hot rolling process. Lee and Choi [16] constructed on/offline neural network models to determine the rolling force in the hot rolling process. They first calculated the rolling force (RF m ) by using a typical rolling-force model, then multiplied the network model by RF m . The network model consists of one hidden layer with various numbers of units. Then, the model can be optimized by the Levenberg-Marquardt algorithm (see [17], [18]). Bagheripoor and Bisadi [19] also used an artificial neural network architecture model to predict the rolling force and rolling torque. Specifically, they considered a heat equation with a volumetric rate of heat generation (see [20]), and set up different process variables such as roll speed, thickness reduction of the slab, temperature, and interaction friction coefficient. Data samples were generated from the finite element method, and the samples were finally considered as training/test data. Previous studies have not extended to neural networks with deep layers, thus limiting the scope of application because numerous variables are used in the real process.
More recently, Wang et al. [21] used deep neural networks combined with a mathematical model to predict the rolling force for a tandem cold rolling process. We adopted a similar approach to their work, but a more difficult problem arises for hot mills operating above the recrystallization temperature.
Indeed, a formula for the rolling force with high temperatures is more complicated compared to the cold rolling case. For example, a temperature distribution in the thickness direction must be considered during the hot rolling process. Specifically, the head and tail temperature of the coil is lower than that in the middle due to convection and conduction heat transfer.

C. CONTRIBUTIONS
The main contribution of our work comprises the following items, as compared to previous works.
1. We validated the application of machine learning techniques to industrial processes. The complex operation information recorded and collected by the steelmaking plant was used as input data. Our approach could utilize complex data effectively, which existing classical methods could not achieve. Such merits lead to the development of an elaborate model. Thus, our work is valuable from the perspective of the evolution of the steelmaking industry. 2. Our method uses the synergy of the classical mathematical model and the deep neural networks. This approach is more powerful than previous works which used raw process data as model inputs. We first created additional features which are significant physically or statistically from the process history data. By analyzing the importance of the input variable (in predicting the target variables) in the primary model, a select set was used in the final model. We also calculated feature importance from decision tree-based models, enabling us to overcome the black-box property of neural network-based models. As a result, our method can exploit the existing domain knowledge of the classical models. 3. Our model can be applied in-line in the industrial hot rolling process. After the model has been trained with sufficiently small validation errors, it can evaluate new data with high accuracy. Our model can be regularly updated by utilizing the data that is continuously accumulated by the plant.

II. PROCESS DESCRIPTION
We collected the process history data from the reversing hot rolling mill from oversea branch of POSCO. Fig. 1 shows a configuration of the hot rolling plan. The hot rolling process goes through the following steps. Initially, a slab is reheated to the re-crystallization temperature (approximately 1220-1270 • C) in a furnace, and it is rolled to an intermediate size bar after several passes in a roughing mill (RM). Then, the bar is transferred to a finishing mill (FM, a Steckel mill), where it is rolled to a final target thickness. FM using a Steckel mill has the following distinctive features. As shown in Fig. 2, a Steckel mill is a single stand, 4-high reversing mill consisting of two work and back up rolls. It is similar to a reversing rolling mill except for two coilers and coiler furnaces. They are located at the entrance side and the delivery side of the rolling mill stand.  The material is rolled back and forth through the mill until the target thickness is reached.
Two coilers are used to feed the material by pushing and pulling it through the mill. The coiler furnaces can heat the coil and maintain set temperatures. The heated ambient temperature in the coiler furnace provides continuous and additional heat to the material, facilitating thermal consistency and rolling to a thinner thickness. For these reasons, Steckel mills are suitable for rolling high alloy steel and thin materials.
However, in the finishing roll using a Steckel mill, the temperature drops of the head region of the coil caused by contact while winding on the drum is an inevitable problem. During the material winding on the drum, the temperature drop of the material occurs differently every time. The temperature variation of the material causes a large fluctuation in the high temperature deformation resistance.
The precise delivery thickness after rolling can be obtained through accurate prediction of the rolling force and the roll gap, according to (1): where h (mm) is the delivery thickness, RF (kg) is the rolling force, M (kg/mm) is the mill modulus (a unique characteristic of the mill), and s (mm) is the roll gap. The setting value of the roll gap can be calculated from an accurate rolling force prediction, and the rolling force is greatly influenced by the high temperature deformation resistance of the material. Therefore, the prediction of temperature is very important in accurately calculating the rolling force.
However, since the temperature of the head and tail regions of the coil that contact the drum decreases rapidly and nonuniformly, it is difficult to predict it accurately. This inaccurate temperature prediction ultimately causes the off gauge problem, or the thickness deviation of the product.

A. MECHANICAL APPROACHES
In this section, we introduce the thermo-mechanical approaches used to compute the rolling force in the hot rolling process. One of the commonly used models is the Sims model [2] coupled to the Misaka model [6]. More precisely, the rolling force could be calculated from the following Sims model: where b (mm) is the coil width, k p (kg/mm 2 ) is the deformation resistance, and H (mm) and h (mm) are the thicknesses of the coil at entry and delivery, respectively. We attached the additional variable information in Table 1. The deformation resistance k p is approximated using the Misaka type equation: where A, B, n, and m are the tuning parameters which depend on the steel grade of the coil, T is the temperature of the coil, is the strain, and (s −1 ) is the strain rate. Table 1 summarizes the other related factors used in the equations for the rolling force and the deformation resistance. The exact formulations of some factors are highly dependent on the factory environment or engineers' choice. Thus, we omit some details, but the attached references give their fundamental formulations.
Using only the traditional methods caused a huge gap between the calculated force and the actual force, because all the relevant factors in the real environment cannot be considered. Therefore, we suggest machine learning based models combined with the traditional models to overcome the existing limitations. We will create additional features from the main components of the above models, and use them to train our model. Additionally, the rolling force calculated by the classical models is used as one of the input features of our model.

B. REPRESENTATIVE TEMPERATURE
To set up the problem related to the temperature of the coil as mentioned above, we define a representative temperature of the coil as the appropriate value for the input T of the deformation resistance formula (3). First, we collected values of the process variables such as the rolling force and the entry/delivery thickness of the coil from process history data, and substitute them in (2), yielding the deformation resistance k P . T can be calculated by substituting the strain and the strain rate from the data and calculating back from (3). We remark that the representative temperature T is uncorrelated to the surface temperature measured by several sensors in the plant.

C. MACHINE LEARNING MODELS 1) GRADIENT BOOSTED TREE
Decision tree model is a supervised learning model that draws a final conclusion based on the conditions of the input variables. At each internal node, one attribute is labeled as one of the input variables, and the joint space of the input variables is split by a decision rule. Thus, each final node, or a leaf, corresponds to a disjoint region of the input space partition. We denote such regions obtained from the j th leaf as R j . By assigning a constant γ j to the region R j , we can construct a decision tree regression model expressed as Gradient boosting is a machine learning technique which builds a single strong model as a sum of several weak prediction models. The weak models are iteratively combined to minimize the difference between the ground truth and the prediction result of the previous step. For our problem, the decision tree model was selected as a weak regression model. Precisely, a boosted tree model is a sum of trees: with = {{ m = (R jm , γ jm ), j = 1, · · · , J m }, m = 1, · · · , M } where J m is the number of leaves of the m th tree. For a given data set (x i , y i ), i = 1, · · · , N , a typical supervised learning minimizes the objective function: where l is a differentiable loss function which measures the difference between the prediction y i and the targetŷ i , and is a regularization function. Thus, the boosted model update at the n th step minimizes the objective function: is the prediction of the (n − 1) th step prediction model. Here, a second-order approximation can be used for computational efficiency. Other optimization methods have been studied, such as the pruning and split finding algorithms. For more details, see [11].
Since we can evaluate how the splitting at each node (according to its attribute) affects the performance of the trees, the importance of the features in the final prediction model can be measured. In particular, the rolling process deals with dozens of operation variables. Thus, rather than using all the variables to learn, training models with only some important features can be more efficient in a dataanalytic sense.

2) ARTIFICIAL NEURAL NETWORK
Known as the most basic model of artificial intelligence, artificial neural networks (ANN) are supervised learning methods which model the biological structural units of a real brain. It consists of a network of neurons with thresholds and activations, multiplied by the input signal weight of each neuron, and then transmitted to the next neuron. More recently, ANN with several hidden layers, called deep neural networks (DNN), have become popular because of their remarkable success in various machine learning tasks. The structure repeats to yield the final output value. It is currently used in various fields such as speech recognition [25], reinforcement learning [26], [27], and image classification [28].
Mathematically, the DNN is a nonlinear function consisting of hidden layers and several hidden nodes. Each node before and after the layer is connected through an appropriate activation function. An artificial neural network including L layers can be formulated as follows: where z l i is a value of the i th node of the l th hidden layer, m l is the number of nodes, σ l is an activation function, and w l+1 ji and b l+1 j are the weights and biases between the l th layer and the (l + 1) th layer. We also found mathematical issues in the ANN model such as complexity, approximation property, and estimation (see, e.g., [29], [30]). A feedforward ANN with one hidden layer can approximate up to the Borel measurable function, and thus ANN is considered as a universal approximator (see [31], [32]). Sequentially, Hornik et al. [33] proved that the multi-layer feedforward networks with a monotone sigmoid function can also approximate any measurable function.
We remark that such convergence results do not always lead to suitable weights w l+1 ji and biases b l+1 j in (8). Therefore, we applied back-propagation techniques based on given training data. In particular, we used the Adam optimization algorithm [34], which is one of the successful optimization techniques for stochastic gradient descents. The Adam optimization updates using first and second moments of the objective functions, adaptively reducing its learning rate. Specifically, we implemented our neural network models and optimization with PyTorch library [35]. Derivative computation in PyTorch is based on automatic differentiation (AD), which is one of the powerful scientific computing tools. We note that AD is distinct from numerical differentiations (such as finite difference methods) or symbolic differentiations. The basic principle in AD is that even the most complex functions consist of a sequence of elementary operations, and a computation of its derivatives is possible using the chain rule [36]. This allows us to apply the gradient-based optimization for our complex and deep model.

IV. DATA A. DATA DESCRIPTION
Hot rolling process log data were collected from July 19, 2019 to August 9, 2019 from the hot rolling mill from oversea branch of POSCO. A Steckel mill was used for the rolling, which works over a maximum of 7 passes for the whole process. Data features were divided into two categories: common feature and pass information feature. Common features are relevant to all passes of the process. They include information about equipment specifications and status of slab, such as radius of working rolls, total process time, initial temperature, size, and steel grade of slab. Pass information features are the features obtained from each pass of FM process. Some of these features are values that must be set in advance according to the process manual or at the engineer's discretion, whereas others are values observed as a result of the process. For example, rolling speed, time, unit tension, coil thickness and surface temperature in entry/delivery of each pass are the pass information features.
Some features were measured for each region of the coil. In this study, we divided the coil into three regions: head, mid, and tail. The head and tail regions are approximately 10−12 m from both ends of the entire coil and is directly affected when wound on the coiler. Direct contact with coiler drums tends to complicate the prediction of changes in the internal temperature at the head and tail ends of the coil. The region-specific features include rolling speed, torque, and surface temperatures of the coils.
We note that the raw data temperature was obtained by measuring the surface temperature of the coil. However, since heating and cooling of the coil surface are repeated during rolling, an uneven temperature distribution occurs in the thickness direction of the coil. Indeed, it is not appropriate to directly use the raw temperature data as inputs of the classical models.
Our dataset includes information on the coils of various grades of stainless steel. The grades of typical stainless steel can be classified according to their crystal structure arising from their chemical composition (Cr, Ni, Mo, Mn, N, and C). In this study, we focused on single standard grade of stainless steel as target. The data for the other steel grades were used for training only. The data corresponding to the target steel grade were split in half: one half was used for training and the other half for testing. The training data was subdivided into training and validation data. The training data were used to fit the parameters of each model, whereas the validation data were used to monitor its training progress. Usage of validation data is important to prevent problems associated with model overfitting.

B. FEATURE ENGINEERING
Feature engineering is a method of processing raw data into a form suitable for training predictive models. For example, it is possible to create new artificial features using existing variables or to change the shape of the data. In particular, we focused on improving the accuracy of the model by creating artificial variables with domain knowledge of the rolling process so that the learning model can easily understand the complex geometry between the variables.

1) SCALING
If the units of the data are unspecified or if the distribution is biased, scaling of the data is used to improve learning by artificially adjusting the distribution of the data. When dealing with the artificial neural network model in our case, min-max scaling was used as follows: where X is one of the numerical features, and X min and X max denote a minimum value and a maximum value of X , respectively. Finally, the scaled dataX is considered as the final input to process the learning data.

2) ONE-HOT ENCODING
Categorical or string variables are required to be transformed to a dummy variable. That is, if a variable has a meaning that indicates it is assigned to a class, it needs to be converted to a learnable form such as a number or series. For example, we apply one-hot encoding to pass number and steel grade variables to create dummy variables, because they only represent the dependency on the corresponding class.

3) ARTIFICIAL FEATURES
We created features artificially by applying arithmetic operations to existing features. Training models are more efficient, VOLUME 8, 2020 especially when additional information on the relationship between the target and input variables is available. In this case, thermo-mechanical factors frequently used in the field were used as artificial features, as in Table 1. In addition, additional features that were analyzed to be significant by exploratory data analysis (EDA) and correlation analysis were used. Table 2 summarizes the artificial features we used to train our models.

V. RESULTS
In order to solve the problem, we used two prediction models: a gradient boosting regression model and an artificial neural network model. In particular, one of the most popular opensource gradient boosting models, XGBoost regressor (XGB) was selected to predict the rolling force and the representative temperature for each pass in the hot rolling process. Hyperparameters of XGB were elaborately tuned via K -fold cross validation (K = 5). A detailed description of the hyperparameters of each model is presented in Appendix A. Quantitative analyses were performed according to the conditions of the input data features. Specifically, two comparative experiments were suggested to determine the followings: -Differences in the behavior of the temperature distribution in the region of the coils (head/mid/tail).
-Effect of the artificial features on predicting the temperature and rolling force. To achieve the second goal, the default models were denoted as XGB and DNN, and the modified models trained with additional artificial features as XGB-A and DNN-A. We obtained the feature importance ranking from the learning results of the XGB-A model, and trained each model using only the top-K features. Thus, each model was denoted with the K suffix (XGB-K and DNN-K ). Here, K was set to 1/3 of the ratio of the whole features.
We evaluated the prediction performance for each model using two measurements: the average relative error (RE) and the ratio of data within 5% relative error among the whole test data (RA). Each result was obtained after 10 repetitions of the train-test set splits. Tables 3 and 4 show the relative error analysis for each model and each region of the coils. The best performance result is shown in bold for each case. Fig. 3 and 4 are visual representations of these results. In all the cases, the accuracy of the representative temperature prediction is higher than that of the rolling force prediction. Rather than directly predicting the rolling force, it is better to predict the representative       temperature with high accuracies to calculate the rolling force and utilize domain knowledge to obtain a better output. Fig. 6 and 7 are the scatter plot of test data as a prediction result of DNN-A model.

A. SIMULATION RESULTS
We have found that the model performance of the latter pass deteriorates, as shown in Tables 6 and 7. To our knowledge, the frequency of manual intervention by field engineers to fine-tune the material thickness tends to increase at the end of the process, resulting in noisy data from a data analysis perspective. These noises are factors that hinder model learning.
The DNN model tends to yield a more accurate predictive performance for both the temperature and rolling force than the XGB model. While the DNN model may be superior in terms of predictive accuracy, many of the advantages of the XGB model cannot be ignored. Thus, we recommend the complementary use of these two models. For example, the most important variables can be selected from the XGB model through feature importance analyses, analyzed using field knowledge, and then used as input values for the DNN model.

B. FEATURE IMPORTANCE ANALYSIS
We present an analysis of the features that have been shown to carry high importance in the XGB models for both temperature and rolling force predictions.
-Roll diameter: Since the roll is in direct contact with the coil during rolling, the status of the roll is crucial to the temperature change of the coil. For example, the area of contact with the roll is proportional to the diameter of the  roll, which can be a major cause of heat loss due to heat conduction.
-Slab width and length: These features determine the size of the material being processed. The larger the surface area of the material, the greater the thermal change due to friction with air or cooling.
-Total time: This feature is the sum of the rolling time in all passes and the idle time in between. The longer the rolling process takes, the greater the effect on the temperature change of the material.
-Temperature in FM entry: These features describe the status of slab when first entering the FM process.

VI. CONCLUSION AND DISCUSSION
In this study, machine learning models were proposed to solve prediction problems in the hot rolling process of steelmaking. Our models aimed to predict the rolling force and temperature of the coils for each region of the coil and each rolling pass. A method was proposed to take advantage of both the classical mechanical model and the new predictive VOLUME 8, 2020 model. For example, process variables that enhance the model performance were analyzed separately.
In particular, our models can be used as an in-line application for the industrial steelmaking process. It can be used to set new rolling process conditions by analyzing the results of the trained model using previous process history data (e.g., 6−12 months). Data from the rolling process is continually accumulated and stored in the database, and thus the model can be updated regularly by learning new data. Table 5 summarizes the list of hyperparameters used in this study. Table 6 and 7 compare the entire results of the predictions for the representative temperature and rolling force, respectively. We evaluated the performance of each model using two measurements: the average relative error (RE) and the ratio of data within 5% relative error among the whole test data (RA). Each row corresponds to the results for each pass and for each region (head/mid/tail). Head, mid, and tail are abbreviated as H, M, and T, respectively. Fig. 8 and 9 summarize the overall DNN results represented in Table 6  She has published more than 65 scientific articles in the fields of applied mathematics and interdisciplinary research. Her research interests include optimization, deep learning, applied mathematics, partial differential equations, and data analysis in applied fields.