A Multi-Step Anomaly Detection Strategy Based on Robust Distances for the Steel Industry

Steel making industries exhibit extreme working conditions characterized by high temperature, pressure, and production speed as well as intense throughput. Due to high economic and energy investments of the overall production process, an intense and expensive preventive maintenance program is adopted to avoid breakdowns. Steel making process would greatly benefit from a predictive maintenance module able to detect incoming faults from data process analysis. However, due to intense preventive maintenance, available data recording process operations enclose only a few samples of fault events, avoiding the efficient application of classical data driven anomaly detection models. In an attempt to overcome the above mentioned limits, we report the outcome of an industrial research project on data-driven anomaly detection in a steel making production process. The study assesses a fault detection strategy for rotating machines in the hot rolling mill line: we developed an automatic two-step strategy, which combines two statistical methods over the available data set: more precisely, the combination of Re-weighted Minimum Covariance Determinant estimator and Hidden Markov Models helped identify working conditions in a drive reducer of a hot steel rolling mill line and automatically isolate signs of decreasing performance or upcoming failures. The proposed strategy has been validated on real data collected in a steel making plant placed in the South of Italy.


I. INTRODUCTION
The steel making process forges a collection of raw materials, including steel scrap, carbon, and limestone, into steel bars with different diameter sizes. In order to obtain the highest production rates, today's steel industries handle heavier loads and faster velocity than ever before. Indeed, continuous production processes working in these harsh environments characterized by high temperatures, pressures, and production speeds need to flow uninterrupted [1]: unexpected equipment failures are expensive and potentially catastrophic, resulting in severe risk for plant operators, unplanned The associate editor coordinating the review of this manuscript and approving it for publication was Baoping Cai . production downtime, costly replacement of parts. To reduce breakdown impacts on production, steel making companies adopt preventive maintenance strategies substituting equipment long before the end of their useful life [2], [3]. Apart from scheduled maintenance, production line reliability may be increased adopting fully automatic monitoring and fault diagnosis systems. Indeed, a major task in steel making industries is the detection of faults, whereas neither an analytical description of faults and process models nor the collection of typical breakdown patterns exists [4].
Fault detection methods can be roughly classified into two broad categories: model-based methods and data-based methods (also referred to as history-based). The first relies on a mathematical model of the system under analysis; faults are VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ detected by computing the error between the estimated model and the actual values of the process variable. However, this method requires a detailed theoretical model the derivation of which is a complex and time-consuming task, in particular when production lines are equipped with rotating machines. Data-based methods have been largely adopted since they do not require any specific a priori knowledge of the system characteristics; they only rely on the processing of available data measurements and statistical techniques [5], [6]. A fundamental requirement for the effective implementation of data-based method is the availability of relevant, information rich training data. However, in the particular environment of high risk/high intensity industries -such as steel industriesthe availability of a large training data set is in fact impracticable since machine faults are prevented and unlikely to occur. Lack of breakdown samples induces to address further investigation towards the development of efficient methods to detect anomalies in production data, rather than to predict fault events [4].
In an attempt to settle an anomaly detection module in the steel production line, we developed a two-step scheme, relying on complementary data-driven techniques. Specifically, a preliminary anomaly detection phase has been carried out based on Minimum Covariance Determinant (MCD), a simple robust multivariate outlier technique. Secondly, a more complex method is proposed based on Hidden Markov Models (HMM), targeting a more precise identification of the anomalies. Both techniques essentially rely on the fusion of information received from sensors to obtain one reliable measure and a corresponding cut-off level; new data coming from the operating plant are compared with the cut-off level to be classified as anomalies, normal operating conditions, or intermediate operating stages.
The preliminary anomaly detection method of our scheme uses multivariate analyses techniques which exploit the nature of anomalies, i.e. measures aside from main features of the majority of the data, exhibiting their own pattern or even no pattern at all. These measures are also referred to as outliers; in what follows we will refer to anomalies and outliers interchangeably. Various methods for detecting multivariate outliers have been proposed in the literature (see [7], [8] for an overview); out of these, Principal Component Analysis (PCA) and Independent Component Analysis (ICA) have been largely adopted in industrial anomaly detection with some applications to steel making industry [9]- [12]. While intuitive and popular, PCA has been developed under the assumption that the extracted principal components (PCs) have a Gaussian distribution; in practice, PCA is not often applicable, and it has been proved to provide unreliable monitoring results [8]. ICA-based monitoring has been developed to overcome PCA limits and it has been successfully adopted to analyze non-Gaussian data in the steel casting process [8]. However, ICA also showed several drawbacks: computational load, unavailable criteria for sorting independent components, and solutions changing at each computation since ICA is initialized randomly.
Among various available robust techniques for estimation of multivariate location and scatter and outliers detection, we focus on the MCD estimator [13]- [15]. Briefly, the idea behind the MCD is to discard a fraction of most distant data points; the data to be retained are chosen as those whose sample variance-covariance matrix has the lowest determinant. These data are expected to be the closest to each other and not contaminated by outliers. In order to increase efficiency, a re-weighting step is usually performed: observations with squared distances from the MCD fit above a given threshold are trimmed. Finally, anomalies conditions are identified looking for multivariate outliers by the inspection of univariate distances. In particular, data points whose distance is larger than a fixed threshold can be pointed as potential outliers.
Multivariate analysis provides binary information (i.e. normal/anomalous) on production process operating conditions. It is a simple yet useful instrument to build up a fully automated process data classification. However, a complex production process such as rolling mills shows a transition from normality to failure through degradation of process performances over time. Indeed, some changes in process variables (such as vibration, oil temperature, motor electrical absorption) may alter the operating conditions without immediately leading to a breakdown; however, if unreported, the persistent shift over time to a range of intermediate efficiency states will conduct to a failure. Hence, to monitor and predict several latent states of the activity of the production process along time we fit an Hidden Markov Model (HMM) over distances. The HMM provides a dynamic classification of distances and in particular of large distances corresponding to outliers into the latent states. Furthermore, the HMM returns the probability of moving towards degradation stages to the final inferred state denoting failure. HMM represent a well-known statistical tool to monitor different stages of the process and for automatic outliers detection [16]- [18]. Here we mainly rely on HMM to deal with the continuous-in-time nature of the available measurements. Furthermore, we also stress that effective outliers detection rules should be driven from suitable robust methods well suited to handle outliers [19].
We applied the two-step scheme described above to a steel making industry placed in the south of Italy. We focused on anomaly detection in rolling mills of the metal sheet forming processes. This problem has been rather studied in the literature. Machine learning techniques have been more successfully applied to faults affecting product qualities [20]- [23]; when applied to anomaly detection methods, most of the proposed strategies rely on classical machine learning methodologies applied to rotating machines (as in [24], [25]) hence they are strongly limited by the absence of data on breakdown events in the steel making process. To overcome this limit faults are usually simulated, for example, corrupting a vibration measurement with an additive train of pulses [25]. A residual based analysis has been adopted in [4]; authors rely on model identification of main rolling mills components and classify anomalies accordingly. Despite some intriguing results, this methodology suffers from oversimplification of plant activity and it is impracticable in a long chain of rolling mills. Conversely, the two-stage scheme we developed relies on robust measures; in fact, the HMM is identified over the distances of the data points from the robust fit stemming obtained through RMCD estimation, allowing a dynamic strategy that takes into account the time-varying nature of the data and infers several latent states in the production process. Robust distance fuses the information received from sensors displaced on the production process, hence obtaining one reliable measure and the corresponding trade-off, easily implementable and adaptable to all process stages. The procedure has been validated through production data from the working plant.
Among our main contributions we consider: • The identification of a metric fusing sensors' measures in one reliable feature.
• The design of an automatic procedure for anomaly detection in large data sets with strongly unbalanced clusters.
• The extension of the static anomaly detection method to an HMM-based method providing additional information on the evolution of anomaly severity in time.
• The validation of the proposed methodologies to rolling mills of a steel production plant, using data recorded from Pittini, a steel making plant placed in the South of Italy.

II. DEVICE DESCRIPTION AND PROBLEM SETTING A. STEEL MAKING PROCESS
Steel making is a multistage process including steel making, continuous casting, and rolling stages (see Figure 1). The process starts with the melting of raw materials into the electric arc furnace (EAF) at 1600 o C. The molten steel is continuously supplied into the casting lines, in which the liquid steel is ''guided'' through rolls; the output of this stage is a semi-finished product, namely the billet.
Billets are sequentially rolled in multiple rolling mills, i.e., a chain of rotating machines that progressively stretches the bars and reduces their diameters. The mill drive train supplies mechanical energy to the top and bottom driven rollers of the rolling mill stand; it is configured with an electrical motor, a gear drive, and a pinion stand, all connected by couplings; the combination of these elements is also referred to as a cage ( Figure 2).
A steel making plant generally produces several types of bars differing for their diameters (also denoted as profiles); each profile is worked by a subset of the available milling machines. For instance, Pittini plant produces 25 profiles (φ, in mm) each one worked by a specific combination of the 18 cages comprising the milling line. In order to obtain the highest production rates, today's rolling mills handle heavier loads and faster velocities than ever before, leading to high temperatures, pressures, and degradation. The layout of a rolling mill is highly vulnerable to sudden breakdowns: if a fault occurs in one of the operating cages, it may cause injuries to operators, damages to other mechanical equipment, and relevant economical losses. For these reasons, faults are usually prevented by scheduling frequent and costly maintenance interventions which usually comprise the replacement of critical components much earlier than the end of their useful life.

B. AVAILABLE SIGNALS AND MONITORING SYSTEM
The core system to be studied consists of an electric motor and a gear reducer, represented in Figure 3, serving the blooming mill of the cage. The following are cage parameters monitored in Pittini's plant; they represent as well a list of the most frequently recorded data rolling mills: • Three vibration signals measured by three accelerometers placed on the shell of the drive reducer, sampled at 10 kHz.
• Oil pressure in the drive reducer, acquired at 1 Hz.
• Two temperature signals measured by two probes placed in the opposite sides of the reducer, sampled at 1 Hz.
• Current absorbed by the motor driving the mill, with the sampling frequency of 10 Hz.
• Revolution speed of the motor, measured in Round Per Minute (RPM) and acquired at 10 Hz.
• Torque of the motor sampled at 10 Hz. The set of the above mentioned sensors is replicated on 6 of the 18 cages of the plant, i.e., the monitoring system of the rolling stand supervises the activity of the 6 most used rolling machines. The recording system is completely automatic: it acquires and stores on a server the data corresponding to one registration every 10 produced billets, i.e., the data are stored only if the plant is producing. In particular, starting from 5 sec when the billet is loaded to all cages involved in processing the ongoing profile, the acquisition lasts for 15 sec and outputs a data set with a number of rows equal to the number of samples and a number of columns equal to the total number of sensors. The corresponding impact in terms of memory load is about 100 MB/billet.

C. STRATEGY FOR ANOMALY DETECTION
In this section, we describe our two steps methodology for anomaly detection in rolling mills, schematically is represented in Figure 4; the theoretical background supporting each step is detailed in Section III while examples of its application to Pittini's plant are in Section IV. The first preparatory step is the computation of residuals on the acquired measures. This step is peculiar to Pittini's data set, nevertheless, it may apply to several production sights. The supervisory system records process parameters of all worked profiles; according to its production plan, each profile is milled in some cages at peculiar reference values of the production parameters. Indeed, different profiles are worked at different nominal values of temperature, pressure, and line velocity. The first objective of our work is to obtain one reference parameter -fusing information received from sensorsindependent from billets profiles. To this aim a conditional mean has been evaluated: for each profile and for each feature the mean of all recorded data is computed. In order words, a profile-wise centering operation has been performed in order to avoid the effects due to data acquisitions from distinct profiles. With a slight abuse of terminology, we refer to the record of the profile-wise centered data as the data set of residuals; this will be input to subsequent phases. Since the data are supposed to be contaminated by the occurrence of several outliers, we perform a centering operation based on a robust mean, rather than on the classical sample mean. Here, we use the Huber estimate of location [19]. The use of the sample mean is deprecated since it would be attracted by outliers unavoidably and all residuals severely shifted. As a consequence, a large rate of genuine observation could be transformed into anomalous. The second step concerns the robust fitting of multivariate location and scatter over the p−dimensional residuals evaluated using the RMCD estimator. Next, relying on the robust estimate, a static outliers detection test is performed for each data point at an overall confidence level α = 0.01. The testing procedure is based on distances from the robust fit. Finally, an HMM is fitted to the robust distances in order to take into account time dependence among subsequent measurements (i.e. a dynamic outliers detection is performed). One of the inferred HMM states is expected to collect outliers and denote a failure state.
The analysis of robust distances through the RMCD leads to appropriate testing rules that are meant to detect inconsistent data. Although efficient, this kind of strategy is basic since it only deals with the pair of latent conditions anomaly/not anomaly and it is likely to suffer a large number of false alarms (false positives). On the contrary, the transition from a normal operating condition towards faults can be identified fitting an HMM over such distances evaluated on the entire data set. Moreover, the outliers detection rule from RMCD is static: it does not take into account the sequential nature of the data, whereas the analysis on robust distances made by fitting an HMM is dynamic, the evolution of the clustering process being governed by a first-order Markov chain.

A. STATIC ANOMALY DETECTION
Notation: We denote by R and N + the sets of real and positive natural numbers, respectively. Uppercase bold variables X represent matrices, lowercase bold ones x represent vectors, that are supposed to be column, i.e., x = (·, . . . , ·) is a column vector.
Let Y = (y 1 , y 2 , . . . , y n ) be a sample of size n from a p-variate Normal distribution Y ∼ N p (µ, ), where µ is the sample mean, is the sample covariance matrix, and p ∈ N + is the number of features, i.e., y i = (y i1 , y i2 , . . . , y ip ) ∈ R p , µ ∈ R p , and ∈ R p×p , respectively.
Definition 1 (Mahalanobis distance): The Mahalanobis distance (MD) between the ith sample y i = (y i1 , y i2 , . . . , y ip ) and the sample mean µ ∈ R p , with respect to the covariance matrix ∈ R p×p , is A traditional approach to detect outliers from a data set is based on the MD evaluated on the sample meanμ and the sample covariance matrixˆ of the observed data Y. In particular, a sample y is considered anomalous if d(y;μ,ˆ ) is larger than some fixed threshold. Under the assumed pvariate Normal model with known µ and , the squared MD obeys a chi-squared distribution with p degrees of freedom, i.e., d 2 (Y; µ, ) ∼ χ 2 p . A frequent choice [19] consists in using as a threshold the 0.975-or 0.99-level quantile of the χ 2 p distribution. However, it is well known that traditional approaches in computingμ andˆ are not robust to the outliers, indeed they are biased by outliers themselves.
This motivates the exploration of methods to robustly estimate the parameters of the p-variate Normal distribution. The MCD is one of the most popular methods. The method is very popular as it relies on good asymptotic properties [26] and the availability of efficient algorithms for its computation [14]. The idea behind MCD is to discard a fraction of the most distant data points [19], with denoting the trimming level determining the size of the considered subsample. The data to be retained are chosen as the set of sizen := n(1 − ) whose covariance matrix has the lowest determinant (where · indicates the floor function), i.e., the data that are most close to each other are less likely to be outlying.
Given data Y i , the MCD estimator iŝ where z is a binary vector whose components z i equal to zero in case of a trimmed observation, and the consistency factor c(p, ) is used to make the MCD consistent at the p-Normal model, by inflating the covariance matrix based on the selected subset [19]. The MCD can be evaluated according to an iterative algorithm that assures a monotonic decrease of the determinant of the estimated covariance matrices, i.e., at each iteration j = 1, 2, . . . , |ˆ j | ≤ |ˆ j−1 |, where |·| here is the determinant, and the superscript is used to represent the iteration. Therefore, the sequence of determinants is expected to converge in a finite number of iterations. The complete MCD procedure is summarized in Algorithm 1.

Algorithm 1 MCD algorithm
Require:μ 0 ,ˆ 0 be initial estimate based on a subsample of sizen, , n, p Ensure:μ MCD ,ˆ MCD 1:μ MCD ←μ 0 ,ˆ MCD ←ˆ 0 2: repeat 3: for i = 1, 2, . . . , n do 4: end for 6: Sort distances d (1) ≤ d (2) . . . ≤ d (n) in non increasing order and take a subsampleȲ of sizen based on the lowest distances 7: Estimateμ MCD andˆ MCD using (2) 8: until convergence In order to increase the efficiency of the method in the finite horizon, a reweighting step is usually performed, leading to the reweighted MCD (RMCD) method. The reweighting step is performed downstream of Algorithm 1 and works as follows. Given the distances d i,MCD = (y i −μ MCD ) ˆ −1 MCD (y i −μ MCD ), i = 1, 2, . . . , n at convergence, computed as described in Algorithm 1, then the observations y i for which d 2 i,MCD > χ 2 p,q are trimmed, where χ 2 p,q is the q-quantile of the χ 2 p distribution. Let¯ ≤ be the rate of observation trimmed in the reweighting step. Then, VOLUME 9, 2021 the RMCD is obtained as the sample covariance matrix over the non-trimmed set of data inflated by c (¯ , p).
Let (μ RMCD ,ˆ RMCD ) denote the RMCD estimate. Then, an observation y is labeled as anomalous if It is worth noting that the problem of mislabeled instances can be further faced with additional well-known methods. For example, when the number of instances is relatively large, to contain the number of false positives (FPs), we suggest controlling multiplicity by some correction procedure, such as the False Discovery Rate (FDR) [27], as we show in the experiments section.

B. DYNAMIC ANOMALY DETECTION
The anomaly detection strategy described so far can be considered as a static analysis as it identifies only two states in the process, i.e., no failure and failure, thus overlooking any kind of time dependencies. However, real conditions of manufacturing processes exhibit continuous trends over time, starting from healthy states and dropping toward some failure conditions. In light of this dependency of any machinery health's conditions on the time, it would be desirable to implement a dynamic analysis on the data. To this aim, we here provide a general method based on HMM that outputs a set of latent states along with an estimate of their transition probability distribution. An HMM is a framework for representing probability distributions over sequences of observations. It involves two interconnected models: i) the state model, consisting of a discrete-time, discrete-state, first-order Markov chain, with states z t ∈ {1, 2, . . . , k} and transition probability P(z t |z t−1 ), and ii) the observation model, whose underlying dynamics is given by P(y t |z t ), where y t is an observation at time t. Thus, in this model, an observation y t is produced by a stochastic process Z, but the state z t of this process cannot be directly observed, i.e., it is hidden [28]. The stochastic process Z is assumed to satisfy the Markov property, i.e., given z t−1 , the current hidden state z t is independent of all the states prior to t − 1. The corresponding joint distribution of a sequence of states z 1:τ and observations y 1:τ can be factored as P(z 1:τ , y 1:τ ) = P(z 1 )P(y 1 |z 1 ) τ t=2 P(z t |z t−1 )P(y t |z t ) To summarize, an HMM can be defined by the tuple k, n, P, Q, P 0 , where k is the number of possible hidden states, assumed to be discrete, and n is the number of distinct observations. Moreover, P and Q determine the underlying transition models for the states and the observations, respectively, with P i,j = P(z t = i|z t−1 = j) and Q i,j = P(y t = i|z t = j). The last element, P 0 , represents the initial state distribution.

2) (Decoding) Given an observation sequence Y and an
HMM λ = (P, Q), discover the best hidden state sequence z = (z 1 , z 2 , . . . , z n ) . 3) (Learning) Given an observation sequence Y, the set of states k in the HMM and the initial state distribution P 0 , learn the HMM parameters P and Q. Problem 1 aims at computing the likelihood of a particular observation sequence Y. Because the hidden state sequence corresponding to Y is not known, one possibility is to sum over all possible hidden state sequences, i.e., P(Y|λ) = z P(Y|z, λ)P(z|λ). This procedure is extremely computationally inefficient because there are k n possible hidden sequences, thus requiring an exponential time in the number of states. A more efficient solution is called forward algorithm and it is a kind of dynamic programming algorithm that computes and stores in a table the partial probabilities for each state-time pair, and iteratively computes the following probability: where α t (j) = P(Y, z t = j|λ), and b j (y t ) is the state observation likelihood of the observation y t given the current state j. After having initialized the probability α 1 (j) = P 0 j b j (y 1 ), 1 ≤ j ≤ k, the forward step (4) can be recursively applied for all states and observations in order to compute the marginal P(Y|λ) = k i=1 α n (i). An extended version of the forward algorithm can be used for the learning problem. In particular, if we design a backward procedure that computes the backward probability of seeing the observations from time t + 1 to the end, i.e., β t (i) := P(y t+1 , . . . , y n |z t = i, λ), we can solve the problem of training both P and Q. This algorithm is known as the Baum-Welch algorithm [30], and it is a special case of the well-known Expectation-Maximization (EM) algorithm [31]. The recursive backward rule is given by The most probable state sequence z for an observed sequence Y can be obtained exhaustively by deriving all the possible state sequences, generating P(z|Y, λ), and finally obtaining the z for which this probability is maximum. Even if this approach may be feasible for relatively small Y, it becomes computationally expensive as n increases. To efficiently obtain the most probable state path for a given observation sequence, i.e., to solve the decoding problem, we can use the Viterbi algorithm [32] that is a dynamic programming method that efficiently computes the best path probability by substituting the sum operation (over the previous path probabilities) with a maximization. It is worth noting that, for page limit reasons, we do not formally state the three algorithms as they are standard methods used to solve the HMM problems, and we remind to [28], [30], [32] for their formal definitions. Based on the above discussions, we can now cast the HMM problems to our dynamic anomaly detection task. In particular, assuming that the hidden states represent the health conditions of the monitored system, it would be important to estimate the probability to transition from one state to another, given a sequence of observations, i.e., problem 3. Moreover, given an estimate of the model and a new observation sequence, it would be desirable to assign to such sequence a corresponding hidden state sequence, e.g., using the Viterbi algorithm.
In summary, the proposed methodology performs as follows: 1) compute the robust distances d(y i ;μ RMCD ,ˆ RMCD ), i = 1, . . . , n; 2) fit an HMM model using the Baum-Welch algorithm; 3) given a new observation, compute its robust distance from the RMCD estimate d(y;μ RMCD ,ˆ RMCD ) and predict the corresponding hidden state using the Viterbi algorithm.
In all the aforementioned methods, the number k of hidden states is assumed to be known. In real applications it is usually determined according to information criteria, such as the Akaike or the Bayesian information criteria (AIC and BIC) [33].

IV. RESULTS AND DISCUSSIONS
In this Section, we present and discuss the results obtained applying the proposed methodologies on real data collected from the steel making production process described in Section I. In particular, in the following, we show the results obtained from data stored at 1 Hz and at 10 kHz; we do not show the results obtained with 10 Hz data since they did not exhibit any anomalous patterns during the observation time.
Moreover, during the observation period lasting three months, we observed only one significant downtime caused by a fault on the 18th cage; the overall number of billets produced during the observation time is equal to 5634. This unique occurrence is a rare event, taking into account the frequent preventive maintenance interventions scheduled to maximize the reliability of the production process. In light of the above considerations, we use the HMM-based approach to incorporate time dependencies into anomaly detection model and provide an automatic methodology able to estimate an health model based on real data. The application of the proposed methodology on production data shows benefits derived from our approach. The data set adopted in the following analysis has been pre-processed to remove NaN data and anomalous zero values; moreover, as discussed in Section II-C, a robust profile-wise centering step has been performed to obtain a data set independent on the produced profiles.

A. LOW FREQUENCY DATA
In this Subsection, we show the results of the static and dynamic anomaly detection methods applied to the data acquired at 1 Hz, namely two temperatures and one pressure transducer. Figure 5 displays the robust distances along with the detected anomalies for an overall level α = 0.01. The points above the cut-off line are those detected as outliers.
In particular, there are 360 flagged points (outliers) of which 108 with profile φ8. According to the steps highlighted in Section II-C, we first obtained robust residuals by inspection of the different profiles. Then, we applied the RMCD with an initial trimming level of 25%. The reweighting step leads to the deletion of about 17% of the data. Trimmed observations may give an indication of the presence of outliers and the actual level of contamination in the data but the outliers detection process is supposed to be performed separately, based on the current robust fit and formal rules as in (3). Figure 5 displays the robust distances along with the detected anomalies for an overall level α = 0.01. The points above the cut-off line are those detected as outliers.
As far as the dynamic anomaly detection analysis is concerned, we designed an HMM for a Gamma observations distribution and log link, as the chi-squared distribution belongs to the Gamma family, and the logarithmic link allows to estimate a positive mean. We chose the number of states according to the BIC criterion, and we considered a multi-start approach based on fifty random starts in order to make the final maximum likelihood estimate independent from the initialization. As we can see in Table 1, the BIC criterion suggests a model with k = 6 hidden states.
The results from the HMM analysis over the distances from the RMCD fit are given in Figure 6 and Figure 7, which display the distribution of the robust distances by the six inferred states and their classification on a log scale, respectively. The group with the largest modal distance (whose   label is 6) should be interpreted as the fault alert state. It is composed of 18 points, 16 of which are of φ8. On the contrary, the groups with the lowest modal distance can be seen as a normal condition state, whereas the other represents intermediate situations that characterize the transitions from normality to anomaly. Clearly, the HMM analysis of the robust distances led to a more plausible classification and dynamic description of the evolution of the process over time. Table 2 gives the fitted transition probabilities. In particular, we notice that the probabilities to have a fault when the process is in any of the other states are rather small, with a total probability of 0.014. Furthermore, the probability that the process from the fault alert condition of state 6 returns in the previous pre-alert state labeled 2 is 0.635. This finding may suggest the potential occurrence of false alerts. There is also a remarkable persistence in the fault alarm state, as the probability to remain in state 6 is 0.314.
We also ran a classical multivariate Gaussian HMM to the original data, still taking into account profile-wise centering. In this case, the BIC criterion leads to choose k = 8, which is a larger number of states underlying a more complex to interpret data structure (not only from a strict visualizing point of view). From the inspection of conditional means and variance-covariance matrices, it turns out that the failure state, characterized by the more extreme values for the second temperature sensor and the pressure sensor, is composed  of 400 points. In addition, there are at least a couple of alert states. We can state that the HMM evaluated over robust distances returns a more feasible description of the data structure and effective detection of the failure state. On the contrary, the classical HMM suffers from masking effects, since it is not able to detect outliers properly with a lower number of states and even with k = 8 is likely to flag many false positives.

B. HIGH FREQUENCY DATA
The same analysis has been performed on the data generated by the three accelerometers. The data generated for each billet have been summarized with the mean values, that are immediately affected by anomalies in the production process. Figure 8 shows the distribution of robust distances among profiles; the horizontal line gives the cut-off to detect outliers. It is evident that the majority of outliers belong to profiles of φ8: 158 over a total of 381 detected anomalies. In particular, the more extreme points in the first box-plot are those corresponding to the observed fault.   By running the HMM, the BIC criterion identifies five latent states. In this example, the HMM has been fitted after that the squared robust distances have been transformed according to the Wilson-Hilferty formula [34], which allows to transform χ 2 p distributed variate towards normality. Therefore, a Gaussian HMM has been fitted because the Gamma HMM showed severe convergence issues. The BIC criterion selected a model with five latent states. The distribution of distances within each state is displayed in Figure 9, whereas their classification is given in Figure 10. We notice that the fifth cluster includes the most incipient fault samples: 34 over 36 anomalies concern φ8 profile. The output from HMM clearly reduced the risk of false alarms with respect to the former static analysis. The entries in Table 3 give the transition probabilities: the probability of remaining in the fault state is 91.6%. In order to better assess the performance of the proposed technique, we also fitted a Gaussian multivariate HMM to the high frequency profile-wise centered values. The classical procedure returned a similar structure with the same number of latent states and a similar classification into states. The main difference concerns the fitted conditional probability to remain in the fault state that now is lower than its robust counterpart and equal to 0.876. Here, the classical procedure is able to deal with outliers since they are very far from the bulk of the data and characterized by huge values. Nevertheless, the underestimation of the conditional probability to remain in the alarm state could badly affect the prediction of future anomalies.
C. COMBINING DATA ANALYSIS: 1 Hz AND 10 kHz DATA We here consider the combination of the above analyzed data, i.e., temperature, pressure, and vibration signals obtained through robust distances. Let us focus on the results stemming from the HMM analysis over the robust distances. The BIC criterion suggested to select seven hidden states. The seven clusters inferred from the fitted HMM are given in Figure 11. It is quite evident that the state labeled 1 includes very large outliers. In this latent state 39 points have been classified of which 37 are from profile φ8. The classification is displayed in Figure 12. These results are in strong agreement with those stemming from the previous analyses. In this example, fitting the classical HMM model under the assumption of multivariate normality of the data at hand does not lead to select a feasible number of latent states, with k ≥ 10. It is evident that such a result makes any interpretation effort challenging and likely to be not useful. This means that the classical multivariate HMM is not able to cope with outliers, if not at the cost of an extremely complex underlying structure that can undo any prediction effort.

D. PREDICTION AND ALARM DETECTION
In order to investigate and assess the predictive capability of the proposed strategy, we decided to split the data into VOLUME 9, 2021 two groups: the first 2700 observations are used as train data, whereas the remaining 2744 as test data. The static outliers detection based on the reweighted MCD leads to flag about 20% of the test data as outliers. On the contrary, the HMM analysis detects only 49 points, after classifying robust distances into seven latent states: the points are displayed in Figure 13. The method is able to detect the occurred anomalies, despite few false alarms.

V. CONCLUSION
In this work we have proposed a two-step scheme, relying on two different methodologies (static and dynamic), for fault detection for rolling mill in a steel production plant, using real field data. A preliminary fault detection phase has been carried out based on robust MDs and on the RMCD. This method appears to be very effective for the detection of faulty samples, and it results computationally efficient. Secondly, based on the robust distances obtained in the previous phase, a more complex method is then used to confirm the detection of the fault and provide additional information about the probability of transitioning among the latent states, i.e., the HMM-based method can potentially predict the evolution over time of the status of the monitored system. In the results section, we proved the effectiveness of the proposed procedures when applied to real data.
We believe that the proposed methodology is attractive because it is straightforwardly implementable yet sufficiently accurate. Future work will cover the extension of the proposed methods to the frequency domain. Moreover, it will be investigated the application of the proposed anomaly detection method to an automatic plant's supervisor systems such as a digital twin.
KISAN SARDA (Student Member, IEEE) received the bachelor's degree in electronics and telecommunication engineering and the master's degree in electrical engineering with the focus on control systems from the University of Mumbai, India, in 2013 and 2017, respectively. He is currently pursuing the Ph.D. degree in information technologies. His current research interests include deep learning, system identification and control in the manufacturing field, prognostics and predictive maintenance, and Boolean networks.
ANTONIO ACERNESE (Student Member, IEEE) received the bachelor's and master's degrees in electronics engineering in automation and telecommunications from the University of Sannio, Benevento, Italy, in 2015 and 2017, respectively, with a focus on automation, where he is currently pursuing the Ph.D. degree in information technologies engineering. His current research interests include reinforcement learning, optimization and control in the manufacturing field, prognostics and predictive maintenance, prediction, and control methods. LUIGI GLIELMO (Senior Member, IEEE) was born in 1960. He received the Laurea degree in electronic engineering and the Ph.D. degree in automatic control. He taught at the University of Palermo, the University of Naples Federico II, and the University of Sannio at Benevento, where he is currently a Professor of automatic control. He has coauthored more than 160 papers on international archival journals or proceedings of international conferences, co-edited two books, and holds three patents. His research interests include singular perturbation methods, model predictive control methods, automotive controls, deep brain stimulation modeling and control, smart-grid control, and systems biology. He was the General Co-Chair of the 2019 European Control Conference, Naples, Italy. He has been the Chair of the IEEE Control Systems Society Technical Committee on Automotive Controls. He seated in the editorial boards of the IEEE TRANSACTIONS ON AUTOMATIC CONTROL. He also serves as an Associate Editor for the on-line journal IEEE CONTROL SYSTEMS LETTERS.
CARMEN DEL VECCHIO (Member, IEEE) received the Laurea degree in management engineering from the University of Naples Federico II, Italy, in 1999, and the Ph.D. degree in control and computer science from the University of Sannio, Italy, in 2004. Since 2011, she has been an Assistant Professor with the Engineering Department, University of Sannio. Her research interests include optimization and control of interconnected processes through various theoretical instruments, such as decisions Markov chains, probabilistic Boolean networks, machine, and reinforcement learning, and applications to several fields, including manufacturing, smart grids, medicine, and biology. VOLUME 9, 2021