Detecting Anomalous Multivariate Time-Series via Hybrid Machine Learning

This article investigates the use of hybrid machine learning (HML) for the detection of anomalous multivariate time-series (MVTS). Focusing on a specific industrial use-case from geotechnical engineering, where hundreds of MVTS need to be analyzed and classified, has permitted extensive testing of the proposed methods with real measurement data. The novel hybrid anomaly detector combines two means for detection, creating redundancy and reducing the risk of missing defective elements in a safety relevant application. The two parts are: 1) anomaly detection based on approximately 50 physics-motivated key performance indicators (KPIs) and 2) an unsupervised variational autoencoder (VAE) with long short-term memory layers. The KPI captures expert knowledge on the properties of the data that infer the quality of produced elements; these are used as a type of auto-labeling. The goal of the extension using machine learning (ML) is to detect anomalies that the experts may not have foreseen. In contrast to anomaly detection in streaming data, where the goal is to locate an anomaly, each MVTS is complete in itself at the time of evaluation and is categorized as anomalous or nonanomalous. The article compares the performance of different VAE architectures [e.g., long short-term memory (LSTM-VAE) and bidirectional LSTM (BiLSTM-VAE)]. The results of using a genetic algorithm to optimize the hyperparameters of the different architectures are also presented. It is shown that modeling the industrial process as an assemblage of subprocesses yields a better discriminating power and permits the identification of interdependencies between the subprocesses. Interestingly, different autoencoder architectures may be optimal for different subprocesses; here two different architectures are combined to achieve superior performance. Extensive results are presented based on a very large set of real-time measurement data.


I. INTRODUCTION
T HIS article investigates the use of hybrid machine learning (HML) [1] to infer the quality of industrially produced elements, for which final quality control is not possible. In the specific use-case analyzed and presented here, underground foundation columns are being produced; however, since they are subsurface their final quality cannot be inspected directly. Consequently, the quality must be inferred through the analysis of measurement data acquired in real-time from the instrumented rig. This requires the analysis of large sets of multivariate real-time measurement data (MVRTD), emanating from instrumented industrial equipment. This is a pertinent topic since industrial IoT [2] is making ever-increasing volumes of measurement data available for analysis. Furthermore, HML, for example, physics-informed neural networks [3], is an emerging research topic and is considered the best approach to obtaining reliable results in conjunction with the analysis of the measurement data from physical systems [4], [5], [6]. Although this article is focused on a specific use-case, many of the results are relevant in other applications; in particular, the examination of different autoencoder architectures for the unsupervised analysis of multivariate time-series (MVTS) data and the combination with a genetic algorithm to optimize the hyperparameters.
The importance of machine learning (ML) in the field of instrumentation and measurement [4], [7] is increasing with the growing volumes of measurement data recorded in industrial processes. However, as shown in [5], the two communities, i.e., the machine learning and the instrumentation and measurement community, use different terminologies. As a result, we feel it is important to have interdisciplinary publications that bridge both the topics.
As described in [7], ML architectures should be chosen with care when applied in safety-relevant applications, as is the case here. For this reason, an HML approach was chosen, which augments an outlier detection technique based on multiple physics-based metrics, with unsupervised ML. The HML techniques were successfully applied in other fields related to instrumentation and measurement, e.g., in [8] and [9]. Many ML approaches presented in literature use publicly available datasets to perform ablation studies [10] and performance benchmarking of different architectures. A survey on different methods for time-series forecasting [11] on publicly available time-series datasets suggested that hybrid approaches are superior to pure statistical or ML approaches. The same study This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ concluded that more complex models do not necessarily lead to higher accuracy. This is important, since high accuracy, in the field of instrumentation and measurement, is a highly desirable feature [5], [6].
Especially in anomaly detection, benchmarks need to be treated with care, since most suffer from one or all the following issues [12]: the problem is too trivial and can be solved in a few lines of code; the anomaly density is too high; if supervised learning is used, the ground truth is mislabeled and often there are no data available after the anomaly has occurred, and Wu and Keogh [12] called this as the run to failure bias. The study done on the comparison of the classical pure ML methods, pure (deep) ML methods, and the combination of both, HML, suggests that there is no pure ML method that can outperform a hybrid system [13].

A. New Contributions
The initial results of this work were published in [14]; however, since then, significant modifications have been made and improved results obtained. This article presents these extensions and new results.
1) An HML framework that captures a priori knowledge in the form of key performance indicators (KPI), which is used to implement autolabeling, permitting a truly unsupervised learning. Furthermore, the combination of two means for anomaly detection, KPI and ML, is introduced; this leads to a redundant anomaly detection system and a higher likelihood for detecting anomalies. 2) The conceptual framework for handling MVTS measurement data in an object-oriented way is presented; covering all the tasks from data ingestion to outlier detection. The data are handled together with the pertinent metadata, events, and segments. The concept that industrial processes can be described as consisting of subprocesses is discussed. KPIs are defined for the subprocesses separately and this leads to a hierarchical anomaly detection in the subprocesses. Aggregating the results yields a more granular classification of anomalies and reveals possible interactions between the subprocesses.

3) A clear definition and algorithmic implementation for
the outlierness is provided, i.e., a metric for the extent to which a time-series is considered to be an outlier by the KPI-based anomaly detector. 4) The statistical analysis of the performance of autoencoders with LSTM and BiLSTM layers for the subprocesses is presented. 5) The implementation of the autoencoder training has been modified, to significantly reduce the amount of padding required within the mini-batches during training.

B. Organization of the Article
This article is organized as follows: in Section I, an introduction to the topic of HML for anomaly detection in multivariate time-series is given; in Section II, insights into the use-case are presented; Section III discusses the structured data handling for large amounts of industrial real-time data. The first part of the hybrid framework based on KPI is shown in Section IV, whereas the anomaly detection based on ML can be found in Section V, and the results of the comparison of architectures are presented in Section VI. The article is concluded with insights gained and further work in Section VII.

II. USE-CASE AND NATURE OF THE DATA
The vibro-ground improvement techniques are used for almost 100 years to support buildings and infrastructure works; however, there is still no standardized method for quality control for ground improvement techniques [15]. The increasing economic pressure in the construction industry and a higher risk awareness accelerated the change process of traditional, mostly manually performed techniques, toward automated solutions for quality control [16]. The produced foundation columns are subsurface, and therefore, the quality control needs to be based on indirect methods and reliable labeling for supervised ML is not a feasible solution. For the last 30 years monitoring KPIs, which were derived from the recorded data from instrumented production rigs, built the basis for geotechnical and quality evaluations of the process [17]. These data are also utilized in installation reports which to this day are the core of quality control. For each foundation column, an installation report is created, which needs to be manually evaluated by geotechnical experts [15]. However, evaluating hundreds, if not thousands, of installation reports with limited human resources increases the risk that the controls are not performed with the due-diligence required. For this reason, our framework is used to support the geotechnical expert at this task and refining the KPI definitions to cover as many anomalies as possible. The KPI refinement is driven by new process insights, gained by the knowledge discovery process, performed after ML anomaly detection, to increase the interpretability of the anomaly detection framework, which is essential in critical infrastructure and safety-relevant applications [18]. The opportunity costs of false negative are much higher than of false positive detection. Therefore, the two anomaly detection results are combined with a logical or-if an MVTS is flagged by one of the means for anomaly detection as anomalous it is considered as anomalous and undergoes further investigations by geotechnical experts.
In more detail, the use-case presented here is the vibro ground improvement process. The goal is to improve the bearing capacity of the ground so that more stable foundations can be produced for a building. However, this leads to a safety-relevant system, since failure to detect an anomalous foundation column can lead to a local instability of a building. The process consists of four phases: run-in, process preparation is performed; penetration, whereby a rig penetrates the ground with a large vibrator until a predefined bearing capacity is reached; compaction, is the repetitive process of withdrawing the vibrator, introducing gravel into the ground and compacting with the vibrator; and run-out finalizing the process.
A typical building site may have between 250 ≤ m ≤ 1500 such foundation columns. Fusing the location data with machine data enables the georeferenced viewing of the data associated with the foundation columns. This supports  the geotechnical experts in establishing an overview of the geotechnical properties and enables an automatic detection of changing the subsurface properties. An example of such a view is shown in Fig. 1; this site has m = 637 foundation columns. The georeferencing permits the identification of systematic changes in the subsurface ground properties over a site. This is important since reliable detection of anomalies requires the separation of systematic changes from local anomalous behavior.
A series of instrumented rigs work in parallel on a single building site; see Fig. 2, producing the required foundation columns. The instrumented rigs have n = 9 sensor channels used in this process. The sensor data acquired in real-time are sampled at f s = 1 Hz. This yields an MVTS for each foundation column; an example of such data can be seen Fig. 3. Due to changes in the subsurface properties, the time required to produce the foundation may vary significantly; as a result, the time-series have strongly varying lengths. For the site data shown in Fig. 1, there is a median number of time-steps per foundation of t med = 345 and an interquantile range (IQR) of t IQR = 169. A t med of 345 time-steps with 1-Hz sampling frequency corresponds to an average lead time of 345 s. Consequently, the computational methods must be capable of analyzing, the execution of the production process with strongly varying lead time, which leads to recordings of the process with widely varying lengths. To summarize, there are approximately 1000 MVRTD associated with a single building site, each of these time-series having n = 9 sensor Fig. 3. Example of the real-time data acquired from the rig while producing a single foundation column. One such MVTS is acquired per foundation column. The four phases of the process can be seen: run-in (white), penetration (red), compaction (green), and run-out (white). Only the penetration and compaction phases are relevant for the quality of the foundation column. channels and several hundred time-steps recorded per sensor. Furthermore, there are multiple building sites to be monitored.

III. STRUCTURED HANDLING OF DATA
A significant portion of this project is associated with managing such volumes of multivariate measurement data, together with the corresponding metadata, in a systematic manner. There are tens of thousands of MVRTD that need to be handled. Consequently, a systematic data ingestion process has been defined. In this process, each of the MVTD is ingested. In the same step, events and segments are determined by applying a rule-based system. By performing data fusion, the data from computer-aided design (CAD) are merged with machine data for each of the foundations. This provides for the georeferencing of the data and enables the comparison of planned versus executed work. An object class for an MVTS has been defined. It provides a container for the MVRTD and augments this with the respective metadata, events, and segments. In this manner, all data required to segment and process the data from a specific foundation column are contained in a single MVTS object. These objects are saved in binary HDF5 [19] format and made available via a standard interface. This facilitates the exchange of data and fast loading into memory.
In addition, an index has been implemented with one column per foundation element and one row per indexing value, e.g., KPI. This permits the identification of MVTS from either its metadata or the values of the associated KPI or by an indexing value emanating from the ML evaluation of the data.

IV. KEY PERFORMANCE INDICATORS
Given the collection of MVTS, the task now is to individually characterize them by a series of metrics, called KPIs here. The goal of the technical KPI is to capture expert knowledge about the process and physical properties that can be expressed in terms of scalar metrics computed from the real-time machine data. These provide the physics-based side of the hybrid learning process.
The penetration and compaction phases are definitive for the quality of the executed foundation column, whereas the complete process, including run-in and run-out, may be relevant for the efficiency. Consequently, KPIs are defined for and categorized according to: penetration, compaction, and complete element. The categorization is achieved via metadata and not hard coded. In previous work [14], these phases were analyzed together. However, defining KPI in separate categories provides the possibility of a more granular classification of the MVTS according to the subprocesses. For example, the penetration phase may be anomalous, whereas the compaction is not. Currently, there are more than p = 50 different KPIs computed.

A. Exemplary Technical KPI
Here, two new exemplary KPIs 1 are presented. They are physics-based metrics and permit the identification of certain types of anomalies. Furthermore, they can be computed using the map-reduce paradigm [20], opening the door to parallel computation, if required.
The incremental work as a function of time w(t) is computed from the amperage a(t) multiplied with the operating voltage (electrical work) and the product of force f (t) and penetration rate dd(t)/dt (mechanical work). This computation has the same temporal resolution as the MVTS data and corresponds to the map step in map-reduce. Then the total work W , required to perform penetration, is the integral over where by t s and t e correspond to the start and end times for penetration, respectively, and this aggregation corresponds to the reduce step. The total work required to complete penetration W is used as a metric to characterize each MVTS. The second KPI presented L is the ratio of the traversed length to the depth penetrated. Given d(t) the depth as a function of time, L is computed as follows: ( Consider the exemplary MVTS shown in Fig. 4: this corresponds to a straightforward penetration of the ground and yields L = 1. In contrast, consider Fig. 5: as can be seen, there were difficulties penetrating the ground and the vibrator had to be retracted several times and the process restarted. This behavior is considered anomalous; however, it has predictive value, since together with the georeferencing it can be used to determine whether this is a local random anomaly, or if the subsurface properties are changing systematically over a site.

B. Heat-Maps and Outlier Detection via KPI
Currently, more than p = 50 KPI are computed, in three categories, for each foundation column. The KPI can be organized and amperage a(t) are the channels that form the MVTS of a single foundation column. The work channel w(t), underlayed in blue, is a derived variable, computed from the above three channels and has the same temporal resolution as the real-time data from the machine. The integral over w(t) yields the total work required to perform penetration. This dataset corresponds to the almost ideal process of creating a column. The L ratio, indicated in the header, refers to the ratio of traversed distance to depth penetrated.  as a matrix T ( j, k), whereby each row j corresponds to a KPI and each column k to a specific foundation point. Normalizing the KPI matrix T , using min-max normalization, by row provides a uniform scaling for the graphical representation, e.g., in Fig. 6 the KPI heat-map for the compaction phase and in Fig. 7 the heat-map of the KPI relevant for all phases is shown.
In addition, the KPI matrix T is used to detect MVTS that are statistical outliers; this infers the possibility of an anomalous foundation column. The foundation columns are subsurface, the precludes access to the ground-truth, and Extract row vector t q 25 (t) and q 75 (t)) refer to the 25% and 75% percentiles of the vector t), whereas I Q R refers to the interquantile range.  consequently, an inference must be used at this point. The term outlierness is coined here, and it is a relative measure of the degree to which the MVTS can be considered to be an outlier. It is the number of KPI for which the column data are a statistical outlier divided by the total number of KPI being considered. The pseudocode for the computation of the outlierness is shown in Algorithm 1.
V. ML-BASED ANOMALY DETECTION As described in [18], autoencoders are a popular choice for anomaly detection in critical infrastructure. The task of identifying patterns that are not present in data from normal operations is called anomaly detection [18]. Anomaly detection can also be described as a binary classification problem with one class containing the anomalous data and one class containing the nonanomalous samples [21]. Modeling the normal behavior in physical systems is often not a feasible solution, if it is unknown or simply too complex. In ML Fig. 8. Interplay of the components of our hybrid framework. (a) Files are recorded at different sites and mapped to a storage location. From these data, the KPIs are calculated as part of the data-ingestion process. (b) From KPI visualizations-e.g., the heat-map is created and the outlierness-rank of the MVTS-is calculated. Based on the outlierness, the files are separated in a training and validation set. (c) Process of getting the autoencoder anomaly score for the data of the validation set using a trained ML model. By passing the validation set through the autoencoder, the reconstruction error can be calculated. In the end, the two anomaly scores per MVTS for the KPI-based anomaly detection and the autoencoder-based anomaly detection are combined in a bivariate histogram as a starting point for the knowledge discovery process.
approaches for anomaly detection, this normal behavior is learned/inferred from provided data samples [21]. A simplified drawing of the architecture used is shown in Fig. 8.
Most of the lately invented architectures for anomaly detection in time-series data are designed for streaming data for the multivariate [22], [23] or univariate case [24] using sliding window approaches. The industrial use-case presented in this article, however, deals with a high number of MVTS data files with recordings of varying length. Each MVTS is complete at the time of evaluation. In this manner, it is more closely related to boundary value problems, than locating a changing pattern in streaming data.
To apply the autoencoder to the samples of varying length, a novel resampling approach was developed (see Section V-D), which reduces the effects of resampling to a minimum. This approach differs from the previously mentioned, since the goal is not to localize the position of the occurrence of an anomaly, but to detect whether a built column, an MVTS, is anomalous as a whole. In contrast to approaches (e.g., [25]) which apply ML on KPI data, in this framework the KPI data are used as a separate mean for anomaly detection and for the unsupervised training set construction. The learning algorithm, however, is applied on the recorded time-series data. In addition, the available process knowledge should be incorporated into the hybrid anomaly detector to acquire explainable and physically consistent results.

A. Objective-Based Preprocessing
To meet different objectives, different preprocessing steps need to be performed to extract and enhance the special characteristics of the data. The objective can also be dependent on the subprocess monitored. In this section, the preprocessing steps regarding outlier detection in terms of the quality of the foundation columns are shown.
1) Feature Selection: Using the whole set of channels is in most cases not beneficial for the ML performance. A higher number of features used results in a higher number of learnable parameters, increased model complexity, and training time. The goal is to reduce the number of features used while preserving the relevant information and eliminating redundant features [26].
Despite using unsupervised ML, knowledge about the monitored system and the different recorded channels should be used when deriving the subset of channels. In this approach, the channels from which the work performed as a function of time (see Figs. 4 and 5) can be derived were chosen.
Depending on the goal that should be archived with ML, a different subset of channels may be required as input to the ML model. Two possible examples of goals that require a different channel selection are anomaly detection regarding the foundation column quality or anomaly detection regarding the production efficiency of the ground improvement process.
2) Rule-Based Segmentation: Based on the definition of rules and events using expert knowledge, the MVTS are segmented into subprocesses. This is done because separate ML models are trained for the separate phases. The subprocesses are indicated in Fig. 3 with different colors. 3) Data Trimming: The objective of anomaly detection for the case study is to detect anomalies in the sense of quality. The ground improvement process contains process pauses, e.g., logistical difficulties in gravel delivery. These pauses do not have an impact on the effectiveness of the performed process but the process efficiency. To meet our objective, these pauses when no drilling is performed are removed from the MVTS used for ML. After trimming the data by removing the process pauses, in MVTS with anomalous process behavior discontinuities can occur. These discontinuities caused by operating errors and other process anomalies can be detected using appropriate techniques [27], [28]. This is done to use these anomalous cases in a knowledge discovery process. This is based on previous work published in [29]. Schematic representation of an autoencoder with the encoder that encodes the input features into latent variables and the decoder that reconstructs the lower dimensional representation available in the latent space.

B. Architecture Selection
An autoencoder is an ML structure performing two mappings [30]. The goal is to extract latent variables that allow for a minimal amount of distortion in the reconstructed signal [30], [31]. The dimension of the latent space n z is for anomaly detection chosen to be n z < b, b being the dimension of the encoded data [32]. Autoencoders and variational autoencoders (VAEs), which differ in the regularization of their cost function, were successfully used for anomaly detection in time-series in the past [33]. A schematic representation of the ML architecture is shown in Fig. 9.
The assumption behind anomaly detection using VAEs is performing some kind of dimensionality reduction and the assumption that anomalies contain nonrepresentative features which cannot be encoded into the lower dimensional space [34], [35]. As described previously, the three-channel MVTS used for ML is converted into a lower dimensional latent representation. The degree of dimensionality reduction is determined by the dimension of the latent space, the latent dimension n z ; in this work, n z = 1 was chosen to archive a sufficiently high degree of dimensionality reduction.
A recurrent neural network (RNN) was incorporated into both, the encoder and the decoder, to handle the sequential data [36], namely, a long short-term memory (LSTM) [37] layer. Unidirectional LSTM is processing the input data in the forward direction, whereas bidirectional LSTM (BiLSTM) is processing it in a forward and backward direction [38]. Bidirectional models can only be used when the whole data sample that they are applied to is available [36]. This condition is fulfilled in the use-case presented here, since the data analyzed are not streaming data. BiLSTM performs well, especially when the value of time-step t depends on both, prior and past time-steps [39]. This is the case in boundary value problems. The drilling process described in this article can be seen as a (homogeneous) boundary value problem since the drilling starts at the surface and ends at the surface.

C. KPI-Supervised Training
As described in Section V, most of the data, gathered with physical systems, are unlabeled and reliable labeling is hard to archive [21]. Because of these circumstances, an unsupervised anomaly detection approach was chosen. Unsupervised anomaly detection is done under the assumption that the training set ⊂ , whereas is the set of all available samples, contains mostly samples from normal operations but could contain a small number of still unknown anomalies [13], [40]. This training set is constructed using the outlierness calculated with the KPI [13] for this reason, and we call it KPIsupervised training as the anomalous samples are excluded and autolabeled using the outcome of the KPI classifier. This excludes the already known anomalies covered by the KPI framework.
The training set is constructed using the outcome of KPI anomaly detection. The proposed framework does not require human labeling, since it uses the results gained with the KPI classifier to construct a training set only consisting of MVTS which are considered to be nonanomalous. The whole process of preprocessing and training is illustrated in Fig. 10.

D. Length-Preserving Training
The training procedure based on stochastic gradient descent was adjusted to apply on MVTS of different time-series lengths and minimize the effects of resampling to a common length which is required for gradient calculation. During training, the samples are sorted according to their time-series length and only resampled to a common length, downsampled to the shortest length of the samples forming a mini-batch; see Fig. 11. After training the autoencoder in this new way, it can reconstruct samples while preserving their real length. This is desirable since an unusual time-series length, the process length of an industrial process, can be a sign of anomalous process execution. This avoids the necessity to resample the time-series to a common time-series length [41], [42] or to pad the data with zeros to the length of the longest time-series [43], [44], which adds information and has an impact on the learning performance, especially when used with LSTM [45].
The cost function optimized with stochastic gradient descent is a regularized [46] version of the least-squares cost function minimizing the reconstruction error. Being Y k the kth sample of a mini-batch andŶ k the reconstructed MVTS, i.e., the output of the autoencoder the reconstruction error E r of a mini-batch of size m is given as 2 2 A F denotes the Frobenius norm of the matrix A. Fig. 11. Samples forming a mini-batch of the training dataset are downsampled to the shortest time-series length of the mini-batch and the training dataset is not resampled to a common time-series length.
In the VAE, the regularization is based on the Kullback-Leibler divergence 3 between two distributions and given as D KL () [47], [48]. The loss function of a VAE can be written as [47] and [48] with being a prior chosen by the user, often a multivariate Gaussian distribution with zero mean and a diagonal covariance matrix. The second distribution is the learned distribution in the latent space. In the formulation introduced by [48] γ = 1, in this case, the objective of the learning algorithm is minimizing the evidence lower bound (ELBO) [47]. In our approach, we use a so-called β-VAE [49] where the Lagrangian multiplier β is a hyperparameter that is optimized during hyperparameter optimization (HPO), described in Section V-E, to ensure that the balancing between the two terms of the loss function is optimal.

E. Hyperparameter Optimization
The hyperparameters have a significant impact on the performance of the network [50]. The optimal performance of an ML architecture on a dataset is obtained by optimizing the hyperparameters. The available hyperparameters depend on the architecture chosen and their optimal value on the problem and the corresponding data. The search space of the HPO grows exponentially with the number of parameters optimized. This combined with the long training times of ML architectures mostly heuristics or meta-heuristics are used [36], [43].
In this framework, a genetic algorithm is used for the HPO, and the following hyperparameters were optimized. . The genetic algorithm adjusted for this framework is described in more detail in [51] and was executed with the following settings: 15 individuals per generation, threefold cross-validation for the fitness estimation, and a maximum of ten generations. Moreover, a combination of two crossover functions was used to obtain a better exploration of the search space. 4

F. Anomaly Detection Using Autoencoders
The last step of anomaly detection using autoencoders is performing the binary classification task. In this step, the model trained on the subset is applied to the whole available dataset .
When using autoencoders, the output is the reconstructed signal from which the reconstruction error can be derived [33], [44], [52]. From the reconstruction error, the anomaly score is calculated. The anomaly score used here for the sample y with a time-series length t is calculated as follows: with y (c) denoting the measurement taken at time-step c and y (c) the reconstructed measurement at time-step c. Because of the changing time-series length, the anomaly score E a is normalized by the number of time-steps t, which leads to a normalized anomaly score E n To get the classification, a threshold u is needed to represent the boundary between the classes. Since the distribution of reconstruction errors is positive semi-definite, a skewness adjustment needs to be performed [53], [54], [55]. The MVTS with E n < u get assigned to the class of nonanomalous samples and MVTS with E n ≥ u are marked as anomalous.

VI. ARCHITECTURE COMPARISON AND RESULTS
The novel hybrid framework was applied to the case study introduced in Section II. Separate autoencoders were trained for the analyzed subprocesses, compaction, and penetration. The following architectures were compared. 1) VAE-LSTM: VAE with one LSTM layer in the encoder and the decoder. 2) VAE-BiLSTM: VAE with one BiLSTM layer in the encoder and decoder.

A. HPO Results
The HPO was executed four times to cover all the two compared architectures; the two phases and the results are shown in Table I.

B. Architecture Comparison
In this section, different architectures are evaluated phasewise with metrics based on the anomaly score which is derived from the reconstruction error of trained autoencoders. Because of the nature of ML problems, random initialization  of learnable parameters, and an overdetermined solution space, their performance varies when executing them multiple times on the same dataset using the same hyperparameters. These observations were previously described in [51].
The visualization of the anomaly score of the trained architectures is shown in Fig. 12. The performance fluctuation of the training on the same data with the same hyperparameters can be observed on the vertical lines. The horizontal lines indicate that the sample was reconstructed with a similar anomaly score in the majority of the t trained autoencoders.
To get an estimate of the performance of the architectures, r = 125 instances of each ML architecture were trained and then an outlier detection was performed. The goal was to identify the best architecture in terms of reliability, the variance of the outcome should be low, and at the same time the distance between the two groups should be maximized. The obtained results are shown in Table II. 5 The results of the comparison of the ML architectures as shown in Table II are interpreted as follows. 1) The mean of the means of the reconstruction errors of samples classified by the autoencoder architectures as nonanomalous is lower for VAE-LSTMs in both the subprocesses; the corresponding results occur calculating the median of medians.
2) The mean of the means of the anomaly scores of samples classified by the autoencoder architectures as anomalous is higher for BiLSTM in the penetration subprocess. This can also be seen in the box-plot shown in Fig. 13. A higher mean of the distance between nonanomalous and anomalous samples is desirable, since it makes the two groups better distinguishable. Based on the results, the following selection of architectures was made: for the two subprocesses, different architectures   were chosen; for penetration, autoencoders incorporating BiLSTM layers outperformed autoencoders using LSTM layers; for compaction, the architectures using LSTM layers performed significantly better.

C. Subprocesswise Anomaly Detection
The bivariate histogram shown in Fig. 14 indicates that a high anomaly score in one of the two phases does not correlate with a high anomaly score in the other phase. A correlation coefficient r = −0.0628 −0.3 indicates a very weak negative linear relationship [56] between the anomaly scores of the two phases. This is consistent with the experts' institution of this case study; if the penetration phase is anomalous, this does not result in an anomalous compaction subprocess. Bivariate histogram of the anomaly score E n for a trained VAE-BiLSTM and the KPI outlierness for the penetration phase.

D. Results KPI-Supervised Training
As described in Section V-C, the training sets were constructed using the outlierness. However, when analyzing the distribution of the anomaly scores of both the groups, MVTS which are considered, based on the KPI analysis, to be anomalous and those considered to be no-anomalous, the distributions shown in Fig. 15 of the anomaly score were obtained. It can be seen that the outcome is two nearly identical probability density functions. Based on these results the KPI-supervised training needs further refinement to produce meaningful, distinctive results. This could be caused by the KPI covering all the sensor channels or by the fact that the KPI classifier and autoencoder anomaly detection cover other anomalies; see Fig. 16.
However, summed up it can be said that a high reconstruction error of an optimized and trained autoencoder does not lead to a high outlierness and vice versa. This can be seen in Fig. 16, where the best performing architecture for compaction (VAE-BiLSTM) was trained with optimized hyperparameters to obtain the anomaly score and the KPIs covering the penetration phase were used to calculate the outlierness. As already reported in [14], the two methods identify a common set of MVTS having both low anomaly scores and low outlierness. Based on these, two means of anomaly detection two classifications can be derived and combined to cover a wider variety of anomalies; since the application shown is safety-relevant, every additionally detected anomaly is strengthening the process safety.

VII. CONCLUSION
A new framework for the handling of MVTS data recorded in industrial applications was presented. The presented framework standardizes the data handling and makes the HML classifier applicable to a wide variety of use cases. Multiple technical KPIs that capture physical knowledge were developed. Furthermore, additional refinements of (Bi-)LSTM autoencoders were performed to make them better suitable for MVTS data with varying lengths. This permits the construction of ML architectures that preserve the real length of the MVTS after applying ML.
A new approach of unsupervised training set construction using the outcome of the KPI anomaly detection was investigated, which needs further refinement. A comparison of LSTM-VAE and BiLSTM-VAE was performed and led to using different recurrent layers for the present subprocesses of the investigated case study. It can be concluded that training separate ML models for the subprocesses is beneficial since the anomaly scores of the two phases are not correlated and it also gives insights into which of the subprocesses the recorded data are anomalous.
Further work should be done in the direction of unsupervised training set construction and evaluation metrics for unsupervised ML architectures and training functions which are robust against outliers in the training set. The comparison shows that different architectures of different sizes are optimal for the different subprocesses. It should be noted that it is concluded after this study that due to the extensive knowledge discovery process and the derived preprocessing techniques, only a low number of not excluded anomalies is still present in the compaction phase. One indicator for that is the overall low number of detected anomalies and the overall low anomaly score of the subprocess.