Two-Stage Deep Anomaly Detection with Heterogeneous Time Series Data

We introduce a data-driven anomaly detection framework using a manufacturing dataset collected from a factory assembly line. Given heterogeneous time series data consisting of operation cycle signals and sensor signals, we aim at discovering abnormal events. Motivated by our empirical findings that conventional single-stage benchmark approaches may not exhibit satisfactory performance under our challenging circumstances, we propose a two-stage deep anomaly detection (TDAD) framework in which two different unsupervised learning models are adopted depending on types of signals. In Stage I, we select anomaly candidates by using a model trained by operation cycle signals; in Stage II, we finally detect abnormal events out of the candidates by using another model, which is suitable for taking advantage of temporal continuity, trained by sensor signals. A distinguishable feature of our framework is that operation cycle signals are exploited first to find likely anomalous points, whereas sensor signals are leveraged to filter out unlikely anomalous points afterward. Our experiments comprehensively demonstrate the superiority over single-stage benchmark approaches, the model-agnostic property, and the robustness to difficult situations.


I. INTRODUCTION
A. Background W ITH the rapid development of the fourth industrial revolution (Industry 4.0), which encompasses Industrial Internet of Things (IIoT) as well as smart and sustainable manufacturing systems [1], a vast amount of data are generated every day [2]. Along with the growing importance of Big Data analysis, studies on anomaly detection [3] have attracted increasing attention for modern industry. Anomaly detection enables us to solve a variety of important problems in smart industrial systems, e.g., security protection [4], fault detection [5]- [7], and condition monitoring [8].
On the other hand, artificial intelligence (AI) or machine learning (ML) models have been widely used in discovering anomalies on industrial datasets (see, e.g., [4]- [15] and references therein). Nevertheless, real-world industrial datasets contain the following challenging issues [16]: 1) the quality of data depends heavily on data acquisition environments; 2) observed patterns of anomalies may not be readily predictable; 3) and very few (trusted) labels are available. To tackle such challenges, we should design AI/ML-based anomaly detection models in a data-driven fashion, which are capable of capturing complex temporal or organic relations and being robust to both label noise and class imbalance.

B. Motivation and Main Contributions
In this paper, we present a novel data-driven unsupervised approach for anomaly detection using a manufacturing dataset collected from a real-world factory assembly line, which accompanies a variety of considerably challenging problems such as severe data imbalance with very scarce labels and intrinsically unpredictable patterns. In particular, given heterogeneous multivariate time series data consisting of operation cycle signals and sensor signals, we are interested in automatically discovering true abnormal events in the factory assembly line, which are defined as anomalies in our context.
Our study is motivated by our empirical findings that conventional single-stage benchmark detection approaches (even sophisticated) may not provide satisfactory performance under our real-world manufacturing dataset; this is because existing approaches are not inherently designed to suit our own dataset composed of two different signals (i.e., operation cycle signals and sensor signals), exhibiting significantly different patterns from each other. More precisely, when there are two different signals recorded over a multi-scale time span as in our dataset, existing anomaly detection methods are not adequate for making use of such signals altogether. As a novel and effective solution to resolve this issue, we propose a two-stage deep anomaly detection (T-DAD) framework in which two different unsupervised learning models are adopted depending on types of signals. The two models are trained without labels since unsupervised learning is rather appropriate for our dataset in which very few abnormal events are only available, similarly as in many studies in such industrial systems [8], [17]. More specifically, in Stage I, we select anomaly candidates by using a model trained by using operation cycle signals; in Stage II, we finally detect abnormal events (i.e., the most likely anomalies) among the candidates by using another model, which is suitable for capturing the temporal dependence, trained only by using sensor signals. A distinguishable feature of our T-DAD framework is that signal-specific detection tasks are carried out in such a way that, along with the two empirically optimized thresholds, operation cycle signals are exploited first to find likely anomalous points whereas sensor signals are leveraged as a secondary tool to filter out unlikely anomalous points afterward. In other words, our T-DAD framework fully takes advantage of the characteristics of our heterogeneous time series data in detecting anomalies, which basically differs from prior attempts in the literature.
Through intensive experiments, we demonstrate the superiority of our T-DAD framework over various single-stage benchmark methods using a certain assembly process dataset in terms of the best F 1 score, representing the best possible performance of a trained model on a test set given the optimal threshold(s), where the F 1 score is the harmonic mean of Precision and Recall. We also show the modelagnostic property of our T-DAD framework by diversifying two appropriately chosen learning methods in Stages I and II and then successfully integrating them into our framework. To validate the robustness of the proposed T-DAD framework to other fringe environments, we evaluate the performance of anomaly detection using other assembly process datasets, which display consistent trends in terms of not only a balanced number of features in each of operation cycle and sensor signals but also a non-negligible fraction of anomalies by showing substantial gains of T-DAD over benchmark methods.
Our methodology sheds crucial insights into a simple implementation of data-driven anomaly detection methods when heterogeneous multivariate signals with unpredictable patterns are given as input and/or very few (or no) labels are available. Various problems in real-world manufacturing domains include but are not limited to fault detection, condition monitoring, manufacturing inspection, functional failure detection, alarm management, and so forth. When such practical problems deal with heterogeneous signals, using our distinctive two-stage strategy can be promising with potential gains. Moreover, as operational benefits in manufacturing domains, automatic discovery of abnormal events with much higher accuracy through our approach would enable cost-effective management in the Industry 4.0 era along with minimal or complementary interventions by the operator with domain knowledge.

C. Organization
The remainder of this paper is organized as follows. In Section II, we summarize studies related to our work. Section III describes our dataset. The overall methodology of T-DAD is presented in Section IV. Experimental results are provided in Section V. In Section VI, we summarize the paper with some concluding remarks. Table I summarizes the notations used in this paper, which will be formally defined in the following sections when we introduce our problem definition and technical details.

II. RELATED WORK
The framework that we propose in this paper is related to two broader fields of research, namely 1) anomaly detection in real-world manufacturing domains 2) time series anomaly detection. There has been a steady push to solve anomaly detection tasks using real-world manufacturing datasets such as realtime motor fault detection using motor current signals [6], fault detection using wind turbine datasets [7], condition monitoring using a simulated bearing fault dataset [8], manufacturing inspection using tile production data [9], and functional failure detection using sensor data collected from a turbine syngas compressor in global manufacturing plants [5]. Anomaly detection in manufacturing or production lines (e.g., Intel's CPU manufacturing [10], and semiconductor manufacturing [11]) has also received significant attention.

B. Time Series Anomaly Detection
Anomaly detection in a time series using stochastic time series models such as autoregressive moving average (ARMA) [18] and autoregressive integrated moving average (ARIMA) [19] has been actively studied in the past two decades. Oneclass support vector machine (SVM) was also widely used for time series novelty (i.e., out-of-distribution) detection [20]. Moreover, probabilistic methods such as hidden Markov models [21], Bayesian networks [22], and Poisson models [23] have been presented for time series anomaly detection.
With recent advances in deep learning across a wide range of domains, anomaly detection tasks have also successfully employed various deep neural networks such as recurrent neural network (RNN), convolutional neural network (CNN), autoencoder (AE), variational autoencoder (VAE), generative adversarial network (GAN) and graph neural network (GNN). First, RNN models have been actively developed in time series anomaly detection to capture the temporal dependence in time series sequential data. The strength of using a long short term memory (LSTM) network for time series anomaly detection was demonstrated in [24] by presenting a nonparametric and dynamic error thresholding approach that does not depend on either scare labels or prior distributional assumptions. A stochastic RNN for multivariate time series anomaly detection to learn robust representations with a stochastic variable connection and planar normalizing flow was presented in [25]. Second, with regard to CNN-based anomaly detection models, a deep CNN approach was developed in [26] as a time series predictor module. Moreover, an algorithm that combines spectral residual and CNN was designed in [17] to achieve the state-of-the-art performance on univariate time series anomaly operational cycle signals sensor signals : data exist.
: data do not exist. detection. Third, AE-aided approaches have also been widely employed by taking into account the reconstruction error as an anomaly score. An AE-based algorithm was proposed in [27] for anomaly detection in high-performance computing (HPC) systems. A robust anomaly detection method integrating AE and robust principal component analysis was presented in [28]. An encoder-decoder architecture-based anomaly detection method using LSTM layers was proposed in [13]. As an endto-end learning solution, an unsupervised anomaly detection model that utilizes the AE and Gaussian mixture model was proposed in [29]. In [15], a scalable anomaly detection method for multivariate time series was built upon an encoder-decoder architecture with an adversely training framework inspired by GAN. Fourth, among generative models, VAE or GAN-based time series anomaly detection approaches also received considerable attention. A VAE-based anomaly detection algorithm was designed in [30] along with a solid theoretical explanation for adopting VAE on univariate time series anomaly detection. An approach combining LSTM and VAE was presented in [31] for time series anomaly detection. An unsupervised anomaly detection method was also presented in [32] based on GANs using LSTM networks for each generator and discriminator. Fifth, increasing attention has recently been paid to time series anomaly detection models [33], [34] based on GNNs [35] in order to capture complex relationships between variables in multivariate time series data.

C. Discussions
As shown above, various time series anomaly detection approaches from traditional methods to current deep learningbased methods have been extensively developed in the literature. Nonetheless, to the best of our knowledge, there is no prior work on anomaly detection that makes use of the characteristics of heterogeneous time series signals similarly as in our real-world manufacturing dataset.

III. DATASET DESCRIPTION
We use a multivariate time series dataset collected from an assembly line of automotive engine valves as one of the large-scale real-world manufacturing datasets. The entire dataset consists of a set of features recorded from December 2, 2019 to December 7, 2019, in which there are 518,400 timestamps (i.e., temporal records). Let us describe the set of features, which are partitioned into the following two types of signals.
• Sensor signals: Since sensor signals are recorded every second regardless of the operation status, they tend to exhibit temporal continuity; • Operation cycle signals: Due to the fact that operation cycle signals are logged in units of operation cycles, they are less likely to follow temporal continuity. Different from sensor signals, since such operation cycle signals are purely event-driven, they occur irregularly. We note that sensor signals are evenly spaced in time while operation cycle signals are unevenly spaced in time; this comes from the fact that the operation is inactive when either the planned short downtime or the irregular (unplanned) long downtime occurs. The behavior of heterogeneous features is illustrated in Fig. 1.

IV. METHODOLOGY
In this section, to describe the proposed methodology, we first present the problem definition. Then, we elaborate on our T-DAD framework.

A. Problem Definition
We define our problem in automatically detecting true abnormal events that have occurred in the factory assembly line using the heterogeneous time series manufacturing dataset. It would be senseless to perform point-wise anomaly detection since it is rather crucial for the operator to be aware of whether a possible abnormal situation has occurred within an acceptable time range [17]. Thus, we aim at conducting range-wise detection of abnormal events by discovering a most likely anomalous point, which is defined as a nearby timestamp manifesting an abnormal pattern from the point at which an abnormal event has indeed occurred. Problem 1. Given a heterogeneous multivariate time series dataset, range-wise anomaly detection is to find an anomalous point t such that there exists an abnormal event in [t −δ, t +δ] for a certain tolerance δ > 0.
Note that the tolerance δ is assumed to be set appropriately depending on the operator's criterion beforehand.

B. Overall Architecture
To accomplish anomaly detection with our heterogeneous multivariate time series data, we propose novel two-stage detection, termed T-DAD, built upon a data-driven approach inspired by non-straightforward empirical findings. Fig. 2 illustrates the overall architecture of the proposed T-DAD framework in which signal-specific detection tasks are carried out after training two different unsupervised learning models, the so-called Stage I and Stage II models, according to two types of signals (i.e., operation cycle signals and sensor signals). To be specific, we start by adopting the Stage I model with multivariate operation cycle signals and the Stage II model with multivariate sensor signals that account for the temporal dynamics, where each model is unsupervisedly trained without any labels. Then, we compute anomaly scores, denoted by a t , from the output of the each model, respectively, at each timestamp t. Thereafter, we perform twostage anomaly detection as follows. In Stage I, we select a set of very likely anomaly candidates, denoted by T o,anomaly , at which the resulting anomaly score a (o) t is higher than a certain threshold τ 1 > 0, i.e., where T o is the set of timestamps at which there exist operation cycle signals. On the contrary, in Stage II, we detect the most likely anomalous points by filtering out very unlikely anomalous points among the selected candidates. More precisely, we remove the set of points, T s , such that the maximum of the resulting anomaly scores a , is lower than another threshold τ 2 > 0, where t i ∈ T o,anomaly and η > 0 is a range setting parameter to be empirically found. That is, we finally detect the set of anomalies,T anomaly , as in the following: where

C. Implementation of the T-DAD Framework
In this subsection, we elaborate on the proposed T-DAD framework, which constitutes two models along with signalspecific detection tasks. In our study, we adopt the DAE model [27], [28] in Stage I with operation cycle signals and the LSTM-DAE model [13] in Stage II with sensor signals. Note that, due to the model-agnostic property of our proposed framework, any unsupervised learning model with which the anomaly score is computed at each timestamp can be employed in each stage. Nevertheless, we focus on describing our T-DAD framework built upon the DAE and LSTM-DAE models since they are relatively tractable. 1 Let us describe the model training process. We first adopt a DAE model employing a feed-forward multi-layer neural network in which the desired output is the input itself, as stated in [27], [28]. Our DAE model contains a multi-layer perceptron (MLP) encoder and an MLP decoder, denoted by E MLP (·) and D MLP (·), respectively, where the multivariate operation cycle signal x (o) t ∈ R 1×do is used as input. Here, d o denotes the number of features in the operation cycle signal. By training this DAE, we are capable of measuring the reconstruction error at each timestamp t ∈ T o , which is regarded as the anomaly score a In the meantime, to take advantage of temporal continuity, we adopt an LSTM-DAE model [13] along with the multivariate sensor signal x (s) t ∈ R 1×ds , where d s denotes the number of features in the sensor signal. This LSTM-DAE contains an encoder-decoder with LSTM layers, denoted by E LSTM (·) and D LSTM (·), respectively. We apply a sliding window technique with the window size w > 0 and the step size γ > 0 to the sensor signal x (s) t so as to generate sequences S 1 , S 2 , · · · , which are used as input of LSTM-DAE, where γ < w is assumed and S j = {x (s) Since a sensor signal x (s) t at timestamp t is contained in multiple sequences, the reconstruction error at timestamp t from a sequence S j can be expressed as x We then take the maximum of those multiple reconstruction errors calculated at each timestamp t, 2 which is regarded as the anomaly score a (s) Next, we turn to describing our two-stage signal-specific detection tasks. We optimally decide two thresholds τ 1 and τ 2 shown in (1) and (3), respectively, via exhaustive search in the sense of maximizing the performance (i.e., the F 1 score). To explain a rationale of our distinctive two-stage strategy, we give a motivating example as follows. Fig. 3 illustrates the visualization of anomaly scores a  Fig. 3(a), we make the following observations: 1) there is an anomalous point t 1 whose anomaly score a (o) t1 is higher than the threshold τ 1 ; 2) the maximum of anomaly scores a (s) t , ∀t ∈ [t 1 − η, t 1 + η] is higher than another threshold τ 2 , which implies that the anomalous point t 1 will not be removed from the candidate; and 3) since the difference between the timestamp at which an actual abnormal event occurred and the candidate point t 1 is within the tolerance δ, it is possible to successfully detect an actual abnormal event range-wisely. On the other hand, from Fig. 3(b), one can see that 1) there is an anomalous point t 2 such that a t2 > τ 1 ; 2) the anomalous point t 2 is filtered out due to the fact that max t∈[t2−η,t2+η] a (s) t < τ 2 ; and 3) no abnormal event is finally discovered. We summarize the overall procedure of our T-DAD framework in Algorithm 1.
From the above empirical findings, we may conclude that 1) the DAE model based on operation cycle signals, aimed at detecting highly likely anomalous points, cannot solely solve the anomaly detection problem and 2) the LSTM-DAE model based on sensor signals plays a supplementary but crucial role in eliminating very unlikely anomalous points, thus resulting 2 Other values such as the minimum and arithmetic mean can also be adopted. However, we have empirically verified that taking the maximum leads to the highest performance. in much higher accuracy.

V. EXPERIMENTAL EVALUATION
In this section, we first describe the set of labels in our manufacturing dataset and data preprocessing techniques. Then, we introduce several performance metrics and five benchmark methods for anomaly detection. We also present experimental settings. As our main contributions, we comprehensively perform experimental evaluations to validate the superiority of our T-DAD framework over single-stage benchmark methods.

A. Label Description
True breakdown events caused by the assembly line are unavailable in our collected dataset; thus, in our experiments, the set of alarm points is used as a ground truth instead. In fact, many industrial plants include a key component for not only efficient and safe operation but also control of plants, the socalled alarm system [36], [37], which alerts an operator about abnormal conditions or malfunctions of the process that might be anomaly candidates. Timely intervention by the operator alongside alarm logs in industrial datasets can be very useful in detecting possible malicious anomalies (i.e., abnormal events to possibly catastrophic levels).
if a  In our dataset, an alarm point indicates the timestamp at which an alarm event arises and is recorded as a binary format. That is, if the value at a timestamp in a sequential time series of binary signals is 1, then it signifies that an alarm has occurred.

B. Data Preprocessing
We start by addressing that our dataset is collected from a total of 39 assembly processes. Among them, we focus primarily on the 2 nd assembly process dataset for the analysis since it contains not only a number of features in each of heterogeneous signals but also a non-negligible fraction of alarm events used as a ground truth. There are 32 alarm points in the second assembly process dataset, which however contribute only to 0.06% of all the timestamps. Additionally, the effectiveness of the proposed T-DAD framework will be empirically validated using other assembly process datasets.
Next, we apply the following three data preprocessing techniques into our selected assembly process dataset for successful analysis: data imputation, feature extraction, and data normalization. Specifically, we perform data imputation in such a way that the missing values in each signal are imputed with those observed at the previous timestamp. After the imputation of missing data, we exclude some features such that their values remain unchanged over time. This is because static features over time would be useless for data analysis. Finally, min-max normalization is adopted for each signal so that all the values in each signal are transformed into [0, 1], which enables us to alleviate the impact of different scales among the signals.

C. Experimental Setup
In this subsection, We first describe the settings of the neural network models used in our proposed T-DAD framework. Unless otherwise stated, as default settings, we run experiments using the DAE and LSTM-DAE models in Stages I and II, respectively, of T-DAD since this combination among the unsupervised learning models under consideration in Section V-E leads to the best performance (which will be specified in Section V-F).
In our experiments, both DAE models are trained using the Adam optimizer [38] with an initial learning rate of 10 −3 , where the batch size is set to 32. Loss functions based on the mean square error and the mean absolute error are used for the DAE and LSTM-DAE models, respectively. As for the DAE model, the MLP encoder-decoder has 2 hidden layers with the size of 6 each; the standard tanh and ReLU activation functions are applied alternately in every layer between input and output layers; an 1 -norm regularizer is adopted in the first layer of the encoder. As for the LSTM-DAE model, the encoderdecoder has 2 hidden LSTM layers with 64 hidden units each; dropout [39] is applied to each layer with probability 0.2; the window size w and the step size γ are set to 180 and 60, respectively.
Additionally, in our experiments, the tolerance δ is set to 600 (seconds) since such a tolerance level may be acceptable in the sense of triggering an alert without causing a long delay. The range setting parameter η used in Stage II of T-DAD is set to 14 (seconds) due to the fact that one operation cycle tends to be mostly from 13 to 15 seconds. Note that the parameters δ and η can be set to different values according to the operator's criteria as well as assembly process environments.
We split our time series dataset into training (records measured from 00:00:00 on December 2, 2019 to 00:00:00 on December 3, 2019) and test (remaining records) sets in chronological order. During training, the two DAE models are learned; and the signal-specific detection tasks with threshold selection are conducted on the test set.

D. Performance Metrics
We adopt a modification of the well-known Precision and Recall so that we evaluate the performance of range-wise alarm detection appropriately since operators generally do not care about point-wise metrics, similarly as in [30], [40]. More specifically, in our study, the Precision means the fraction of finally detected anomalous points that find actual alarms within the tolerance δ; and the Recall means the fraction of alarm points that are actually found by finally detected anomalous points within δ. Furthermore, instead of the original F 1 score, which is the harmonic mean of Precision and Recall, given a particular threshold, we adopt the best F 1 score in [15], [25], [30] that indicates the best possible performance of a model on the test set given the globally optimal threshold. 3 In our T-DAD framework, two thresholds are chosen in the sense of maximizing the F 1 score via exhaustive search, which would still be valuable in understanding the fundamental limits of unsupervised alarm detection in our manufacturing dataset.

E. Benchmark Approaches
We describe five benchmark methods for anomaly detection. They are used not only as backbone models in our two-stage framework but also as state-of-the-art single-stage benchmark approaches for comparison. utilizes adversely trained AEs inspired by GANs.

F. Experimental Results
In this subsection, our empirical study is designed to answer the following four key research questions.
• RQ1. How sensitive is the T-DAD framework to the selection of two thresholds τ 1 and τ 2 ? • RQ2. How much does the performance of T-DAD improve the detection accuracy over single-stage benchmark approaches? • RQ3. Which combination of backbone models in T-DAD yields the best performance? • RQ4. How robust is the T-DAD framework to other assembly process datasets with fewer features? To answer these questions, we carry out four comprehensive experiments as follows.
1) Sensitivity Analysis with Respect to Thresholds (RQ1): To analyze the parameter sensitivity, we show the performance of the proposed T-DAD framework in Section IV with respect to the F 1 score according to the values of thresholds τ 1 and τ 2 . From Fig. 4, it is observed that the performance tends to be more sensitive to the values of τ 1 since a small variation in τ 1 leads to a non-negligible difference in the F 1 score. One can also see that the best F 1 score is found at (τ 1 , τ 2 ) = (2.55, 0.08), which will be used in the subsequent experiments unless otherwise specified.
2) Comparison with Single-Stage Benchmark Approaches (RQ2): Due to the fact that, to the best of our knowledge, anomaly detection exploiting all the heterogeneous signals in our real-world manufacturing dataset has never been studied yet, there is no method that works directly under our setting. For this reason, we demonstrate the superiority of our proposed two-stage framework over single-stage benchmark methods, where the DAE and LSTM-DAE models are used in Stages I and II, respectively, of our T-DAD framework. We take into account the following four cases as feasible single-stage approaches.
We present the performance comparison between the T-DAD framework and five single-stage benchmark methods (including DAE, DAGMM, USAD, LSTM-DAE, and LSTM-VAE) for the above four cases in Table II with respect to the Precision, Recall, and best F 1 score (P, R, and F 1 , respectively, for short in the table). We would like to provide the following insightful observations: • T-DAD is much superior to all single-stage benchmark approaches in terms of the best F 1 score; • Only using operation cycle signals (i.e., C1) among four cases exhibits the best performance regardless of performance metrics, which reveals that operation cycle signals (rather than sensor signals) play a significant role in offering satisfactory performance; • While the Recall of T-DAD is identical to that of C1 with DAE, the Precision of T-DAD is remarkably enhanced compared to the second best performer by virtue of judicious utilization of sensor signals in Stage II of T-DAD. Namely, sensor signals serve as a secondary tool in further improving the best F 1 score. This implies that our strategy in Stage II indeed leads to the removal of very unlikely anomalous points (i.e., reduction in false positives). • Performance comparison between C1 and (C3, C4) indicates that the naïve vector concatenation of two signals even degrades the detection accuracy; • From the comparison among five benchmark models in C1, it is seen that the best performer in terms of the best F 1 score is DAE while the second and third best  II  PERFORMANCE COMPARISON BETWEEN THE T-DAD FRAMEWORK AND SINGLE-STAGE BENCHMARK METHODS.   Single-stage methods  Two-stage method  Methods  DAE  DAGMM  USAD  LSTM-DAE  LSTM-VAE  T-DAD   Cases  C1  C2  C3  C4  C1  C2  C3  C4  C1  C2  C3  C4  C1  C2  C3  C4  C1  C2  C3  C4   performers are USAD and DAGMM, respectively; • The performance of the LSTM-DAE and LSTM-VAE models is quite poor, which implies that simply using models with LSTM layers originally designed for learning temporal dependence rather deteriorates the detection accuracy.
3) Model-Agnostic Property (RQ3): To show the modelagnostic property of our T-DAD framework, we diversify underlying models in Stages I and II (i.e., the Stage I and II models) by leveraging five benchmark methods shown in Section V-E. In Stage I where operation cycle signals are used, we choose DAE, DAGMM, and USAD for model candidates since using the other two (i.e., LSTM-DAE and LSTM-VAE) do not perform appropriately as stated in Section V-F2). On the other hand, in Stage II where sensor signals are used, we choose all five methods as model candidates.
The performance comparison among the model combinations (·, ·) in (Stage I, Stage II) of T-DAD is presented in Table III with respect to the Precision, Recall, and best F 1 score (P, R, and F 1 for short in the table). Our findings are as follows. First, experimental results clearly exhibit the modelagnostic property of our T-DAD framework while showing that the adoption of two stages indeed enhances the performance over its counterpart (i.e., single-stage benchmark methods). From Tables II and III, note that the performance of the singlestage approach showing the highest best F 1 score among four cases (i.e., C1) is identical to that of using DAE in Stage II of T-DAD regardless of models in Stage I. Second, it is examined that RNN-based models such as LSTM-DAE and LSTM-VAE are more appropriate in Stage II than other benchmark methods due to the fact that sensor signals tend to show strong temporal dependence. From Table III, we may conclude that either (DAE, LSTM-DAE) or (DAE, LSTM-VAE) is the best combination of models in our T-DAD framework.

4) Robustness to Fringe Conditions (RQ4):
To validate the robustness of our T-DAD framework, we perform anomaly detection on two other assembly process datasets whose characteristics are appropriate for our experimental study in terms of not only a balanced number of features in each of heterogeneous signals but also a non-negligible fraction of alarms. Specifically, we note that 32 out of 39 assembly processes contain only a very few operation cycle or sensor signals, which means that our framework exploiting heterogeneous time series signals may be unnecessary (i.e., adoption of an existing single-stage detection model rather than multistage methods would be satisfactory in such cases). Among the remaining 7 assembly process datasets, assembly process datasets that have less than 30 alarms are excluded from our experiments in order to avoid unreliable verification caused by the very scarce labels. In consequence, we choose the following two assembly process datasets: • The 5 th assembly process dataset having 22 features in operation cycle signals and 7 features in sensor signals after feature extraction; • The 12 th assembly process dataset having 11 features in operation cycle signals and 6 features in sensor signals after feature extraction. To demonstrate the validity of our T-DAD framework on other datasets, we present the performance comparison between T-DAD and three single-stage benchmark approaches (including DAE, DAGMM, and USAD) in Table IV, where the DAE and LSTM-DAE models are used in Stages I and II, respectively, of our T-DAD framework. We do not adopt the other two benchmark methods (i.e., LSTM-DAE and LSTM-VAE) since they do not perform appropriately similarly as in Section VF2). It is clear to see that the results in the table follow similar trends to those in Table II. On both assembly process datasets, T-DAD shows superior performance over single-stage benchmark approaches, which demonstrates that our T-TAD framework is robust to other assembly process datasets.

VI. CONCLUDING REMARKS
In this paper, we introduced a novel two-stage framework, termed T-DAD, for automatically detecting anomalies using a real-world manufacturing dataset, collected from an assembly line of automotive engine valves, containing many challenging problems. More precisely, we developed our T-DAD framework in such a way that signal-specific detection tasks are carried out in a data-driven manner after unsupervisedly training a model in stage I with operation cycle signals and another model in stage II with sensor signals. Through comprehensive experiments, we demonstrated that the proposed framework not only substantially outperforms various single-stage benchmark methods but also is robust to fringe situations. Moreover, we found that the gains over single-stage benchmark methods come from our two-stage detection strategy depending on types of signals by judiciously exploiting sensor signals as a secondary tool in further improving the performance. Potential avenues of future research include how to solve a general setting where we undergo more heterogeneity along with more than three types of signals in highly sophisticated real-world manufacturing. Early detection of anomalies for preventative maintenance also remains for future work.