Out-of-Distribution Data Generation for Fault Detection and Diagnosis in Industrial Systems

The emergence of Industry 4.0 has transformed modern-day factories into high-tech industrial sites through rapid automation and increased access to real-time data. Deep learning approaches possessing superior capabilities for intelligent, data-driven fault diagnosis have become critical in ensuring process safety and reliability in these industrial sites. However, such applications trained exclusively on in-distribution process data face challenges in the wake of previously unseen out-of-distribution (OOD) data in the real world. This paper addresses the challenge of out-of-distribution data detection for deep learning-based fault diagnosis models by generating synthetic data to simulate real-world anomalies not present in the training set. We propose Manifold Guided Sampling (MGS), a data-driven method for generating synthetic OOD samples from the in-distribution data-supporting manifold estimated through a deep generative model. Synthetic data from MGS enhances the model capacity for prediction uncertainty quantification, resulting in safe and reliable models for real-world industrial process monitoring. Furthermore, the MGS algorithm maintains the in-distribution data feature space as a reference point during data generation to ensure the resulting synthetic OOD data is realistic. We analyze the effectiveness of MGS through experiments conducted on the steel plates faults dataset and demonstrate that augmenting training data with synthetic data from MGS enhances the model performance in OOD detection tasks and provides robustness against dataset distributional shifts. The findings underscore the effectiveness of utilizing synthetic MGS-generated OOD data in scenarios where real-world OOD data is limited, enabling better generalization and more reliable fault detection in practical applications.


I. INTRODUCTION
Industry 4.0 (I4.0) has revolutionized modern-day factories through rapid automation and increased access to real-time data from complex industrial processes [1], [2], [3], [4], [5].Central to the proliferation of industrial process datasets are multitudes of integrated sensors that gather data, resulting in large-scale, high-dimensional, and nonlinear historical process data.The compiled datasets are the in-distribution (ID) data representing some underlying industrial process.
The associate editor coordinating the review of this manuscript and approving it for publication was Guillermo Valencia-Palomo .
Recently, data-driven fault diagnosis (FD) models trained on large-scale industrial process datasets using deep learning (DL) techniques have demonstrated the ability to deliver actionable insights required to cope with the increasing demands around safety, efficiency, and production quality [6], [7], [8].For DL-based FD models, the underlying assumption is that the training and testing data are independent and identically distributed.
However, during deployments in the real world, gradual changes over time result in data distributional shifts and the emergence of out-of-distribution (OOD) data [9], [10], [11], [12], [13].In industrial applications, exposure of data-collecting sensors to potentially harsh and variable environmental conditions, physical shock or damage, excess electrical noise, imprecise pre-and post-deployment sensor calibration, sensor drifts, process parameter variations, and changes in working conditions are some of the factors commonly associated with data distributional shifts and the emergence of OOD data [14], [15], [16], [17], [18], [19].
In recent years, DL algorithms have achieved state-ofthe-art (SOTA) performance in a broad range of tasks [20], [21], [22], [23], leading to the integration of DL into safety-critical tasks such as biometric identification [24], medical diagnosis [25], and fully autonomous driving [26], [27], [28].Similar adoptions in manufacturing are increasingly common, such as the uptake of deep neural networks (DNNs) in FD as the preferred process monitoring approach for complex industrial process data [8], [29].Nonetheless, current SOTA DL models are known to generate inaccurate and overconfident predictions on OOD data, further degrading the performance of DL-based FD systems [30], [31], [32], [33], [34].Improving capacity for OOD detection is crucial in safeguarding the DL-based FD models, especially for safety-related systems where the consequences of wrong predictions can be catastrophic, leading to the total shutdown of entire operations, while in other cases, injuries or the loss of life.
Data-driven DL-based FD models require a comprehensive dataset with broad coverage of operating conditions and fault scenarios to model the accurate system behavior during training.Therefore, data quality and availability directly affect the performance of DL-based FD models.Insufficient training data, also known as data scarcity, affects DL-based FD models by restricting the exploration of a comprehensive data feature space, thus limiting the model's ability to learn the most informative and discriminative features required for OOD detection tasks [35].Data scarcity also relates to the long-tailed distribution problem or imbalanced dataset, common in safety-critical industrial systems where actual fault scenarios are rare and hard to simulate.The long-tailed distribution problem impacts the generalization of DL-based FD models as they tend to perform well on the dominant classes, unlike the less frequent classes [36], [37].DL-based FD models with poor generalization tend to perform poorly in OOD detection tasks.The data scarcity problem underscores the need for additional training data to improve DL-based FD model generalization and capacity for OOD detection.Furthermore, despite the effectiveness of approaches such as Reverse KL-divergence Prior Networks (RKL-PNs) [34], [38], Aleatoric Epistemic uncertainty DNNs (AE-DNNs) [39], and Out-of-DIstribution detector for Neural networks (ODIN) [40] in OOD detection, all these approaches require access to realistic OOD data during training.
This paper addresses the challenges of training data quality and availability for data-driven DL-based FD systems by generating synthetic OOD data that simulate real-world anomalies not present in the training set.We aim to bridge the We propose MGS, a data-driven method for generating synthetic OOD data based on a deep generative model.Implementation of MGS begins by training a variational autoencoder (VAE) to obtain a learned ID data-supporting manifold of the large-scale high-dimensional nonlinear historical process data.From the manifold-related hypotheses on high-dimensional data [41], [42], [43], [44], we observe that samples existing in (i) the classwise intersecting regions on the manifold, (ii) the classwise low-probability regions on the manifold, and or (iii) regions located outside the learned ID manifold; all represent regions from which we can obtain OOD latent variables (see Fig. 1).Therefore, decoding the OOD latent variables obtained from sampling the regions of interest on the manifold generates a combined set of OOD historical process data.MGS facilitates the generation of realistic OOD samples that augment the original ID training dataset.In particular, the availability of synthetic OOD data enables us to train DL-based FD applications using approaches similar to RKL-PNs and AE-DNNs.Throughout the training process, a dual loss function merges two objective functions using a convex combination that optimizes the ID samples on the one hand and OOD samples on the other.The resulting DL-based FD model offers the improved capacity to handle complex real-world industrial environments through enhanced performance on tasks such as OOD detection and uncertainty estimation.
Based on our approach, we summarise our main contributions as follows: • We propose MGS, a data-driven method that generates synthetic OOD samples by leveraging an ID data-supporting manifold estimated through a deep generative model.
135062 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• We improve the quality of synthetic OOD data by learning disentangled OOD latent variables, performing targeted sampling from extremely low probability regions on the manifold, and expanding the range of angle choices for OOD latent variables outside the manifold.
• We demonstrate that incorporating synthetic OOD data from MGS improves the capacity for FD models to detect OOD input data and estimate predictive uncertainty, resulting in reliable FD models for real-world industrial process monitoring.

II. RELATED WORK A. FAULT DIAGNOSIS METHODS
FD methods fall under four general categories: model-based, knowledge-based, data-driven, and hybrid approaches [45], [46].Data-driven methods for FD have gained significant popularity and effectiveness in dynamic modern industrial environments, especially with the advancement of machine learning (ML) and artificial intelligence (AI) technologies.This work focuses mainly on data-driven DL-based FD methods that leverage large-scale industrial process datasets to learn patterns and relationships directly from data, making them more adaptable and proficient in detecting and diagnosing faults or anomalies.He and He [47] introduce a method for diagnosing bearing faults using DL.The approach involves preprocessing sensor signals through a short-time Fourier transform (STFT) and, to detect bearing faults, constructs an optimized deep learning structure called a large memory storage retrieval (LAMSTAR) neural network using the resulting spectrum matrix.The LAMSTAR network uses Self-Organizing Maps (SOM) models to process the spectrum matrix that identifies subpatterns in input data for bearing fault diagnosis.Results suggest that the LAMSTAR network-based method performs better at 'normal' and relatively low input shaft speeds.Li et al. [48] present an approach for diagnosing motor bearing faults using neural networks and time/frequency-domain vibration analysis.Vibration simulation enables the design of various motor rolling bearing FD strategies.Results show that neural networks can interpret motor-bearing vibration signatures effectively.Jiang et al. [49] propose an improved deep recurrent neural network (DRNN) method, alleviating the need for manual feature extraction and selection for intelligent fault diagnosis.DRNN uses frequency spectrum sequences as inputs to reduce input size, improve robustness, and adopt an adaptive learning rate to enhance training performance.The DL-based FD models in [47], [49], and [48] are application-specific, relying on specialized feature extraction techniques.
Further, Wen et al. [50] propose a new Convolutional Neural Network (CNN) based on LeNet-5 for fault diagnosis.The proposed method converts signals into two-dimensional (2-D) images, allowing for better feature extraction and eliminating the effect of handcrafted features.The technique demonstrates improved prediction accuracy on popular datasets, including the motor bearing, self-priming centrifugal pump, and axial piston hydraulic pump datasets.Xia et al. [51] introduce a CNN-based method that utilizes sensor fusion to diagnose rotating machinery faults.The approach automatically extracts representative features through feature learning, eliminating the need for manual feature selection.The method applies sensor fusion at the data level for enhanced accuracy and reliability for various machinery types and faults with limited prior knowledge.We observe that FD models, depending on robust feature extraction, can be application-specific, requiring explicit knowledge of relations between process variables.Restricting OOD data during training to the application domain for application-specific models improves OOD detection results.
Xu et al. [52] propose a method for fault diagnosis using a deep transfer convolutional neural network framework.Time-domain signal data is transformed into images and used as input for a CNN-based LeNet-5 to automatically extract features and classify faults.Several offline CNNs are pretrained to improve real-time performance, and their shallow layers are transferred directly to the online CNN, significantly improving the real-time performance while achieving the desired diagnostic accuracy within a limited training time.Lu et al. [53] propose a DL-based FD model named DAFD to address cross-domain learning problems in FD.DAFD models trained in a particular source domain are adoptable in a different but related target domain.While [52] and [53] both utilize transfer learning, the latter emphasizing domain adaptation, there is a need for a dedicated strategy for dealing with OOD data.Our method seeks to address, among others, the problem of OOD detection, where synthetic samples emerge from the original target domain.
Qiao et al. [54] propose an adaptable, time-frequency dual-input model based on a CNN and long short-term memory (LSTM) network (TFWConvLSTM) to address the problem of bearing fault diagnosis under variable loads and different noise interferences.TFWConvLSTM utilizes a time-frequency dual-input structure to enhance feature extraction and adopts a CNN-LSTM structure to capture spatiotemporal characteristics.Additionally, the LSTM gate structure is employed to use temporal features and improve noise immunity fully.Zhao et al. [55] propose an end-to-end Batch-Normalization-Based LSTM (BN-based LSTM) neural network for fault diagnosis.Unlike traditional methods, BN-based LSTM trains the representation of raw input data and classifier simultaneously, utilizing the dynamic information of process data.In particular, BN-based LSTM implements batch normalization to reduce the internal covariate shift and accelerate the convergence of the LSTM network.Zhang et al. [56] propose a novel method based on gated recurrent unit neural networks for fault diagnosis of rotating machinery (FDGRU).The approach initially converts the one-dimensional time-series vibration signals into two-dimensional images, followed by applying the temporal information of the time-series to a Gated Recurrent Unit (GRU) that learns representative features from constructed images.A multilayer perceptron (MLP) is finally employed to implement fault recognition.Zhao et al. [57] develop a new DL method, deep residual shrinkage networks (DRSNs), for FD tasks with highly noised vibration signals.Jia et al. [58] present a DNN-based intelligent method for diagnosing the faults of rotating machinery.The proposed DNN models trained on massive datasets are less dependent on human labor or prior knowledge about signal processing techniques and diagnostic expertise.We observe that the DL-based FD implementations mentioned above are deep networks with Softmax layers as the network output, resulting in overconfident model predictions for both ID and OOD samples.

B. OOD DATA GENERATION METHODS
OOD data generation is an important topic in ML as it is essential for creating safe and reliable models ready for deployment in the real world.In practice, the knowledge of OOD data distribution, a priori (during training), can reduce the tendency of DL-based FD models to make unsafe, false predictions with high confidence [59].Generating high-quality and realistic data representative of the target distribution is one of the main challenges in OOD data generation.Currently, the main approaches for OOD data generation include data augmentation and deep generative modeling.This work focuses on synthetic data generation through deep generative modeling.

1) DATA AUGMENTATION
Inoue in [60] proposes SamplePairing, a technique for data augmentation that synthesizes a new image sample by overlapping a source image with another randomly picked from the training data through a process of pixel averaging.Zhang et al. [61] propose a simple data-agnostic augmentation routine known as mixup that constructs virtual training samples generated as the linear interpolation of two random samples from the training set and their labels.The mixup approach regularizes the neural network to favor simple linear behavior in between training examples.Further, Tokozume et al. propose Between-Class learning (BC learning) [62], an approach geared toward data augmentation for sound recognition networks.BC learning generates new data samples between class sounds by mixing two sounds belonging to different classes with a random ratio.Krizhevsky et al. [63], in their Alexnet implementation, employ variations of data augmentation techniques such as random cropping, flipping of extracted patches, and altering the intensity of RGB channels.Adaptions of the data augmentation techniques in Alexnet feature in subsequent submissions in the ImageNet Large Scale Visual Recognition Challenge (ILCVRC) [64].Nonetheless, most of the data augmentation techniques by design focus on diversifying ID data to enhance the training set and prevent problems with overfitting and poor generalization.

2) DEEP GENERATIVE MODELING
Lee et al. in [65] propose a generative adversarial network (GAN) [66] with a modified objective function, allowing the GAN to generate OOD samples in the 'boundary' lowdensity regions of training distributions.During training, optimization happens jointly for two models where a confident classifier improves the proposed GAN and vice versa as training proceeds.However, the confident classifier is pre-trained on ID and OOD samples, creating an unrealistic scenario where the model has prior knowledge of OOD samples.Vernekar et al. in [67] demonstrate the inability of GANs to generate samples for a simple 3D dataset, suggesting the method will experience difficulties operating in higher dimensions.Further, Vernekar et al. [67] propose a method for generating two separate types of OOD samples from latent encodings derived from the learned manifold of a Conditional VAE (CVAE) [68].The approach faces scaling-up challenges due to the high computing cost requirement when calculating the Jacobian over the entire dataset and capacity limitations related to the Gaussian distribution in the VAE.Motivated by the idea to relax the classic assumption of Gaussian distributed data, Mȏller et al. in [69] present Soft Brownian Offset (SBO) sampling, a method to create synthetic OOD samples at the tails of data distribution by applying transformations on the latent representations of deep generative models such as VAEs.SBO is also applicable to generic low-dimensional feature representations of the ID data.Nonetheless, the approach is limited to OOD sampling from only the low-density regions of the low-dimensional learned manifold.
Our work builds upon the concepts in [67], where we propose using Umbrella Sampling (US) [70] to access latent variables of the OOD data located in the low-density regions of the learned manifold by sampling extremely low-probability areas of the posterior distribution.Further, we utilize a class-based Jacobian, calculated from a limited sample size, resulting in efficiencies in computing cost.

III. PROPOSED MANIFOLD GUIDED SAMPLING METHOD
In this section, we present our proposed method.First, we outline the concepts of manifold hypothesis that form a basis for our proposed approach.Second, we provide a comprehensive overview of our methodology, including the sampling process and implementation for types 1A, 1B, and 2 OOD data.Finally, we outline the implementation of the MGS algorithm, along with the proposed pseudo-code.

A. MANIFOLDS AND HIGH-DIMENSIONAL DATA
One of the prominent characteristics of modern-day industrial datasets is the high dimensional data typically compiled in a large-scale nature.For high-dimensional data, the number of features is usually large and can easily exceed the number of observations in a dataset.Due to the challenges associated with learning in higher dimensions, [71], it is essential to identify low-dimensional subspaces of the data space containing meaningful information.A collection of methodologies for analyzing high dimensional data based on geometrical and topological approaches support the following hypotheses: • The manifold hypothesis states that real-world data presented in high-dimensional input space are more likely to concentrate on a much lower-dimensional sub-manifold embedded in the high-dimensional input space [41], [42], [43], [44].
• The manifold hypothesis for classification states that for multi-class data, different classes are likely to concentrate on different disjoint sub-manifolds separated by low-density regions in the input space [41], [72].The manifold-related hypotheses are essential to many dimension reduction algorithms and other manifold-inspired algorithms [72].

B. METHODOLOGY
Given a high-dimensional ID dataset, we begin by obtaining a transformation into a lower-dimensional data space, retaining the meaningful properties present in the original data.A deep generative model such as the VAE is suitable for this task as it can model a relatively smooth latent space.In practice, the VAE generates a reconstruction of the input x, given the latent variable z through a decoder p θ (x|z) and the encoder q φ (z|x), representing the variational approximate posterior.From the ID high-dimensional input space, the VAE models a lower-dimensional manifold embedding where the high-density regions correspond to dense areas of the input space.For our implementation, we use the Total Correlation Variational Autoencoder (β-TCVAE) [73], a VAE model that learns disentangled latent representations from the input data.In particular, through β-TCVAEs, we obtain a more interpretable generative model capable of understanding the role of each latent dimension in the data generation process.
Vernekar et al. in [74] propose two categories of OOD samples: Type 1 (1A and 1B): OOD samples on the data manifold and Type 2: OOD samples outside the data manifold.In this work, we adopt similar OOD sample categorizations with adjustments towards feature robustness under the influence of outliers and resource management for improved computational costs.
For Type 1: OOD samples on the data manifold, the low-density regions in the input space corresponding to ID data boundary regions on the manifold represent areas consisting of OOD data.Following the manifold hypotheses (Sec.III-A), we observe that boundary regions on the manifold have densities that gradually decrease the further away you move from the dense areas.
We begin by obtaining q φ (z|x) from our trained β-TCVAE model, a uni-modal multivariate Gaussian with a diagonal covariance structure.q φ (z|x) represents the variational approximate posterior distribution from which we seek to retrieve the outliers representing classwise Type 1A: classwise samples on the intersecting regions of manifold and Type 1B: classwise samples on low-probability regions of the manifold, (See Fig. 1: Type 1A, 1B OOD).Sampling from low probability regions of a given classwise cluster distribution Z ID k ∼ q φ (z k |x k ) retrieves the local class k outliers existing around the respective cluster region on the manifold.
To obtain samples in the low-density regions of the learned ID data-supporting manifold, we apply US, an algorithm that performs sampling on extremely low-probability regions of a posterior distribution, accurately down to approximately 15 σ on the credible region.The US algorithm applies temperature stratification, a technique that flattens the distribution by defining various temperatures and biasing window functions, enabling the exploration of wider parameter ranges and low-probability areas of the posterior distribution.Exponential spacing of temperatures ensures equal exchange probabilities between windows.
From class k encoder mappings Z ID k , we can estimate the mean µ ID k and covariance ID k that define the structure of the aggregate class k posterior distribution.In our application, we use the minimum covariance determinant (MCD) method [75], [76], [77], a robust estimator of the mean and covariance matrix aimed at minimizing the influence of outliers.The US algorithm then enables us to sample the low-probability regions of the class k posterior distribution, a multivariate Gaussian with mean µ ID k and covariance ID k to obtain our outlier z OOD k .Additionally, we can increase the diversity of generated OOD samples through targeted adjustments of the disentangled latent variables.To this end, we introduce a latent noise vector variable ϵ with elements in the range [−2.5, 2.5].We apply random transformations to individual elements of latent vector z through vector addition with ϵ and decode to obtain a more diverse set of OOD data.
For Type 2: OOD samples outside the data manifold, we observe that samples inhabiting regions of relative proximity yet isolated from the ID data-supporting manifold represent an additional category of OOD data.To this end, we adopt the method proposed by Vernekar et al. in [74], where a sample existing in a direction perpendicular to the tangent space of the sub-manifold at a point x ID corresponds to an OOD sample x ⊥ OOD that falls outside the manifold (See Fig. 1: Type 2).In particular, consider a VAE that models a lower-dimensional data manifold from the high-dimensional ID data X ID through corresponding latent variables Z ID .The encoder q φ : X ID → Z ID and decoder p θ : Z ID → XID functions provide a mapping through which we can recreate input data in the form xID = p θ (q φ (x ID )) and as a result, for a given point x ID , the tangent space of the manifold is the column space of the Jacobian matrix: Notably, the basis vectors of the left null-space of the Jacobian denoted null(J ⊤ (x ID )), span the space perpendicular to the sub-manifold at the point x ID .The perpendicular vector v ⊥ is thus obtained by randomly sampling the set of unit vectors V ⊥ ∼ null(J ⊤ (x ID )).However, a primary concern is the computational cost of the Jacobian matrix over the entire dataset.In our implementation, we make the following adjustments: (i) From the manifold hypothesis for classification, we observe that different classes are likely to concentrate along different sub-manifolds.Therefore, we obtain a per-class average Jacobian including the corresponding left null space upon which we derive the class-specific perpendicular vectors v ⊥ k .(ii) Based on the quality of the average Jacobians, we can reduce the required number of samples to achieve reliable results to a limited number of batches b ∈ [10,50].Further, we replace the perturbation (vector addition transformation) with a d × d rotation matrix, where k is the same dimension as xID k ∈ R d .We then generate x ⊥ OOD k by rotating xID k by an angle γ , uniformly sampled from within the range [45 The rotation matrix provides us with a broader spectrum of choices where the greater angle sizes γ yield x ⊥ OOD k samples more similar to v ⊥ k .

1) MGS LEARNING ALGORITHM
We outline the proposed algorithmic approach for the MGS in 1, describing the procedure to obtain synthetic OOD data from a deep generative model.Inputs for Algorithm 1 include (i) a minimum acceptable threshold distance d * from the ID data, (ii) a perturbation value ϵ for adjusting the latent variables, and (iii) a batch size b for the decoder Jacobian matrix.The MGS algorithm uses the following four main steps in its implementation: In the first step, we train a β-TCVAE, obtaining the ID data supporting manifold with disentangled representations of the latent variables, an encoder q φ (z k |x k ) and decoder p θ (x k |z k ).
In the second step, use the MCD approach to obtain the mean µ ID k and covariance ID k from the classwise posterior distribution of the latent variables z ID k .Using the US approach, we perform targeted sampling on the low-probability regions of the class k posterior distribution to obtain z OOD 1 k for classwise types 1A, 1B latent variables.For type 2, we obtain the classwise tangent space of the manifold J(x ID k ) averaged over the batch of size b as illustrated in equation 1.The left null-space of the Jacobian denoted null(J ⊤ (x ID k )) gives the classwise latent variable z OOD 2 k for Type 2 OOD data.
In the third step, we compile z OOD 1 k and z OOD 2 k into unified collections of classwise latent variables z OOD k .We then perturb the classwise latent variables z OOD k in the form (z OOD k × ϵ) to obtain zOOD k .
Finally, in step four, decoding zOOD k using the decoder p θ (x k |z OOD k ) generates the preliminary classwise OOD dataset XOOD k .Based upon samples that fall within a minimum acceptable distance d * from the original input dataset d * ≤ d(x OOD k , X k ), we uniformly select classwise samples xOOD k and compile the final OOD dataset X OOD k .

A. CASE STUDY
We evaluate the effectiveness of our proposed synthetic OOD data generation method, MGS, using the Steel Plates Faults

# Step 1
Fit a β-TCVAE on dataset D to obtain encoder q φ (z k |x k ) and decoder For types 1A and 1B: (i) Obtain the mean and covariance from the classwise posterior distribution using MCD XOOD ik ← − xOOD ik end end Output: XOOD , the set of generated OOD samples dataset [78].We compare MGS against other synthetic OOD data generation methods: OOD Detection and Generation using Soft Brownian Offset Sampling (SBO) [69] and OOD Detection in Classifiers via Generation (CGen) [74], as baselines.Finally, we augment the raw ID data with 135066 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.synthetic OOD data and train a DL-based FD application on the real-world industrial task of defects classification under OOD data uncertainty.Fig. 2 illustrates a selection of the results (fault type: stains) from OOD data generation methods applied to the Steel Plates Faults dataset.Comparatively, MGS-generated OOD intersects the least with ID data.

1) STEEL PLATES FAULTS DATASET
In the steel industry, intelligent fault diagnosis during steel plate production is essential for the timely identification of defects that directly influence the safety and performance of the final product.Notably, fault diagnosis in steel plate production is challenging due to the complex nature of defects owing to the dynamic production process and the quality of raw materials [79].The steel-plate surface defect inspection system involves capturing video images of the steel plates on the rolling equipment, followed by image processing and analysis, detecting the area of the defect, extracting features from the defect area, and finally, defect classification [80].
The steel plates faults dataset consists of 1941 instances for classifying surface defects in stainless steel plates during industrial production.This is a labeled dataset where instances are classified into either of the seven distinct typologies of faults: Pastry, Z Scratch, K Scratch, Stains, Dirtiness, Bumps, and Other Faults.Each recorded instance consists of 27 attributes representing the geometric shape of the fault and its contour.For this dataset, we apply FD to diagnose the source of the fault from among the seven commonly occurring faults of the steel plates.The target class distribution reveals an imbalanced dataset.

B. EXPERIMENTAL SETUP
For synthetic OOD data generation, we utilize the β-TCVAE, a variant of the variational autoencoder that attempts to learn disentangled representations.We choose the β-TCVAE architecture as an encoder consisting of three fully-connected layers (27,16, and 4 output features) and the decoder with three fully-connected layers (4, 16, and 27 output features).We train the β-TCVAE for 5000 epochs using the Adam optimizer [81] and a base learning rate of 0.1.We partition the data into a train/test split of 70%/30% and use a large batch size of 128.To achieve disentangled representations, we combine the mean squared error reconstruction loss with the special case β-TCVAE where for the ELBO-TC-Decomposition, we choose the following weights α = 1 for index-code mutual information (MI), γ = 1 for dimension-wise KL and β = 10 for total correlation (TC).During the sampling stage, to obtain types 1A and 1B using the US algorithm, we select a series of higher temperatures {T i } L i=0 that flatten the target distribution to allow for the exploration of wider ranges of parameters.In particular, we use the linspace NumPy function [82] to select 24 number evenly space between intervals from 1 to 30.For both MGS and CGen, we combine types 1A, 1B, and 2 OOD data in the ratio 70%/30%.Further, we implement the comparison method, SBO, using the hyperparameter setting d * = 0.45, d * = 1 and σ SBO = 1.We utilize a deep feedforward neural network (DFNN) for experiments on the DL-based FD models.The network architecture consists of four fully-connected layers (270, 216, 162, 108, 54, and 13 output features), with each layer followed by a rectified linear unit (ReLU) [83], a batch normalization layer [84], and a dropout layer [85].We use four approaches to train our models with an aggregate dataset of ID and synthetic OOD data: (i) RKL-PNs [34], [38], (ii) AE-DNNs [39], (iii) exposing an ordinary DNN to OOD data during validation (ODNN-OOD), and (iv) training an ordinary DNN using only ID data (ODNN-ID).For the ordinary DNN, we utilize the softmax-cross entropy loss.We train the classifiers for 1000 epochs using the Adam optimizer [81] and a base learning rate of 0. Finally, to evaluate the robustness of models trained using synthetic OOD data, we infuse noise into the test data to simulate OOD data in the real-world industrial environment.We create three distinct test OOD datasets by introducing randomness through the following three methods: (i) Gaussian noise with a mean of 0 and a standard deviation of 1, (ii) Poisson noise with an influence parameter of 1, and (iii) Uniform noise within the range of −1 to 1.

1) EVALUATION METRICS
For the evaluation of models on predictive uncertainty and OOD detection, we choose the following metrics 1 : Accuracy (ACC) ↑ measures the model performance as a percentage of correct predictions out of the total predictions made.
Acc evaluates the model's generalization performance on a hold-out test set.The higher the accuracy score, the more accurate the model's prediction.
Expected Calibration Error (ECE) ↓ measures the consensus between classifiers' predicted probabilities (confidence) and empirical accuracy.
where n represents the number of samples and B j is the bin j [59]. 1 Arrows next to the evaluation metric indicate which direction is better 135068 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Brier Score (BS) ↓ measures the accuracy of predicted probabilities.

BS
computed as the mean squared error of predicted probabilities and true classes where p is a vector of predicted probabilities and y is the one-hot encoded ground truth [86].AUROC−OOD ↑ measures the Area Under the Receiver Operating Characteristic Curve for OOD data by posing the problem set as a binary classification with the OOD data considered the positive class.
False Positive rate (FPR) at N% True Positive Rate (TPR) − (FPRN) ↓ measures the probability of a model misclassifying an out-of-domain input as in-domain given N % of the ID samples are correctly classified [74].
Confidence Calibration measures the correlation between confidence and correctness of model predictions.For a selected threshold, the metric is provided by the area under the precision-recall curve (AUPRC) [89] as follows: • Aleatoric Confidence (Alea.Conf.) ↑ obtained using maximum class probability max k pk as the threshold and a binary set of labels where 1 corresponds to correct predictions while 0 to incorrect predictions.
• Epistemic Confidence (Epist.Conf.) ↑ we use the empirical variance of the predicted class pk , as the threshold against a binary set of labels where 1 corresponds to correct predictions while 0 to incorrect predictions.OOD Detection measures the models' ability to detect OOD samples.For a selected threshold, the metric is provided by the area under the precision-recall curve (AUPRC) [89] as follows: • Aleatoric OOD Detection (OOD Alea.) ↑ obtained using maximum class probability max k pk as the threshold and a binary set of labels where 1 corresponds to in-domain data while 0 to out-of-domain data.
• Epistemic OOD Detection (OOD Epist.)↑ we use the empirical variance of the predicted class pk , as the threshold against a binary set of labels where 1 corresponds to in-domain data while 0 to out-of-domain data.

V. RESULTS AND DISCUSSION
First, we analyze the impact of synthetic OOD data on DL-based FD model performance in OOD data detection tasks.Ordinary classifiers (ODNN-ID) trained using softmax cross-entropy obtain higher model accuracy, ECE, NLL, and Brier scores due to training exclusively on ID data.Nonetheless, incorporating OOD data during validation in the ODNN-OOD classifiers enhances the model performance on OOD detection tasks.MGS-generated synthetic OOD data achieves the best within classifier scores of 0.82 for AUROC-OOD and 0.97 for FPR95.
Table 2 presents the results from experiments investigating (i) the correlation between confidence and correctness of model predictions and (ii) measures the models' capacity for OOD detection.Fundamentally, aleatoric and epistemic confidence metrics (Alea.Conf and Epist.Conf) seek to establish the likelihood of correct predictions given high confidence.For the RKL-PN and AE-DNN classifiers, the CGen-generated OOD data achieves the best confidence scores.Nonetheless, the ODNN-OOD classifier using MGS-generated OOD data during validation has the best overall confidence scores at 0.9317 aleatoric and 0.9318 epistemic, indicating model predictions that are more likely to be correct given the increase in confidence.For the OOD detection tasks, investigations reveal that using the MGS-generated OOD data enhances the performance of RKL-PN and AE-DNN classifiers, evidenced by the significant improvements over the other OOD generation approaches.In particular, augmenting the Steel Plates Faults dataset with MGS-generated OOD data for the AE-DNN classifier achieves the best overall scores at 0.9618 OOD aleatoric and 0.9643 OOD epistemic.The CGen-generated OOD data achieves the best scores for the ODNN-OOD, while failure to use any OOD data during training yields the poorest scores of 0.50 for both OOD aleatoric and OOD epistemic, further demonstrating the significance of synthetic OOD data in the training of safety-related FD applications.
Fig. 3 illustrates the predictive entropy density plots of ID and OOD data from AE-DNN trained on the Steel Plates Faults dataset augmented using OOD data from SBO, CGen, and MGS methods.Augmenting ID data using MGS-generated OOD yields the best divergence in predictive entropies, with OOD samples predominantly obtaining high entropies while ID obtaining low entropies.Notably, distinguishing between ID and OOD data is essential for DL-based FD systems deployed in safety-related industrial environments, in this case, implementable through a thresholding-based system.The distinction in predictive uncertainties between ID and OOD samples highlights the benefits of using MGS-generated OOD data to enhance the capacity for OOD detection tasks.

VI. CONCLUSION
This paper proposes Manifold Guided Sampling (MGS), a data-driven method for generating synthetic out-ofdistribution (OOD) data based on deep generative networks.In particular, MGS leverages an in-distribution (ID) datasupporting manifold of large-scale industrial process data and a combination of strategic manifold sampling techniques to create realistic OOD data.Through MGS, we address the challenges of training data quality and availability for data-driven deep learning-based fault diagnosis systems by generating synthetic OOD data that simulate real-world anomalies not present in the training set.We demonstrate the impact of augmenting ID data with synthetic OOD data during training for models, with results that suggest the synthetic data improves the model capacity for OOD detection and provides robustness to the effects of distributional shifts.
135070 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Furthermore, MGS samples low-probability regions of the manifold and is more efficient in terms of compute resources due to the utilization of smaller batch sizes when generating the tangent space of the manifold.It maintains the in-distribution data feature space as a reference point during data generation and applies a similarity distance constraint to ensure the resulting synthetic data is realistic.Our results show the best distinction between ID and OOD data, which is crucial for systems deployed in safety-related industrial environments.In future work, we aim to investigate the effectiveness of our approach on time-series datasets and high-resolution sensor data such as large-scale multimodal camera-LiDAR datasets.

FIGURE 1 .
FIGURE 1. Types of OOD data.Type 1A: classwise samples on the intersecting regions of the manifold, Type 1B: classwise samples on low-probability regions of the manifold, and Type 2: classwise samples located in proximity regions outside the learned ID manifold.

Algorithm 1
MGS Learning Algorithm Input: minimum distance d * set as an acceptable threshold from the ID data, perturbation value ϵ for adjusting latent variables, batch size b for decoder Jacobian matrix.Data: D = {x i , y i } N i=1 , set of N i.i.d.labeled samples from the training dataset.

FIGURE 3 .
FIGURE 3. Predictive entropy density plots for ID and OOD data from AE-DNN trained on Steel Plates Faults dataset augmented using OOD from (i) SBO, (ii) CGen, and (iii) MGS methods.Entropy scores are normalized to fall within the range [0, 1].MGS OOD improves AE-DNN capacity to detect OOD data using predictive entropy scores by achieving the best divergence between ID and OOD predictive entropy scores.
1. Through a learning rate scheduler, the base learning rate adaptively changed to 0.01 at epoch 75 and 0.001 at epoch 90 during training.For the optimizer tuning, we ultimately settle on Adam with ϵ values of 10 −4 .We partition the data into a train/test split of 70%/30% and use a large batch size of 128 for all experiments on the Steel Plates Faults dataset, an imbalanced dataset, hence increasing the chances of including samples from the minority classes in each batch during training.
presents the results from experiments evaluating the DL-based FD model robustness against a collection of noise-infused OOD data.We observe that training classifiers using a combination of ID and synthetic OOD data achieves superior model performance in detecting noise-infused OOD data.In particular, augmenting the Steel Plates Faults dataset with MGS-generated OOD data during training enhances the performance of RKL-PN and AE-DNN classifiers, as evidenced by the AUROC-OOD and FPR95 scores.ODNN-OOD classifiers with access to MGS and CGen synthetic OOD data during training outperform ODNN-ID classifiers with the observation that CGen-generated OOD data achieves the best improvement across the ODNN category.

TABLE 1 .
Accuracy, ECE, NLL, Brier, AUROC-OOD, and FPR95 test set results for RKL-PN, AE-DNN, ODNN-OOD, and ODNN-ID trained on Steel Plates Faults ID and synthetic OOD data generated from SBO, CGen, and MGS methods.Boldface values indicate better results per method.

TABLE 2 .
ID/OOD aleatoric and epistemic test set results for RKL-PN, AE-DNN, ODNN-OOD, and ODNN-ID trained on Steel Plates Faults ID and synthetic OOD data generated from SBO, CGen, and MGS methods.Boldface values indicate better results per method.