Building a Scalable and Interpretable Bayesian Deep Learning Framework for Quality Control of Free Form Surfaces

Deep learning has demonstrated high accuracy for 3D object shape error modeling necessary to estimate dimensional and geometric quality defects in multi-station assembly systems (MAS). Increasingly, deep learning-driven Root Cause Analysis (RCA) is used for decision-making when planning corrective action of quality defects. However, given the current absence of scalability enabling models, training deep learning models for each individual MAS is exceedingly time-consuming as it requires large amounts of labelled data and multiple computational cycles. Additionally, understanding and interpreting how deep learning produces final predictions while quantifying various uncertainties also remains a fundamental challenge. In an effort to address these gaps, a novel closed-loop in-process (CLIP) diagnostic framework underpinned algorithm portfolio is proposed which simultaneously enhances scalability and interpretability of the current Bayesian deep learning approach, Object Shape Error Response (OSER), to isolate root cause(s) of quality defects in MAS. The OSER-MAS leverages a Bayesian 3D U-Net architecture integrated with Computer-Aided Engineering simulations to estimate root causes. The CLIP diagnostic framework shortens OSER-MAS model training time by developing: (i) closed-loop training to enable faster convergence for a single MAS by leveraging uncertainty estimates of the Bayesian 3D U-net model; and, (ii) transfer/continual learning-based scalability model to transmit meta-knowledge from the trained model to a new MAS resulting in convergence using comparatively less training samples. Additionally, CLIP increases the transparency for quality-related root cause predictions by developing interpretability model which is based on 3D Gradient-based Class Activation Maps (3D Grad-CAMs) and entails: (a) linking elements of MAS model with functional elements of the U-Net architecture; and, (b) relating features extracted by the architecture with elements of the MAS model and further with the object shape error patterns for root cause(s) that occur in MAS. Benchmarking studies are conducted using six automotive-MAS with varying complexities. Results highlight a reduction in training samples of up to 56% with a loss in performance of up to 2.1%.

(near zero-defect manufacturing strategy). This is a very challenging task due to increased product variety and smaller batch sizes with ever-decreasing time-to-market. Currently, numerous modern manufacturing systems applications use multiple robotic assembly stations placed in a production line to improve productivity and product quality. For example, a typical automotive body assembly (Body-in-White (BIW)) consists of hundreds of stamping parts and components processed along 60-85 assembly stations with a production rate of 30-65 vehicles per hour with 3-5 varieties of vehicles being produced simultaneously on a single assembly line. Automatic robotic assembly lines can significantly increase productivity and the variety of products produced on a single line. Nonetheless, the embedded complexity of an automatic production line can lead to failures involving robot, end effector with joining head, and/or fixtures, resulting in diminished product quality, scrap, rework, or production downtimes. As a result, there is an urgent need for diagnostics of modern automatic assembly lines. The diagnostics approaches should have the ability to work with current requirements of modern assembly lines, i.e., have capability for (i) multi-variety products with non-rigid parts and shape variances which are rapidly scaled-up from low to high volume production (this translates to the need for rapid scalability of diagnostics models between products and production volumes); and (ii) automatic recovery from disruptions at minimum cost (translating to need for interpretability of the diagnostic results, which will provide transparency and the causes of their prediction, as they are often used to plan costly corrective action of quality defects).
Therefore, it is crucial to develop diagnostic approaches that are scalable and interpretable to ensure that multi-station assembly systems (MAS) can continue to produce highquality products in the presence of uncertainties induced by non-ideal parts and operational errors [1], [4]. Recently Object Shape Error Response (OSER) approaches have been proposed for single-station [5] and multi-station assembly systems (OSER-MAS) [6] that have integrated Bayesian deep learning with CAE simulations, hence enabling effective root cause analysis (RCA). The OSER-MAS leverages a Bayesian 3D U-Net encoder-decoder based architecture with multiple output heads. Multiple output heads enable simultaneous estimation of a heterogeneous set of process parameters that can be real-valued or categorical while quantifying uncertainties. The decoder is leveraged to estimate object shape errors for upstream stations. The OSER-MAS gives superior performance for tasks such as process parameter estimation, object shape error estimation and uncertainty quantification but lack capabilities for scalability and interpretability.
This paper proposes a novel closed-loop in-process (CLIP) diagnostic framework which simultaneously enhances scalability and interpretability of the current OSER-MAS based approach by integrating an algorithm portfolio inclusive of techniques such as closed-loop training, transfer learning [7], continual learning [8] for scalability and 3D gradient-based class activation maps (3D Grad-CAMs) [9] for interpretability within the OSER framework. The CLIP diagnostic framework shortens OSER-MAS model training time by developing: (i) closed-loop training to enable faster convergence for a single MAS by leveraging uncertainty estimates of the OSER-MAS model; and, (ii) scalability model to transfer meta-knowledge from the trained model to a new MAS and thus, each new MAS requires comparatively less training samples. Additionally, CLIP increases the OSER-MAS transparency for their quality root cause predictions by developing an interpretability model which entails: (a) linking elements of MAS model with functional elements of the Bayesian deep learning architecture; and, (b) relating features extracted by the Bayesian deep learning architecture with elements of the MAS model and further with the object shape error patterns for root causes(s) that occur in MAS. Although the CLIP framework is developed and validated considering the previously proposed OSER based approaches, the framework can be leveraged to simultaneously enhance the scalability and interpretability of similar approaches in related domains such as stamping and machining where deep learning and CAE simulations are leveraged to relate object shape errors to root causes.

B. RELATED WORK 1) SCALABILITY
Scalability within manufacturing systems has been stated as a set of capabilities to provide transfer of knowledge and ideas from other engineering and management areas [10]. Scalability for algorithms to perform RCA of MASs translates into effectively leveraging the learning for one type of assembly system and then transferring this learning in form of features and relationships which can be relevant for another similar assembly system, and hence can enable learning for the latter using an exponentially lesser amount of data and computation capabilities. Applications of using transfer learning techniques have been done for fault diagnosis [11]. Digital Twins [12] have also been proposed as a way to enable scalability. Successful applications of transfer learning across multiple domains [7], [13]- [16] have enabled scalability in a sustainable manner that does not require exhaustive training data and computation capabilities. Recent works have also proposed that scalability should be life-long or continual and should not come at an expense of forgetting previous learning when new features or relationships are learnt for new systems [8], [17]- [20]. Hence, to enable scalability for MAS frameworks and methodologies that integrate transfer/continual learning with existing deep learning approaches for RCA, it is essential to ensure that training data and computation times do not become barriers for the application of such models within industrial setups.

2) INTERPRETABILITY
Interpretability has been another major concern for the application of deep learning-based RCA models within MAS as they do not provide the required context, trust and confidence VOLUME 9, 2021 within root cause estimates. This lack of transparency when coupled with costly actions driven by them result in such models not being adopted at scale. Various methodologies such as Gradient-based class activation maps that can integrate deep learning estimates with the required transparency have been proposed [21]- [23]. Bayesian deep learning [5] has been proposed to integrate confidence and uncertainty measures with root cause estimates but there is a need for frameworks within MAS that can provide interpretability while accounting for context and confidence. Such frameworks enable trust in black-box deep learning models and provide different levels of interpretability.

3) ROOT CAUSE ANALYSIS OF ASSEMBLY SYSTEMS
Dimensional and geometric variations are some of the biggest challenges faced by the manufacturing industry. Indeed, two-thirds of quality issues in the automotive and aerospace sectors are caused by dimensional variations [24]. Driven by the development of in-line measurement systems such as coordinate measuring machines (CMM) and 3D scanners, models for RCA of assembly systems have seen a lot of development in both, industrial applications, and academic research. These models [25] can be grouped into two categories: (a) knowledge-based models; and (b) estimationbased data-driven models leveraging statistical and machine learning techniques [26]. For single station systems, Apley and Shi [27] established the deviation transfer model based on process information such as fixture positioning and used least-squares to diagnose fault sources. Chang et al. [28] leveraged a linear model between shape error sources and measurement features followed by parameter estimation and statistical tests to diagnose shape error sources. Yu et al. [29] leveraged influence coefficients based on finite element modeling to establish shape errors between the sources of variation and measurements of flexible sheet metal parts. Further least-squares estimation was used to estimate errors in fixture positioning.
For MASs, Agrawal et al. [30] used regression models of sensor data. Zou et al. [31] proposed integrating BIC with LASSO variable selection. Shang et al. [32] proposed a Binary State Space Model (BSSM) for MASs to perform binary diagnosis. Jin et al. [33] proposed state-space and stream-of-variation (SoV) for multi-station shape error propagation of automotive assemblies. Ding et al. [34] extended the SoV method of assembly shape error of rigid parts using state space considering different fixture locating scheme. Using the above approaches as a base, various RCA models for MASs have been proposed. Ding et al. [35], [36] compared different variance estimation techniques and concluded a basis for method selection under different scenarios. Ceglarek and Prakash [37] proposed shape error diagnosis based on enhanced piecewise least squares (EPLS) to detect and isolate collinear dimensional faults caused by fixture variation. Ceglarek and Shi [38], [39] employed pattern matching for diagnosis of fixtures based on principal component analysis. Liu and Hu [40] used designated component analysis for shape error diagnosis of flexible sheet metal parts. Various enhancements have been proposed using the knowledge of MASs [41], [42].
Given the ever-increasing complexity of MASs, increased computation capabilities and developments in machine learning, recently, RCA approaches [26] using machine learning have been proposed to overcome limitations of the above-stated methods such as linear approximations of the MAS. Du et al. [43] utilized artificial neural networks to monitor and identify process variability. Beruvides et al. [44] applied reinforcement learning to perform RCA. Bayesian Networks [45], [46] are seen as an alternative to solve small dataset problems and integrate process data and engineering knowledge. Recently, Sinha et al. developed Object Shape Error Response (OSER) for single-station [47], [5] and Object Shape Error Response for Multi-Station Assembly Systems (OSER-MAS) [6] that aim to integrate Bayesian deep learning elements such as Bayesian 3D Convolutional Neural Networks and Computer-Aided Engineering (CAE) simulations thereby, blending (a) engineering knowledge-techniques with (b) estimation-based data-driven approaches. This satisfies various model capability requirements for RCA of MASs such as (i) high data dimensionality [48]; (ii) non-linearity [49]; (iii) collinearities [50]; (iv) high faults multiplicity [51]; (v) uncertainty quantification [52]; (vi) dual data generation capabilities [12]; (vii) high dimensionality and heterogeneity of process parameters [53]; and, (viii) fault localization [54]. In addition to the aforementioned model capability requirements, RCA techniques must be further developed and enhanced to fulfil two additional key requirements [55] in order to enable implementation and large-scale adoption across different manufacturing environments: (ix) Scalability as automotive multi-station assembly processes include hundreds of stamping parts and components, multiple stations with multiple stages in each station [4] namely, place-clamp-fasten-release (PCFR) to finish the final assembly product. The multiple variation sources in the MAS interact and accumulate in a non-additive manner. The final product accuracy and performance depend upon the accumulated performance of individual stations in the system. In MASs, the quality of the components is influenced by (i) incoming non-ideal and deformable parts; (ii) PCFR-topart interactions as parts move through PCFR stages (shape error induced by fixturing and joining operations); (iii) partto-part interactions within each station and further magnified between stations; and, (iv) station-to-station interactions due to re-orientation errors between stations (change of fixture locating layout between stations).
(x) Interpretability as the estimations of root cause(s) will require insights into why such estimates were made by the deep learning model. Such interpretability insights are essential for contextualizing the root cause(s) estimated by the deep learning model. Additionally, root cause (RC) estimates drive costly corrective actions hence, model interpretability integrated with measures of uncertainty [5] are crucial 50190 VOLUME 9, 2021 requirements for effective and efficient corrective and control actions.
The paper will address the aforementioned requirements as follows: Requirement (ix) by developing a closed-loop training framework that leverages the epistemic uncertainty [5] estimates of the Bayesian 3D U-Net based OSER-MAS model to intelligently sample from the process parameter hyperspace for faster convergence and hence, reduce the computation time bottleneck of the CAE simulations. The presented CLIP diagnostic framework in this paper will utilize high fidelity CAE simulator of the assembly process called Variation Response Method (VRM) [56]. Further, to exponentially enhance the scalability for high dimensional MAS and reduce CAE simulation time, uncertainty guided continual learning [8] and transfer learning [7] features are integrated with the CLIP diagnostic framework underpinned algorithm portfolio. This enables the transfer of meta-knowledge from the trained model to a new MAS and thus, each new MAS requires comparatively less training samples. Theoretically, given that multi-physics processes of the assembly system are similar within each station hence, the features extracted by spatial convolutional operations are transferable across different assemblies; thus, making transferability within the model essential for scalability. Given that models for different MAS would be trained sequentially, leveraging continual approaches reduces catastrophic forgetting. A model with continual learning capabilities can also account for the dynamic nature of the assembly system and hence, achieve life-long learning. This is accomplished using uncertainty guided continual learning [8] that leverages the Bayesian neural network parameter uncertainty to assign importance for each task i.e., a particular assembly case study thereby, enabling continual learning by updating less important parameters at a faster rate.
Requirement (x) by developing an interpretability model which is based on 3D Gradient-weighted Class Activation Maps (3D Grad-CAMs) and entails: (a) linking elements of MAS model with functional elements of the 3D Bayesian U-Net model; and, (b) leveraging 3D Grad-CAMs [9] that provide insights into the regions within the input that the model focuses on to estimate process parameters i.e. root cause(s). These collectively provide the required interpretability for RCA. Additionally, Bayesian Deep Learning enables uncertainty quantification and segregation into aleatoric and epistemic uncertainties, thus, providing a measure of the required confidence while conducting RCA.
The key contribution of the paper is the development of the CLIP diagnostic framework underpinned algorithm portfolio that includes the following models: capabilities of the CLIP diagnostic framework underpinned algorithm portfolio using six different industrial automotive assemblies of varying complexities. The rest of the paper is organized as follows; Section II formulates the object shape error estimation and RCA problem for MASs, Sections III and IV discuss the methods for scalability and interpretability, respectively, Section V presents the industrial case studies and finally, conclusions and future work are summarized in Section VI.

II. PROBLEM FORMULATION A. MULTI-STATION ASSEMBLY SYSTEMS
As illustrated in Fig. 1 [6], MAS can be represented as a process tree where different nodes correspond to stages within a single assembly station (Fig. 1a) or as stations within the assembly system (Fig. 1b). A station consists of multiple stages namely, positioning, clamping, fastening and release (PCFR). The input to each station is a set of incoming parts (objects) that need to be assembled. Within the process, object shape errors can be induced in any station by one or multiple variations of the process parameter(s). These errors are further propagated and accumulate in a non-additive manner [34]. Any variation in the process parameter(s) is a source of shape error and thus must be first quantified and then estimated as a root cause(s) of the shape errors. In MAS, these process parameters are classified into three categories: (a) Real-valued parameters of incoming parts (objects) variation as caused by upstream fabrication processes such as stamping, extrusions, etc.; (b) Real-valued process parameters related to PCFR stages of each assembly station. They represent any deviation from nominal in fixturing/tooling or joining operations; and, (c) Binary joining-based process parameters in the fastening stage that indicate the success of the joint. The value is {1} when joint is successfully completed or {0} for an unsuccessful joint due to the excessive gap between objects to be joined or current failure in the tool. In this paper, Self-Piercing Riveting (SPR) is the considered fastening /joining process.  On the other hand, the aforementioned process parameters which can be real-valued or binary are represented by y s,ś for station s stageś. The process parameters for station s, inclusive of all stages, is represented by y s while y represents the total set h of process parameters across the entire assembly system. The process parameter vector y consists of c binary parameters and r real-valued parameters denoted by y c and y r , respectively.
The aim of the proposed 3D U-Net CNN model training is to learn a function f (·) that takes as input the combined object shape error at the end of the system x N s ,4 , i.e., after the final stage of the last station (N s ), and estimates the process parameters across the entire system and the object shape error for all objects at the end of the previous stations:

B. ROOT CAUSE ANALYSIS
For comprehensive RCA for MASs, the paper proposes three key steps namely: (i) fault identification; (ii) fault localization; and, (iii) fault isolation. Using the estimates obtained within (2), RCA can be done using the following steps to isolate single or multiple faults that occur in a MASs.

1) FAULT IDENTIFICATION
Fault identification involves identifying which process parameters are potentially at fault. Faults can be identified by comparing the values of process parameters with given standards. For all binary process parameters y c that are {0} are identified as faults as they represent a failure in the joining process. Real-valued process parameters can be identified as potential faults based on the fault identification strategy.
If a threshold approach is leveraged, each y s,ś that is greater than the threshold is identified as a fault on the other hand if six-sigma fault identification strategies are used a sample of products s p are observed and mean µ s,ś p and standard deviation σ s,ś, p of each process parameter is calculated. Based on the significance level used, a mean shift or a change in variance (heteroskedasticity) in process parameters can be identified as a fault. The subset of process parameters identified as fault via a threshold or a six-sigma approach can be denoted as:

2) FAULT LOCALIZATION
Fault localization involves identifying particular stations within which the object(s) (sub-assemblies) shape error becomes significant. The shape error estimates ] for all objects o : o = 1, . . . , n, are compared with the design nominal. If the shape error is beyond the threshold (assembly tolerances) at the end of a station and within the threshold, for the previous station the fault is localized to that particular station for the corresponding object. This is done for each object and thus, multiple faults for different objects can also be localized. The subset of stations localized is denoted as:

3) FAULT ISOLATION
Fault isolation involves integrating the information from fault identification and fault localization to isolate which process parameters within the potentially identified faults y F and localization stations s F are 'malignant' and have a significant impact on the shape error of the final object (product). For all potentially identified process parameters, the process parameters that lie within the localized stations s F (and corresponding objects o F ) are isolated as faults and estimated as RCs.
Given that manufacturing systems are stochastic, all process parameters always have an inherent level of variation, hence, using such a three-step approach enables differentiating faults that are benign (have no significant impact on the product shape error) from faults that are malignant (cause product shape error to go beyond assembly thresholds). The RCs can be denoted as: For the estimation of f (.) as shown in (2), Sinha et al. [6] proposed the Object Shape Error Response for Multi-Station Assembly Systems (OSER-MAS) approach. The proposed approach leverages a Bayesian 3D U-Net architecture (Fig. 2) [6] that enables: (i) estimation of a heterogeneous set of process parameters; (ii) estimation of upstream object shape errors; and, (iii) quantification and segregation of uncertainties. The model is trained using a weighted loss function [6] that accounts for all the aforementioned outputs and uncertainty quantification. The architecture consists of four levels of the encoder-decoder. The end of the encoder consists of a regression and classification head, each contains one hidden Dense Flipout layer with 64 nodes and ReLU activation. The output nodes in each head are equal to the real-valued and binary process parameters, respectively. These heads enable estimation of a heterogeneous set of process parameters. The end of the decoder estimates the upstream object shape errors. Given the use of Bayesian Flipout layers in the encoder and regression and classification heads, the architecture enables quantification and segregation of uncertainties. The encoder-decoder model consists of seven key functional elements namely: (1a) Object shape error voxelization; (1b) Encoder with down-sampling kernels; (1c) Decoder with up-sampling kernels; (1d) Multiple output heads; (1e) Attention gate; (1f) Bayesian Flipout layers; and, (1g) Residual connections. The interpretability of each element is discussed concerning the requirements of MASs. Overall the model takes as input the voxelized shape error after the final station x N s ,4 → V u,v,w,d and give as output the shape error after previous stations and the process parameters y r , y c , x 1,4 , . . . ,x N s −1,4 . Uncertainties are estimated for each output value. The uncertainty estimates are crucial in driving costly corrective actions. They are segregated into epistemic and aleatoric uncertainties. The aleatoric uncertainty is estimated by considering the outputs follow a multivariate normal distribution with a diagonal covariance matrix. The epistemic uncertainty is estimated by assuming each weight ω in the network follows a normal distribution with parameters θ ω =(µ ω , σ ω ), and then estimating the posterior parameters of the distributions. The paper proposes to leverage these measures of uncertainty to build approaches that enable scalable learning. The overall epistemic uncertainty of the estimated process parameters σ (y) and the uncertainty of the weights σ ω can be further leveraged to build methods that enable scalable learning by leveraging closed-loop sampling from CAE simulators and further leverage uncertainty-guided continual learning [8] for effective learning that aids in transferring knowledge in between similar MASs hence enabling convergence using exponentially lesser training samples while also ensuring that there is no catastrophic forgetting of learning for previous MASs.

B. CLOSED-LOOP SAMPLING AND TRAINING
Closed-loop sampling enables the dynamic and adaptive generation of training samples based on the uncertainty and error of the previous training iterations while ensuring that the sample generation has a degree of randomness to prevent the repeated generation of similar samples (Table 1). This enables faster convergence to the optimal weights and biases distribution parameters of f (.) as shown in (2). Sampling is done using VRM [56] as the CAE simulator, which takes as input a set of process parameters and outputs the object shape errors after the desired stage/station. Initially, Latin Hypercube Sampling (LHS) [57] of the process parameters within the allowable ranges is done for input to the CAE simulation model to generate the test (E) and validation (V) set.
While each element in X is characterized by the object shape error x s,ś (1) and each row in Y consists of a vector of process parameters: y = {y r , y c }. The initial training set T 0 is also generated using LHS. This is used to train the proposed architecture f (.). After training inference is performed on the validation set to obtain the predictions and uncertainty on the validation set (V) For each sample the absolute error is calculated and summed up across all the process parameters: The error is summed up across all h process parameters for each sample e v which is a column vector consisting of combined error for each sample.
The normalized error (ẽ v ) and uncertainties (σ V ) are weighted to obtain the sampling importance metric τ for each sample: Samples are sorted based on the sample metric. This is done considering that the samples having the highest importance, i.e., having the maximum sum of error and uncertainty would have a more significant contribution to model convergence than other samples. Based on sampling metrics and actual process parameters Y V , the parameters of the sampling distribution are estimated. Gaussian Mixture Model (GMM) with a pre-specified number of mixtures is considered as the sampling distribution. The sorted set of sampling metric and process parameter is represented as (τ v ,Y v ). Given the GMM has K mixtures, the sorted set is subdivided into K blocks each block i b having samples τ i b =dim(V )/K where dim(V ) represents the total number of samples within the validation set. Each block i b is used to estimate distribution parameters for the i b th mixture. Each mixture is characterized by a multivariate normal distribution while the component within the multi-variate normal corresponds to the process parameter. The distribution parameters for the i bth component consisting of s samples is estimated as in (11), where Y V i b represents the subset of samples in the block i b where φ vector is normalized, to sum up to 1 and hence represent mixture component weights φ i b , and where µ i b , i b represent the mean vector and covariance matrix respectively for the i bth mixture. After estimating the distribution, T i samples are drawn and evaluated and then added to the training set to further train the model. This ensures that samples are drawn considering the error and uncertainty while accounting for the fact that the model should not over fit on the validation set. It should be noted that although the validation set is used for sampling, the model is never trained on the validation samples, the randomness in the sampling ensures that samples are drawn from regions where the error and uncertainty are high. After each training is done the model is tested on the test set (E) to determine the model performance. The training and sampling are terminated either after i) the performance on the test set (E) reaches the required threshold ε or ii) the maximum number of training iterations n m is reached. The performance threshold can be decided based on the case study and application, and the maximum number of iterations is decided based on the CAE simulation budget.

C. UNCERTAINTY GUIDED CONTINUAL LEARNING
The closed-loop sampling approach enables faster convergence within one MAS but does not enable the trained f (.) to be used across different MASs. Leveraging the trained function f (.) across different MSAs to enable transferring of relevant knowledge (features) and hence enable convergence in comparatively lesser samples is crucial in enabling scalability. Continual learning methods (also known as sequential/lifelong learning) aim to incrementally learn new tasks without forgetting previous tasks for which they have been trained. In the context of MASs, the scale-up starts from tasks or cases as simple as a coupon or Top-hat assembly to full-scale MASs such as automotive car door or cross member assemblies and can cumulatively consist of up to 100 assembly stations each consisting of multiple stages. Each assembly case is treated as a task T i , and continual learning is performed for a total of T n tasks. Continual learning enables the transfer of the process parameter estimation capabilities of previous assembly cases to more complex assemblies while retaining the essential capabilities required for process parameter estimation of previous assemblies. The key to achieving continual learning requires assignment of importance to each neural network weight ω and further updating only non-important weights such that the model learns the new task without forgetting the previous task [8]. The approach leverages uncertainty guided continual learning because the weight uncertainty σ ω of the Bayesian 3D U-Net model serves as an implicit measure of importance. Additionally, the ease of interpretation, strong mathematical foundation and good results on various datasets [8] motivate the use of such a learning algorithm.
Given the use of Bayesian neural networks and a normal distribution parametrized by θ ω =(µ ω , σ ω ) for each weight ω within the network, the standard deviation of each weight distribution is leveraged as the metric for importance. To enable continual learning the learning rate α for each parameter is updated by the corresponding importance .
The importance of the parameters is set to be inversely proportional to the standard deviation, which mathematically translates that weights with higher standard deviation are less important and hence can be updated at a higher rate to learn new tasks, while weights with lower standard deviation are more important and hence should be updated at a lower rate to prevent catastrophic forgetting (performance loss) for the old tasks.
Based on various empirical studies done by Ebrahimi et al. [8], the learning rate adaptions were determined as: The overall algorithm consisting of closed-loop sampling and continual learning, to train the model on multiple tasks T i , . . . , T n (different assembly case studies) is shown in Table 2. After each task, the learning rates are updated as shown in (14). The number of output nodes within the model is kept equal to sum all process parameters across all cases studies. For each assembly case, the specific process parameters nodes output the values while other nodes corresponding to process parameters for other assembly cases are set to output a nominal fixed value (generally set to zero). Overall continual learning aims to learn process parameter estimation capabilities of each assembly case study incrementally while minimizing the forgetting for previous assembly case studies.

D. TRANSFER LEARNING
Transfer learning is an effective method for transferring learning (process parameter estimation for MASs) from one task to related tasks (between different assembly case studies) using exponentially lesser training samples and hence acts as a key enabler for scalability. The paper leverages transfer learning and continual learning as a combined algorithm portfolio enabling scalability. The choice between them can be made based on training results and deployment performance. Transfer learning is mathematically formalized [14], [13] as a domain D which consists of features X and a distribution over the feature space P(X). In this case the domain entails process parameter estimation f (·) on a particular assembly case T i with shape error features X and distribution over the feature space as P(X). Within Domain D task T is performed that constitutes learning a conditional distribution P(Y |X) to estimate process parameters Y The paper aims to 'transfer learn' from the source domain (D s ) corresponding to a particular assembly case to a target domain (D T ), i.e. a similar assembly case to perform the same task of estimating f (.) (2) while accounting for differences between cases such that at least one of the elements between the domain and target are not the same: Considering the prior knowledge on the similarity of assembly cases studies it can be estimated that: (i) (X s ≈ X T )given that similar features [58] need to extracted from the object shape error data that include bends, twists, rotations, translations etc.; (ii) P(X s ) ≈ P(X T ) -similarly, the distribution around the input features is approximately the same; (iii) Y s =Y T -the outputs for each study are different given a different number of process parameters are involved; and, (iv) P(Y s |X s ) =P(Y T |X T -the conditional distribution is also significantly changed given the change in the assembly system and the output. To account for (iii) the final layer of the network is replaced with nodes corresponding to the new set of process parameters; to account for (iv) the approach leverages standard protocols established in transfer learning to achieve transferability. Based on past work on successful applications of transfer learning that involved using ImageNet data to aid Computer-Aided Detection [15], the fine-tuning transfer learning protocol is leveraged. The network weights are initialized using the weights of a network trained on the previous assembly case study. The whole network is then fine-tuned while keeping the learning rate of the convolutional layers α c in the first two encoder levels ten times less than the rest of the network α F . The regression and classification heads are replaced and reinitialized. The nodes in each of the heads are determined by the number of process parameters for the case study. Overall transfer learning (Table 3) aims to learn process parameter estimation capabilities of each assembly case study while 'transferring' knowledge from previous case studies. Although while doing this the model can 'forget' estimation capabilities for previous case studies. Fig. 3 summarizes the overall framework for closed-loop sampling and training integrated with continual and transfer learning approaches.

IV. METHODS FOR INTERPRETABILITY A. 3D U-NET ARCHITECTURE INTERPRETABILITY
The implementation of deep learning models within industrial environments requires opening the black box and providing interpretability and causation on why the deep learning models can give superior performance as compared to traditional linear or piece-wise linear approaches traditionally used for RCA of MASs. The paper proposes to do that on two levels: Providing a link between MASs requirements and functional elements of the Bayesian 3D U-Net architecture of the OSER-MAS approach. This also aims to provide a link between the engineering challenges faced in RCA of MASs and the developments done within the OSER-MAS model to overcome these challenges.
Leveraging 3D Grad-CAMs to interpret the features that are extracted by the architecture and then propagated through various encoder and decoder layers to be interpreted as root causes. To integrate high measures of confidence within the deep learning model estimates, it must be established that the input context x N s ,4 (object shape error) on which the model focuses should be directly related to the estimated output y (process parameter or root cause), e.g. if the model estimates 'part variation' as a root cause, it should focus on the 'part variation' rather than other possible root causes such as 'clamping' or 'positioning'. Clearly extracted semantics in convolutional layers integrates a much-needed measure of 'trust' within the root cause estimates.

B. 3D GRADIENT-WEIGHTED CLASS ACTIVATION MAP
3D Gradient-weighted class activation maps (3D Grad-CAMs) aim to visualize the input features that led to a particular output. In the context of MASs, this aims to localize key regions within the input shape error x N s ,4 that led to the estimation of a process parameter y. This is estimated by taking a discriminative gradient of a particular process parameter y m output with respect to the feature map of a selected convolutional layer within the 3D U-Net architecture. The map for a particular output process parameter y m is represented as L y m grad−CAM and can be calculated as a weighted sum of the features maps: (17) where A f represents f feature maps (f = 1, 2, . . . , F) for the selected convolutional layer and a, b, c represent the dimensions of the 3D feature maps. ReLU represents the activation function which is rectified linear unit. The weights are calculated by summing the gradients for each element within the feature map: The L y m grad−CAM is interpolated to match the dimensions of the input voxelized object shape error. The overlay between the voxelized object shape error and interpolated 3D Grad-CAM provides interpretability information on what features/spatial regions within the shape error did the model focus on to estimate the selected process parameter y m of interest. The interpolated 3D Grad-CAM is then transformed to a point-based shape error and smoothing is done using a median filter for consistency across the mesh. These can be visualized to obtain regions within the shape error input that the neural network model is focusing on to estimate the process parameter. Fig. 3 summarizes the algorithm portfolio including the integration of closed-loop training with continual/transfer learning-based scalability model and 3D Grad-CAMs based interpretability model. The effective integration of these models enhances the diagnostic capabilities of the OSER-MAS model and enables scalable and interpretable RCA.

V. CASE STUDIES A. EXPERIMENTAL SETUP
Verification and validation is done using T n = 6 tasks or assembly systems (Fig. 4, Table 4) with varying complexities ranging from a single part coupon level assembly to automotive industrial multi-station assemblies. Each assembly case is considered as a unique task. Continual or transfer learning is done sequentially for all case studies as in the order mentioned below. The case studies include: (1) Flat Plate (Coupon) Assembly: consists of n = 1 ideal compliant part with a flat 2D geometry. It involves N s = 1 station and four stages (PCFR) and is controlled by h = 7 real-valued y r fixturing and joining based process parameters.
(2) Top-Hat Assembly: consists of a n = 2 ideal compliant parts with a simple 3D geometry. It involves N s = 1 station and 4 stages (PCFR) and is controlled by h = 17 real-valued y r fixturing and joining based process parameters.
(3) Door Halo Reinforcement Panel Assembly: consists of n = 1 ideal compliant part with complex 3D geometry. It involves N s = 1 station and 2 stages (PC) and is controlled by h = 3 real-valued y r fixturing based process parameters.
(4) Door Inner and Hinge Reinforcement Assembly: consists of n = 2 ideal compliant parts with a complex 3D geometry. It involves N s = 1 station and four stages (PCFR) and is controlled by h = 6 real-valued y r fixturing based process parameters.
(5) and (6) Cross Member Assembly: consists of n = 4 non-ideal compliant parts with complex 3D geometry. It involves N s = 3 stations each with four stages (PCFR). Two sub-cases within this are considered: case (5) consisting of h = 12 real-valued y r part variation and fixturing based process parameters; and, case (6) consisting of a heterogeneous (real-valued and binary) set of h = 158 process parameters including 123 real-valued y r part variation, fixturing and joining based process parameters and 25 binary process parameters y c indicating the success of joining (Fig. 5).
For comparison and benchmarking of scalability, all five cases are analyzed under T s = 4 training scenarios are considered: (i) Random Sampling: Involves randomly sampling from the CAE simulator within the allowable ranges for each process-parameter. Each case study is trained on a re-initialized network with random weights. This also serves the baseline performance expectations.
(ii) Closed-loop Sampling: Involves training the five case studies using Algorithm 1 as shown in Table 1.
(iii) Transfer Learning with Closed-loop sampling: Involves training the five case studies sequentially using Algorithm 3 as shown in Table 3.
(iv) Continual Learning with Closed-loop Sampling: Involves training the five case studies sequentially using Algorithm 2 as shown in Table 2.
Interpretability is verified by considering the cross member assembly (5) to provide links between MASs requirements and architecture functional elements and to obtain 3D Grad-CAMs for key process parameter variations. While obtaining 3D Grad-CAMs the weights and biases of the network are fixed at the mean values (ω = µ ω ).
Before training all shape errors are pre-processed and voxelized to (u, v, w, d) = (64, 64, 64, 3) voxel grids V 64,64,64,3 . The deviation features d include deviations in all directions for all points (x k ,ỹ k ,z k ). The shape error after the final station (x N s ,4 ) is used as input while the process parameters y and upstream stations shape errors x 1,4 , . . . ,x N s −1,4 are used as output. The model architecture hyperparameters were selected as proposed in the OSER-MAS approach. Training hyperparameters were optimized for all scenarios. Adam optimizer [59] is used for training in scenarios (i), (ii) and (iii). Initial learning rates α 0 = [0.1, 0.01, 0.001, 0.0001] were compared for scenario (i) and (ii), α 0 = 0.001 gave optimal performance in terms of error and convergence, α 0 = [0.1, 0.01] gave inferior performance as compared to α 0 = [0.001, 0.0001], α 0 = 0.001 was finally selected as the learning rate gives faster convergence between the two values. The same combinations were tested for scenario (iii) while under the constraint that α C = α F /10 given the fine-tuning protocol to ensure later layers learn at a faster rate as compared to initial layers. α C = 0.0001 and α F = 0.001 gave the most optimal performance. Scenario (iv) tested initial learning rate for stochastic gradient descent (SGD), α 0 = 0.001 gave optimal performance. The learning rates were multiplied in each case by the weight uncertainty as described in Table 2. Minibatch sizes of 8, 16 and 32 were tested. Larger batch sizes could not be used given the high GPU memory requirements of 3D CNNs. Minibatch size of 32 gave the best performance. Smaller sizes such as 8 and 16 caused the training process to be unstable. The model is trained for 300 epochs. Group normalization [60] with four groups is used after each convolutional layer. This prevents overfitting and accounts for small minibatch size due to GPU memory size constraints and aids in stabilizing the training process. Optimization for training hyperparameters was done for case (1) (Flat Plate Assembly) to ensure computation feasibility. Given case studies (1) to (4) consist of a single station; the decoder of the 3D U-Net model is not used as upstream shape error for previous station stations does not need to be estimated. Case study (5) leverages the decoder to estimate shape error after upstream stations. The work has been implemented using Python 3.7 and TensorFlow -GPU 2.1 [61] and Tensor-Flow Probability 0.9. A python library named DLMFG [62] has been developed to validate and replicate the results of the methodology. For this paper, both, the data generation and evaluation of the approaches have been done using VRM. Two Nvidia Tesla V100 32 GB GPUs are used for model training and deployment.

B. DISCUSSION: SCALABILITY
The results for training assembly cases sequentially in the aforementioned scenario are summarized in Fig. 6. Training in all scenarios is done until convergence. Model is considered converged when the performance metrics on the test set E = X E , Y E are better than the threshold. R-Squared (R 2 ) ≥ 0.90 is considered as the convergence criteria for y r and Receiving Operating Characteristics -Area Under Curve (ROC-AUC))≥ 0.90 is considered as a convergence for y c . The initial training set size T 0 is set to be 500 samples and T i = 100 samples are added in each closed-loop iteration based on the estimated GMM parameters. Based on empirical tests the pre-specified number of mixtures in the GMM is fixed at K = 5.
The results show that for low complexity cases such as Flat Plate (1) and Top-Hat (2) the effects of using closed-loop sampling with continual or transfer learning gives only minor reductions in training samples for converging. As the complexity of the assembly cases increases and the effect of pre-trained weights of continual and transfer learning become significant reduction up to 50% in the number of required training samples for case (6) can be seen. This validates the need for scalable approaches required for training high-dimensional assembly cases while leveraging the pretrained models on low-dimensional assembly cases.
The performance measure of using continual learning for all T n tasks are done by comparing R 2 on the T i th task when learning is performed only till T i th task (R 2 T i ,T i ) and when learning is performed till the T n th task (R 2 T i ,T n ). Catastrophic forgetting for each task (CF T i ) is quantified as the difference between performance in the aforementioned situations: Negative value of CF T i means catastrophic forgetting of previous cases while positive values mean that learning new tasks has improved the performance of previous tasks.  The convergence is measured by the number of samples (S T s ) for training for the given training scenario T s . VOLUME 9, 2021 Improvement in convergence T s for a training scenario T s is measured as the percentage difference in training samples required for convergence as compared to the baseline training scenario which for this case is (i) random sampling Table 5 also summarizes the improvement in convergence for the CLIP framework i.e. improvement in convergence when using training scenario (iv) Continual Learning with Closed-Loop Sampling T s =(iv) . Fig. 7 aims to compare loss in performance (CF T i ) with improvement in convergence ( T s =(iv) ). Overall, across the six cases, the proposed CLIP framework provides 56% improvement scalability as quantified by improvement in convergence with a loss in performance of only 2.1% as quantified by Catastrophic forgetting.

C. DISCUSSION: INTERPRETABILITY
The interpretability of the Bayesian 3D U-Net is done on two levels: (1) By linking requirements of the MASs with functional elements of the Bayesian 3D U-Net architecture. These include: (1a) Object Shape Error Voxelization: Shape error voxelization provides an intermediate 3D data structure linking mesh obtained from CAE simulation and point clouds obtained from 3D optical scanners. Voxelization ensures that both these data structures are converted to voxels and are hence, compatible with 3D convolution operations fulfilling: Requirement (i) high data dimensionality; and, Requirement (vi) dual data generation capabilities. The voxels are multi-channels with each channel corresponding to one component of shape error. The resolution of voxels depends on the required performance. Fig. 8 shows voxelization for one component of shape error at different resolutions. Low-resolution voxels capture global shape error patterns, as the resolution is increased, local shape error patterns are effectively captured. This also increases the discriminative capability required to differentiate between collinear shape error patterns, although this comes at a higher computational cost. Empirical studies have shown that there is no significant increase in performance above (64×64×64) for RCA of MAS. Case (6) Cross Member Assembly (h = 148) is used for a sensitivity study. Object Shape Error Reconstruction Error and performance is compared against voxel granularity. The reconstruction error is less than 1% given (64 × 64 × 64) or more granular voxels. The model performance does not increase over R 2 = 0.96 even with voxels as granular as (96 × 96 × 96). Additionally, with increased voxel sizes above (64×64×64) the minibatch size used during training has to be further reduced to 16 given GPU memory constraints which result in unstable model training and hence negatively impacts performance. Table 6 summarises the results of the sensitivity study. (1b) Encoder with Down Sampling Kernels: As described earlier the Bayesian 3D U-Net architecture consists of four levels of the encoder and decoder models (Fig. 2). Each level of encoding consists of the down-sampling kernel (see Down-sampling kernel in Fig. 2). The kernel consists of 3D Max pooling, which is duplicated to a residual connection and encoding connections. The residual connection consists of a 3D convolution with a filter size of one and a stride length of one in all three dimensions. The encoding connection consists of two 3D convolutions of filter size three and stride length one with ReLU activation in between. Then, the residual connection and the encoding connection are merged using element-wise addition. Finally, ReLU is applied before duplicating the output into the decoder input and next level encoder input. Overall, the down-sampling kernels in the encoder with consecutive 3D convolutions and pooling are essential for spatial correlation filtering, feature extraction, and non-linear transformations. Consecutive levels of the decoder extract more discriminating features from the high-resolution voxelized shape error input. The discriminative ability of the features increases at each consecutive encoder level thus, enabling accurate estimate of process parameters hence high root cause isolability. Each level of the encoder is also linked to the corresponding decoder, this enables transfer of features related to the part geometry and hence enables accurate estimation of upstream part shape error at the end of the decoder. Fig. 9 highlights the features extraction capabilities of different levels of the encoder, while lower levels focus on the entire part higher levels focus on the regions that contain the shape error. This enables fulfilment of requirements: (ii) non-linearity; (iii) collinearities; and, (iv) high faults multiplicity. The 3D Grad-CAMs for encoder levels provides interpretability by visualizing the extracted shape error features. The transparency provided in the extracted shape error features enables interpretability on why a particular root cause was isolated.
(1c) Decoder with Up-sampling Kernels: Each level of the decoder consists of the up-sampling kernel (see Up-sampling kernel in Fig. 2) and provides real-valued segmentation maps that estimate object shape error, i.e., the three components of deviation for each subassembly at the end of all upstream stations. Each level of the decoder consists of two input sources; the encoder input from the corresponding level encoder and the decoder input from the previous decoder level. The following operations are then performed: (i) up-sampling of the decoder input, which is duplicated and sent to the attention gate and feature concatenation layer; (ii) the attention gate [63] distils information from the encoder and then generates relevant features that are concatenated with the up-sampling output from (i); and, (iii) this concatenated feature set is duplicated to the residual connection and the decoder connection and similar operations as in the encoder layer are performed. The number of channels of the decoder output equal to the number of components of shape error multiplied by the number of upstream stations, while the granularity of the output is the same as the input voxel size. Various levels of the decoder aggregate features from the corresponding encoder and previous decoder. This integrates part geometry features (as provided by the encoder) with the shape error features (as provided by the previous decoder) enabling accurate estimation of upstream part shape error. Different levels of decoder reconstruct shape error within different regions of the part. This enables fulfilment of requirement (viii) Fault Localization. Fig. 10(a) highlights the up-sampling capabilities of different levels of the decoder that enable estimation of object shape error of upstream assemblies. Level 2 reconstructs features of the pocket reinforcement subassembly while levels 3 and 4 reconstruct features of the cross-member reinforcement assembly. The 3D Grad-CAMs for decoder levels provides interpretability into the stations within which the fault is localized. Fig. 10(b) compares the actual upstream assembly shape errors as estimated using CAE simulation with those estimated by the decoder output. (1d) Multiple Output Heads: The model consists of two output heads one head estimates real-valued process parameters y r as done in a regression setting while the second head estimates categorical/binary process parameters y c as done in a multi-label classification setting. This is essential for MASs as they have a large number of process parameters inclusive of VOLUME 9, 2021 (a) real-valued parameters for non-ideal parts and fixturing/ tooling (within Positioning (P) and Clamping (C) stages of the PCFR assembly cycle); and, (b) binary parameters for joining operations (within the Fastening (F) stage of the PCFR assembly cycle). The number of regression output nodes is equal to the number of real-valued parameters and the number of classification output nodes is equal to the number of binary process parameters This enables fulfilment of requirement (vii) High Dimensionality and heterogeneity of process parameters.
(1e) Attention Gate: The soft-attention mechanism as proposed by Oktay et al. [63] is used between corresponding levels of the encoder and decoder. The attention approach allows the model to be specific to local regions. In the context of shape error estimation of assemblies, this helps the model focus on particular parts/subassemblies in each station. Adding of the attention gate increases in accuracy of upstream stations shape error estimation as the model learns where to look within the final assembly to estimate upstream sub-assemblies [x 1,4 , . . . ,x N s −1,4 . This decoder inclusive of attention gates improves performance for requirement (viii) Fault Localization. Fig. 11 shows 3D Grad-CAMs for areas of focus at different functions within the up-sampling kernel.
Attention 3D Grad-CAMs enables interpretability by providing insights into the regions focused by the decoder in estimating upstream shape errors.
(1f) Bayesian Flipout Layers: Given the uncertainties in the system and the availability of only a limited dataset, a deterministic estimate of function f (.) as shown in (2) is not feasible. The Flipout [64] layers leveraged in the encoder enable uncertainty quantification. These estimates of uncertainty integrate measures of confidence within isolated RC(s) and hence drive costly corrective actions [6]. This is realized by using Bayes-by-Backprop [65] which integrates backpropagation with variational inference [66] to estimate a posterior distribution q θ (ω) which is parameterized by θ over the neural network weights based on the pre-specified prior p (ω). This enables fulfillment of requirement (v) uncertainty quantification. The uncertainties are key elements of interpretability insights as they integrate a measure of confidence within the root cause estimates.
(1g) Residual Connections: Given the deep architecture of the model, vanishing gradients can be a major issue, hence residual [67] or skip connections are added within each down-sampling and up-sampling kernel that ensure effective prorogation of gradients by providing a skip route.    12 highlights the 3D Grad-CAMs for various stages of the residual connection. As seen (highlighted in red rectangle) in level three the layer before the residual connection has negligible activations due to skipping of the layer while the gradients become significant after the addition of the residual. The residual connections improve performance for requirements: (ii) non-linearity, (iii) collinearities, and (iv) high faults multiplicity.
(2) Using 3D Grad-CAMs to interpret the working of the architecture for different process parameter variations or root cause(s). The above 3D Grad-CAMs provide a global level of interpretability by linking functions elements of the architecture with requirements of the MAS. The next local level of interpretability aims to provide transparency into the 3D Grad-CAMs for various levels of the encoder for key root cause scenarios. This links the shape error features extracted by each level of the encoder to estimate that particular root cause. Fundamentally, to interpret that the architecture is isolating a root cause correctly the features extracted by various levels of the encoder should correspond to the shape error patterns caused by that root cause. To validate this the cross member assembly (case (5)) is considered and the working  of the architecture in analyzed for five key root cause(s) scenarios: (2a) Part Variation Root Cause: This is caused due to variation is upstream fabrication processes is estimated as variation in y m = y 1 = 2 mm. Fig. 13 represents the output of the assembly given the incoming part (cross-member) n = 3 has part variation. The region marked in red depicts a bend in the part that is unique to a part variation root cause [58]. The 3D Grad-CAMs as shown in Fig. 13 highlights, that the first encoder focuses around the entire part, the second encoder level can identify the edges near the bend and the final encoder levels (three and four) can identify the region where the bend has occurred and hence accurately estimate y 1 as a part variation root cause.
(2b) Positioning Root Cause: This is caused by tooling installation and calibration error, or tooling deterioration due to gradual wearing out of fixture locators and is estimated as variation in y m = y 5 = 1 mm. They affect the part placement including orientation/reorientation and stability. The 3D Grad-CAMs as shown in Fig. 14 highlights that the encoder focuses around the entire part that has an error in orientation and estimates y 5 as the magnitude of the error.  (2c) Clamping Root Cause: This is caused by misalignment of the clamp in the y-direction and estimated as y m = y 11 = 2 mm. They cause part bending of compliant parts. The 3D grad-CAMs as shown in Fig. 15 highlights that the encoder can focus on the local bend pattern at the location of the clamp and estimate y 11 as the magnitude of the clamp misalignment in the y-direction.
(2d) Joining Root Cause: This is caused by misalignment of the joining tool (SPR) in the y-direction and estimated as y m = y 12 = 2 mm. They lead to a defective joint between the two assemblies. The 3D grad-CAMs as shown in Fig. 16 highlights that the first level of the encoder can focus on the region of defective joint and the later levels focus on the subassembly affected due to the defective joint and hence estimates y 12 as the magnitude of joining tool misalignment in the y-direction.
(2e) Part Variation and Clamping Root Causes: This is caused when there is an upstream part variation (y m = y 1 = 2 mm) and misalignment of the clamp in the y-direction (y m = y 11 = 2 mm). These lead to multiple simultaneous bends across the assembly. The 3D Grad-CAMs as shown in Fig. 17 highlights that various levels of decoder focus on all effected regions within the sub-assembly to simultaneously estimate multiple root causes. This capability is crucial in ensuring that deep learning models have high RCA capabilities even in scenarios when all process parameters have variation and potentially are at fault (100% fault multiplicity). Such cases of high fault multiplicity cause various shape errors that are collinear (highly similar). The ability of architecture to simultaneously focus on multiple areas within the multi-channel voxelized input and localize various bends, twists and other shape error patterns which are potentially overlapping (interacting) and then relate them to the process parameter(s) causing it, makes it the ideal approach to do RCA of high-dimensional MASs with high fault multiplicity using granular 3D data structures such as mesh (CAE) or point clouds from 3D scanners.

VI. CONCLUSION & FUTURE WORK
The paper proposed a novel closed-loop in-process (CLIP) diagnostic framework underpinned algorithm portfolio to address the current limitations of scalability and interpretability. Scalability is enabled by leveraging closed-loop training integrated with uncertainty guided continual learning or transfer learning. The approach enables effective transfer of knowledge through invariant features between MASs thereby, achieving quicker convergence with 56% lesser training samples. The overall loss in performance was limited to only 2.1 % as quantified by average catastrophic forgetting (Table 5). Interpretability is enabled by leveraging 3D Grad-CAMs that provide insights into the functioning of key elements within the architecture and also relate features extracted by the architecture to shape error features within MAS. The visual interpretability explanations and uncertainty estimates integrate confidence hence, enabling trust in black-box deep learning models.
Scalability and interpretability are key challenges that must be solved to enable widespread adoption of deep learning methodologies in industrial environments. Key industrial application entails RCA of assembly processes of discrete components made of sheet metal parts used in automotive, aerospace or consumer products industries. These applications will leverage directly the CLIP diagnostic framework with the OSER approaches to enable scalable and interpretable root cause analysis and will be especially beneficial for processes with larger number of parts and/or larger number of assembly stations. The framework can also be leveraged for transfer of learning to different type of manufacturing processes such as stamping, machining and additive manufacturing that can be formulated using the proposed formulation of object shape error estimation for root cause analysis, this will lead to leveraging transfer and continual learning to other manufacturing processes that can be linked to assembly processes. Interpretability has been a major barrier preventing the adoption and deployment of deep learning models in the industry. The interpretability elements proposed by the work aim to eliminate the barrier and integrate context and confidence to the estimates hence, enabling wider adoption. Leveraging such automated and interpretable RCA models provides a transformative framework by ensuring early estimation and elimination of process variations before they become defects thereby, helping to achieve Zero-Defect-Manufacturing and Right-First-Time.
Future work involves addressing the current limitations of the approach such as estimating dynamic changes in manufacturing systems and quantifying that as concept drifts or covariate shifts. Such changes when detected would then result in the model being fine-tuned such that the model accounts for the dynamic changes in the manufacturing system. This would enable lifelong learning for dynamic manufacturing environments. Further work also involves quantitative modeling of invariant features between different MASs. These invariant features can be linked to first principle models of the MASs and hence, further enhance scalability and interpretability.