Comparative study of classification algorithms for quality assessment of resistance spot welding joints from preand post-welding inputs

Resistance spot welding (RSW) is a widespread manufacturing process in the automotive industry. There are different approaches for assessing the quality level of RSW joints. Multi-input-single-output methods, which take as inputs either the intrinsic parameters of the welding process or ultrasonic nondestructive testing variables, are commonly used. This work demonstrates that the combined use of both types of inputs can significantly improve the already competitive approach based exclusively on ultrasonic analyses. The use of stacking of tree ensemble models as classifiers dominates the classification results in terms of accuracy, F-measure and area under the receiver operating characteristic curve metrics. Through variable importance analyses, the results show that although the welding process parameters are less relevant than the ultrasonic testing variables, some of the former provide marginal information not fully captured by the latter.


I. INTRODUCTION
Manufacturers in the automotive industry face an increasingly competitive environment [1]. Remarkably, resistance spot welding (RSW) is a critical manufacturing technology in such sector [2], [3], since its high speed and adaptability for automation render it suitable for mass production [4].The number of RSW joints per vehicle is very high (around 5000 according to Xia et al. [5]), and there can be significant variability in the quality of each of them due to the fact that RSW is a complex process [6], [7]; more precisely, the heat generated by an electrical current has to be substantial enough to promote local melting and the formation of a weld nugget at the faying interface [8], [9], while at the same time the amount of heat is influenced by the electrical conductivity of the materials to be joined and by their surface condition, as well as by the thermal conductivity of the electrodes that are water-cooled and act as heat sinks [9]. RSW holds a promising optimization potential as a result of the balance that it establishes between cost and performance; remarkably, the tendency in the automotive industry is to reduce the number of RSW joints per vehicle, which makes the accuracy of the tools to assist in the quality control of RSW joints [10] more critical, as the fewer the RSW joints per vehicle, the stronger the requirements for each of them [6].
In the literature, two main RSW quality-control modelling approaches are found: (i) models that predict the quality of the RSW joints from the welding parameters established prior to or during the welding process [11]- [16], and (ii) models that assess the quality level based on the data of the ultrasonic oscillograms obtained after the welding process [6], [17], [18]. Notably, even though ultrasonic nondestructive testing requires that human operators have a certain degree of training, it is a promising technique for estimating the quality of RSW joints in the automotive industry [19], where it represents a cost-reduction opportunity [20], [21]. As a result of the competitive nature of such sector, optimization and cost reduction are two of its cornerstones, being thus of interest to thoroughly explore the potential and limitations of the different machine learning techniques to assess the quality level of RSW joints from either pre-welding inputs (welding parameters), post-welding inputs (ultrasonic oscillograms) or both, as well as the possible synergies between both types of information.
In the present contribution, different classification algorithms are explored and compared. Remarkably, some of the techniques that combine several classifiers simultaneously and that are showing a major success in a wide range of scientific fields have been included in the analyses. The comparison is performed using different metrics. The accuracy (i.e., the classification rate) of the algorithms is analyzed, but also the F-Score or F-measure -i.e., the harmonic mean of the precision and recall-that combines both the positive predictive value and the sensitivity in a single measure. In addition, since in different industrial environments type I error (false positives) and type II error (false negatives) do not have the same impact in terms of safety and monetary costs, the analysis was completed with the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) measure for non-binary classifiers. This latter metric provides an overall comparison of the performance of each classifier over the full range of trade-offs between the two types of errors -and hence between safety and economic costs-.

A. MATERIALS AND EQUIPMENT
The chemical composition and mechanical properties of the steel sheets welded by RSW are shown in Table 1 and Table2 respectively. Sheet thickness was 1 mm. The steel sheets were welded in a single-phase alternating current (AC) 50 Hz equipment by means of water-cooled truncated cone electrodes with 16 mm body diameter and 5 mm face diameter.
The ultrasonic testing of the RSW joints employed a transducer whose frequency and diameter were, respectively, 20 MHz and 4.5 mm.

B. WELDING PARAMETERS
In this study, a total of 437 joints were obtained by RSW. The welding parameters considered include: (i) the welding current (with values varying between 4 and 8 kA RMS); (ii) the welding time (with values ranging between 4 and 20 cycles); (iii) the electrode force (whose value was kept fixed at 980.7 N); (iv) the electrode material (two types [22]: Class 2 and Class 3 of RWMA Group A); and (v) the treatment applied to the electrode material (with three options [23]- [25]: O61, TH02 and TF00).

C. ULTRASONIC TESTING AND QUALITY LEVELS
Ultrasonic testing was performed in accordance with the pulse-echo method with A-scan technique and, therefore, the obtained ultrasonic oscillogram is a plot of wave amplitude versus time [26]. The location of the reflecting interface determines the echo positions, and the sound attenuationwhich depends on the weld nugget microstructure-determines the height of the echoes [6].
The quality level of each RSW joint was assessed from its ultrasonic oscillogram that, in turn, depends on the weld nugget; recall that the weld nugget has melted and solidified and, thus, it has a cast microstructure with coarse and columnar grains that produces higher attenuation on the ultrasonic beam than that of the parent metal. Thereupon, the thickness of the weld nugget influences the ultrasonic oscillogram in the following manner: the greater the thickness, the higher the attenuation. Another parameter of the weld nugget, its diameter, affects the ultrasonic oscillogram as well; more precisely, given that the interface between the two steel sheets causes the reflection of the ultrasonic beam, one-layer echoes will appear between principal echoes if the weld nugget diameter is smaller than the ultrasonic beam width [27].
Considering the effect of weld nugget on the ultrasonic beam behaviour, four quality levels were established for classifying RSW joints [6], [28]:  Good weld. The thickness of the weld nugget is large and its diameter is greater than the ultrasonic beam width. Therefore, the span of the sequence of echoes is short due to the high attenuation, and the distance between consecutive echoes is the sum of the thickness of each of the two steel sheets.  Undersize weld. The diameter of the weld nugget is smaller than the ultrasonic beam width. Thus, the portion of the ultrasonic beam that does not pass through the weld nugget -and whose reflection takes place at the interface between the two steel sheets-causes one-layer echoes between the principal echoes.  Stick weld. The weld nugget has an adequate diameter but a small thickness. Hence, the distance between consecutive echoes is the sum of the thickness of each of the two steel sheets, and the attenuation is lower than in a good weld, so the span of the sequence of echoes is longer.  No weld. There is no melted and solidified metal.
Therefore, the span of the sequence of echoes is longer than that of a RSW joint with weld nugget, and the distance between echoes is the thickness of one steel sheet.

D. INPUTS FOR ASSESSING THE QUALITY LEVEL
A total of 14 inputs were considered for assessing the quality level ( Fig. 1):  Four of the five welding parameters: welding current, welding time, electrode material, and treatment applied to the electrode material (the electrode force was not considered as an input because its value was kept fixed for all RSW joints).  The ten components of the representative vector of each ultrasonic oscillogram, obtained with a program developed by Martín [29]. This 10-component vector uses only the first six echoes of the ultrasonic oscillogram [6], [17]:

E. COMPUTATIONAL METHODS
The experimental design used to assess the influence of the individual and combined use of pre-welding and post-welding inputs as well as the effectiveness of the different classification tools and algorithms in exploiting such information is as follows. The performance of a large number of classifiersincluding tree ensembles and stacking models-has been analyzed using stratified nested cross-validation (10 folds). The hyperparameters of the algorithms have been optimized through grid search in the inner loop, and their performance has been measured in the outer loop. In addition, variable importance analyses based on a random forest model were conducted to determine the relative impact of each of the inputs in the overall classification accuracy attained.
A brief description of the main classifiers and of the different variable importance analyses conducted is provided below.

1) ADABOOST
Due to its great accuracy on a multitude of very diverse problems [30], [31], boosting is one of the most popular classification techniques. In this paper, two of the most prominent boosting algorithms are used: Adaptive Boosting (AdaBoost) [32] -the most popular and well-known boosting algorithm-and the eXtreme Gradient Boosting (XGBoost) algorithm [33], which, as a consequence of its exceptional results, has received a lot of attention since its publication.
Boosting consists in obtaining a strong classifier from the sequential combination of weak base learners. AdaBoost typically uses classification trees as base classifiers (in the present work, two types of base classifiers were used: decision stump and J48 trees) and has been employed in identifying and classifying weld defects [34]. At each iteration, the training set is reweighted so that higher weights are given to instances or data that have been previously misclassified, remaining the previous classifiers unchanged. This distribution of weights is denoted by D t and changes at each iteration t of the algorithm. Each time AdaBoost trains a tree on the training sample x , , where x i are the regressors and y the output variable, it generates a weak hypothesis h t aimed at minimizing the error of the distribution D t through the weak classifier Eq. (1) AdaBoost then estimates an αt parameter according to (2), which weights the contribution of classifier t to the overall strong classifier. log (2) The next step is updating the weight distribution so as to train the next weak classifier. In the case of multi-class classification, if the M.1 version of the algorithm [32] is used, this weighting is performed according to (3), where I is the indicator function.
When the algorithm has built the number of trees selected, it generates the final classification using (4), where T is the total number of trained trees.
sign ℎ x XGBoost is an algorithm that also uses boosting, but in this case, based on the additive expansion of regression trees. More specifically, it uses gradient-boosting decision trees as a base, but incorporates different mechanisms to exploit memory resources and parallelization, thus reducing computation times efficiently [33]. XGBoost gained fame by winning 17 of the 29 machine learning tasks proposed in Kaggle in 2015, a fact that together with its current high cross-platform portability, has resulted into XGBoost receiving a lot of attention. Although it is an algorithm that reaches its maximum potential in massive datasets -as it makes a more efficient use of computational resources than other algorithms-its application to small and medium-sized datasets is also beginning to receive attention, e.g. in welding processes [35]. From a formal perspective, and given a sample of n data and m features, where again x i are the regressors and y the output variable, the objective function of the algorithm is: Where l is a differentiable convex loss function that captures the difference between the prediction and the actual data, f t is the t-th tree in the additive expansion and Ω is a regularization function that penalizes the complexity of the regression trees. The loss function is approximated through the second-order Taylor expansion: Where g i represents the first derivative of each sample (gradient) and h i indicates the second derivative of each sample (hessian). In the specific case of multi-class classification used in this work, the loss function is a generalization of the logistic loss function [30].

3) RANDOM FOREST
The random forest algorithm is based on bootstrap aggregation (bagging), that is, it combines the results obtained by multiple classification trees built -deep and unpruned-on different bootstrapped samples as a way to reduce variance and hence improve accuracy. However, it outperforms bagging by considering just a subset of predictors (random subspace method) at each split, which serves to decorrelate the different trees in the forest [36], [37]. In classification problems, the number of predictors (m) typically considered at each split is , where p is the total number of predictors. Notably, random forest models are robust to overfitting and to the presence of correlated regressors. In addition, even though bootstrap aggregation methods result in improved accuracy at the expense of interpretability, they enable the obtention of variable importance measures such as the original individual variable importance proposed by Breiman [37], group variable importance [38] and conditional variable importance [39], [40] and, thus, they have been successfully applied to quality assessment in resistance spot welding process [6], [16].

4) INDIVIDUAL VARIABLE IMPORTANCE
Within the framework of random forests for classification, individual variable importance analysis is generally enacted as follows: once the model is built, each bootstrapped sample has a tree fitted to it and the corresponding out-of-bag (OOB) sample -approximately one third of the observations in the real dataset [36]-. Let us assume that such real dataset has M predictors; to determine the relative importance of the mth predictor, its values are randomly permuted in all the OOB samples and then run down their corresponding trees. As a result, each OOB observation obtains several class label predictions -the quantity of which depends on the number of OOB samples where it appears-; eventually, the majority vote is taken and it is compared with the true class label to compute the misclassification rate. The importance of the mth predictor is subsequently calculated as the change in classification accuracy after the permutation with respect to the original case -mean decrease in accuracy over all trees [37]-.

5) CONDITIONAL VARIABLE IMPORTANCE
In [39], the authors pointed that the above-described variable importance measure showed a bias towards correlated predictors and developed an alternative measure: the conditional variable importance, in which the dependence between a predictor and the outcome is calculated conditionally upon the values of other predictors. In particular, VOLUME XX, 2017 1 for each tree, they propose to divide -completely bisect-the predictor space into a multidimensional grid in accordance with the partition induced by that tree, and it is within each such partition that the OOB values are conditionally permuted. Eventually, the importance of each variable is calculated as the difference in predictive accuracy before and after the permutation, and averaged across all trees [40] -as in Breiman's measure-.
In the present case study, conditional variable importance is of interest since it is particularly appropriate to address the research question of whether an increase in predictive accuracy may be attained as a consequence of considering both the welding parameters and the oscillogram variables together.

6) GROUP VARIABLE IMPORTANCE
Sometimes, rather than the individual importance of each variable, it is the joint importance of different groups of variables that is of interest. For such cases, Gregorutti et al. [38] proposed a group variable importance method that consists of using the same random permutation for each variable in the group under consideration, so that the empirical joint distribution of the group of variables is preserved, but the link between such group of variables, the rest of predictors and the response is effectively broken. In the present contribution, two groups of variables were considered: (i) welding parameters and (ii) oscillogram variables.

7) ROTATION FOREST
Rotation forest is a tree-based ensemble method that seeks both diversity and accuracy by means of feature extraction. More specifically, rotation forest uses the C4.5 decision tree [41] as base classifier and it is defined by three main parameters: (i) L -the number of trees in the forest-; (ii) Kthe number of subsets in which the feature set is split-; and (iii) p -the proportion of the observations (X) to select-. The training set for each classifier is built as follows: first, the feature space is divided into K -disjoint-subsets; then, for every such subset a nonempty subset of classes is randomly selected and a bootstrap sample of size 75% of the data count is drawn; subsequently, Principal Component Analysis (PCA) is run on the reduced dataset that includes only the features in the corresponding subset and the bootstrapped observations selected; eventually, the PCA coefficients obtained across all subsets are rearranged in a rotation matrix so as to match the original feature order, and the classifier is built using , as the training set [42]. Remarkably, for problems with continuous real-valued features, rotation forest was found to be significantly more accurate on average than competing techniques from the families of algorithms: support vector machines (SVMs), treebased ensembles and neural networks [43], being thus recommended to consider it among the algorithms with the greatest performance.

8) OTHER CLASSIFIERS
The rest of classifiers implemented can be listed and succinctly described as follows: (i) Naive Bayes is a probabilistic classifier based on Bayes' theorem that, despite being a simple Bayesian model and making the assumption that given the class each feature is independent of any other, has proven to give good results in many real contexts [44]. (ii) Several implementations and variants of SVM classifiers, which maximize the width of the gap between different classes by mapping the data into a higher-dimensional space where the separation of the classes is simpler -recall that for the mapping process, it is possible to use different kernels-; more specifically, in the present contribution we used the Least Squares Support Vector Machine classifier (LSVM) [45] with radial basis function (RBF) kernel, the sequential minimal optimization algorithm for training a support vector classifier [46], [47] with radial basis function (RBF) and the Pearson VII function-based universal kernel (PUK) [48]. (iii) The performance of a multilayer perceptron neural network [49], (iv) logistic regression [50], and (v) the J48 algorithm -a Java implementation of the C4.5 decision tree [41]-was also analyzed. Eventually, commonly used baselines based on simple decision rules such as (vi) OneR and (vii) ZeroR were also evaluated, to compare against them the behavior of the more sophisticated classifiers [50] implemented.

9) STACKING
Apart from bagging and boosting [51] -the two most common ensemble techniques for classification-there is a third methodology for combining classifiers on the same dataset known as stacked generalization or stacking [52]. This approach is giving excellent and, in some cases, close to optimal results [53]- [55], and it is also recently starting to be successfully applied in manufacturing-related fields [56], [57]. While the base classifiers are usually of the same type as in the other ensemble techniques, the idea of stacking is the opposite. Typically, stacking consists of combining the strengths of classifiers of different nature and based on dissimilar hypotheses to generate a more accurate classifier. A two-level structure is then used to decide in which cases to use the predictions of each algorithm. Initially, different base classifiers are trained -the level-0 models-. The predictions of these models constitute the input of another classifier known as the metalearner or level-1 model, which learns a good generalization of the classifiers it combines. The primary goal of the metalearner is to detect the regions of the classification space in which each classifier or set of classifiers is most reliable. Although, in principle, any level-1 generalizer that is relatively global and smooth can be expected to perform well, in practice it is often appropriate to use overfitting-resistant classifiers such as logistic regression or random forests -which are the ones used in this work-. In the implementation selected, the metalearner receives as attributes the vector with the probabilities of each class, a strategy that increases the generalization capacity [58]. VOLUME XX, 2017 1 Since to be able to generalize the results of the metalearner it is necessary not to use the same instances as at level 0, nested cross-validation was conducted; more specifically, 5-fold internal cross-validation was implemented within the inner loop, being the honest evaluation of the classifier performance conducted on the outer loop.

III. RESULTS
For each performance metric (accuracy, F-meaure and AUC) an ANOVA test was conducted to check the null hypothesis of equality of means across the 50 algorithms implemented. In accordance with the results obtained, the null hypothesis can be rejected in all cases at a significance level of 0.001, which means that some algorithms perform significantly better than the baseline classifiers.
A post hoc analysis was then performed for each metric using Duncan's multiple range test at a 0.1 level of significance (a corrected paired Student's t-test [59] was also performed and it gave very similar results at 0.05 significance level). Recall that in Duncan's multiple range test, two classifiers are considered statistically different if their difference exceeds the studentized range statistic. The results are presented in Table  3. Differences between performance metrics that do not share the same letter are considered statistically significant. These results show several relevant aspects. First, there is a wide range of classifiers that obtain good results, i.e., whose performance is significantly superior to the 50% prediction accuracy of the ZeroR baseline. The fundamental predictive component lies in both the specific classifier selected and in the set of input variables used to train and validate the model. In this latter vein, the use of the welding parameters alone allows us to correctly classify approximately 80% of the instances using tree ensemble algorithms. In particular, Random Forest, XGBoost and Stacking of tree ensembles obtain the best results in terms of accuracy. In accordance with the AUC, these algorithms are also statistically significantly better than the rest of the algorithms on the same dataset.
Although these results are interesting per se and allow to conduct the quality control of the welding process exclusively from the welding parameters, if they are compared against a posteriori RSW joint quality analysis techniques -such as an analysis based on ultrasonic testing-, the accuracy attained in this latter case is significantly better. More precisely, if only the data from ultrasonic oscillograms is used to determine the quality of the RSW joints, a wide range of algorithms far exceed 90% accuracy; in particular, all algorithms based on trees, as well as several algorithms based on SVMs with different kernels; it should be recalled that even simpler classifiers such as Naive Bayes obtain meritorious results. Within such a framework, it is also worth highlighting that stacking algorithms based on tree ensembles already achieve extremely high prediction results.
A fundamental question addressed in the present contribution is whether the joint combination of the two RSW joint quality determination methods -i.e., the welding parameters and the ultrasonic oscillogram variables-can improve the automatic quality-level classification. For this purpose, the results provided in Fig. 2 are studied. Fig. 2 shows the accuracy results of the algorithms that obtain a 90% accuracy with any of the regressor sets used -i.e., the welding parameters alone, the oscillogram variables alone or the combination of the two-, and compares the prediction obtained in each case. A relevant result is that tree-based algorithms -even basic trees such as J48-are able to combine information from both sources -welding parameters and oscillograms-to improve the prediction by almost 1% systematically (p-value=0.0012 using paired t-test). This result is significant because improving classification when many algorithms are already performing above 95% accuracy is challenging. Other classifiers that are not as robust to partially redundant information, such as SVMs, have trouble in incorporating this information effectively, being the combination of both regressor sets even counterproductive in some cases.
An additional aspect to highlight is the excellent performance obtained by the stacking algorithms in the task of assessing the quality level of RSW joints based on either the welding parameters, the ultrasonic oscillogram variables or both. In terms of accuracy and F-measure, the combination of random forest together with XGBoost gives the best results. Still, stacking based on tree ensembles (combining random forest, rotation forest, XGBoost and adaBoost) using as metalearner either logistic regression or random forest also gives very competitive results. In the case of using the area under the curve (AUC) as the performance measure, the approach that provides the best results uses as base learners the four tree ensembles listed above with any of the two metalearners proposed. It should be recalled that the use of one or the other performance metric in industrial contexts will depend on the variation over time of the non-quality costs. Remarkably, the results obtained show that the combination of stacking with tree ensembles when both the welding parameters and the variables from ultrasonic nondestructive testing are used as inputs, provides very accurate results for any desired range of specificity and sensitivity. As regards the variable importance analyses conducted, Fig.  3 shows on the left the results of the individual variable importance proposed by Breiman [37] -which have been obtained using the randomForestSRC R package [60]-and on the right the individual conditional importance of each variable according to Debeer and Strobl [40] -which were calculated with the R packages party [39], [61], [62] and permimp [40]-. Notably, Breiman's approach provides the importance of each variable within the model, as its random permutation of the values of the regressor variable is supposed to mimic the absence of such variable within the model; on its part, the conditional variable importance approach quantifies the contribution of each predictor conditioned to the presence of the rest of the regressors. Recall that even though the conditional approach induces changes in the overall ranking of all the predictors with respect to Breiman's method, the most remarkable change is related to different position of the welding process variable welding time; in fact, its higher position in the conditional variable importance analysis may be interpreted in relation to the 1% increase in predictive accuracy that is attained when both the welding process and the oscillogram variables are considered. In relation to the group variable importance analysis, its results are shown in Fig. 4. It serves to highlight that the set of most discriminant regressors is by far that of the oscillogram variables.

IV. CONCLUSIONS
In this work, the use of pre-welding inputs (welding parameters) and post-welding inputs (ultrasonic oscillogram variables) to determine the quality level of RSW joints has been analyzed both individually and in combination. In the analyses, a large number of classifiers have been compared using stratified nested cross-validation, i.e., optimizing the hyperparameters of the algorithms through grid search in the inner loop, and measuring the performance in the outer loop. The results obtained show that:  The analysis of RSW joint quality using postwelding variables is systematically superior to using only the pre-welding inputs. Compared to the baseline (ZeroR) the welding parameters increase the accuracy by an extra 30%, while the analysis based on ultrasonic testing improves the prediction by 45%. In particular, a large set of algorithms are capable of obtaining very accurate results (an accuracy greater than 90%) using ultrasonic testing data.  The combined use of pre-welding and postwelding inputs allows improving prediction in algorithms that are less sensitive to overfitting and internal correlations between regressors. In particular, algorithms based on trees and tree ensembles allow statistically significant improvements in prediction. Given the good results of many algorithms using only ultrasonic oscillograms, improving the outcomes is particularly challenging; however, since in competitive industries such as the automotive one little improvement make the difference, and in view of the low implementation cost of tracking the welding parameters compared to using ultrasonic oscillograms exclusively, the combination of both approaches may be of interest for the reduction of RSW non-quality costs.  The variable importance analyses conducted on the basis of random forests confirm that the information for quality evaluation of the oscillograms is superior to that of the welding parameters. However, welding parameters such as welding time and the treatment applied to the electrode material have an influence that is still relevant and not fully captured by the ultrasonic testing, thus being of interest to exploit this information in a valuable way.  The results show that stacking techniques that effectively combine different classifiers to issue the final prediction yield the best performance for all the prediction metrics analyzed. The joint combination of boosting-and bagging-based tree ensembles obtains the best results for the problem. Engineering. She has participated as a modeler and data analyst in several Spanish and international research projects that led to publications in renowned multidisciplinary journals. JOSÉ I. SANTOS has received a B.S. degree in industrial engineering from the Universidad de Valladolid (Spain), an M.S degree in Information Systems from the Escuela de Organización Industrial (Madrid), an M.S degree in applied economics from the Universidad Nacional de Educación a Distancia (Spain), and the Ph.D degree in industrial and civil engineering from the Universidad de Burgos (Spain). He is an Associate Professor in management engineering at the Universidad de Burgos. His main area of expertise is the modelling of complex systems. Most of his research focuses on the application of methods and techniques for the study of complex systems, including agent-based modelling, complex network theory, and machine learning.