Anomaly Detection Using Ensemble Classification and Evidence Theory

Multi-class ensemble classification remains a popular focus of investigation within the research community. The popularization of cloud services has sped up their adoption due to the ease of deploying large-scale machine-learning models. It has also drawn the attention of the industrial sector because of its ability to identify common problems in production. However, there are challenges to conform an ensemble classifier, namely a proper selection and effective training of the pool of classifiers, the definition of a proper architecture for multi-class classification, and uncertainty quantification of the ensemble classifier. The robustness and effectiveness of the ensemble classifier lie in the selection of the pool of classifiers, as well as in the learning process. Hence, the selection and the training procedure of the pool of classifiers play a crucial role. An (ensemble) classifier learns to detect the classes that were used during the supervised training. However, when injecting data with unknown conditions, the trained classifier will intend to predict the classes learned during the training. To this end, the uncertainty of the individual and ensemble classifier could be used to assess the learning capability. We present a novel approach for anomaly detection using ensemble classification and evidence theory. A pool selection strategy is presented to build a solid ensemble classifier. We present an architecture for multi-class ensemble classification and an approach to quantify the uncertainty of the base classifiers and the ensemble classifier. We address the problem of detecting unknown conditions (while feeding out-of-distribution data), presenting a novel approach that monitors the uncertainty of the ensemble classifier using evidence theory. Finally, we use the benchmark Tennessee Eastman to perform experiments to test the ensemble classifier’s prediction and anomaly detection capabilities.


I. INTRODUCTION
Ensemble classification (EC) has become a popular subject of applied research in several branches of the industry, such as automotive, pharmaceutical, energy, and insurance.The EC power relies on the ability to discover patterns in data that can throw light on the optimization of the business or the discovery of anomalies in the process.Ensemble learning in supervised classification is a common practice because it permits combining different models to achieve better performance.Bagging and Boosting are two commonly used techniques that have proved to be a good fit for fault diagnosis [1] [2].An alternative way to combine classifiers is through the use of information fusion.To this end, there are different strategies to achieve information fusion [3], the most common ones involve the use of Fuzzy Logic [2], Bayesian [4], and evidence theory (ET) [5].We focus on the use of ET to build the ensemble classification.However, there are essential aspects to consider while doing ensemble classification, such as an architecture for the information fusion, the selection of (the pool of) classifiers, a solid training procedure that guarantees performing classifiers, and uncertainty quantification of the classifiers that signalizes the learning capability.The pool selection plays a vital role in the overall performance of the EC.There are different factors to consider while making the pool selection; namely, the diversity or heterogeneity of the sources [6], the inclusion of expert classifiers (e.g., that can detect specific classes) [7], the data nature, and the classifier mathematical principle.The use of heteregeneous sources improves the results in information fusion, in this specific case the use of diverse classifiers [8].The challenge here is to define an effective strategy to measure the heterogeneity or the diversity between the classifiers' combinations.An additional point to note is the expertness of a classifier, which means a classifier that can detect particularly well a specific class.Data nature and the mathematical principle are prior considerations during the selection of the pool, which in most cases involves expert knowledge.A performing EC requires not only an effective pool selection strategy but also a measure on the certainty of the predictions, in other words, how reliable the predictions can be.A high uncertainty would signal a poor training process and a poor-performing classifier (e.g., a poor generalization).Thus, uncertainty quantification (UQ) can be used to measure the epistemic uncertainty (e.g., associated with lack of knowledge in the classifier) [9].Moreover, the UQ can be used to weigh the classifier's performance (after training), which can be useful during information fusion.
The definition of a pool selection strategy and uncertainty quantification pave the way for a performing EC.The architecture of the EC plays a vital role in the combination of classifiers.Important considerations include the number of classes (e.g., binary or multi-class), ensemble size (e.g., the number of classifiers that form the EC), the selected classifiers (e.g., only NN-based classifiers), and the transformation of the predictions into a common framework that allows the fusion [10].While training the EC, a data subset is selected that is expected to represent the overall data.Training an EC with a specific portion of the input space assures the learning procedure on that specific portion [6].The drawback is limiting the EC with a static behavior, which means the classifier will perform predictions only based on the learned classes (e.g., indistribution data).The challenge here is doting the ensemble classifier with an anomaly detection capability when feeding unknown conditions (e.g., out-of-distribution data).We propose a novel approach for anomaly detection using ensemble classification and evidence theory (ECET) that provides the guidelines to create performing ensemble classifiers (EC).Besides, we present an uncertainty quantification methodology to measure the uncertainty of the trained classifiers and the uncertainty of the EC during inference.Finally, we present an architecture for binary and multiclass EC using evidence theory that not only provides a robust classification performance but also tracks the EC uncertainty to detect anomalies while feeding unknown data (out-of-distribution data).
The contributions of this paper are: • A methodology for ensemble classification using evidence theory (ECET) that applies a fusion on the decision level for the pool of classifiers.The methodology is used for creating binary and multiclass ECs using different ensemble sizes and classifiers.• A strategy for pool selection that considers the criteria diversity, expert, and pre-cut.This pool selection reduces the number of possible combinations while providing a performing EC. • An uncertainty quantification methodology that measures the uncertainty after the training process of the pool of classifiers.Besides, it assesses the uncertainty of the EC predictions during model inference.• An approach for anomaly detection (AD) using ensemble classification and uncertainty quantification.The AD detects anomalies while feeding unknown data (out-ofdistribution data) through an uncertainty tracking.This paper is structured as follows: a literature review is presented in section II.The ensemble classifier's architecture and methodology are presented in Section III.Section IV presents a use case of the proposed methodology applying the benchmark dataset Tennessee Eastman.Finally, the conclusion and future work are summarized in Section V.

II. RELATED WORK
This section presents the state of the art of main topics addressed in this paper: ensemble classification, pool selection, uncertainty quantification, and anomaly detection.Fig. 1 shows the relationships between the topics of this section.

A. Ensemble Classification
Ensemble classification (EC) is a versatile approach commonly used in the research community because it provides a more robust output and benefits from the heterogeneity of its sources (e.g., classifiers) [6].Defining a reliable EC architecture requires special considerations in terms of data preparation (e.g., classifiers that require standardized data), ensemble size (e.g., the number of classifiers), classifiers type (e.g., shallow classifiers, deep learning models, or hybrid) and the strategy to combine the classifiers.The data preparation is a usual step in a machine learning pipeline, which is an explicit requirement for some classifiers (e.g., using a standard scaler for NN-based classifiers).The EC ensemble size (or the number of classifiers involved) is usually a trial and error parameter, which means that a fixed parameter cannot be used as a general parameter for all the cases to be learned (e.g., a large number of classifiers in the EC does not guarantee a better outcome).In contrast, selecting a combination strategy presents multiple options for EC [3].The popular strategies include the use of bagging (e.g., random forest) and boosting (e.g., AdaBoost) [1] [2], Bayesian [11], fuzzy [12], average [6], majority voting [13] [14], and evidence theory (ET) [5] [15].ET is a preferred approach in literature because it not only provides a framework to combine different information sources (e.g., predictions of classifiers) in the form of sets of evidence but also considers the uncertainty of the information sources [16].This last feature allows allocating confidence in the ET predictions.Examples of EC using ET can be found in [16] [17] [18].Although we stated the benefits of EC, there are essential considerations to be addressed while forming an EC, namely the pool selection, the transformation of the classifier output into a set of evidence, and the architecture, among others.

B. Pool Selection
The pool selection plays an essential role in EC because its strength lies in combining heterogeneous and performing classifiers.For this purpose, different criteria are proposed in literature using performance [14], expert area or competence sub-region [7], and diversity measurement [6] [19].A common practice for selecting the pool of (base) classifiers is using performance as a sorting criterion after the training phase because of its simplicity and effectiveness.However, the performance relies on a static value corresponding to specific data conditions (e.g., a local portion of the input space), which is an issue in case of a concept drift [7].Specifically, it occurs when the training data is only a local representation of the data, which incurs an underperformance when presenting data from a different part of the input space.To this end, Jiao et al. [7] proposes a dynamic ensemble selection to address the concept drift by training and selecting classifiers per competence subregions dynamically.On the other hand, it is important to note that considering a n number of base classifiers produces a notable amount of possible combinations and, thus, leads to the dilemma of which combination to choose.The question here is how to reduce the number of possible combinations without compromising the EC's performance and generalization.A requisite to form an EC relies on the heterogeneity of the information sources, or put this in different terms, in how diverse they are.The diversity between classifiers improves the generalization capability of the EC [6].To this end, diversity measurements are proposed in literature to tackle this problem.Jan et al. [19] propose a pairwise diversity measure for an incremental classifier selection to identify which classifiers improve the learning capability.The diversity measure compares two ECs using matrices with misclassified samples and uses a (customized) indicator function that quantifies the diversity.The approach discards classifiers from an EC that neither improves the accuracy nor the diversity.

C. Uncertainty Quantification
The support of a classifier relies not only on its capability to provide a condition prediction while feeding data but also on how reliable this prediction is.For this purpose, uncertainty quantification (UQ) assesses the prediction reliability (e.g., a 90% likeliness of a prediction not only provides the associated class but also how certain the prediction can be).The uncertainty is divided into two categories: aleatoric (e.g., associated with random effects) and epistemic (e.g., or model uncertainty associated with lack of knowledge) [9].We focus on the quantification of epistemic uncertainty.The epistemic uncertainty is quantified using different strategies, namely using entropy [20], variance [21] [22], Bayesian neural networks (BNN) [23], Monte Carlo dropout [24], and evidence theory [16] [21].An extensive survey of the different methods for UQ is found in [25] [26].Using ensemble systems is a common approach to quantifying epistemic uncertainty.Dong et al. [21] present a multi-expert uncertainty-aware learning (MUL) approach that compares the variance of multiple parallel dropout networks.The uncertainty is quantified using the difference between the variance of the different networks (experts).Thus, a significant difference in the variance of the experts is an indicator of high uncertainty of the prediction.Wang et al. [22] uses an uncertainty-driven deep multiple instance learning to optimize the learning process by discarding noisy training samples from the positive bags.To this end, the authors calculate the mean and the standard deviation of the probabilistic ensemble prediction of a bag to determine the uncertainty.Likewise, Dong et al. stated that a larger standard deviation corresponds to higher uncertainty.Huo et al. [16] monitors the uncertainty at the decision level using an ET-based EC.In contrast to [22] which uses the UQ during the training phase, [16] and [21] use the UQ during model inference.A framework that monitors the uncertainty in the complete EC lifecycle, namely from the training phase until the inference, could not only improve the EC performance but also add the feature of anomaly detection while performing inference.

D. Anomaly Detection
Anomaly detection has been addressed using different approaches namely multivariate statistical methods (e.g., Hoteling, Mahalanobis distance) [27] [28], variational autoencoders (VAE) [29] [30], Bayesian neural networks (BNN) [31] [32], ensemble learning [16] [33], among others.An extensive survey of the different methods for anomaly detection is found in [34].The use of the Mahalanobis distance and Chi-square distribution is commonly preceded by a feature reduction on the data using principal component analysis (PCA) [27] [35] and canonical variate analysis (CVA) [11].Sun et al. [27] propose using a confidence region to diagnose heart diseases using a Gaussian mixture model and the Mahalanobis distance.Nguyen et al. [28] use the Mahalanobis distance and Chisquare distribution for IoT node authentication by setting a cut-off value in the Chi-square distribution to identify nonlegitimate nodes.Cao et al. [11] propose using clustering, a sub-region CVA, and Hoteling to identify faults.He et al. [29] proposes a DL-based framework for diagnosis and fault detection in bearings.The authors implemented a VAE, in which the latent space of the encoder is used together with a novelty threshold (based on the Bhattacharyya distance) to identify unseen faults.The anomaly detection is related to the out-of-distribution data (OOD), in which the test data distribution presents dissimilarities to the training data (indistribution data) [24] [36].An (ensemble) classifier should be aware of anomalies, which implies a strategy that considers monitoring the prediction confidence.The prediction confidence is addressed by the approaches, namely, BNN and EC using ET.Wu al. [31] proposes a method using dropoutbased Bayesian deep learning to detect unexpected faults of high-speed train bogie.The approach outputs a diagnosis result and an uncertainty indicator of the detected class.Ensemble learning using ET is a popular approach because it not only provides a robust classification but also can detect anomalies by monitoring the uncertainty [16].The last feature addresses the issue of EC that cannot detect unknown classes (as a result of having classifiers trained with a fixed dataset [6]).
Our approach differentiates from the current contributions in literature, in which we present a holistic methodology for ensemble classification using evidence theory (ECET) that not only provides a robust classification output but also assesses the likeliness of the output by quantifying the uncertainty.First, we present an uncertainty quantification (UQ) approach using evidence theory that measures not only the uncertainty after the training phase of the (individual) classifiers but also the EC uncertainty during inference.Second, we propose a pool selection strategy that allows the creation of an EC by choosing the pool of (base) classifiers using key criteria such as performance, diversity, and expert area.Third, we propose an EC architecture using evidence theory that provides binary and multiclass ECs with a robust prediction (with respect to the base classifiers) with a likeliness degree through the use of UQ.Fourth, we propose a novel anomaly detection procedure that performs uncertainty tracking on the EC while performing inference on unknown data (out-of-distribution data).We demonstrated the approach's robustness by applying the Tennessee Eastman benchmark.

III. ECET: ANOMALY DETECTION USING ENSEMBLE
CLASSIFICATION AND EVIDENCE THEORY This research proposes a data-based model approach to address anomaly detection using Ensemble Classification and Evidence Theory (ECET).The main topics considered in this section are: a theoretical background, uncertainty quantification, criteria selection of classifiers, the ensemble classifier using evidence theory (ECET), and anomaly detection using ECET.
This approach consists of two main blocks: the training process and the inference model for classification and anomaly detection (see Fig. 2).
The overall system is detailed as follows: • The prior step is the training of the pool of classifiers, and comprises three main activities namely the model training, model validation and uncertainty quantification.
The main goal of this block is to provide a pool of trained models.The inputs to this block are the training data D T r and validation data D V a , a pool of classifiers, and a list of hyperparameters for the classifiers.The main contributions of this block, represented with blue arrows, are: the overall uncertainty quantification U Q V a of each model per trained class.
• The second step is the anomaly detection using ECET.This step is the model deployment, and it consists of three main blocks, namely the pool selection, the ensemble (classifier), and the anomaly detection.The main goals of this block are the classification of known conditions and the detection of anomalies in the data.The inputs to this block are the pool of trained models and the testing data D T e .The main contributions of this block are: a methodology for the ensemble classifier using evidence theory (ECET).a strategy for anomaly detection using uncertainty quantification U Q T e .

A. Theoretical Background
This section presents the theoretical basics for the development of the sections III-D, In the first place, the basics of evidence theory are presented.Secondly, the evidential treatment of supervised classification presents the application of the ET for uncertainty quantification, ensemble classification, and anomaly detection.
1) Evidence Theory: The evidence theory (ET) serves as a framework to model epistemic uncertainty.Important to mention is that the ET allows to combine different information sources.The effectiveness of the combination or fusion relies upon using heterogeneous sources of information.Formally, a frame of discernment Θ is defined as [37]: Θ = {A, B}, where A and B are focal elements.The power set is 2 θ represented as: 2 θ = {φ, {A}, {B}, Θ}}.A mass function is defined using: m: 2 θ → [0, 1].The mass function fulfills the conditions: m(φ) = 0, and A⊆Θ m(A) = 1.The focal elements are mutually exclusive, thus: The Dempster-Shafer Rule of Combination (DSRC) allows to combine two sources of information, specifically two mass functions, using the equation: where m 1 (B) and m 2 (C) are the mass functions of each information source, m DS (A) is the resulting mass function after the DSET fusion.
The amount of conflicting evidence b k is represented by: Important to note is that the uncertainty represented by the term b k is split by each combined focal element of m DS (A).The Yager rule of combination (YRC), likewise the DSRC, allows to combine two mass functions using [38]: where m Y (A) is the resulting mass function after the Yager fusion.The mass function of the focal element Θ is represented by: where q(θ) represents the evidence of the focal element Θ, and q(φ) is the conflicting evidence.Thus, q(φ) is defined as: In contrast to DSRC, YRC only performs a fusion over the known elements, the intersection of elements that results in φ are considered separately in the term q(φ).The latest means that the conflicting evidence q(φ) is assigned solely to the focal element Θ of the combined mass function m Y (θ).
The following equation is used while performing a fusion of more than two information sources: where m n (Z) is the mass function of the n th information source, and n ∈ N. 2) Evidential Treatment of Supervised Classification: The prediction of a classifier is represented as either a unique label or an array with the probabilities for all the possible labels.Thus, the classifier prediction is: In the case of a unique label, a transformation is required to consider all the possible labels in the mass function.Θ includes all the possible cases: The elements of the power set for two labels are represented as: The last term corresponds to the uncertainty since it corresponds to both cases.This term will be referred to as U .Thus, the mass function is represented as: where, Since only one label will be active at a time, a strategy is necessary to fill up the mass function.This strategy considers which label is active by assigning a nearly one and a nearly zero for the rest of the inactive labels: Cheng et al. [39] applied a sensitivity to zero approach, which approximates the zero and one value to a nearlyzero value and nearly-one value, respectively.This approach enhances information fusion since all the evidence will be considered, even if these values are small.This approach is relevant when using DSET and its orthogonal multiplication and was applied to a set of evidence in [40]: In [10], a methodology was presented to transform a prediction into a set of evidence: Having the prediction as a row vector p (e.g., a common scenario for NN-based classifiers with a softmax layer), where p ∈ {L 1 , L 2 , L 3 ...L n }, and n ∈ N. Thus, a prediction p can be represented as a row vector of size 1 × n: The prediction row vector p has an associated confidence weight row vector w p with the size 1xn: The row vector prediction p is transformed into a row vector evidence e of size 1 × n + 1 as follows: where i = 1 and j = n + 1, this operation is also denoted as the Hadamard product of the row vectors p and w p .The last term of the row vector e defines the uncertainty of the prediction p: where i = 1 and j = n + 1, and p • w p is the dot product between the row vectors p and w p .The uncertainty can also be represented as: Thus, the evidence e can be represented as the row vector:

B. Ensemble Classifier and Evidence Theory
The ECET provides the guidelines for the creation of performing binary and multiclass ensemble classifier (EC).A preliminary step involves the selection of the pool of classifiers (detailed in the following subsection), the transformation of the classifiers' predictions into sets of evidence (detailed in section III-A2), and the definition of an EC architecture.The robustness of the EC lies in the use of (heterogeneous) multiple.In this context, heterogeneity is understood as the diversity between classifiers (e.g., different classification principles, or trained in different portions of the input space).It is important to note, the role that the architecture plays in the performance of the EC, specifically the ensemble size (e.g., the number of classifiers), the pool of classifiers (e.g., only NN-based classifiers), the use of confidence weights for the fusion (e.g., calculated from the validation data after the training process).We propose a supervised EC architecture using evidence theory.Fig. 3 displays the main components of the architecture, namely the number of classifiers, the transformation of prediction to a set of evidence, the information fusion, and the transformation of the set of evidence to a prediction.
Formally, the EC consists of a N number of classifiers trained with a training dataset D T r with o T r number of observations and f T r number of features, where o T r , f T r , N ∈ N. The confidence weights of the classifiers are calculated using a performance metric and the validation dataset D V a with o V a number of observations and f V a number of features, where o V a , f V a ∈ N.These weights are used during the information fusion to weigh the set of evidence of each classifier.
The framework of discernment Θ is described as: Θ = {F 1 , ..., F M }, where M is the number of classes or faults, and M ∈ N. Thus, the prediction of a classifier ŷi takes a values in Θ, while feeding the testing dataset D T e with o T e number of observations and f T e number of features, where o T e , f T e ∈ N. The prediction ŷi , then, is transformed into a set of evidence m i using Eq. ( 7)- (23).The set of evidence m i is a row vector 1 × M + 1, which includes in its last element the overall uncertainty.
The next step is the combination or fusion of the set of evidence of each classifier; refer to Algorithm 1.For this purpose, we apply the rules of combination DSRC and YRC using the Eq. ( 1)-( 6) to obtain the ensemble set of evidence F Di , F Yi , using DSRC and YRC respectively.Important to mention is that while effectuating the information fusion, the ensemble uncertainties U Di , U Yi are obtained. 6: by Eq.( 7)-( 23) by Eq.( 1)-( 6) by Eq.( 3)-( 5) 14: ŷEN ← conv2Label(F Di ) by Eq.( 24) The last step consists of the transformation of the set of evidence of EC F Di into an ensemble prediction ŷEN using an argmax function: where ŷEN ∈ Θ.

C. Criteria for the pool selection of classifiers
The criteria selection of classifiers to build the pool for the ensemble classifier plays a vital role.There are many considerations when selecting a combination of classifiers for the ensemble.We propose the following criteria: performance, expert area, and diversity.Besides, we use an extra parameter pre-cut that affects only the diversity.Fig. 4 shows the ensemble selection procedure.
1) Performance: A common practice for classifiers selection is sorting out the classifiers according to a performance metric.To this end, we propose using the F1-score as a performance metric because it is not sensitive to unbalanced classes.Besides, the F1-score (F1) combines the metrics precision (PR) and recall (RE) for its calculation.RE is also known as fault detection rate (FDR).A detailed description, of how F1, PR and RE are calculated, is presented in [41].
2) Expert Area: The motivation of an expert area lies in the ability of certain classifiers to excel in learning specific classes [7].This ability is an important feature, because an "expert" classifier could improve the ensemble in the specific class where the expert classifier excels.We propose using a softmaxbased strategy that identifies the expertness of classifiers.Lastly, the pool is sorted out using the expert area criteria.The softmax function σ of a vector z fulfills σ : R K → (0, 1) K , for K ∈ N. Thus, the softmax function z i is defined using: where i, j ∈ N.
The expert area of an ensemble classifier EXP EN extends Eq. ( 25) to measure the degree of expertness of a classifier M o i with regard to a class F i .Thus, the equation is defined as: where z k i represents the prediction of the i th classifier for the class k, K is the number of classifiers in the pool, and i, j, k ∈ N.
As stated above, an expert classifier could detect a class with high performance while having a bad performance with the rest of the classes.For this reason, the results of the softmax function σ EN apply a selection of the classifiers with respect to a class.In order to increase the numerical separation between the elements of the k th class, each element of σ EN msk (z) is masked using: where S σ EN (z) i,k represents the subset of maximum values of the σ EN (z) i,k evaluated in a specific class k, and |S σ EN (z) i,k | is the cardinality of the subset.V A M ax and V A M in are the upper and lower values assigned to the i th element of the k class, respectively.Where V A M ax > 0, V A M in ≥ 0 and ∈ R, and V A M ax V A M in .The latest considerations for V A M ax and V A M in play a key role in the numerical separation and, therefore, for the assignment of expertness to a classifier.For this reason, it is recommendable at least a separation of 10 during the application of the expert area to the pool selection (e.g., V A M ax V A M in ≈ 10).The next step is summing the results per fault on each classifier using the following equation: where N C represents the number of classes.The last step is applying a softmax to σ EN msk−sm (z) using Eq. ( 26): where EXP EN i represents the expertness of the i th classifier.
3) Diversity: The strength of ensemble classification derives from the heterogeneous nature of its information sources.For this purpose, it is necessary to have a grounded strategy that considers the diversity between their sources.There are different methods in Literature that aboard this topic, specifically the diversity measurement between the predictions of two classifiers, namely the disagreement measure, Kohavi-Wolpert variance, generalized diversity, and misclassification diversity [13] [42] [43] [19].We concentrate on the following criteria: data nature, classification principle, and measuring diversity.The authors use the first two criteria considering the classifiers' nature according to the available Literature.The last criterion has been explored in Literature using different diversity measurement methods.
Data nature considers whether the data is linear or nonlinear separable.Some classifiers can handle only linear separable data (e.g., perceptron, linear support vector machine, and linear regression).Other classifiers, such as the support vector machines (SVM), can perform a data transformation, where the non-linear data is transformed into several dimensions using the kernel trick [44].After this transformation, the data can be linearly separable, with the inconvenience of an increase in computing power.
The classification (mathematical) principle has an essential role in this approach because each of the considered classifiers handles the input space differently.Common supervised classification principles are based on how the methods identify the regions in the input space (an extensive review on this topic can be found in [44]): • Maximum Likelihood Classification: A popular method is the naive Bayes classifier, which uses conditional probability with the assumption that the input features are mutually independent: P (A|B) = P (B|A) Ṗ (A) where P (A|B) represents the posterior (or the prediction), P (A) is the prior (assigned with expert domain knowledge), P (B|A) is the likelihood, and P (B) is the evidence.
• Spectral measurement space: To this category belongs the methods k nearest neighbor (KNN) and support vector machines (SVM).KNN parts from the premise that an instance of the input space is likely to belong to K neighbors instances of a certain class.This method commonly uses the euclidean distance to determine the membership of the instance.Support vector machines proved to be powerful to separate the classes using (highdimensional) hyperplanes: wx − b = 1.As mentioned above, one strength of the method relies on the use of the kernel trick to perform a data transformation, which makes the data linearly separable.[13] propose a diversity measurement for an ensemble classifier.The first step is the definition of a diversity measurement Div(i, j) between two classifiers: where the subindex i corresponds to the i th predicted class of the first classifier, subindex j is the j th predicted class of the second classifier, N is the number of samples, and N dif f represents the number of misclassifications respect to the ground truth by the two classifiers.It is important to note that both classifiers must classify the sample wrongly under test.The second step is the definition of a diversity measurement of the ensemble classifier (EC), which considers all the individual classifiers that form the EC: where i = j, and K is the number of classifiers in the pool.A variation of [13] considers the diversity measurement between two classifiers Div(i, j), where N dif f counts the number of misclassifications when at least one classifier is wrong with respect to the ground truth, also known as the disagreement measure [45].
4) Pool Selection: Now in this approach, we consider an ensemble classifier of size m, which means that the predictions of m classifiers will be combined using information fusion.Considering a pool of N P classifiers and an ensemble size N ES , it generates a N C number of combinations using the equation: The parameters under consideration are the expert criteria Exp, the diversity Div, the version of diversity V er, and the pre-cut P C.
The pre-cut is a parameter that only affects the diversity because it reduces the size of the pool: Having defined the different criteria for the pool selection, it is possible to summarize the strategy for the ensemble selection using a Pseudo-Code, see Algorithm 2.

D. Uncertainty Quantification
The uncertainty quantification (UQ) provides a glimpse of the learning capability of the (ensemble) classifier.The UQ can be performed during training, validation, and testing.After training, the UQ is used to assess the learning process using validation data (e.g., EC performance).The UQ can assess anomaly detection while feeding unknown data using an uncertainty tracking strategy during testing.We propose three ways to quantify the uncertainty: using a performance metric and the rules of combination DSET and Yager.
The UQ using a performance metric U Q P is represented by: where N is the number of trained labels or classes, P i is the performance of the i th class, and N, i ∈ N.
The UQ using the rule of combination DSET is modeled using b k from Eq. ( 2): The UQ using the rule of combination YAGER is modeled using Eq. ( 5): The terms b k and q(φ) are calculated using the Eq. ( 2): We quantify the uncertainty of the individual classifiers (e.g., SVM, alexnet) using U Q P , U Q DS , and U Q Y .It is important to note that the performance metric is only used for individual classifiers because it relies on supervised classification.The latest means that the labels are known after training the classifier, so the performance metric can be applied.U Q DS and U Q Y are calculated using a random validation batch.Each batch contains a number M of samples of the validation data D V a .The classifier is evaluated using a fixed number of iterations i, in which a new random validation batch is used.The predictions of each batch are combined using the rules of combination DSET and Yager to obtain U Q DS and U Q Y , respectively.Furthermore, the predictions of each batch are used to calculate U Q P .
In the case of ensemble classification, we feed the data to the EC, but the ground truth remains unknown.For this reason, we quantify the uncertainty using the obtained U Q DS and U Q Y from the EC.

E. Anomaly Detection using Uncertainty Quantification
The EC can predict the classes which are used during the training process.However, the EC has no learning capability once the training is consumed, which means that the EC excels in the classification task while the data corresponds to known classes.The constraint here lies in that the EC classifies the data of an unknown condition into the known classes which were used during the training phase.We propose an anomaly detection capability for the ECET, which is based on the uncertainty monitoring of U D and U Y .This feature identifies anomalies while feeding data of unknown conditions; refer to Algorithm 3.This function, however, generates a new anomaly prediction ŷAN and a new anomaly framework of discernment ΘAN .In case of discovering a new unknown condition K, the frame of discernment ΘAN increases in one element: where N, K ∈ N.
The EC provides the uncertainties U D and U Y while feeding a data sample.The procedure checks whether the uncertainties U D and U Y exceed the maximum thresholds T r D M x and T r Y M x .If the case is affirmative, then the anomaly detection predictor ŷAN receives a new class A K , where K ∈ N. In case the uncertainties lie below the thresholds, the anomaly prediction ŷAN receives the value of the ensemble prediction ŷEN .
Algorithm 3 Anomaly Detection using ECET procedure ANOMALY DETECTION N P ← calculating pool size() by Eq.( 24) 9: return ŷAN IV.USE CASE: ANOMALY DETECTION USING ECET ON THE TENNESSEE EASTMAN DATASET This section presents the results of the classification performance and anomaly detection of ECET using the benchmark dataset Tennessee Eastman.For this purpose, we first present a description of the dataset and the considerations that are taken for the experiment design (e.g., data preparation, the pool of classifiers, and performance metrics).The subsection results provide the experiment's outcome for the uncertainty quantification, pool selection, classification performance, and anomaly detection.Two final subjects close the subsection: a comparison with literature and the discussion of the results.

A. Description of the Dataset
Down and Vogel created a simulation that replicated the process of the Tennessee Eastman chemical plant [46].The process consists of five principal process units: the reactor, condenser, recycle compressor, vapor-liquid separator, and product stripper.The plant produces two liquid products, G and H while using four gaseous inputs, A, D, E, and C. Fig. 5 displays the piping and instrumentation diagram (P&ID).The benchmark is a popular dataset in the research community because it provides a challenging use case for supervised classification and clustering approaches.and 800 samples for the fault case.The dataset has 52 input variables.There are fault cases that are especially challenging while performing classification (e.g., fault cases 3, 9, 15, and 21).The fault cases have been grouped into three categories: easy (1, 2, 4, 5, 6, 7, 12, 14, and 18), medium (8,10,11,13,16,17,19,20) and hard (3, 9, 15 and 21) [48].For this reason, it is a usual practice to select the hard faults to test the robustness of the approaches.The results of subsections IV-C3, IV-C4, and IV-D use the data of the hard faults in order to show the ECs' performance.

B. Experiment Design
We use the TE dataset to test the ECET capabilities: the classification performance and anomaly detection of the ensemble classifier (EC).We trained ten classifiers in total: five NN-based models and five non-NN-based models (from now on, they will be referred to as Machine Learning or ML-based models).There are three primary ensemble classifiers (ECs), namely NN-based, ML-based, and Hybrid (a combination of the NN and ML-based models).The current approach is developed using python 3.7 under the IDE Spyder from Anaconda.The models were defined using the ML frameworks Scikitlearn (for ML models) and PyTorch (for NN-based models) [49] [50] [51].The experiments are performed using a CPU i7-7700 @3.60GHz x 8, 32GB RAM, a GPU NVIDIA GeForce GTX 1660 SUPER, and a Ubuntu 20.04.3 LTS environment.
1) Dataset preparation: The considerations taken for the experiments are: • The fault cases are grouped into main sets: (1,2,6,12) and (3,9,15,21).The first group contains a fraction of the easy faults, whereas the second group contains the hard faults.The data of (0,1,2,6,12) conform to an easy dataset, and a hard dataset contains the data of (0,3,9,15,21).• The binary ECs are trained using the normal condition (0) and one of the fault cases of the datasets (e.g., hard dataset (0,3,9,15,21)), whereas the multiclass ECs are trained using all the cases.
• In the case of the anomaly detection experiments, the datasets are reduced to (0,1) and (0,3) for binary ECs, and (0,1,2,6,12) for multiclass ECs.Given the extent of all the possible combinations of data and ECs, the datasets were selected as representative to show the approach's performance.We use all the fault cases as unknown conditions.The easy and hard datasets result in a training dataset of 52 input variables and 2900 observations (500 samples for the normal condition + 480 samples per fault case * 5 fault cases), and a testing dataset of 52 input variables and 4800 observations (960 samples for normal condition + 960 samples per fault case * 4 fault cases).It is important to highlight that the first 160 samples of the testing data of a fault case correspond to the normal condition, giving a rest of 800 samples of faulty condition.In the case of the binary datasets (e.g., (0,1)), the training dataset is composed by 52 input variables and 980 observations (500 samples for the normal condition + 480 samples for the fault case), and a testing dataset of 52 input variables and 1920 observations (960 samples for normal condition + 960 samples for the fault case).
The training dataset is split in a 70/30 ratio to have the training and validation datasets, respectively.The validation dataset is used to quantify the trained classifier's uncertainty and calculate the confidence weights for each class per classifier.The testing dataset is used to determine the classification performance of individual classifiers and ensemble classifiers.
The testing dataset for the anomaly detection experiments differs from the classification experiments in that an unknown fault case is added.As an example, in the case of a binary EC trained with the (training) dataset (0,1), the anomaly detection capability of the EC is tested using the (testing) dataset (0,1,2).For practical purposes, the label of ( 2) is changed to (-1), the reason of this lies in the fact that EC assigns the label (-1) to unknown conditions.The training dataset is scaled for a mean of µ = 0 and a standard variance of σ = 1.The scaling parameters of the training data are applied to the testing dataset [52].
2) Pool of Classifiers: The pool of classifiers considers two main groups: NN-based and ML-based models.The ML-based group consists of the classifiers: decision tree (DTR), support vector machine (SVM), K-Nearest-Neighbours (KNN), Naive Bayes (NBY), and AdaBoost (ADB).At the same time, the NN-based group consists of popular models such as alexnet (ale), lenet (len) and vgg.The second group is completed with two customized architectures a multilayer perceptron (mlp) and a deep neural network (cmp).The hyperparameters for the ML-based models are obtained using the module gridsearch from the ML framework scikit-learn.The hyperparameters (HP) for the ML models are detailed as follows: DTR (criterion='gini', maximal depth=28), SVM(C=1000, gamma=0.1, kernel='rbf'), KNN (metric='manhattan', n-neighbors=3, weights=distance), NBY(no HP) and ADB(lr=0.01,number of estimators=50).The architecture and hyperparameters for the NN-based models ale, len, and vgg are documented in detail in [53].However, the 2D convolutional layers are replaced by 1D convolutional layers since the dataset is 1D.The architecture and hyperparameters for mlp and cmp are detailed in Fig. 6 and Fig. 7, respectively.The hyperparameters of cmp are obtained using the module optuna.The learning rate is set to lr=0.001, and the number of epochs is set to 20.Experimental results did not show any improvement by increasing the number of epochs.3) Performance Metrics: We choose the F1 score as the performance metric to monitor the learning capability of each of the experiments.The computation time is also tracked during each of the experiments.The fault detection rate (FDR) is used in section IV-D to compare the approach's results with literature.

C. Results
This section presents the results of the experiments performed using ECET, specifically for the uncertainty quantification (individual classifiers and ECs), classification performance (individual classifiers and ECs), and anomaly detection (AD) (using ECs).
1) Uncertainty quantification of individual classifiers: We quantify the uncertainty after the training of the models using U Q P , U Q DS , and U Q Y as detailed in section III-D.The uncertainty quantification (UQ) is represented as plots for the ML-based and NN-based classifiers.For illustration purposes, we present only the plots for the multiclass classifiers using the easy dataset.Figures 8a, 8b, and 8c present the UQ for the MLbased classifiers, namely KNN, SVM, NBY, DTR, and ADB.It is important to note that the desired value for U Q P , U Q DS , and U Q Y is zero, which implies consistent and accurate predictions.In the case of U Q DS and U Q Y , zero value means no conflicting evidence.A random validation batch of 20 samples is used to quantify the uncertainty.These samples are extracted from the validation data.This operation is performed 50 times.The ML-based classifiers show comparable results using U Q DS and U Q F 1 with values tending to zero (with the exception of some samples in the case of U Q DS that take the value of one).In contrast, U Q Y has a stable approximated value of one, which implies constant conflicting evidence in the predictions of the random batch.
Figures 8d, 8e, and 8f present the UQ for the NN-based classifiers, namely ale, len, vgg, mlp, and cmp.Likewise, the 2) Pool Selection: We first construct a baseline with the performance of the individual classifiers.In this manner, it is possible to compare the performance of the new ECs.As stated in the experiment design, we use two datasets: easy (0,1,2,6,12) and hard (0, 3,9,15,21).In addition, we create binary and multiclass classifiers for the two datasets.For illustration purposes, we present the performance of each (binary and multiclass) classifier (ML-based and NN-based) using the easy dataset.Each table summarizes the performance (F1score) per fault case, an average F1-score, and the computation time for training the (individual) classifiers of the pool.For illustration purposes, we present only the results of the ECs using the easy dataset.
Table IV shows the performance (F1-score) and computation time for the individual multiclass (ML-based and NNbased) classifiers using the fault cases (0,1,2,6,12)).KNN presents the highest average F1-score with a value of 0.94 from the ML classifiers.In contrast, ale and mlp present the better average F1-score with values of 0.95 and 0.93 from the NNbased classifiers, respectively.Table V shows the performance and computation time for the individual binary classifiers using the normal condition (0) and one of the fault cases (1,2,6,12).For illustration purposes, the normal condition of the table is the average of the results of the normal condition of all the binary classifiers.The best average F1-score corresponds to NBY and SVM with a value of 0.98 for the ML-based classifiers, whereas ale and cmp have the highest average F1score with values of 0.98 and 0.97 for the NN-based models, respectively.
The next step is the creation of the ECs: ML-based, NNbased, and Hybrid.After applying the procedure described in section III-C4, we obtain 38 ECs.As seen in Table VI, the parameters are expert (Exp), diversity (Div), diversity version (Ver), and pre-cut (P-C).It is important to note that we divide the ECs into three groups: ML (M2..M5), DL (D2..D5), and Hybrid (H2-1..H10-1).Each row of the table is an EC (e.g., M2 EC consists of the ML models KNN and SVM), in which the first letter denotes the nature of the classifier (e.g., M for ML-based, D for NN-based, and H for Hybrid), the first number is the ensemble size.The last number is the consecutive number for this EC.The ECs H5-1, H6-1, and H6-4 present the highest average F1 score with a value of 0.97 for each EC.The training time shows mixed results varying from 8s for M2 and 1242s for H9-3.The ECs H5-1, H6-1, and H6-4 have a training time of 557s, 476s, and 611s, respectively.The training time is represented as a percentage using a relative time.For this purpose, we consider the highest training time of the EC H9-3 with a duration of 1242 seconds (relative time of 100%) as reference.In contrast, one of the ECs with the smallest training time M2 presents a duration of 8 seconds (relative time of 1%).
3) Ensemble Classification Performance: This subsection presents the classification performance of the ECs (ML-based, NN-based, and Hybrid) in the form of tables and plots.The plots represent the performance of selected ECs, which can be visualized through a confusion matrix, a plot of prediction versus ground-truth, a plot of the DSET uncertainty, and a plot of the Yager uncertainty of an EC.Likewise the previous section, the tables summarize the individual F1-score per fault case, an average F1-score, and the relative time for each EC.The inference time is the time required to process all the testing samples, afterwards the relative time is calculated.
Table VII presents the multiclass ECs (H5-1, H6-1, and H6-4) and the individual multiclass classifiers (KNN, ale, and mlp) with the best performance from Table VI while using the cases (0,1,2,6,12).The ECs H5-1, H6-1, and H6-4 present comparable results with an average F1-score of 0.97, whereas the individual classifiers KNN, ale, and mlp have values of 0.95, 0.94, and 0.94, respectively.It is important to highlight the relative time difference between H6-1 with 38.3% and KNN with 0.5%.Table VIII presents the binary ECs (M2, M5, and H6-2) and the individual binary classifiers (NBY, SVM, and ale) while using the normal condition (0) and the fault cases (1,2,6,12).Similarly to Table VII, the rows represent the performance of a binary EC, and the normal condition is the average of all the binary classifiers of the EC.The ECs M2, M5, and H6-2 present comparable results with an average F1-score of 0.98, 0.99, and 0.99, respectively, whereas the individual classifiers NBY, SVM, and ale have a value of 0.99 each.Moreover, the results of the binary ECs are comparable to those of the multiclass ECs of Table VII.Every binary EC of a fault case has its correspondent computing time; therefore, it is inadequate for comparison purposes.Table IX presents the multiclass ECs (M5, H4-1, and H4-3) and the individual multiclass classifiers (NBY, SVM, and KNN) with the best performance from Table VI while using the cases (0,3,9,15,21).The ECs M5, H4-1, and H4-3 present comparable results with an average F1-score of 0.22, 0.23, and 0.23, respectively, whereas the individual classifiers NBY, SVM, and KNN have values of 0.5, 0.58, and 0.58, respectively.It is important to highlight the relative time difference between H4-1 with 1.1% and KNN with 0.3%.
Figures 9a, 9b and 9c show the confusion matrices for the ECs M5, H6-1 and H5-1, respectively.As it can be noted, the confusion matrix (CM) with the best results (results mostly in the diagonal, which means that the predictions are equal to the ground truth) corresponds to the EC M5.The EC H6-1 presents a CM with mostly correct predictions, except for the cases (2,12), which present misclassifications.In contrast, the EC H5-1 presents the highest number of misclassifications, which implies the inability of the EC to distinguish between the cases (inadequate training).Figures 9d, 9e and 9f display the predictions (blue) compared to the ground truth (red) for the ECs M5, H6-1, and H5-1, respectively.The EC M5 shows the clearest plot, in which the predictions correspond mainly to the ground truth.Whereas the EC 6-1 presents good results for the fault cases (1,2,6).In contrast, the EC H5-1 presents a noisy plot, which, similar to its confusion matrix, implies poor performance.Figures 9g, 9h and 9i present the DSET UQ for the ECs M5, H6-1 and H5-1, respectively.The uncertainty of the EC M5 presents mostly values near zero, implying reduced conflicting evidence.In the case of H6-1, the uncertainty has notable fluctuations for the cases (0,12).The EC H5-1 presents a plot full of fluctuations (with values ranging from zero to one), implying the inability to properly classify the faults and resulting in conflicting evidence during the fusion.Figures 9j, 9k and 9l show the YAGER UQ for the ECs M5, H6-1 and H5-1, respectively.The uncertainty of the EC M5 presents a notable amount of fluctuations, which implies the presence of conflicting evidence during the fusion.In the case of H6-1, the uncertainty has notable fluctuations for the cases (0,2,6,12).It is also noticeable that the uncertainty has a constant value under 0.4.The EC H5-1 presents a plot full of fluctuations, implying conflicting evidence during the fusion.It is visible that the uncertainty has a constant value of 0.7 in most of the cases.
4) Anomaly Detection: This subsection presents the anomaly detection performance of the ECs in the form of tables and plots.Table XI presents a full series of experiments for multiclass ECs trained with the cases (0, 1,2,6,12).The ECs H5-2, H6-2, and H7-2 present the highest average F1-score with a value of 0.63 for each EC.The inference time shows mixed results varying from 250s for M2 and 31611s for H9-2, which correspond to the relative time of 1% and 100%, respectively.The ECs H5-2, H6-2, and H7-2 have a training time of 29364s, 31101s, and 31598s, respectively.In addition, the relative time of the ECs H5-2, H6-2, and H7-2 have values of 93%, 98% and 100%, respectively.
It is important to highlight the relative time difference between the ECs, specifically while observing M5 and H5-2 with values of 1.9% and 92.9%, respectively.Fig. 10 presents the plots of selected ECs: binary M5 trained with (0,1), multiclass H5-2 trained with (0,1,2,6,12), and binary H3-3 trained with (0,3) while injecting an anomaly (fault case 7).Figures 10a, 10b and 10c show the confusion matrices for the ECs M5, H5-2 and H3-3, respectively.As it can be noted, the confusion matrix (CM) with the best results corresponds to the EC M5, and most of the samples of the unknown condition are detected as an anomaly by the EC.The EC H6-1 presents a CM with mostly correct predictions, except for the cases (2,12), which present misclassifications.The EC presents a good anomaly detection capability.In contrast, the EC H5-1 presents the highest number of misclassifications, as well as an inability to detect the unknown condition samples.Figures 10d, 10e and 10f display the predictions (blue) compared to the ground truth (red) for the ECs M5, H5-2 and H3-3, respectively.The EC M5 shows the most precise plot, in which the predictions correspond mostly to the ground-truth.The EC 6-1 presents good results for the fault cases (1,2,6).In contrast, the EC H5-1 presents a noisy plot, which, similar to its confusion matrix, implies poor performance.It is important to remark that the EC, without the anomaly detection tracking results in an EC classifying the unknown data as the known cases and subsequently in a high fluctuation area.Figures 10g, 10h and 10i present the DSET UQ for the ECs M5, H5-2 and H3-3, respectively.The uncertainty of the EC M5 presents mostly values near zero, except for the unknown case (anomaly), which implies increasing conflicting evidence during the fusion.In the case of H3-3, the uncertainty has notable fluctuations for the cases (0,12), especially in the anomaly samples.The EC H5-1 presents a plot full of fluctuations, which implies a notable amount of conflicting evidence during the fusion.Figures 10j,  10k and 10l show the anomaly detection (AD) tracking for the ECs M5, H5-2, and H3-3, respectively.The anomaly detection of the EC M5 identifies most of the anomalous samples.In the case of H5-2, the anomaly detection has notable fluctuations while injecting the anomalous samples.The EC H3-3 has the lowest performance in terms of anomaly detection, which is reflected in the high amount of fluctuations while injecting the anomaly.

D. Comparison with Literature
This subsection compares the ECET with literature, specifically while addressing classification and anomaly detection performance.For this purpose, we selected BIN and MC ECs, which will be compared with other literature approaches using F1-score and FDR (e.g., depending on the selected metric by the paper's author).
1) Comparison of classification results: Table XIII displays the classification results of the binary ECs M5, H3-3, and H5-1.The binary ECs are trained using the normal condition and one fault case (e.g., the first row of Fault 1 represents the classification results of the binary EC trained with data of the normal condition and fault 1).The experiments address the 21 fault cases, which means that we have 21 binary ECs.The overall classification results are comparable with the literature approaches such as support vector machines (SVM) and modified partial least squares(MPLS).The average FDR of EC H5-1 is 83.57%, while SVM has a score of 81.49% and MPLS has a score of 83.93%.At first sight, the approaches yield similar results while looking into the detailed fault cases, specifically the easy faults (1,2,4,5,6,7,12,14 and 18).The rest of the fault cases show mixed (inferior or superior) results with respect to literature (e.g., medium and hard faults).
Table XVII presents the anomaly detection results of the

E. Discussion
The uncertainty quantification (UQ) assessed the training capability of the individual classifiers (e.g., SVM with low uncertainty and high performance, while ale with high uncertainty and low performance).In the case of the EC, the UQ is successfully used to detect anomalies (e.g., high uncertainty reflects the likeliness of an unknown condition).
For the ensemble classification, a pool of ML-based models and NN-based models were trained.Because the ensemble classification relies on the heterogeneity of its sources, a combined expert-diversity strategy was used during the pool selection.The proposed strategy systematically reduces the number of possible combinations while providing diverse ensemble classifiers (EC).The performance of the ECs was comparable, and in some cases superior, with respect to the individual classifiers.In addition, the experiments consider the impact of varying the ensemble size and different classifiers architectures (e.g., ML-based, NN-based, and Hybrid EC), as well as binary and multiclass ECs.The ensemble classification uses the evidence theory to perform the fusion of the classifiers at the decision level.To this end, the rules of combination DSET and YAGER were applied.The first one was mainly used to combine the classifiers, whereas both were used to detect uncertainty changes.Some ECs provided better results such as the case of the multiclass EC H6-1 (trained with the cases (0,1,2,6,12)) with an average F1-score of 0.97.In the case of the multiclass EC H4-1 (trained with the cases (0,3,9,15,21)), the EC had a poor performance with an average F1-score of 0.23.The poor performance can be attributed to the challenging fault cases and to a (poor) training of the classifiers that conform the EC.The training time showed notable differences, ranging from 476s for the EC H6-1 and 12s for the EC H4-1.The latest represented with the relative time with values of 38% and 3%, respectively.The EC's constraints rely on the classifiers' static behavior since the classifiers were trained once with specific data of the input space.The reliability of a model is related directly to how the training data represents the input space or how representative the training samples are with respect to the overall data.Another constraint relies on the primary assumption of the mutual exclusion between the faults.This situation can occur in real applications, although a combination of faults is a common scenario.The anomaly detection relies on the ensemble classification performance.This situation is visible while comparing the average F1-score of binary EC M5 (trained with the cases (0,1)) and the binary EC H3-3 (trained with the cases (0,3)), which have values of 0.62 and 0.20, respectively.As it was noted in the classification performance subsection, the ECs trained with the cases (0,3) showed poor classification performance.Consequently, the anomaly detection performance of the ECs trained with the cases (0,3) is also poor.Important to note is that classifiers with poor performance add noise during the fusion.This situation is clear while visualizing the anomaly detection results in the plot prediction versus ground truth.

V. CONCLUSION
We presented a novel approach for binary and multiclass supervised classification and anomaly detection of unknown conditions using ensemble classification and evidence theory (ECET).The uncertainty quantification (UQ) during the training assessed the learning capability of the trained models.Whereas the UQ, during the inference of the ensemble classifier (EC), tracked the uncertainty changes while feeding testing data.This last feature is used to detect anomalies because a high uncertainty would correspond to a high likeliness of an anomaly, as demonstrated in the results.The pool selection strategy plays a vital role in selecting the classifiers that form the EC.The combination complexity is reduced, and the resulting pool is derived from a selection of an optimal EC with respect to individual classifiers (e.g., NNbased and ML-based classifiers).We test different binary and multiclass ensemble classifiers using the Tennessee Eastman benchmark, obtaining favorable results (e.g., depending on the EC).Selected ECs present comparable results, and in some cases superior results, with respect to Literature.We propose a hybrid strategy based on DSET and Yager UQ for the detection of anomalies.The strategy was successful because the EC could detect anomalies while feeding unknown data.The latest was also validated while comparing to other literature approaches.
Future research includes using the anomaly detection approach for an automatic update of the ensemble classifiers.Different pool selection strategies will remain in focus since they proved to have a significant impact on ensemble selection.Further experiments will include training different classifiers using different feature reduction strategies with the same model.
) which might be a trivial question when having N C = C(5, 3) = 10 combinations, but in the case of a pool size of N P = 10 classifiers and an ensemble size of N ES = 5: N C = C(10, 5) = 252 combinations.This situation brings a new challenge since each combination is a potential experiment that needs to be performed.Considering all the ensemble

Table I :
List of symbols and abbreviations.
D Fusion using DSET rule of combination F Y Fusion using Yager rule of combination Algorithm 1 Ensemble Classifier using Evidence Theory Networks of Classifiers or layered classifiers:The decision trees, committees of classifiers, and artificial neural networks or neural network-based (NN) models belong to this category.Decision trees use a decision-like model in which the attributes are tested using decision nodes, and their outcomes generate new branches.The leaf nodes contain the class labels.The basic unit of an NN-based model is the perceptron.The perceptron is represented using: ŷ = f (w T x + b), where x represents the inputs, w are the weights, ŷ is the predicted output, and b are the biases.The basic NN-based model that can handle non-linearities is the multilayer perceptron, which uses (several) layers of neurons and a non-linear output layer.
• Committees of Classifiers: This method implies training several algorithms of the same type (e.g., decision tree) in parallel and performing a majority voting on the classifier predictions to obtain the class labels.The committees of classifiers include the approaches using boosting (e.g., AdaBoost, XGBoost) and bagging (e.g., Random Forest).•

Table III :
TE Dataset fault cases.

Table XII :
Anomaly detection results of selected ensemble binary and multiclass classifiers.

Table XIII :
Classification results comparison of binary ensemble classifiers respect to literature using FDR.The results for fault cases 2 and 6 are comparable, showing slight differences.Table XV presents the results of the same ECs and literature approaches of Table XIV but using the metric FDR instead.The overall classification results are comparable with respect

Table XIV :
Multilabel classification comparison of ensemble classifiers respect to literature using the hard faults of TE and F1-score.

Table XV :
Multilabel classification comparison of ensemble classifiers respect to literature using the hard faults of TE and FDR.

Table XVI :
Anomaly detection comparison of ensemble classifiers respect to literature using the hard faults of TE and F1-score.partial least squares (MOD-PLS).The multiclass EC H3-4 presents better results with an average FDR of 73.76%, in some cases comparable to literature, specifically using PCA 76.68%, AE 76.56%, and AAE 78.55%.The multiclass EC M3 presents the best average FDR with a score of 87.97%, which are superior to DPCA-DR and MOD-PLS with scores of 83.51% and 83.83%, respectively.The results of M3 are mixed while considering all the fault cases.However, the ability of the EC to identify the hard faults is remarkable.The M3 results achieved an FDR of 91.88%, 90.75%, 91.25%, and 94.13% for fault cases 3, 9, 15, and 21, respectively.The best result for fault case 3 corresponds to AAE with an FDR of 34.88%, whereas the highest result for fault case 9 is achieved by AAE with an FDR of 33.62%.The best result for fault case 15 corresponds to DPCA-DR with an FDR of 38.5%, whereas the highest result for fault case 21 is achieved by MOD-PLS with an FDR of 72.66%. modified

Table XVII :
Anomaly detection comparison of ensemble classifiers respect to literature using all faults of TE and FDR.