Deep ensemble-based classifier for transfer learning in rotating machinery fault diagnosis

Nowadays, intelligent models can correctly detect faults by analysing signals from rotating machinery. However, most of the studies are run in controlled environments and the performance in industrial real-world environments is not yet fully validated. Hence, a suitable tool to implement fault diagnosers is transfer learning, this topic is under development and challenges persist. This paper proposes a framework for creating accurate 1D-CNN based fault classifiers that can be transferred between different rotating machines and working conditions. Multiple Bayesian processes select architecture parameters and hyperparameters, which minimize a loss function related to their transferability to other machines and to the same machine under different operating conditions (such as load and engine speed). The resulting model is fitted to heterogeneous fault diagnosis data resulting in a 1D-CNN ensemble that improves the performance of the unitary model. Subsequently, the transfer learning capability of the ensemble is analyzed on two source data sets using function and parameter based transfer. The results are compared with classical fault diagnosis classifiers. Finally, additional transfer operations on five target domain datasets validate our framework on limited labeled samples and allow interpretation of the ensemble results. The ultimate goal is to find an ensemble that can generalize fault diagnosis on rotating machinery for easy implementation and update in industrial settings.


I. INTRODUCTION
C ONDITION Monitoring (CM) in rotating machinery has been instrumental in preventing failures, scheduling maintenance, and estimating life expectancy. Advances in this field have experienced growth as artificial intelligence has been introduced. CM in rotating machinery has been traditionally assessed by measuring different signals (e.g., vibrations and acoustic signals). Classical approaches extract various time, time-frequency, and frequency domain features on those signals. These features are used to create Machine Learning (ML)-based intelligent models to detect anomalies and faulty conditions. Novel approaches introduce Deep Learning (DL) techniques to take advantage of their generalization capabilities and their property to extract knowledge without explicit feature extraction [1]. Nonetheless, realworld industrial datasets are generally unavailable due to their acquisition cost and commercial sensitivity. The trend is to build controlled environments where failures are induced under multiple working conditions in the rotating machinery. DL models can be designed for CM in such an environment. However, the transition from the controlled setup to the industrial application is not always successful. Moreover, finding the right DL model is a time-consuming task that can be successfully addressed with the appropriate computational resources.
In more mature DL applications, such as image processing and natural language processing, data collection needs have been alleviated with Transfer Learning (TL). TL is to find a way to combine a source domain that contains rich historical data with a new or target domain that lacks data. TL has a pivotal role in addressing data constraints and model implementation in real-world applications. In CM, TL is commonly approached following three transfer styles [2]: i) transfer between machines aims at conveying knowledge from laboratory-controlled to industrial configurations [3], ii) transfer between working conditions (e.g., speed or load) create models that maintain performance in scenarios different from those in which they were built [4], and iii) finally, transfer between multiple fault types aims to leverage similarities between their feature indicators, for example, from a rich data source of bearing fault problems to a sparse data target of gear fault problems [5]. More on this topic is presented below.

A. RELATED WORKS
The existing literature on transferring between working conditions is extensive and focuses on varying the rotation speed and load on the machine. For example, [4], [6], [7] showed that classical Neural Networks (NN) and Convolutional Neural Networks (CNNs) trained on the source domain result in a set of parameters and hyperparameters that can be reused for another working condition.
Another common and prevalent trend is to transform measured signals into images and tune state-of-the-art models from the field of image processing. Fast Fourier transforms (FFTs), spectrograms, and wavelet transforms provide visual representations suitable for fault detection with TL [3], [7], [8]. For instance, [9] transformed time-series into image representations by using the frequency-slice wavelet transform (FSWT), and [10] plotted the raw time-series behavior of vibration signals; both tuned well-known image classification architectures for fault diagnosis. This approach is accurate; however, it sustains disadvantages such as loss of performance in the face of noisy measurements, high model tuning cost, lack of interpretability, among others. Finally, most image transformations are on cyclostationary signals or constant operating conditions; thus, faults under varying operating conditions (referred to as non-cyclostationary) might be misclassified.
An alternative approach is to transfer features or samples from a source domain to a target domain. This will augment the target domain data and is achieved by distribution matching using methods such as Transfer Component Analysis (TCA), Joint Distribution Adaptation (JDA), Adversarial Domain Adaptation (ADA), and others [11]- [13]. The raw signal from the source domain can also enrich the target domain data by using the Maximum Mean Discrepancy (MMD). The MMD is used to measure the difference between the source and target distributions according to the DL model built [3], [14]. One of the main disadvantages of those approaches is the cost of training such models-the TL process also becomes very greedy.
In this work we are interested on the definition of DL architectures that are transferable between machines and working conditions for CM. In this matter, ensemble of classifiers has shown versatility for fault classification and TL. Traditionally, base models of an ensemble are obtained with bootstrap sample versions of the original training data (as in the Random Forest approach); however, for CM, different approaches propose to train the ensemble on different working conditions or to modify their transfer method to maximize the classification performance on a target domain. For instance, the work in [15] presents an ensemble with base models created on different channels and then transferred by tuning the parameters on different working conditions. In this work, an arbitrary CNN was defined using the signal images as feature maps, and a combination method based on an adjustable weighted majority voting. The work presents higher performance than traditional methods; however, few samples per class are not evaluated, and their tests were performed on the same machine by varying the working conditions. On the other hand, [16], [17] propose to transfer a 1D-CNN deep ensemble by defining different MMD kernels to each base model. Similarly, the work in [18] trains a set of CNNs to obtain feature extraction functions; these are coupled with auto-encoder decision functions. Bootstrap is used to define the training data subsets for each base model. The resulting experiments show improved performance; however, their transferability is not demonstrated. [19] presents a similar approach to the latter. In the aforementioned works, the ensemble results are mostly combined by voting the most frequent class. Moreover, they do not consider how to correctly define deep learning architectures suitable for TL operations in fault diagnosis. These last two points will become the focus of this paper.

B. CONTRIBUTIONS
The main objective of this work is to find a fault diagnosis classifier that can be easily transferable between different types of rotating machines and the same machine under different working conditions. To do so, a deep 1D-CNN ensemble is constructed with various data sources to better generalize the fault patterns. The idea is to find a set of invariant features or decision functions to increase performance in TL operations. Our contributions are threefold • We propose a framework in which 1D-CNN base models are created independently with different source domains. Then, a Bayesian optimization process is carefully defined for each base model to find its parameters and hyperparameters. However, the Bayesian objective function maximizes not only the traditional model performance on a test set but also its performance after tuning its parameters over other data sources and different working conditions. • We propose an ensemble model with the results from the previous item. The results of the Bayesian processes are joint by a combination method. Three combination methods were tested and the best one was selected for further analysis. • Finally, we provide the elements to validate such an ensemble in different transfer tasks (transfer from one working condition to another or from one machine to another). Throughout this paper, the classical approaches will be compared with the 1D-CNN based result, especially in the presence of few samples per class. Although the experiments in this paper are developed on datasets from controlled environments, extensive experiments are proposed to demonstrate the transferability of the ensemble model under different conditions such as working conditions, type of machinery, type of sensor and number of samples available. The resulting validated ensemble model, which is publicly available, becomes valuable knowledge for practitioners in the field because it can reduce the effort of collecting large amounts of data to create deep learning solutions. Moreover, the results of the framework composed of multiple Bayesian processed also alleviates the efforts of finding the best Deep Learning architecture to solve the problem.
The remainder of this paper is organized as follows. Section II presents 1D-CNN and TL foundations. Section III presents the framework of our proposal. Section IV describes the two source and the five target domain datasets. Section V reports the experiments and results of the framework. Finally, Section VI outlines the conclusions of the paper.

A. 1D CONVOLUTIONAL NEURAL NETWORK
1D Convolutional Neural Networks (1D-CNNs) are derived from the ordinary Neural Network with some structural changes. In general, a 1D-CNN is composed of layers with learnable weights or kernels W and biases b in their neurons. In such a layer, a convolution operation is performed as in the following first layer example: where W 1 ∈ IR d×k is a weight vector with k one dimensional filters of size d. X 1 ∈ IR n×c is a n-dimensional input with channel depth c. Following, Eq. (1) is passed through a nonlinear operator g which generates the layer's activation, i.e., A 1 = g(Z 1 ). After this operation, it is common, but not mandatory, to apply a Pooling layer, which will reduce the dimensions of the activation result. The convolution operations commonly represent an automatic feature extraction function f f e . In contrast, the final FC layers are viewed as a mapping function f m , where a Softmax operation will compute the probability distribution of the possible classes Cl i , i.e, P = p(ŷ = Cl i |X 1 ). An optimization process should be used to obtain the parameters {W, b} and hyperparameters (number of hidden layers and units, activation function g, learning rate, and others) that minimize a loss function, which in turn measures how well the model fits the data.

B. TRANSFER LEARNING
The goal of TL is to use an existing model trained under the best conditions for one application (source domain) and use it in another similar application (target domain).
i a sample of ndimensional measurements and y s i its label. Equivalently, In TL, the premise is that similarities can be found between decision functions and datasets. To find such similarities, domain adaptation approaches are implemented. In the following, we briefly describe the domain adaptation approaches used in this work.

1) Parameter-based transferring
A parametric model in the source domain is defined by The hypothesis is that the parameter space of both target and source domains are intercepting, e.i., Ω s ∩ Ω t = ∅ and Υ s ∩ Υ t = ∅. This approach has been extensively studied and is the most attractive approach in TL [20].
In most cases, it is sufficient to find a candidate DL model with X s . The resulting parameters, and hyperparameters are copied into a new model and will undergo small deviations from the original space; they are modified using a new cost function and a fine-tuning regime with X t [3], [4], [6]- [8].

2) Feature-based transferring
The feature space of the source and target are matched. There exist two strategies to perform such matching: symmetric and asymmetric feature transfer. In symmetric feature transfer, both domains are mapped to a common latent feature space VOLUME 4, 2016 where their probability distribution discrepancy is minimized as proposed by the TCA algorithm [11]. Another example is automatic feature extraction with CNNs that can produce a feature space in the source domain suitable for the target domain by fine-tuning. In contrast, an asymmetric method transforms one domain into another (source or target) by minimizing the distance between similar samples as JDA [12].

C. TL CONTEXTS FOR FAULT DIAGNOSIS
Different scenarios can be set up in rotating machinery due to operational requirements. Such scenarios are commonly reproduced in controlled environments for data collection purposes and are referred to as context in this paper. A context consists of different machinery configurations defined by the tuple where T i corresponds to the machinery type, F i is the fault type, and O i represents the operating condition. A context C i leads to the definition of one or more sub-source domain datasets S i .
One can extract a finite number of contexts by finding all the possible combinations between the working conditions, fault types, and machinery. In this work, the possible context compositions are: • One machine and type of fault operating in different conditions: where Comb is a function that creates all possible combinations without repetition. r is the total operating conditions at T i , i.e., the number of speeds and loads at which the failures were collected. • Several machines intercepted by their working condition: The contexts are purposely selected to find or highlight important results throughout the evaluation process. For example, in the first element, we would like to ensure that the resulting model works under different operating conditions on the same machinery. In the second case, we combine operating conditions of different machines to improve the generalization capabilities of the same model in the presence of failures in different devices (e.g., gears and bearings).

III. FRAMEWORK
Our framework aims at finding transferable 1D-CNN models between different rotating machines and working conditions (load and speed). To do so, we propose to use various source domains X s to find such transferable models. Figure  2 outlines the steps to achieve this goal. k source domain signals are divided into contexts C i and each context i into sub-source domains S i using the definition in Section II-C. Subsequently, the vibration signals in S i are sliced into windows of 16,384 samples (equivalent to 0.32 s) in length as proposed by [21]. Furthermore, for each context C i , an optimal 1D-CNN model is obtained through a Bayesian process. During this Bayesian optimization process, a subsource domain is used to train the 1D-CNN, and the rest acts as sub-target domains. The 1D-CNN is evaluated according to its performance on TL operations over the target domains. This Bayesian process is repeated for all source datasets. The best model (M best ) is selected by evaluating its performance on those TL operations using a function (Γ 1 in Figure 2). Moreover, we propose to build an ensemble with base architecture models equal to M best tuned to the available source datasets. The result of the ensemble is given by a combination method Γ 2 . The rationale for using ensembles of models is that they might perform better due to the variance decrement of their results. More details about the framework are presented in the following sections.

A. ARCHITECTURE DEFINITION
One of the most well-known tools for finding DL architectures is the Bayesian search. The work in [22] demonstrated that a random selection of the optimization space yields better results than a grid search. In a Bayesian process, the parameters and hyperparameters of the DL architecture (HP ) are defined as the space of an objective function. The objective function aims to minimize the loss function of the DL architecture by modifying the defined space.
In this work, we propose multiple Bayesian processes to find the 1D-CNN model that performs well on the sub-source domains of each context. The training procedure is unfolded as follows.
The first step is to train a standalone model M r with random initial parameters and hyper-parameter and minimize a loss function L for each source domain h in each context C i , i.e., S h i , ∀h ∈ |S i | and ∀S i ∈ C i .
The next step is to evaluate the versatility of the previous models for transfer operations using all target data available, and vary the models' parameters to minimize the objective function's loss (E) of the Bayesian process.
where → denotes a transfer operation. M r was trained with one sub-source domain and transferred to the remaining datasets by only tuning the top and last layer in M k . The loss E is defined as a weighted sum by w of the average accuracy (ACC) and the loss in the training and validation phases as follows.
The result of this phase is a set of 1D-CNN models (size equals the number of contexts defined in Section II-C). The best and ensemble model are defined in the following section.

B. MODEL SELECTION
In Figure 2, two functions are proposed to the model selection. Each function is defined as follows.
• Γ 1 selects the best model denoted as M best with the minimum Bayesian loss among contexts from eq. Three combination methods will be tested in Γ 2 . i) Average: the Softmax layer results are averaged. The maximum value gives the class output. ii) Stacked: the Softmax layer results are stacked in a vector, and a new Softmax layer defines the class output. This is similar to a weighted combination. iii) Meta: the Softmax layer is discarded, and the last Dense layer activations are summed. Following, a new Dense and Softmax layer will define the class output. The difference between the above approaches is based on the number of parameters adjusted to make the final decision. The Average and Stacked approaches are based on the decision made by each model using its Softmax layers; they have no parameters or a minimum number of parameters to adjust, respectively. On the other hand, the Meta method constructs a new nonlinear function (with the Dense layer) that combines the results, thus adding more parameters to fit, which are relatively small compared to the parameters of the whole architecture. These approaches will be evaluated to find the one that maximizes the final performance.
The architecture of the final ensemble is illustrated in Figure 3. Since all models have the same convolutional layer dimensions, we can stack the results from each source domain. To evaluate the resulting models, we use datasets from different rotating machines, as detailed below.

IV. DATASETS
The GIDTEC 1 research group created the selected datasets in controlled environments. Configurations of rotating machinery, i.e., gearbox, compressor, and others, were assembled. Artificial faults were placed (mainly in gears and bearings). The machinery was subjected to different working conditions (loads and speeds). We briefly summarize the data's Experimental Plan (EP) of the source domain datasets that was strategically selected to have a rich set of vibrations covering two of the main topics of this paper: defective gears and bearings.

1) EP05
Tooth breakage severity failures have been emulated in the driver helical gear of a gearbox. Figure 4 shows the experimental setup. A motor drives the gearbox through a coupling. A frequency inverter controls the motor speed. The output shaft of the gearbox is connected to an electromagnetic torque load via a belt transmission. Four accelerometers measure the vibration signals in a vertical (V 1 and V 2 ) and horizontal (H 1 and H 2 ) configuration. Three different speeds and three loads were used as working conditions. The severity failures are listed in Table 1 Figure 5 shows the configuration of this case. Two bearings (B 1 and B 2 ) are coupled with a shaft as a supporting device. A motor drives the shaft at different speeds, and flywheels can be on the shaft to induce loads when required. Two accelerometers (V 1 and V 2 ) are placed vertically for each bearing. As in the previous case, three speeds and loads were tested. Table 2 shows the faults caused to the bearing components.   The experimental plans for the target domain have similar configurations to those of EP05 and EP02. Table 3 summarizes the source and target domain datasets; the latter consists of a reciprocating compressor, single-stage and two-stage gearboxes, among others. The deployed sensors differ from machine to machine; therefore, we selected the vibration measurement closest to the fault condition.

V. EXPERIMENTS AND RESULTS
In Section V-A, the target domain is evaluated in our framework to obtain M best , the model is validated and compared with state-of-the-art models using classical feature extraction approaches. In Section V-B, the ensemble model is built and tested on the target domain. Finally, Section V-C presents an analysis and explanation of the results.

A. RESULTS ON THE SOURCE DOMAIN
The raw samples were input to our framework, which resulted in M best in Table 4. To validate M best , we perform transfer operations from one speed to another (F 1 → F 2 for example) by tuning the top layer (the first convolutional layer) and adding a new Dense and Softmax layer after layer 10. The trainable parameters are reduced from 394, 680 to 11, 110 with this regime. The results are compared with VGG16, Resnet50, Random Forests (RF), and XGBoost. The first two, which are image-based classifiers, use spectrograms of the vibration signals as input. RF and XGBoost use hand-crafted features in the time, frequency, and time-frequency domain [23]. TL with RF and XGBoost is not applicable; therefore, these models are trained from scratch in the target domain. Table 5 shows the performance based on the average accuracy of each transfer operation after 30 repetitions. M best outperforms the other approaches on the top layer tuning regime.  Table 6 depicts the results after transferring from one machine to another (EP 02 → EP 05 and EP 05 → EP 02). M best does not perform well from EP 02 to EP 05 on the tuning top regime; however, by tuning all the parameters, the average accuracy is higher than the image-based approaches. It is important to mention that tuning all the layers of M best implies adjusting 394,680 parameters; while, 25,703,178 and 2,108,426 parameters tuning the top of VGG16 and Resnet50, respectively. In addition, the average tuning time for M best accounts for less than 50 s against more than 300 s for both VGG16 and Resnet50 on a GPU GeForce RTX 3090. 0.93 ± 0.01 -0.96 ± 0.008 -Resnet50 0.64 ± 0.042 -0.73 ± 0.065 --the accuracy is lower than 0.5 for a maximum of 60 epochs The above results validate the construction of M best . It is noteworthy that RF performs fairly well; consequently, it will be retained for the following tests.

B. RESULTS ON THE TARGET DOMAIN
In this case, the vibration signals were also sliced into windows (win) of 16,384 samples. However, for target domains containing less than 40 samples per class, a data augmentation process was proposed that consists of slicing the original raw signals with overlapping to produce more samples. For instance, for 3 samples per class, we randomly select 1 vibration signal and take 3 successive samples of size win. In Figure 6, a 50% overlap increases the number of samples to 5. For our experiments, the best trade-off between performance and number of samples was found with a 25% overlap.  The ensemble is built with M best using EP 02, EP 05, and EP 02+EP 05 as source domains. The combination methods are Average (Avg), Stack, and Meta (refer to Section III-B). For realistic TL scenarios, the number of labeled samples is limited to 10, 20, and 40 per class. The rest of the samples are marked as unlabeled (test set). Figure 7 shows the accuracy on EP 03 test set, while Figure 8 shows the results on EP 05. The M eta model (in orange) outperforms RF in most cases.It is also notable that the standalone 1D-CNN (M best in dark blue) provides the worst accuracy. This gives a clear indication that the ensemble approach defined by the M eta model is more beneficial for TL in our application. Therefore, the rest of the experiments will be focused on the M eta model performance against the second-best model (RF ).   For EP 06_2 and EP 11, we will vary the size of the ensembles and the sensor. In Table 7, the transfer operation is performed on a multi-label classification task due to the VOLUME 4, 2016 multi-fault signals in EP 06_2. M eta − 5 created from different source domains yields better results. It suffices to use only one accelerometer placed vertically at any bearing (V 1 or V 2 ). However, the accuracy is considerably low using 10 and 20 labeled samples. This is because one of the working conditions provides very low throughput, while in another speed, the accuracy was upper than 0.8 on average with 10 samples. Therefore, it is more beneficial to use the working condition that maximizes the performance and then transfer to another working condition as in Table 5. On the other hand, in Table 8, the vibration and acoustic measurements of EP 11 were used (denoted as V 1 and A 1 , respectively). In general, the acoustic signals yield better results using M eta − 5 for 10 and 20 labeled samples. Our approach correctly works for acoustic and probably for any cyclo-stationary signal.

C. ANALYSIS
Regarding the results in the source domain, Table 6 indicates that accurate transfer between EP 02 and EP 05 is only possible in one direction (EP 05 → EP 02) by tuning only the top layers. To understand the failed transfer (EP 02 → EP 05), we analyze the automatic features from the CNN layers in the frequency domain. To do so, we compute a modified FFT spectrum, which consists of convolving the intermediate time-series with the FFT of the filters of each layer. The result of each layer is summed to obtain a modified FFT spectrum. More details on this operation can be found in [21].
In Figure 10, the modified spectrum of a sample in the transfer operation EP 02 → EP 05 is compared to the original spectrum. The figures show that the automatic feature extraction filters out the higher frequencies and highlights the frequencies of interest. The average Dynamic Time Warping (DTW) distance between the original FFT and the modified for 1000 samples is 65.92 and 81.38 for tuning top and tuning all layers, respectively. Whereas the average DTW between the last two is 14.76. The automatic feature extraction of both tuning top and all is very similar. This analysis led us to conclude that the difference between the models lies in the mapping function. To study the mapping function of the transferred models, we visualize the output distribution of the last Dense layer of two ensembles and their base models. The ensembles were tuned with 10 labeled samples from the bearing severity dataset EP 06_1. The output space of the layer is reduced to a 2D space with t-distributed Stochastic Neighbor Embedding (t-SNE). Figure 11 shows that the contribution of each ensemble helps to separate the faults better, and in consequence, to build an accurate fault classifier.
One important result of this analysis is that it is possible to implement only one automatic feature extraction and create an ensemble of mapping functions. The new configuration would have given the same results as our adopted approach with lower computational cost. Finally, the models and framework are publicly available in the GitHub repository 2 .

VI. CONCLUSION
In this work, 1D-CNNs were defined with Bayesian optimization processes to find the architecture best suited for TL operations. Initially, different feature extraction and mapping decision functions were obtained from which the best ones were selected for the base models of the ensemble. Next, the base models were trained on different data sources. We tested different combination methods and compared our approach with classical TL methods and stand-alone 1D-CNN. The results show improved performance, especially when few labels are available per class.
An analysis of the ensemble model revealed that the feature extraction remains unchanged from one base model to the other, thus demonstrating that the resulting functions extract invariant features. It is the mapping function that plays a key factor in the data discrimination of the ensemble model.
Finally, the resulting ensembel model represents a useful tool for practitioners in the field. Challenges still exist to improve accuracy when labeled samples are limited. In future works, the target domains can be enriched with the source domain, unlabeled data sources, or artificial data from finite element-based simulations or physical models of rotating machinery. Finally, further efforts are planned to cope with non-cyclo stationary signals.