Challenges and Opportunities of Deep Learning Models for Machinery Fault Detection and Diagnosis: A Review

In the age of industry 4.0, deep learning has attracted increasing interest for various research applications. In recent years, deep learning models have been extensively implemented in machinery fault detection and diagnosis (FDD) systems. The deep architecture’s automated feature learning process offers great potential to solve problems with traditional fault detection and diagnosis (TFDD) systems. TFDD relies on manual feature selection, which requires prior knowledge of the data and is time intensive. However, the high performance of deep learning comes with challenges and costs. This paper presents a review of deep learning challenges related to machinery fault detection and diagnosis systems. The potential for future work on deep learning implementation in FDD systems is briefly discussed.


I. INTRODUCTION
Safety and reliability are key factors in industrial operations.Rotating machinery is a vital component in many industries, and it is prone to failure due to harsh working conditions and long operational times [1], [2].Examples of rotating machinery components including gears [3], pumps [4], bearings [5], shafts [6], blades [7], motors [8] and engines [9].Failures in rotating machinery should be detected as early as possible to prevent critical damage [10] and sudden halt of machine operation.Failures may cause delays in operations and, consequently, tremendous economic loss [11].For example, petrochemical industries lose around 20 billion dollars per year due to faults in their machine components [12].According to a report by Duan et al. maintenance accounts for more than 60% of the total cost of aircraft engine components [13].In the worst case, a machinery component failure may lead to loss of human life.Elasha

et al. discussed a case
The associate editor coordinating the review of this article and approving it for publication was Kezhi Li. in which the failure of planet gear in an aircraft caused an accident that cost 16 people their lives [14].
Fault detection and diagnosis (FDD) is crucial to preventing unexpected breakdowns of machinery and ensuring production efficiency and operational safety.Fault detection and diagnosis systems are categorised into model-based, data/signal-based and knowledge-based approaches [15], [16].In modern industry, signal-based or data-driven approaches have attracted more attention from researchers [17] because these approaches provide high diagnostic accuracy and do not require empirical estimation of physical parameters [18].Data-driven approaches have two drawbacks; they depend on good data preparation, and both training and testing samples should be drawn from the same data distribution [19].It is important to remedy these deficiencies to increase the efficiency of data-driven approaches by implementing deep learning models.
With the rapid development of computer system, artificial intelligence (AI) methods have been broadly used in many applications to assist human in interpreting data trends.As illustrated in the diagram of AI systems in Fig. 1, deep learning (DL) is a subarea of machine learning.Deep learning has been used in many applications including natural language processing [20], image processing [21], robotics [22], FDD systems [23] and medical applications [24].The rapid growth of DL implementation in many areas is due to several reasons, including the increase in research on machine learning models and the increase in graphics processing unit (GPU) development [25].The complex architecture of DL requires a fast computing process to train complex architecture.Deep learning models provide an automated feature learning process that can replace the manual feature selection of traditional fault detection and diagnosis (TFDD).
In recent years, the most common deep learning models that have been extensively used in FDD systems include convolutional neural network (CNN), stacked autoencoder (SAE), restricted Boltzmann machine (RBM), deep belief network (DBN) and deep neural network (DNN) [26].The SAE, RBM and DNN models are usually called deep neural networks, and are described in [27]- [29].Every DL model has its own variants, which have been briefly discussed in [30].Meanwhile, Ravi et al. conducted a comparative study of the advantages and disadvantages of each deep learning model [31].
This review is intended to provide a brief discussion of challenges and future development of DL applications in FDD systems for rotating machinery.It does not describe each DL architecture, as such information has been discussed in detail in [32]- [34].Meanwhile, the advantages and disadvantages of every DL model used in FDD systems have been discussed in [35].
The remainder of this paper is organised as follows: Section 2 presents a review of the stages of FDD.Section 3 discusses the advantage of DL over shallow learning models.Section 4 describes the challenges of DL in an FDD system.The future development of DL models in FDD systems is discussed in Section 5. Section 6 concludes the paper.

II. FAULT DETECTION AND DIAGNOSIS STAGES
Fault detection and diagnosis can be divided into four stages: fault detection, fault identification, fault severity assessment and fault growth and remaining useful life prediction.Table 1 presents the available reference on each FDD stage for shallow machine learning (SML) and DL models.In FDD systems, data acquisition is important to obtain the signal that reflects the physical condition of the machinery component.In practice, there are several types of monitoring tools including acoustic emissions (AE), vibration, pressure, oil analysis and thermal analysis [2].Fig. 2 represents the available sensors that can be used in FDD systems, which is based on detection level [36].Chacon et al. stated that AE detects faults earlier than other sensors [37] and can thus provide early detection when faults are present.

A. FAULT DETECTION (STAGE 1)
Fault detection is the examination of faults present in machinery components.A recent study found that most research on fault detection focuses on incipient faults so that the next stage of the FDD process can be performed.Incipient fault detection is important to prevent catastrophic failure [38].Since approximately 2011 to 2015, incipient fault detection has been focused on using AE due to the high sensitivity of the AE sensor.Ferrando Chacon et al. [39], Hiremath and Reddy [40], and Kilundu et al. [41] are among the authors who used AE as a monitoring method for detecting incipient faults.However, in recent years, researchers including Klaussen and Robbersmyr [38], Seshadrinath et al. [42], Zhao and Jia [43] and Lu et al. [44] have successfully diagnosed incipient faults using vibration analysis with more advanced signal processing techniques.Wei et al. discussed in detail early fault diagnosis in rotating machinery components [45].

B. FAULT IDENTIFICATION (STAGE 2)
Fault identification is the examination of the location and type of faults in machinery components.Fault location identification is usually performed on bearing components; faults can be located in the outer race, inner race or rolling element [46].Meanwhile, fault type identification is performed in gear component, and types include chipped gear tooth [47], missing gear tooth [47], worn gear [48] and tooth root crack [49].Most DL models have been extensively used up to this stage 2.

C. FAULT SEVERITY ASSESSMENT (STAGE 3)
Fault severity assessment is the process of determining the size of a fault located in a mechanical component.Kumar et al. have proposed a potential method of estimating fault size in bearing components using the integration of signal processing and a machine learning (ML) model [50].The proposed system requires a good signal processing method to reveal the vibration peak produced when contact occurs between the rolling element and the surface of a fault.

D. FAULT GROWTH AND REMAINING USEFUL LIFE PREDICTION (STAGE 4)
The remaining useful life (RUL) process is the prediction of the life cycle of a machinery component.Meanwhile, the prediction of fault growth is based on the size of the fault after a certain cycle.The output of RUL analysis is the predicted period/time of the component breakdown.A Stage 4 analysis might be useful in an FDD system if the output RUL prediction is expressed in terms of the relationship between time and size of fault growth.To the best of our knowledge, no current research has done this analysis.In common, the features that are more sensitive to fault characteristics will be used.Si et al. [51] conducted a review of statistical approaches for determining remaining useful life.Meanwhile, Nguyen et al. provided a detailed discussion of failure prognostics using a deep learning model [52].

III. A COMPARATIVE STUDY OF DEEP LEARNING AND SHALLOW LEARNING MODELS IN FDD SYSTEMS
In recent decades, FDD systems have typically used the support vector machine (SVM), k-nearest neighbour (KNN), decision tree (DT) and artificial neural network (ANN) models.These types of models are sometimes called shallow machine learning (SML) models [11].The FDD systems based on SML models are called traditional fault detection and diagnosis (TFDD) systems, which are based on statistical data-driven [76].For a model to perform well in TFDD systems, the massive number of features should be carefully selected to reduce the computational complexity of the SML model and prevent degradation of classification accuracy [77].In the SML model, the feature selection process is introduced to reduce irrelevant and redundant features [78].Hui et al. proposed an improved wrapper method to select the best features for an SVM model, which improved its computational efficiency [79].
Moreover, SML models have another deficiency in term of training size.At a certain stage, model performance will become stagnant, as shown in Fig. 3.In contrast, DL models are highly effective when using large training samples [80].Cao et al. conducted a comprehensive study on training data size where the authors proved that the diagnosis performance of a DL model increased as the size of the training data set increased [81].
Traditional fault detection and diagnosis systems require five important steps, as shown in Fig. 4: data acquisition, data processing, feature extraction, feature selection/dimensional reduction and feature classification [82].Finding suitable methods for each step is crucial and requires a trial-and-error process.Each application of a traditional FDD system uses a unique set of methods.For example, Xu et al. performed the TFDD process on bearing components by using modified distance discriminant technique, fuzzy ARTMAP, correlation measure, Bayesian belief method and a selective ensemble of multiple classifiers [83].Meanwhile, Cerrada et al. conducted gearbox fault diagnosis using the following combination methods: correlation-based feature selection, entropy-based feature selection using random forest (RF), the Mamdamitype fuzzy model, hierarchical clustering and membership degree estimation using a fuzzy approach [84].
Deep learning models do not require the combination of methods for each TFDD step; the process is done using multiple hidden layers of deep learning architecture [33].However, under certain conditions, DL models require a signal/data processing step if the signal/data is too noisy.For example, Liu et al. used variational mode decomposition (VMD) to process the vibration signal obtained from the planetary gearbox and fed the processed signal into a CNN model [85].Shao et al. used a dual-tree complex wavelet packet to refine the fault characteristics of the vibration signal and fed the data into an adaptive deep belief network (DBN) [86].In addition, Guan et al. used empirical mode decomposition (EMD) to extract the fault signal that carries more information regarding the physical condition of the machine [87].The signal reconstructed by EMD is transformed into the frequency domain, and the DBN model is used for the fault diagnosis process.However, the previous research did not compare the performance of the DL model between raw time domain and processed signal.Hence, the performance of the deep leaning model with and without signal processing cannot be discussed.

A. CHALLENGES OF DEEP LEARNING MODELS IN FDD SYSTEMS
The challenges of a DL model are related to its architecture and training process.Even though there is extensive published literature on DL implementations in FDD systems,  [89].However, several challenges must still be overcome for the DL model to perform effectively in FDD analysis.

B. THE TRAINING PROCESS OF A DEEP LEARNING MODEL
The training process is a crucial part of a DL model.The model is first trained by being given a set of example data, and the parameters (bias and weight) are fine-tuned using a backpropagation algorithm.However, three factors should be taken into consideration before and during the training process: characteristics of input data, size of the dataset and the architecture DL model which are discussed in detail as follows.

1) CHARACTERISTICS OF INPUT DATA
Division of the continuous signal into several segments is a common process in FDD systems.The segmented signal is distributed into training and testing samples for the analysis.There are two types of segmentation processes that can be performed on a continuous signal, as shown in Fig. 5: segmentation without overlapping and segmentation with overlapping.Overlapping is a data enhancement or data  augmentation process to increase the number of data samples for better generalisation of DL models [90].Determining the correct segmentation length is a crucial part of the analysis in order to preserve the important features in each segmented signal.For example, Duong and Kim segmented the continuous signal into lengths of 500 and 1000 data points for bearing fault analysis.Further details regarding signal segmentation can be found in the research of Jing et al. who conducted a comprehensive study on segment length of a continuous vibration signal [91].Meanwhile, Chen et al. did an analysis of segment length from 64 to 1024 with the increment of 64 segment length using the DBN model.The result showed that diagnosis performance increased with the increase in segment length [92].Several authors discussed the relation of data point number and segment length by proposing the mathematical equations shown in Table 2.It should be noted that knowledge of the data is required to select the length of segmentation.For example, Liu et al. used 50% overlap on the continuous vibration signal [93].Ma et al. overlapped the signal by 0.1 seconds out of the total length of 0.5 seconds [94].
Other challenges of the segmentation process arise when dealing with low-speed and incipient fault signals.In those conditions, the interaction of the fault and the rolling element is at a low energy level, which causes the fault peak amplitude  to occur rarely in the signal, which in turn makes the information difficult to extract [98].An example of a low-speed signal is shown in Fig 6 .It can be seen that as the speed increases, the fault peak amplitude becomes more frequent and obvious.Another example of a signal from low-speed operation can be found in the work of Kim et al. [99].Figs.7(a) and 7(b) represent the sample signals of large and incipient faults, respectively.In Fig. 7(a), the fault peak amplitude is present in the signal at regular intervals, which provides similar information on each segment signal.However, the peaks from low-speed operation (Fig. 6a) and the incipient fault (Fig. 7b) are not frequent, and each segmented signal might contain different information, which would decrease the quality of the diagnosis analysis.
Each segmented signal can be fed into a DL model in several forms, as listed in Table 3. Input characteristics for DL models are essential for accurate diagnosis.Saufi et al. conducted a study in which each type of input data produced a different diagnosis performance [71].The authors used a stacked sparse autoencoder (SSAE) model to analyse multiple types of input, proving the versatility of a DL model over a SML model.They concluded that the selection of input types is a crucial part of a better FDD system.In recent years, researchers have tended to use raw time domain and timefrequency image data because they do not require expending much effort on manual data preparation.For example, Wu et al. used a 1D time domain with a CNN model and proved that their proposed diagnosis system outperformed a shallow learning model with statistical features [101].Extraction of statistical features from each signal domain is time intensive and requires expertise to perform the analysis.In addition, it is difficult to determine the features that should be used, because the features vary according to applications [102].The total number of statistical features from all domains can add up more than 100; Cerrada et al. extracted 359 statistical features from all domains to diagnose gear components [103], and Chen et al. extracted 256 statistical features for FDD analysis on gearbox components using the CNN model [104].The details of the statistical features formula can be found in [105]- [107].Xiang et al. used a teager computed order (TCO) spectrum with a stacking autoencoder to perform fault diagnosis of bearing components under fluctuating speed and variable load [108].The authors extracted 42 statistical features from three domains and fed them into an SML model.The results showed that the proposed DL model outperformed the SML model.During the experiment, Xiang et al. used two bearings with similar geometry and fault characteristics-one for training data and another for testing data-in order to simulate an actual industrial fault diagnosis.

2) SIZE OF DATASET
A large dataset is essential for a DL model, especially for a mechanical signal.However, until now the size of training data has depended on user expertise, and there are no guidelines to ensure that the data is adequate for the training process.Most recent studies used large datasets to train their models.For example, Shao et al. used approximately 6,000 to 9,000 training samples to train a CNN model [112].Guo et al. sampled 32,256 data points to achieve 99% diagnosis accuracy [113].The authors used Pythagorean spatial pyramid pooling CNN to increase the robustness of the model when using data from variable rotating speed conditions.Meanwhile, Jian et al. achieved a satisfactory diagnosis result using 5,000 training samples for bearing components [90].Zhong et al. fed a generative adversarial network 3,000 training samples for each air handling unit fault condition [114].The current challenge facing DL models is to handle more testing samples than training models.Currently, most DL models have been trained with more than 50% more training samples than testing samples.The capability of DL models to perform FDD with a small number of samples has not been determined.The difficulty of obtaining enough data samples in industrial environments to train a deep learning model has been discussed in [115].In addition, large datasets are rarely available in industrial applications [116].Meng et al. addressed the difficulty of obtaining large training samples by using Type 2 data processing, as shown in Fig. 5 [117].
In addition to the number of samples, the size of the input dimension should be considered during the analysis.Common sizes of input features used for DL models are listed in Table 4.According to analysis by Shao et al. the training time of a deep learning model significantly increases with an increase in input dimensions [118].It is worth mentioning that the segmentation signal length sometimes equals to the input dimension as several DL models are directly fed with the segmented signal in raw time domain.

3) THE ARCHITECTURE OF DEEP LEARNING
In general, ML architecture contains parameters and hyperparameters.Bias and weight are common parameters in the ML model that are usually randomly initialized and finetuned using the backpropagation algorithm.However, hyperparameters must be pre-set before an ML model can perform a training process.As they are based on deep architecture, DL models have more hyperparameter settings than SML models.Manual selection of hyperparameters is a difficult task, and guessing the values of the optimal hyperparameters is time intensive [128].Darwish et al. concluded that it is difficult to determine accurate and efficient DL architecture in a reasonable time [129].A list of specific and general DL hyperparameters is presented in Table 5.
Among the DL hyperparameters, the activation function should be manually set by the user because it is difficult to optimise algorithmically.Activation function selection requires user experience and knowledge.The variety of activation functions available for deep learning models during their analysis of bearing fault diagnosis have been discussed in [109].Deep learning and SML models generally use sigmoid functions for activation.However, according to Dahl et al. rectified linear units (ReLU) achieved the same training error as sigmoid functions with a quicker training process [130].Zhao and Kang diagnosed planetary gearbox conditions by using a ReLU activation function in a CNN model [131].The authors conducted the data collection process for 56 seconds for each fault condition.Long duration of data collection is useful to increase the training and testing datasets.Ma et al. performed a fault diagnosis on bearing components by proposing a concatenated rectified linear unit (CReLU) in a CNN model [132].The authors concluded that the proposed model is small, light and fast and can achieve satisfactory diagnosis performance.Meanwhile, Shao et al. conducted a comparative study on several types of activation functions in a deep autoencoder model [133].The authors conducted the analysis using the Case Western Reserve University (CWRU) dataset with 12 bearing conditions where the highest diagnosis accuracy was 97.18%.With 400 input dimensions, the authors were capable of diagnosing the dataset.
Another hyperparameter that should be manually set is the backpropagation algorithm.In recent years, ADAgrad, RMSprop, stochastic gradient descent (SGD), Nesterov, ADAdelta and Adam are popular backpropagation algorithms for deep learning models [134].Jian et al. conducted a comparative study of ADAdelta, Adam and RMSprop for their proposed adaptive one-dimensional CNN model [90].For motor bearing fault diagnosis, the authors achieved higher diagnosis performance with RMSprop.Liu et al. implemented RMSprop on a recurrent neural network-based autoencoder for fault diagnosis of rolling bearings [93].Gong et al. conducted a comparative study between stochastic gradient descent (SGD), ADAgrad, ADAdelta, RMSProp and Adam using a rotating machinery dataset [135].The result showed that Adam produced higher accuracy compared to other backpropagation algorithms.However, the algorithms differed only slightly in terms of classification accuracy, training time and test time.Lai et al. performed a comprehensive analysis of the backpropagation algorithm and found a slight difference between RMSProp and Adam in diagnosis performance [136].Pan et al. diagnosed bearing conditions accurately with an implementation of the Adam algorithm in the CNN model [137].Tang et al. used Nesterov momentum for a DBN model that outperformed the standard DBN model using a gearbox dataset [138].
New variants of Adam and AMSgrad are proposed by Luo et al. to overcome the poor generalisation of ADAgrad, RMSprop and Adam [139].The new variants are called ADAbound and AMSbound.If these new backpropagation algorithms provide significant improvements, especially for complex deep architecture, they could present a challenge to the current algorithms.Implementation of these algorithms in DL models for FDD systems is useful to increase their performance.
The hyperparameter that rarely gets attention is the number of hidden layers, even though the hidden layer size is part of the definition of a DL model.According to analysis by Ren et al. a deep network must consist of an input layer and output layer separated by two or more hidden layers [140].The specific hyperparameters of each deep learning model need to be set in each hidden layer.Thus, selecting the number of hidden layers is a crucial step that affects the selection of other hyperparameters.Wang et al. stated that the proper selection of hidden layers is important to avoid computational complexity [141].Heo and Lee performed an  analysis in which the diagnosis performance increased with an increase in the number of hidden layers [142].However, an analysis by Saufi et al. showed that diagnosis performance varies according to the number of hidden layers [128], which is similar to the result achieved by Sohaib and Kim [68] and Jing et al. [91].
During the training process, the DL model can be in a condition of good fit, underfit or overfit, as illustrated in Fig 8.The model will experience an underfit condition when the training and testing accuracy are low because the model is unable to learn the pattern input features due to a smaller number of training data points.According to Saurabh, underfitting occurs when the number of hidden neurons is low compared to the complexity of the input data [144].
In the overfit condition, the model achieves high training accuracy and low testing accuracy.Too many training cycles (epochs) may cause the model to memorise the features instead of generalising the feature patterns.In terms of FDD analysis, the noisy data may contribute to an overfitting problem wherein the model captures the noise along with the important features.Srivastava et al. stated that implementation of dropout and L1/L2 regularisation is capable of reducing the overfitting problem [145].Shao et al. validated their proposed model performance by presenting training error and classification rates to prove that the model did not experience an overfitting problem [146].In addition, the authors indicated that overfitting problems occur when the number of hidden neurons increases.Similarly, Sheela and Deepa stated that random selection of the number of hidden neurons may cause a ML model to underfit or overfit [147].
As mentioned in the previous section, several authors applied signal processing methods to noisy input data to remove unwanted noise.The training conditions are closely related to DL hyperparameters.Widodo et al. emphasised the importance of hyperparameter selection to reduce overfitting problems [54].He and He mentioned deep learning based signal processing has been developed for bearing fault diagnosis [97].

C. IMPLEMENTATION OF REAL MACHINERY SYSTEMS
Most of the DL applications available in public articles use experimental datasets for FDD analysis.Few studies have been found that use a real machinery system.Experimental and real machinery datasets differ in quality.An experimental dataset is collected under controlled conditions with a less complex system and less environment disturbance.A real machinery system, on the other hand, is a complex structure, and the collected data contain the information from unrelated components of interest.Xu et al. have questioned how high-quality raw data can be obtained in a costeffective way [148].In FDD systems, data quality depends on the types of sensors, data acquisition setup, the surrounding environment and the duration of the data collection.Zhang et al. emphasised that sensor problem result in poor data quality and they conducted a thorough review of how to handle low quality data using DL models [149].High data quality is essential to ensure satisfactory performance of a DL model.
Wang et al. stated that most studies they reviewed on wind turbine gearboxes aimed to refine the fault-related signal more clearly by filtering the background noise and the unrelated vibration components [150].Xu et al. suggested that under practical conditions, the data cleaning procedure is important to ensure good performance of a deep learning model [151].The accuracy of a deep learning model is reduced if the input signal/data contains noise [18].Teng et al. and Saidi et al. performed a fault diagnosis on real wind turbine gearboxes [61], [152] and it is noted that satisfactory diagnosis performance is difficult to achieve in real applications.Sadoughi et al. diagnosed bearing conditions in hay balers, and their proposed model achieved diagnosis performance of less than 95% [66].Shao et al. diagnosed locomotive bearing conditions using a deep autoencoder and achieved diagnosis performance of up to 87.8% [118].However, the performance reached 94.05% for the experimental dataset [118].Xie et al. diagnosed bearing conditions on an experimental test rig and achieved diagnosis performance of between 98% and 100% for every bearing condition [153].Yang et al. achieved approximately 99.57% to 100% accuracy when diagnosing bearing conditions from the CWRU dataset.
Liu et al. diagnosed a natural gas reciprocating compressor component using advanced signal processing consisting of linear mode decomposition and a stack denoising autoencoder [154].The author achieved diagnosis performance of 92.72% under the condition of low signal-to-noise ratio.Peng et al. diagnosed wheelset bearings in a high speed train using a deeper 1D convolutional neural network; the model achieved between 89.7% and 99.9% diagnosis performance with different noise conditions in the dataset [155].The authors used data augmentation to increase the sample size to 329,752 samples.Fault diagnosis on real machinery systems will be a great challenge in future years as current systems become larger and more complex with a high potential to break down on any given day [156].
Moreover, every ML model, including deep learning, requires labelled data, which are usually difficult to obtain in real industries [157].In order to solve this problem, Li et al. proposed data augmentation methods that can create an artificial additional sample during the deep learning training process [157].The authors demonstrated five types of data augmentation: gaussian noise, masking noise, amplitude, time stretching and signal translation.Han et al. used the adversarial learning framework with the CNN model in order to deal with small labelled samples [127].They analysed training datasets of different sizes and found that the diagnosis performance increased as the number of training data samples increased.
Furthermore, imbalanced datasets commonly occur in industry environments, where the datasets for normal and fault conditions do not have similar sample sizes [158].Chen et al. proposed graph-based rebalance semi-supervised learning (GRSSL) in order to deal with an imbalanced, unlabelled dataset [159].The authors used the CWRU dataset to validate their proposed model.Zhang et al. briefly discussed the conditions of imbalanced datasets, on which they used a synthetic oversampling technique called weighted minority oversampling.The authors used a deep learning model to effectively learn the input features for accurate fault diagnosis on an intelligent maintenance system (IMS) bearing dataset [160].Chen et al. analysed imbalanced SCADA data for fault detection in wind turbines using a CNN model [161].The proposed model was able to separate the data from different classes effectively using principal component analysis.Meanwhile, Jia et al. diagnosed bearing conditions with imbalanced fault classification using a normalized CNN model with a neuron activation maximization (NAM) algorithm [162].According to the authors' result, diagnosis performance decreases when the percentage of dataset imbalance increases.

IV. DEVELOPMENT AND MODIFICATION OF DEEP LEARNING MODELS IN FDD SYSTEMS
The versatility of deep learning models allows for better FDD systems compared to SML models [32].In this section, several non-related articles on machinery fault diagnosis have been used for recommendation and reference in a future study.In this section, potential avenues for development and modifications of DL models in FDD systems are discussed.

A. COMBINATION OF DEEP ARCHITECTURE WITH SML MODEL
Deep learning contains a number of hidden layers that are used for feature extraction.The last layer of a DL model is the classification layer, which is used to classify the extracted features.The architecture of a DL model is illustrated in Fig. 10.Softmax regression is typically used as a classification layer in deep learning models [163], [164], but KNN, DT, SVM, ELM and ANN can also be used.The integration of a sparse autoencoder and an SVM model was proposed by Ju et al., who used various UCI datasets to test their proposed model [165].SVM is a well-known model that can handle data with low sample sizes and high-dimensional features [166].This integration may provide a great opportunity for dealing with small data sample sizes in the machinery component.Meanwhile, Xu et al. combined CNN with random forest (RF) ensemble learning [151].They used the CNN model to extract the low-level features and input the features into the RF for classification.Li et al. diagnosed gearbox conditions using two RBM layers for the feature extraction process and RF as an output classifier [167].The authors achieved a slight difference in diagnosis performance by changing the output classifier with SVM and KNN model.Li et al. combined DBN and random forest to classify spacecraft electrical signals; this method performed well in term of classification accuracy and computational efficiency [168].The data used by the authors contained 1,000 features, and the data is distributed with 56% of training data and 44% of testing data.This approach is capable of reducing the computational load and enhancing the classifier's performance.Moreover, Monteiro et al. used SVM in order to improve the decision level of a CNN model, which produced a great improvement in training time and diagnosis accuracy [169].Cheng et al. successfully diagnosed wind turbine gearboxes based on the current signal using autoencoder and SVM models, achieving an overall performance of 89.3% [170].
During the analysis, the authors used the Hilbert transform for amplitude demodulation.Haidong et al. reached approximately 95% diagnosis accuracy for 12 classes of bearing conditions from the CWRU dataset by combining a deep autoencoder with extreme learning machine (ELM) [5].Similarly, Mao et al. proposed a combination of autoencoder and ELM in order to achieve satisfactory diagnosis performance with acceptable processing time [171].The authors discussed 71 statistical features from time domain, bispectrum, EMD, and the wavelet packet decomposition, where each input data is sensitive to certain types of bearing conditions.Li et al.  [93].The authors added noise to the dataset to create signal-to-noise ratios (SNRs) ranging from 1% to 10%.The performance of the proposed model increased from 96.98% to 99.75% as the SNR increased, and the analysis indicated that the noise contained in the signal might reduce the deep learning model performance.Li et al. proposed using deep stacking least squares SVM (LS-SVM) with a one-against-all strategy to extract features from the CWRU bearing signal [175].They found that the classification accuracy increased along with the increase in stacking layer modules.

B. HYPERPARAMETER OPTIMISATION
Hyperparameter selection is an important process, as the performance of a deep learning model is highly affected by the hyperparameters [176].Optimisation of hyperparameters for DL can be automated using several methods, as listed in Table 6.Optimisation algorithms are computationally intensive, with the computational load depending on the hyperparameter size.Even with automated hyperparameter optimisation, human intervention is still needed to decide on the search space for the hyperparameter boundaries [130].Until now, the optimisation process was limited to numerical hyperparameters.
Wahab et al. conducted a detailed analysis of metaheuristics optimisation algorithms including particle swarm optimisation, differential evolution (DE), genetic algorithm (GA), artificial ant colony, artificial bee colony, glowworm search optimisation and cuckoo search optimisation [177].They used a benchmark function to analyse these algorithms.Meanwhile, nature-inspired algorithms have been briefly reviewed by Fizter Jr. et al. and they listed all available optimisation algorithms along with references [178].Beheshti and Mariyam conducted a comprehensive review of metaheuristic algorithms [179].The detailed analysis by these authors provides a great deal of information on the implementation of optimisation algorithms in DL models for FDD systems.
New optimisation methods have been proposed by Mirjalili including grey wolf optimiser, salp swarm optimisation, grasshopper optimisation, ant lion optimisation and whale optimisation [180], [181].It is worth mentioning that the methods proposed by Mirjalili are more user-friendly because they require fewer parameter settings for the algorithms, which can be easily used for DL hyperparameter selection.
Several studies have demonstrated the potential of automated hyperparameter selection in DL models for FDD systems.For example, Saufi et al. compared the performance of random search, grid search, Bayesian optimisation, GA and DE and proved that all methods have the ability to optimise DL hyperparameters [128].However, DE produced slightly higher performance compared to other optimisation algorithms.Haidong et al. used an artificial fish swarm algorithm and particle swarm optimisation to optimise deep autoencoder and deep belief network hyperparameters, respectively [118], [182].

C. IMPLEMENTATION OF REMAINING USEFUL LIFE AND PROGNOSIS ANALYSIS
Most prognoses and predictions of RUL rely heavily on statistical features.Ren et al. extracted three time domain features and one frequency domain feature to estimate the remaining useful life of bearing components [140].Twenty features related to pressure, temperature, fan and core speed, fuel flow and coolant bleed have been used to predict the possible remaining operational life of aircraft gas turbines [74].[190].The authors used the IMS dataset to test their proposed RUL system, and the model achieved the lowest prediction error of 0.7801%.The authors proposed a comprehensive index feature where the performance of DBN using proposed feature outperformed other six statistical features.Ren et al. fed multiple features from time, frequency and time-frequency domains into a deep autoencoder and deep neural network, and the model produced a mean square error (MSE) of 0.2 [191].Similarly, Wang et al. used features from time, frequency and time-frequency domains to predict the useful life of bearing components using a twodimensional CNN model [192].The proposed CNN model outperformed other models in terms of prediction time and error rate.Meanwhile, Li et al. predicted the degradation process of a turbofan engine using deep CNN with raw signal datasets [193].According to the analysis, the prediction error rate decreases with an increase in convolutional layers.Zhu et al. diagnosed bearing component using a wavelet transformation representation and multiscale CNN [194].Li et al. carried out a prediction analysis using an image of the short-time Fourier transform and a CNN model and proved that the RUL process can be used with time-frequency images as input data [195].Deutsch and He predicted the life of bearings and gear components using a DBN-feedforward neural network model [196].Ellefsen et al. proposed a combination of RBM and LSTM for detecting turbofan engine degradation [197].The authors used a GA to optimise random hyperparameters of the proposed model.Al-Dulaimi et al. proposed a hybrid model of CNN and LSTM in order to estimate the remaining useful life of turbofan engines [198].In addition, Zhang et al. used a deep recurrent neural network to analyse a similar dataset of turbofan engines [199].The authors used temperature, pressure and speed sensors during the analysis, which demonstrated that under real-world conditions, multiple types of sensors should be taken into consideration.
Root mean square error (RMSE), mean absolute percentage error (MAPE) and mean absolute error (MAE) have always been used to examine the performance of deep learning models for RUL analysis and are generally used together during the analysis [52].In addition, it was proved that RULs models can use time-frequency images and raw time domain signals.However, use of these types of input data for RUL is still at an early phase and requires more study in future years.

D. INTEGRATION OF MULTIPLE SENSOR TYPES
A great deal of information is required to examine the condition of large machinery such as a wind turbine component.Vibration analysis is the usual method used in FDD systems and has been proven to be effective as reported by  Hui et al. [200].Most of the reference articles in this review paper used vibration analysis to test the proposed models.However, under certain circumstances, a single type of sensor may not provide enough information regarding the condition of the machine.According to Sarkar et al. the time series signal obtained from a single sensor may not carry sufficient information regarding the evolving fault [201].They also successfully diagnosed aircraft gas turbine engines based on combining information from multiple sensors such as pressure, temperature and speed.Wang et al. stated that there are 41 parameters from various sensors that can be used for monitoring wind turbine conditions [202].Li et al. performed a data fusion process by combining vibration and acoustic emission (AE) data for gearbox fault diagnosis [167].They found that the combined data (AE and vibration) produced higher diagnosis performance compared to the analysis without data fusion.In addition, Jing et al. used multiple types of sensor (accelerometer, microphone, current sensor, optical encoder) with a deep convolutional neural network to diagnose planetary gearbox condition [203].Jing et al. conducted a study in which they combined all of the sensor data as an input data for their proposed DL model.Multi-type sensor data fusion is among the future opportunities for development, since DL models have the ability to analyse highdimensional features.Numerous sensors of different sizes may contribute to a multidimensional data space [204], which would give DL a great advantage over SML.

E. FAULT VISUALISATION ANALYSIS
A DL model contains two processes: feature learning, in which the model extracts the features automatically from the input data; and feature classification using a fully connected layer [81].Integration of feature learning with visual feature representation may provide a fault visualisation method, as shown in Fig. 11.There are several types of feature visualisation including t-distributed stochastic neighbour embedding (t-SNE), principal component analysis (PCA), locally-linear embedding (LLE), linear discriminant analysis, locality preserving projection and isometric projection [205].The most common method used to analyse machinery data feature is a t-SNE model.The T-SNE model was developed by Van Der Maaten and Hinton [206].However, is no proof of the effectiveness of t-SNE over other methods, which provides a great opportunity for future study.
Yang et al. used features from the frequency domain and reduced the dimensions of the extracted features before inputting them into a DNN model for bearing fault diagnosis [207].Then, the features processed by DNN were visualised using the PCA model.The authors proved that feature clustering can be automated.An example of their results is shown in Fig. 12, where a clear boundary can be noticed for each class condition.
Moreover, in this paper, a simple analysis based on timefrequency image input data and a stacked sparse autoencoder (SSAE), with a t-SNE model used to analyse the dataset from CWRU.SSAE and t-SNE are both based on default architecture in MATLAB module.The raw time domain signal, with segment length of 800 without an overlapping process, is transformed into a time-frequency image and it is directly fed to the SSAE model., it can be seen that even without a target value, the bearing conditions can be classified into several groups.Classes 2 and 3 might be misinterpreted during the analysis since both overlap.The Fig. 13(b) visualisation demonstrates that fault visualisation can provide initial information during the FDD analysis.However, this analysis is insufficient, since the SSAE and t-SNE models are run using the default setup.If all aforementioned challenges have been met, fault visualisation methods can be improved, and clear separation between each fault class can be produced.

F. THE AVAILABLE ONLINE MACHINERY DATASETS
The vast majority of publications regarding FDD systems use the CWRU dataset.However, there are several datasets that can be obtained from online databases.Table 7 lists these datasets and the advantages and disadvantages of each.The CWRU dataset can be used as a starting point to determine the performance of the proposed model.Unlike other datasets, the CWRU database consists of four operating conditions with three fault severities.A DL model can be tested with multi-fault conditions: up to 12 bearing conditions for each operating condition.Mao et al. achieved satisfactory diagnosis performance and a clear separation between each fault condition using feature visualisation [208].Zhang et al. provided a comprehensive review of the performance of every deep learning tested on the CWRU dataset [209].On the other hand, the data provided by machinery fault database (MaFaulDa) is useful for the next analysis since there is more than one types of sensor.The data is collected from microphone and vibration sensors.Meanwhile, IMS bearing datasets can be used for fault identification and RUL analysis, since the data has been run for endurance test.Unlike in other bearing datasets, the fault in IMS bearing components is naturally grown during the run to test failure.Saufi et al. used the CWRU, IMS and MaFaulDa datasets to examine the performance of their proposed model [128].The authors achieved more than 98% diagnosis accuracy for all datasets.Guo et al. performed a transfer learning process using CWRU as the source domain and IMS dataset as the target domain [210].The RUL analysis can also be done using the IEEE PHM 2012 dataset, the experiment for which was run using the PRONOSTIA platform.The datasets from Paderborn University consist of three types of bearing conditions-normal, outer race faults and inner race faults-with artificial and realistic damage conditions.Using Paderborn datasets, DL models can be tested with several types of bearing damage from stator current signal that cannot be obtained from other bearing datasets.Zhu et al. achieved 97.15% and 82.05% average accuracy using the CWRU and Paderborn university datasets, respectively [211].The authors determined that the time-frequency analysis provides higher diagnosis accuracy compared to frequency spectrum and time domain segment data.However, Chen et al. achieved 94.5% for Paderborn dataset using a deep inception net with an atrous convolution neural network [115].It should be noted that the MaFaulDa and Paderborn datasets receive little attention in the development of DL models.
There are two datasets for gear fault conditions available online: high-speed turbine and IEEE PHM data challenge.The high-speed turbine dataset includes normal and fault conditions of gear sets.The experiment that generated this dataset was conducted using the actual gearbox of a wind turbine system.Furthermore, the IEEE PHM 2009 datasets provide more fault conditions in the gearbox system.The experimental setup and dataset details have been discussed in [101] and [212].The turbine blade dataset from UNSW contains three types of conditions: normal blade, normal blade with an air jet and blade fault.The turbine experimental rig contains a 19-blade arrangement.Most of the gear and blade datasets are not extensively used for DL models which provide a good opportunity for further development.Finally, all the listed data in Table 7 was collected at high rotation speeds; DL models cannot be examined with low-speed conditions.
ZAIR ASRAR BIN AHMAD received the Ph.D. degree from the Faculty of Mechanical Engineering, Otto-von-Guericke-University Magdeburg, Magdeburg.He is currently a Senior Lecturer with the School of Mechanical Engineering, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Malaysia.His current research interests include civil engineering, acoustic engineering, and structural engineering.
MOHD SALMAN LEONG has more than 35 years professional engineering consulting experience, and is acknowledged by the industry and government agencies as the leading authority in acoustics, noise, and vibration in the country.He has been involved in many of the mega-projects and high impact consulting and investigation projects in oil and gas, power generation, infrastructure and construction industries.He is currently a Professor and a Principal Consultant, the Founding Director of the Institute of Noise and Vibration, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia.
MENG HEE LIM has more than 13 years professional consulting experience in acoustics and vibration, with specialization in machinery diagnostics.He has extensive experience in power generation gas and steam turbines, blade failures diagnostics and experimental modal analysis.He is currently a Senior Lecturer and a Consultant with the Institute of Noise and Vibration, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia.

FIGURE 1 .
FIGURE 1. Relationship of artificial intelligence methods.

FIGURE 3 .
FIGURE 3. Performance of deep learning model and shallow learning model [88].

FIGURE 4 .
FIGURE 4. The difference between deep learning and shallow learning models in fault detection and diagnosis systems.
they require prior knowledge regarding their architecture.Currently, several programming modules such as MATLAB, R and Python have been used for the development of DL.Due to different styles of code and training processes, diagnosis performance might be different for each programming module.For the last few decades, the deep architecture of the DL model has been difficult to train.The greedy layer-wise pre-training steps proposed by Hinton et al. have reduced the difficulty of the training process

FIGURE 7 .
FIGURE 7. The difference of bearing signal characteristics based on fault size; a) Vibration signal with severe fault, b) Vibration signal with incipient fault [45].

FIGURE 8 .
FIGURE 8. Training conditions of deep learning models.

FIGURE 9 .
FIGURE 9. Relationship of train and test error during training process.

Fig. 9
represents the relationship between prediction error and training cycle of the DL model.Dong et al. briefly explained the nature of overfit and underfit as they occur in ML models[143].The ideal condition is a good fit, where the model reaches high training and testing accuracy.

FIGURE 10 .
FIGURE 10.Architecture of a DL model for FDD systems.
proposed a deep ELM model in which the sparsity and neighbourhood are integrated into the deep model and used to analyse a bearing fault dataset.The proposed model achieved 97% accuracy for experimental data [172].The experiment was conducted on a Spectra Quest rotor experimental platform and the data was collected at a 12 kHz sampling rate.The proposed diagnosis system outperformed other DL models including a stacked denoising autoencoder and a CNN.Meanwhile, Wang et al. used the CNN model to learn the data features automatically from the CWRU dataset raw signal [173].The features were fed into a hidden Markov model in order to classify the signals based on bearing conditions.The authors achieved approximately 98% diagnosis accuracy for 12 classes of bearing conditions, which is slightly higher compared to Haidong et al.'s model.Instead of being used as a classifier in a DL model, the SML model can be used for data processing along with deep architecture.For example, Hu et al. used a multi-grained scanning forest ensemble at each DBM hidden layer in order to perform accurate diagnosis on industrial machinery systems [174].Hu et al. conducted a comprehensive study of the relationship between the number of classes and the diagnosis accuracy, noting that performance decreases as number of classes increases.In addition, Liu et al. used RNN to analyse multiple time sequence data and a denoising autoencoder to analyse the CWRU dataset

VOLUME 7, 2019
Zhang et al. used a deep learning-based long short-term memory (LSTM) approach to estimate the RUL of aircraft engines.Their model reached the lowest prediction root mean square error (RMSE) at 16.12 [189].Haidong et al. used a continuous deep belief network (DBN) with locally linear embedding (LLE) to estimate the remaining useful life of bearing components

FIGURE 12 .
FIGURE 12. First three principal components of the features from CWRU bearing data [207].

FIGURE 13 .
FIGURE 13.Feature learning and t-SNE analysis: a) Analysis with known target data; b) Analysis with unknown target data.

Fig. 13 (
Fig. 13(a) consists of the t-SNE visualisation from each SSAE hidden layer with the given target values.The visualisation contains four classes, and the feature representation improves at each SSAE hidden layer.Meanwhile, Fig 13(b) represents the analysis without the target value, and the feature is visualised in a single colour.One limitation of current diagnosis systems is that they require target values for better analysis.However, in Fig 13(b), it can be seen that even without a target value, the bearing conditions can be classified into several groups.Classes 2 and 3 might be misinterpreted during the analysis since both overlap.The Fig.13(b) visualisation demonstrates that fault visualisation can provide initial information during the FDD analysis.However, this analysis is insufficient, since the SSAE and t-SNE models are run using the default setup.If all aforementioned challenges have been met, fault visualisation methods can be improved, and clear separation between each fault class can be produced.

TABLE 1 .
List of available research at each FDD stage.

TABLE 4 .
List of input dimensions used on deep learning models.

TABLE 5 .
List of hyperparameters in deep learning models.

TABLE 6 .
Potential algorithms for deep learning hyperparameter selection.

TABLE 7 .
List of available online machinery datasets.