Bearing Fault Detection and Diagnosis Using Case Western Reserve University Dataset With Deep Learning Approaches: A Review

A smart factory is a highly digitized and connected production facility that relies on smart manufacturing. Additionally, artiﬁcial intelligence is the core technology of smart factories. The use of machine learning and deep learning algorithms has produced fruitful results in many ﬁelds like image processing, speech recognition, fault detection, object detection, or medical sciences. With the increment in the use of smart machinery, the faults in the machinery equipment are expected to increase. Machinery fault detection and diagnosis through various deep learning algorithms has increased day by day. Many types of research have been done and published using both open-source and closed-source datasets, implementing the deep learning algorithms. Out of many publicly available datasets, Case Western Reserve University (CWRU) bearing dataset has been widely used to detect and diagnose machinery bearing fault and is accepted as a standard reference for validating the models. This paper summarizes the recent works which use the CWRU bearing dataset in machinery fault detection and diagnosis employing deep learning algorithms. We have reviewed the published works and presented the working algorithm, result, and other necessary details in this paper. This paper, we believe, can be of good help for future researchers to start their work on machinery fault detection and diagnosis using the CWRU dataset.


I. INTRODUCTION
Nowadays, electric machines are used ubiquitously in manufacturing applications. With the rapid growth and improvement in science and technology, and the development of modern industries, machinery equipment is operated in daily basis and for almost all applications which, sometimes, make these machines work under unfavorable conditions, humidity and excessive loads resulting in motor breakdowns leading to huge maintenance expenses, degradation in production level, severe monetary losses, and potential risk of loss of lives. The rotating machines and induction engines play a crucial role in the industrial systems. These rotating machines are composed of numerous elements, such as stator, rotor, shaft, and bearings [1]. Rolling element bearings, also commonly known as bearings, are the core and vulnerable components in the machinery whose health condition, i.e., the crack or The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . faults in different places when operated under varying load, directly affects in the performance and efficiency, stability, and lifespan of the machines [2]. The rolling element bearing (REB) consists of four components: inner-race, outer-race, ball and cage. Fig. 1 shows the experimental platform of the CWRU bearing test rig for the ball bearing system [3], [4], the bearing components, and its cross-sectional view.
Numerous studies [5], [6], regarding the possibility of induction engine failures, reveal that a bearing fault is the top common fault category that accounts for one-third of the entire defects, and the failure of these REBs is one of the most common causes of machine breakdown resulting in severe loss of safety and property and even the crash of the machine or loss of the lives in some cases [7]. For these reasons, fault detection and diagnosis of these REBs have become an essential part of development and engineering research. Condition monitoring and fault detection mechanism of REBs are expected to provide information about the real working state of the machinery at each moment VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. (a) An Experimental platform of the CWRU bearing test rig for ball bearing system [3], [4], the components of REB, and (b) its cross-sectional view. without stopping the production line. Moreover, for the proper understanding of processes associated with bearing faults, the mechanical vibration signals, which can detect, locate and distinguish different types of faults, are considered as one of the most useful and productive sources of information [1], [8].
The working procedure for the bearing fault detection consists of sensors placed at different locations of the machine through which these signals are sent to the data acquisition system for further processing [9]. Fig. 2 (a) shows the vibration data collection process for the CWRU dataset. The performance of fault detection methods does not only depend on the quality of the vibration signals collected, but also on the effectiveness of the applied signal processing and feature extraction techniques [1]. Traditionally, the maintenance of these types of machinery used to be performed after the machine fault had happened. Such a posterior maintenance approach generally leads to unexpected machine breakdown resulting in financial losses and casualties [10]. Thus, it is significant to monitor the bearing health condition on the working state of the machine. For detecting and diagnosing the faults in machinery and REBs, many signal processing approaches, machine learning (ML)-based approaches, and deep learning (DL)-based approaches have been proposed and implemented.
The conventional signal processing methods can be analyzed in the time-domain, frequency-domain, and timefrequency-domain. The methods like fast Fourier transform (FFT) [11], wavelet transformation (WT) [12], empirical mode decomposition (EMD) [13], ensemble empirical mode decomposition (EEMD) [14], empirical wavelet transform (EWT), wavelet packet transform (WPT) [15], variational mode decomposition (VMD), stochastic resonance, sparse decomposition, etc., have been suggested and implemented for the fault signal analysis and classification of the collected vibration data. These approaches belong to either of time-domain or frequency-domain or time-frequency domain analysis.
Time-domain analysis, which usually involves the scalar indices to determine bearing condition, is the most straightforward technique for detecting and diagnosing faults in bearings [16]. In this analysis, the signal is studied through temporal vibrational signal data, and based on its value, the bearing condition can be estimated. Some of the approaches in this method are measuring the peak value, peak-to-peak value, root-mean-square (RMS), and crest factor [17], skewness, kurtosis [18], and spectral kurtosis [19], impulse factor, shape factor, and clearance factor [20]. Similarly, frequency-domain or spectral analysis is the most broadly used method for fault diagnosis in REBs. Here, the time-domain vibration signals are converted to discrete frequency components using FFT. The FFT and discrete Fourier transform (DFT) algorithms are used to analyze the raw vibration signal in the frequency-domain. The advantage of frequency-domain over the time-domain analysis is that the frequency-domain method can detect the required specific frequency components effortlessly [16]. The bearing vibration signals are the combination of periodic components, which are cyclic, time invariance and non-stationary, leading to the development of time-frequency analysis approaches. The frequency-domain technique has the advantage that it can handle both the stationary and non-stationary vibration signals. Many time-frequency strategies like short time Fourier transform (STFT) [21], Wigner-Ville distribution [22], and wavelet transform [12] have been used in machinery fault detection and diagnosis (MFDD).
Even though a suitable height of anomaly diagnosis accuracies was stated, the signal processing methods have some kinds of disadvantages. Temporal analysis cannot determine the defective component of a machine. The frequency peak of bearing fault is not strong enough to be distinguished in FFT analysis of bearing fault. Similarly, cepstrum analysis produces many unwanted large peaks near the zero point, which makes the output difficult to interpret. Moreover, it uses several FFTs and IFFTs on each frame, which can be computationally expensive. Envelope analysis requires the knowledge of the resonance frequency and filtering band in advance, which demands some experience. Wavelet transform, on the other hand, comes up with problems like difficulty in the selection of the suitable mother wavelet, the choice of decomposition level, and its frequency band, which contains the necessary information for fault detection [1]. Again, the maximum of these methods needs different features for different types of vibration data. The conventional methods entirely depend on the values at the fault characteristics frequencies to determine the existence of a bearing fault [2]. Features such as mean, median, kurtosis, peak-to-peak, minimum, maximum, standard deviation, absolute mean, skewness, and RMS are the general characteristics of the vibrating signal, which help describe the exact condition of the bearing. From this, one can easily guess how troublesome and crucial it is to choose the right features to characterize the specific signals that are used for the classification. Thus, the manually selected or handcrafted features may not best describe the motor bearing data, and the data cannot achieve the fundamental solution for the detection and diagnosis. Putting it simply, what features extraction is needed for the optimal choice for a particular signal are remained unanswered up to date [23], [24]. There may occur some exceptional patterns or relationships unseen in the data themselves that can expose a bearing fault, and these distinctive features can be almost impossible for human beings to identify at the initial stage. Thus, many scholars have applied several machine learning procedures for the detection of bearing faults from those vibration data.
Machine learning is a branch of artificial intelligence, which is used to teach machines how to handle data more efficiently. Its working algorithm can be summarized in three steps: It takes some data, finds patterns in data, and predicts new patterns from the data. ML algorithms are used extensively in almost every sector, like speech or text recognition, industrial applications, medical diagnosis, social networking sites, and finance [25]. In MFDD, the use of these ML algorithms for developing a knowledge base system has become useful for early diagnosis of a defect to prevent catastrophic failure and reduce operating costs [26]. In bearing fault detection some of the broadly practiced ML algorithms are artificial neural networks (ANN) [27], principal component analysis (PCA) [28], support vector machines (SVM) [29], [30], k-Nearest Neighbors (k-NN) [31], singular value decomposition (SVD) [32]. These algorithms better analyze the data, learn from them, and then apply what they have learned to make smart decisions concerning the occurrence of bearing faults. Furthermore, these machine learning approaches have yielded a satisfactory result in this field.
The use of deep learning has been increased these days rapidly. Deep learning is a subfield of machine learning, which defines both higher level and lower level categories with higher accuracy. It works excellently with a massive amount of data. Deep learning techniques provide better efficiency and accuracy than ML as the techniques of deep learning tend to solve the problem end-to-end, whereas the ML techniques firstly need the problem statements to break down into different parts and finally combine their results [33]. Fig. 3 shows the working algorithm of both ML and DL, and their performance with the amount of data [34]. Looking at the significant advantages of deep learning, DL-based methods have been proposed and are being increasingly used for the MFDD too. Fig. 2 (b) shows the general flow diagram of the implementation of different techniques and their respective performance for bearing fault detection using vibration data.
The methodology implemented in this review paper is like this: the research works which implement deep learningbased approaches using the CWRU dataset, more specifically Drive-End (DE) defects data and normal-baseline data, for bearing fault detection and diagnosis are overviewed here. Except for [143], all the papers summarized here use DE (12k or 48k) faulty data and normal-baseline data as healthy bearing data in their deep learning models, whereas [143] uses all the records (drive-end, normal-baseline and fan-end data) of the CWRU bearing dataset. This paper presents a detailed overview of the recent works and popular DL-based approaches for the MFDD. The outline of this paper is as follows: Section I introduces the machinery fault, conventional signal processing, and ML/DL approach used in bearing fault detection. Section II contains a brief introduction of the publicly available datasets used in bearing fault detection and diagnosis. Section III is the central part of this paper, which describes the popular DL-algorithms used in machinery fault detection and diagnosis. In this section, we present a detailed overview of the most recent works using DL-based methods like auto-encoders (AE), convolutional neural networks (CNN), deep belief networks (DBN), generative adversarial networks (GAN), recurrent neural networks (RNN) and long short-term memory (LSTM), and reinforcement learning (RL). Section IV is about some of the works which use transfer learning and domain adaptationbased DL methods. Since this paper mainly focuses on the works using the CWRU dataset, section V points out some of the weaknesses of the CWRU dataset. Section VI lists some of the limitations and challenges using DL models for MFDD, and the next section presents some recommendations from the authors. Finally, the paper is concluded in section VIII.

II. DATASET
Data is the fundamental unit and the foundation for all ML or DL architectures. Deep networks and DL algorithms VOLUME 8, 2020 are influential ML structures that work most excellent if trained on vast amounts of data [35]. With the small training dataset, we get limited sample variations, and the network efficiency decreases. So, the quantity and the frequency of data availability has a vital role in DL applications. It can be said that an adequate amount of data for training in each class leads to a better prediction of that class. Generally, the more the data, the better the accuracy. The data structures with different labels can be better learned using abundant variations, and the model will recognize the invariant features compared to such differences [36]. Usually, a dataset can be categorized into two groups depending on its complexity: Simple dataset and a Complex dataset. A simple dataset, which is also called a good dataset in general terminology, is much easier to use, tends to allow for effective and easy data manipulation and calculations for some sort of meaningful statistical analysis [37]. Again, a well-labeled and balanced dataset which does not contain the aberrant data or missing values is also considered as a good one.
On the other hand, a complex set of data can be defined as a big dataset, which has a large volume, wide variety, large velocity and large veracity [38]. The real-world data are mostly highly skewed, and learning from these datasets is a challenging task for standard classification algorithms. When presented with complex imbalanced datasets, these algorithms fail to represent the distributive characteristics of the data accurately and resultantly provide unfavorable accuracies across the classes of the data. Though it is quite challenging to deal with the complex or imbalanced data, the imbalanced data are the characteristics of multiple realworld applications such as medical diagnosis, fraud detection, machinery fault detection, pricing catastrophe [39], [40]. The problem of imbalanced datasets has been approached from two main directions. The first approach is to preprocess the data by under-sampling the majority instances or oversampling minority instances. The latter is a cost-sensitive classification approach. In this approach, a specified learner is changed to incorporate a fluctuating penalty for each of the considered groups of samples [41].
Similarly, data obtained from multiple sources often mean messy data or leads to the data that follow a different internal logic or structure, which increases its complexity. The other factor that makes the data complex is its size. Generally, the bigger the data, the more complicated it is [42]. Traditional techniques simply could not identify all the features of imbalanced data because they assume the data is balanced, and the prediction is biased to the class where the sample number is relatively high. A better approach would be to develop a more generic organizing principle that can accommodate all possible types, rather than individual approaches that deal with specific types one by one [41].
The bearing is the broadly applied topic for machinery fault detection or anomaly detection. The reason may be the readily available public dataset [43]. Some of the accessible opensource bearing dataset used for the bearing fault detection and diagnosis are as follows: The CWRU dataset is a popular, open-source, and easily accessible dataset. The generated dataset is recorded and available on the CWRU website, which provides access to the bearing data for normal and faulty bearings. In this database, the data were collected for normal bearings, singlepoint drive-end (DE), and fan-end (FE) defects. The CWRU bearing dataset serves as the standard reference [23] and the fundamental dataset [2] to authenticate the performance of different ML and DL algorithms.
The bearing test rig arrangement used to obtain the CWRU dataset is shown in Fig. 1, which consists of a 2 hp Reliance electric induction motor, a torque transducer, a dynamometer, and control electronics, which is not shown in the figure. The test bearings support the motor shaft. Torque is applied to the shaft through a dynamometer and electronic control system. The faults were seeded on the REBs, the inner-race (IR) and outer-race (OR), and each faulty bearing was reinstalled on the test rig. Electro-discharge machining was used to introduce the single point faults to the test bearings with fault diameters of 7 mils, 14 mils, 21 mils, 28 mils, and 40 mils. One mil is equal to 0.001 inches. SKF bearings were used for the 7, 14, and 21 mils diameter faults, whereas for the 28 mils and 40 mils faults, NTN equivalent bearings were used. The depth of the fault was 0.011 inches for all the bearings except for inner-race faulty bearing of diameter 0.028 inches, an outer-race faulty bearing of diameter 0.040 inches, and a ball bearing fault of diameter 0.028 inches. The depth of the fault is 0.050 inches for both inner-race faulty bearing of diameter 0.028 inches and outer-race faulty bearing of diameter 0.040 inches. Moreover, for the ball bearing fault of 0.028 inches diameter, the depth of the fault is found to be 0. 150 inches [4].
Acceleration data was measured at locations near to and far-off the motor bearings. The data is collected from multiple sensors placed at different places. Accelerometers, which were attached to the housing with magnetic bases and placed at the 12 o'clock position at both the DE and FE of the motor bearing, were used for collecting vibration data. Additionally, for some experiments, an accelerometer was attached to the motor supporting base plate too. Once the data was collected using a 16 channel DAT recorder, it was processed in a MATLAB environment, and all the data files were stored in MATLAB (.mat) format. Each file contains one or more of the recorded DE, FE, and base plate acceleration (BA) data. The sampling frequency of 12 kHz and 48 kHz were used for the collection of data. For the drive-end bearing experiments, data was collected at 12k and 48k samples per second. Fan-end data was collected at 12k samples per second. For the normal baseline, the data collection rate was 48k samples per second. VOLUME 8, 2020 Vibration data was recorded for motor loads of 0 to 3 horsepower, with motor speeds of 1720 to 1797 rpm after the faulted bearings were reinstalled into the test motor. The dataset consists of data files for different torque loads applied by the dynamometer. However, 'load' is virtually meaningless in bearing fault detection and diagnosis because of a lack of a mechanism to convert the torque to a radial load-borne by the bearings. The main consequence of the motor load is on the shaft speed, which is declined by approximately 4% in the maximum load (3 hp) case, and this would have minimal effect on the diagnosable of the datasets [4], [23].

93158
The dataset consists of 161 records, which are grouped into four classes: 48k normal-baseline, 48k drive-end fault, 12k drive-end fault, and 12k fan-end fault. Each group, again, consists of datasets for ball bearing (B) fault, inner-race fault, and outer-race faults. According to the fault location relative to the load zone, the outer race faults are further classified into three categories: 'centered' (fault in the 6. ''Defective bearings produce vibration equal to the rotational speed of each component bearing frequencies. They relate notably to the rotation of the balls, the cage, and the passage of the balls on the inner and outer races'' [1]. The bearing fault frequencies associated with the defective innerrace, outer-race, cage, and ball are as follows: where, BPFI is inner-race ball pass frequency, BPFO is outerrace ball pass frequency, FTF is fundamental train frequency (cage speed), BSF is ball (roller) spin frequency, Furthermore, f r is the shaft speed, n is the number of rolling elements, d is the rolling element diameter, D is the bearing pitch diameter, and D is the angle of the load from the radial plane. Fig. 1 (b) shows the cross-sectional view of REB, where the parameters Dd and α can be visualized. Furthermore, Table 1 shows the information of the bearings used in the CWRU dataset, whereas Table 2 tabulates different bearing fault frequencies. These frequencies are helpful when detecting the faults through signal processing methods.   Again, CWRU bearing dataset is long, varied, and complex. Each data file consists of data of different lengths, which is not an integer multiple of 2. The Table 3 shows the bearing length information for 4 classes of CWRU dataset.
As mentioned earlier, the CWRU bearing dataset is widely used and is taken as the standard reference for validating many ML and DL algorithms. The dataset does not contain the masking sources, which make it easier to use [22]. The review of the research employing DL algorithms using this dataset is presented in section III.

B. PADERBORN UNIVERSITY DATASET
This dataset [44] is also for bearing fault diagnosis and is provided by the KAT datacenter in Paderborn University. The essential components of the test rig are a drive motor, a torque measurement shaft, a test module and a load motor. Fig. 4 (b) shows the mechanical setup of the test rig for the Paderborn University dataset. The Paderborn university bearing dataset consists of the high-resolution vibration data, which are collected from experiments performed on six healthy bearings, and 26 damaged bearing sets. Out of the 26 damaged bearing sets, 12 were artificially damaged, and 14 were damaged using accelerated life tests to simulate real damage [45]. It provides the basis for the development, validation, and FIGURE 4. Overview of (a) PRONOSTIA experimental platform [53] (b) Mechanical setup of the test rig for Paderborn University dataset [44], and (c) experimental setup for IMS dataset [62].
training of the diagnostic algorithms for condition monitoring (CM) of rolling bearings. The test rig was operated under different operating conditions to ensure the robustness of CM methods at different operating conditions [44]. In total, experiments with 32 different bearing damages in ball bearings of type 6203 were performed. This dataset consists of the synchronously measured motor currents and vibration signals, which enable more accurate testing and implementation of the ML algorithms in practical applications, where the real defects are generated through aging and the gradual loss of lubrication [2]. References [45]- [48] use this dataset in their research.

C. FEMTO DATASET
FEMTO dataset [49], [50] is provided by FEMTO-ST Institute, Besancon, France. The real experiments on bearing's accelerated life tests, which are generated using an experimental platform called PRONOSTIA, are provided in this dataset. PRONOSTIA is an experimentation platform dedicated to test and validate bearings fault detection, diagnostic and prognostic approaches [51]. The main objective of PRONOSTIA is to provide real data related to accelerated degradation of bearings performed under constant and/or variable operating conditions, which are controlled online. The experimental platform, which is shown in Fig. 4 (a), allows to conduct bearings' degradations in only a few hours, and thus it is possible to get a significant number of experiments within a week. PRONOSTIA testbed is composed of three main parts: a rotating part, a degradation generation, and a measurement part. The operating conditions are characterized by two sensors: a rotating speed sensor and a force sensor. In the PRONOSTIA platform, the bearing's health monitoring is ensured online by gathering two types of signals: temperature and vibration signals with the help of horizontal and vertical accelerometers. Furthermore, the vibration signals were recorded every 10 seconds with sampling frequency 25.6 kHz, which allows catching all the frequency spectrum of the bearing during its whole degradation. The vibration signals related to the degraded bearings are then compared to a nominal vibration signal of a nondegraded bearing or nominal bearing. Finally, the monitoring data provided by the sensors can be used for further processing in order to extract relevant features and continuously assess the health condition of the bearing. In total, 17 run-tofailure data under three different operating conditions were included in the FEMTO dataset, but the specific faulty mode of the failed bearing under each test is not declared [52], [53]. Some of the publications using this dataset are: [54]- [56].

D. MFPT DATASET
Another dataset for fault detection and diagnosis of REBs is the MFPT dataset [57], which is provided by the Society for Machinery Failure Prevention Technology. The goal of the Condition Based Maintenance (CBM) fault database is to provide various datasets of known good and faulted conditions for both bearings and gears. A bearing fault dataset has been provided to facilitate research into bearing analysis. The dataset comprises data from a bearing test rig (nominal bearing data, an outer race fault at various loads, and inner race fault at various loads), and three real-world faults [57]. Three real-world example files include an intermediate shaft bearing from a wind turbine, an oil pump shaft bearing from a wind turbine, and a real-world planet bearing fault. The data made available by the MFPT uses a NICE bearing. The defect seeding processes is not described. Three measurements with a load of 1201 N on the bearing are provided for the baseline condition and an outer race fault, as well as seven measurements for both the outer and inner race faults over a range of 0-1334 N bearing load [58]. This dataset, thus, is freely distributed with example processing code with the hope that researchers and CBM practitioners will improve upon the techniques, and consequently, sophisticated CBM systems will be developed. The research which use this dataset for rolling bearing fault detection are [58]- [61].

E. IMS DATASET
This bearing dataset [62], [50] is provided by the Center for Intelligent Maintenance Systems (IMS), University of Cincinnati, and can be downloaded from the Prognostic data repository, which is hosted by NASA. The Prognostics data repository is a collection of datasets that have been donated by various universities, agencies, or companies. The data repository focuses exclusively on prognostic datasets, i.e., datasets that can be used for the development of prognostic algorithms [50]. For the collection of the IMS bearing dataset, four test bearings were mounted on one shaft driven by an AC motor and coupled by rub belts. The experimental setup for this dataset is shown in Fig. 4 (c). The rotational speed was kept constant at 2000 rpm [43]. This database consists of three different datasets. In set one, 2 high precision accelerometers have been installed on each bearing, whereas in datasets two and three, only one accelerometer has been used. Each dataset is formatted in individual files, each containing a 1-second vibration signal snapshot, recorded at specific intervals. Each file consists of 20,480 points with a sampling rate set of 20 kHz. The file name indicates when the data was collected [62].
Some of the research works which employ this dataset are [63], [64].
Among all the datasets mentioned above, Case Western Reserve University (CWRU) dataset is the most widely used dataset [23] for the classification and detection of fault diagnostics of machinery bearings; thus, it serves as the fundamental dataset to validate the performance of the ML and DL algorithms [23]. This paper focuses on the works carried out for the bearing faults detection using the CWRU dataset with DL-based approaches. The following section describes the DL-based approaches used for bearing fault detection and diagnosis.

III. DEEP LEARNING-BASED APPROACHES FOR BEARING FAULT DIAGNOSIS
Deep learning is a branch of ML that is based on ANNs and inspired by the functionality of human brain cells called neurons. It is not mandatory to program everything explicitly in deep learning. It allows computational models, which are composed of multiple processing layers, to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection, anomaly detection, and many other domains. DL discovers complicated structure in large datasets by using the back-propagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer [65]. After analyzing some simple features in the early levels, the network sends this information to the following level, which takes this simple data, combines it into something more complex, and transfers it on the third level. This process lasts until the final layer, where the classification or output is obtained. Deep convolutional nets have brought about breakthroughs in processing images, video, speech, and audio. On working with both structured and unstructured data, DL networks perform automatic feature extraction without human involvement [66], [67].
Deep learning approaches can further be classified into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. A supervised learning algorithm is based on training a model with the data that are correctly classified or labeled. In this algorithm, the model learns to map the input variables to the output variables. The goal is to approximate the mapping function so well that when the new data is introduced to the model, it can easily predict the output variables. Logistic regression, classification, and back-propagation neural networks are some examples of using supervised learning algorithms. However, most of the real-world data are found to be unlabeled. Labeling each datum is a complicated and expensive process and needs expert supervision. So, there is an algorithm that works for unsupervised data, and it is called unsupervised learning. This learning algorithm can learn and organize information without providing error signals to evaluate the potential solution [68]. Clustering, dimensionality reduction, association, anomaly detection, etc. use this algorithm.
In almost all real-world applications, the data is sparsely labeled. Semi-supervised learning lies in between supervised and unsupervised learning in which only the small set of data has corresponding labels. Data provided is with a mixture of labeled and unlabeled data. The semi-supervised learning algorithm is the one in which one can use both supervised learning techniques to make a best-guess prediction for the unlabeled data with the back-propagation algorithm and unsupervised learning technique to discover and learn the structure in the input variables [69], [70]. This algorithm overcomes the drawbacks of supervised and unsupervised learning and is being implemented these days widely. Another deep learning algorithm is reinforcement learning (RL), in which the algorithm is presented with examples that lack labels. This algorithm, which is the combination of ANNs and RL architecture, allows software-defined agents to learn the finest actions possible to achieve their goals. In this type of learning, the algorithm must make decisions, and these decisions relate to a consequence. It deals with delayed rewards and trains the structures to learn from the consequences of their own decision [71]. It is analogous to trial and error in human learning.
Again, common ML systems conventionally address isolated tasks. Transfer learning tries to change this by developing methods to transfer knowledge learned in one or more source tasks and use it to improve learning in a related target task. Transfer learning is a machine learning with an additional source of information apart from the standard training data: knowledge from one or more related tasks [72]. The ML and DL algorithms work efficiently when the data is massive. However, having a vast amount of data is not always favorable in real-world applications. So, to overcome this issue, transfer learning approaches were proposed. It solves the problem of insufficient training data [73].
In this algorithm, the information flows in one direction only, from source task to the target task. Among all the transfer learning approaches, domain adaptation is one of the most popular methods [2]. Domain adaptation deals with scenarios in which a model trained on a source distribution is used in the context of a different target distribution. In general, domain adaptation uses labeled data in one or more source domains to solve new tasks in a target domain [74].
Because of the above mentioned and other numerous advantages of DL-based algorithms, they have been used in multiple studies. Image recognition, pattern recognition, object detection, text recognition, speech recognition, fault detection, and anomaly detection are some of the fields in which DL algorithms have performed well. In MFDD, both the supervised and unsupervised methods have been implemented. The DL models like auto-encoders, convolutional neural networks, deep belief networks, generative adversarial networks, recurrent neural network and long short-term memory, and reinforcement learning have shown their excellent performance with higher accuracy and reliability.
The following section describes the deep learning algorithms that have been applied for the task.

A. AUTO-ENCODERS AND MODIFIED AUTO-ENCODERS
Auto-encoders are widely adopted unsupervised learning procedure which is trained to copy its input to its output. Inside, it has a hidden layer h, which defines a code that is used to denote the input. The network can be regarded as containing two parts: an encoder function h = f (x) and a decoder that produces a reconstruction r = g(h) [67]. Fig. 5 shows the common structure of an auto-encoder where x denotes the input data vector {x 1 , x 2 , . . . ,x n }, h represents the vector in the hidden layer {h 1, h 2 , . . . , h m }, r is the outcomes of the output layer {r 1 ,r 2 , . . . , r n }, and simultaneously, it also signifies the reconstructed data vector. Here, W 1 and W 2 are the weight matrices that complete the connection between the adjacent layers. The ANN takes the mean-square-error between the original input and output as a loss function, which intends to imitate the input as the final output. After this, the structure is trained, and the decoder part is removed, and only the encoder part is kept. Hence, the output of the encoder is the feature representation that can be introduced in next-stage classifiers.
Many studies have been carried out for the bearing fault detection and diagnosis with an auto-encoder and modified auto-encoders. F. Jia et al. did the pioneering work in [75], where they utilized a 5-layer auto-encoder based deep neural network (DNN) and pre-trained it layer by layer with the unsupervised technique and further fine-tuned the DNN with a back-propagation algorithm for classification of the fast Fourier transformed faults and normal bearing data. The merit of this architecture is that it can adaptively obtain fault characteristics from the measured signals for many diagnosis problems, and the technique is good at establishing the non-linear mapping relationship between the different health conditions of machinery and the corresponding measured signals. The number of classes used was 10 and, the classification accuracy for the CWRU bearing data ranges from 99.68 to 99.95 percent, which is much better than the accuracy obtained by using the BPNN-based method. They implemented the PCA method to verify the ability of the proposed method.
Similarly, in [76], the authors have used a stacked autoencoder to construct a DNN. Their algorithm involves two steps of first training the auto-encoder with the input set to get the feature vectors (say h) and then removing the output layer of the auto-encoder, which was trained before, and setting the feature vectors obtained, h, as the input set for the next auto-encoder. So, they stacked the auto-encoder for the classification of the faulty bearing data. In the beginning, they implemented several data preprocessing procedures where they divided the data from various groups into segments of 600 points with 80% overlap, and then FFT was applied on every section to get the amplitude, which was so small that it needed to be multiplied by a certain coefficient. The number of neurons in hidden layers was 800 and 400, respectively, and the number of fault classes was 6. They also used the PCA method to reduce the data dimensions for visualizing.
Some other works which employ auto-encoder in bearing fault diagnosis using the CWRU dataset are listed and described in Table 4. The references [77], [78] are some of the works carried out on MFDD with an auto-encoder using a dataset other than the CWRU dataset.

B. CONVOLUTIONAL NEURAL NETWORKS
After first proposed by LeCunn for image processing [86], we can see how the CNNs have massively improved the performance in computer vision, object detection, natural language processing and speech recognition with the increment in the production and memory of GPU. No doubt, CNNs are the most representative model of deep learning.
Thus, the use of CNNs has proliferated within computer vision. After the fruitful overview of AlexNet [87], the period of deep 2-D CNNs has starred and immediately exchanged the outdated classification and acknowledgement approaches within a short period. VOLUME 8, 2020  Again, 1-D CNN is the modified version of 2-D CNN in which forward propagation (FP) and back-propagation (BP) are simple array operations rather than matrix operations, which makes them more efficient for specific applications in dealing with 1-D signals. Also, 1-D CNN is a relatively shallow structure, which makes them able to learn challenging tasks involving 1-D signal, and they are also suited for real-time applications [24]. Fig. 6 and Fig. 7 show the basic 2-D and 1-D CNN structures, respectively. The use of CNNs in bearing fault detection has been practiced in recent years. Some of the published works for detecting and diagnosing machinery bearing faults with a CNN architecture using the CWRU dataset have been described briefly in the following section.
A hierarchical adaptive deep convolutional neural network (ADCNN) is applied on the CWRU dataset in [88], which has two hierarchically organized components: a fault determination layer and a fault size evaluation layer. Here, the learning rate is changed dynamically so that a better trade-off is maintained between the training speed and accuracy. The ADCNN model in the first layer is based on the classical LeNet5 models proposed by LeCun. The number of layers used is 9, and the batch size is 100. The number of classes used is 4, among which one is the 'healthy' bearing class, and the other three are for fault bearings. The overall accuracy is 97.7%. SoftMax is used in the classification layer.
In [89], a CNN-based approach with multiple sensor fusion is proposed for fault diagnosis of rotating machinery in which both spatial and temporal raw data from the drive-end and fan-end are stacked by converting 1-D vibration data into 2-D input matrix and sent to CNN as input. Representative features were learned directly from raw signals by training the CNN model, where no hand-crafted features were needed. Out of the total samples, 70% were used for training, 15% for validation, and 15% for testing. The average accuracy with two sensors was found to be 99.41%, whereas that with only one sensor is 98.35%. Minibatch stochastic gradient descent and dropout were utilized in the training process with increased efficiency and prevention of overfitting when the size of available data was small. The number of classes used was 9, and SoftMax is used in the classification layer.
An intelligent rotating machinery fault diagnosis based on DL using data augmentation technique is proposed in [90]. The authors use two augmentation methods: sample-based and dataset-based, and five augmentation techniques in general: additional Gaussian noise, masking noise, signal translation, amplitude shifting, and time stretching. Raw vibration signals are directly used as input without any preprocessing in this research where CNN structure and residual learning algorithm are further applied to train the network. A 1-D convolutional layer is first used with F N filter kernels of F L length window size. The raw data are processed for feature extraction initially and transformed into multiple feature maps. Next, residual blocks are stacked to learn high-level abstract representations. In each block, two convolutional layers are adopted for residual learning. For the convolutional layers in the proposed network, the zeros-padding operation is implemented to keep the feature map dimension from changing. They also use the batch normalization technique after each convolution layer to accelerate the training process and dropout for reducing the data overfitting. ReLU is used as an activation function, whereas SoftMax is used as a classifier. The number of labels used is 10. The best accuracy obtained on the CWRU dataset is 99.91% for 400 training samples.
Some other works employing CNN for MFDD using the CWRU bearing dataset are tabulated in Table 5. Again, [24], [91]- [93], and [94] are some of the published works which use CNN architecture for MFDD using a dataset other than the CWRU bearing dataset

C. DEEP BELIEF NETWORKS
Deep Belief Networks, first proposed in [100], are the network architectures consisting of many intermediate layers of Restricted Boltzmann Machine (RBM) in which each RBM layer communicates with the earlier and the following layers. Also, there is no intra-layer communication. The final layer is for the classification. In contrast to other models, each layer in deep belief networks learns the entire input. The deep architecture of DBN addresses the issues with Multi-Layer Perceptrons (MLPs) getting stuck at local optima by representing multiple features of input patterns hierarchically. Furthermore, they also benefit in optimizing the weights at each layer. This problem-solving approach involves making the optimal choice at each layer in the sequence, ultimately finding a global optimum. These greedy learning algorithms start from the bottommost layer and move up, finetuning the generative weights. Here, the learning takes place on a layer-by-layer basis training layers of the DBN one at a time. Consequently, each layer obtains a different type of data. Except for the first and last layers, each layer in a DBN has a dual role, i.e., it serves as the hidden layer to the earlier nodes, and as the input or a visible layer to the following nodes. It can be called a network built of single-layer-networks. These networks can solve many problems like low velocity and the overfitting phenomenon in deep learning. An architecture of DBN is shown in Fig. 8. Besides being used widely in the field of recognizing, clustering, and generating images, video sequences and motion-capture data, they have also been used VOLUME 8, 2020 in detecting and diagnosing rotary machinery fault. Some of the published works in MFDD using the CWRU dataset are described below.
A rolling bearing fault diagnosis using adaptive DBN with a dual-tree complex wavelet packet (DTCWPT) is proposed in [101]. Here, an adaptive DBN is constructed to improve the convergence rate and identification accuracy with multiple stacked adaptive restricted Boltzmann machines (RBMs). The authors used three-layer DTCWPT to preprocess the raw vibration signals to improve fault features. Three experiments were conducted. In experiment 1, first, DTCWPT, EMD, EEMD, and WPT are used to preprocess the raw vibration signal, respectively. Then, nine feature parameters are, respectively, extracted from each frequency-band signal of DTCWPT and WPT and effective IMFs of EMD and EEMD. Finally, all the normalized feature parameters are directly fed into the adaptive DBN. The decomposition level of DTCWPT is set as 3, and the architecture of the adaptive DBN is 72-400-250-100-16. The initial learning rate and momentum are selected as 0.03 and 0.86, respectively. The increasing factor, decreasing factor and threshold are 1.25, 0.6, and 0.2, respectively. The pre-training of each adaptive RBM is completed using 200 iterations. For DBN and EEMD, the ensemble number is chosen as 100, and the added noise amplitude is 0.005 times the standard deviation of the analyzed signal. The first 10 IMFs containing almost all valid information are selected, and the DBN architecture is 90-500-300-100-16. In experiment 2, DTCWPT is applied to preprocess the raw vibration signal, and then, the nine features are extracted from each frequency-band signal of DTCWPT. Finally, without any manual feature selection, all the normalized feature parameters are directly fed into the adaptive DBN, BP, general regression neural network (GRNN), and SVM, respectively. The testing accuracy using the proposed method is 94.37%. In Experiment 3, sixteen sensitive features are manually selected from the original feature set, including eight kurtosis values and eight wavelet packet energy values. Then, the selected 16-dimensional features are fed into the adaptive DBN, BP, GRNN, and SVM, respectively. The architecture of the adaptive DBN is 16-80-45-25-16. Similarly, [102] talks about DBN and Dempster-Shafer (DS) evidence theory for bearing fault diagnosis. Here, researchers installed multiple sensors where the features from the signals from each sensor are extracted using DBN and classified with SoftMax. Then, the predicted result from the SoftMax is fused by DS evidence theory to generate the final prediction of bearing health status. In experiments, vibration signals from DE and FE sensors, which are of four bearing health conditions, are considered. To obtain enough data for the training and testing model, each single vibration data file is split into segments that have equal length. They used the FFT algorithm as the preprocessor for the raw time-domain data. For the feature learning and feature classification, two DBNs are designed to learn fault features from the bearing signals. Since signals from both sensors are vibration signals, they develop both DBN models with the same configuration. The training processes of DBNs are carried out step by step. After the feature learning process, two feature sets, each from FE and DE sensors is obtained. Two SoftMax classifiers are employed to classify these two feature sets. The classification results of two SoftMax classifiers are combined to achieve a better result. DS evidence theory is exploited to fuse the decisions and generate the final prediction. The proposed method achieved 100% accuracy. Also, the proposed system does not require to be retrained when the working condition (load) changes. They also evaluated the model performance under different noisy conditions, in which the worst-case accuracy was 98.5%.
Some of the other remarkable works regarding MFD which employ DBN using the CWRU dataset are listed in Table 6, and the references of the works using other datasets: [103]- [108].

D. GENERATIVE ADVERSARIAL NETWORKS
Generative adversarial networks (GANs) are deep neural net structures that are comprised of two nets, competing one against the other. They belong to the set of generative models, which means they can generate or create new content. They typically work with random input and produce something on output. They were first introduced in [114] by Ian Goodfellow and other researchers. Here, the discriminator, D and the generator, G play the following two-player minimax game [114] with value function For the provided training dataset, the generator intends to create samples that have an equal probability distribution as the real data or training dataset. The discriminator belongs to the conventional binary classifier and is primarily responsible for dual tasks. In the first one, it is essential to determine whether the input arrives from the real data distribution or the generator. In the second task, the discriminator directs the generator over the back-propagation gradient to produce a more realistic sample, which is the only method for the generator to improve its performance of generating fake samples. During this two-player min-max game, the generator takes in impulsive noise as input and outputs a fake sample, which should be maximized by the discriminator as the probability of the result from the original training set. During the training period, the discriminator processes the sample of the training set as an input for half of the time duration and takes in the fake sample G Sample generated by the generator as an input for the other half.
The discriminator, in this process, is trained to optimize or maximize the distance between groups, and to differentiate between the actual image from the training set and the fake sample generated from the generator. Thus, the generator should be capable of making the created probability distribution and the actual data distribution as near as possible so that the discriminator cannot decide which one is real and which one is a fake sample. Hence, in this adversarial process, the generator's capability to study the actual data distribution becomes stronger and stronger, and the discriminators' feature learning and discriminative capacity also become stronger and stronger. Eventually, the training will reach Nash equilibrium, where the Discriminator will be unable to differentiate between two distributions, i.e., D(x) = 1/2. Since this equilibrium is very tough to find, there are many research papers [115], [116] for cracking this issue [114] [117]. Fig. 9 shows an overview of GAN architecture. Like in every other field, the use of GAN is increasing in bearing fault detection also. Several works have been published using GAN for the machinery bearing fault detection and classification using the CWRU dataset. Some of them are described briefly in the following section.
Being inspired by the Wasserstein distance of optimal transport, Cheng et al. proposed a Wasserstein distance-based deep transfer learning method [13]. The authors created a deep transfer learning (DTL) algorithm to perform learning in the target-domain by leveraging information from an appropriate source-domain to determine the domain feature illustrations to minimize the distributions between the source and target domains through adversarial training. CNN based VOLUME 8, 2020 feature extractor generates those illustrations. They applied Wasserstein-1 distance to measure the loss of domain discriminator because Wasserstein-1 distance is the optimum transport plan with the bottommost transport cost. In their research, they inspected four types of bearing condition data acquired by CWRU. Simple data preprocessing techniques like dividing the samples equally in source-domain and target-domain for their modification to a stationary process, implementing FFT to compute the power spectrum in frequency-domain of every sample, and clipping the left side of the power spectrum calculated by FFT as the input of WD-DTL are applied to the bearing datasets. They proposed three transfer scenarios, including two unsupervised and one supervised scenario. Unsupervised transfer scenarios are for transfer between motor speeds (US-Speed) and transfer between datasets at two sensor locations (US-Location), whereas a supervised transfer scenario is for transfer between dataset at two sensor locations (S-Locations). The proposed WD-DTL model achieved the best transfer accuracies with a 95.75% average score.
In [118], the authors proposed an unsupervised fault diagnosis of rolling bearings using DNN based on GANs in which they designed categorical adversarial auto-encoder (CatAAE). By adding a classifier on the latent layer of adversarial AE, they introduced a new model named CatAAE for unsupervised clustering. In this work, an auto-encoder is trained through an adversarial training process and carries out an earlier distribution on the latent coding space. The raw data of rolling bearings is in the time-domain. To keep the consistency of vibration signals under the same faulty condition, multiple features, including time-domain and frequencydomain, are extracted to describe the health status through FFT and statistical methods completely. Next, a classifier is trained to differentiate the prior distribution from the fake distribution by harmonizing shared data between instances and their forecasted categorical class distribution. Although the encoder can extract features from the input data, the proposed method still relies on computational time and frequencydomain features of raw data. The number of labels used in this study is 10, and different clustering indicators were obtained when the model was performed under different SNR and across the varying loads. Mixed time-frequency features are employed in the procedure to get better robustness under different environments. For the original signal, the clustering indicator was found to be 98.36, and for 0 dB SNR, the clustering indicator was found to be 94.35. Also, across the different loads, different clustering indicators were obtained. The average was found to be 90.15. SoftMax was used in the last layer as the classifier. For the data preprocessing, multiple features, including time-domain and frequency-domain, were extracted.
In [119], F. Zhou et al. addressed the problem of misclassification when the data is unbalanced and proposed a fault diagnosis algorithm based on global optimization GAN in which they combine a robust feature extraction capabilities of DNN with data generation capabilities of GAN for solving this issue. In this paper, the authors designed a new generator and discriminator of GAN to generate more discriminant fault samples using a scheme of global optimization. The generator is designed to generate those fault features extracted from a few fault samples via auto-encoder instead of fault data sample. The training of the generator is guided by fault feature and fault diagnosis error instead of the statistical coincidence of traditional GAN. The discriminator is designed to filter the unqualified generated samples in the sense that qualified samples are helpful for more accurate fault diagnosis. The adversarial training mechanism is arranged by alternately optimizing the generator and the 2-level discriminator. Here, the network parameters of the generator, G, are fixed, and the authenticity discriminator, D2, is optimized. Then, a DNN-based fault diagnosis discriminator, D1, is established by using the original unbalance dataset. The DNN fault diagnosis model can be updated by the generated samples to get a smaller fault diagnosis error. Finally, the generator, G, is optimized for making the generated samples more qualified for fault diagnosis. The global optimization is accomplished once the discriminator D1, D2, and generator G reach a Nashequilibrium, which increases the model's efficiency. For preprocessing data, a sliding window is used. The diagnostic accuracy of this method for the three kinds of faults: innerrace, roller, outer-race, for 10:1 unbalanced accuracy ratio is 94.58%, 96.85%, and 93.28%, which is far more than that of using SAE.
Some of the other works which implement GAN for MFDD using the CWRU dataset are as tabulated in Table 7.
Moreover, some of the references for the published article using other datasets than the CWRU dataset are [117], [125], [126].

E. RECURRENT NEURAL NETWORKS
Recurrent neural networks, which are abbreviated as RNNs, are a class of neural networks that permit earlier outputs to be processed as inputs while having hidden states. They remember the past, and their decisions are influenced by what they have learned from the previous state. Thus, RNN recurrently processes the input data [127]. Since RNN has the typical gradient vanishing or gradient exploding issue caused by its nature, the LSTM networks were born to solve the problem. LSTM, an abbreviation of Long Short-Term Memory, is the widespread network of these days. They are also called the cells network, and these cells take the input from the previous state and the current state. Not only in natural language processing, text recognition, or speech recognition, but RNNs are also widely used in machinery fault detection and diagnosis. Fig. 10 shows the basic architecture of the LSTM network.
In [128], an improved bearing fault diagnosis method based on the combined unified structure of 1-D CNN and LSTM was proposed whose input is the raw sampling signal without any pre-processing or traditional feature extraction. Here, the entire architecture is composed of a 1-D CNN layer, a maxpooling layer, an LSTM layer, and a SoftMax layer as the last  layer classifier. First, a part of raw bearing data is used as the training dataset for the model, and when the number of iterations reaches a specific value, the simulation ends. Then, the rest of the signal data was input in the trained model as the testing dataset to verify the effectiveness of the model. The best testing accuracy ranges up to 99.6%.
Fault diagnosis of rolling bearings with RNN-based autoencoders is proposed in [129] in which a Gated Recurrent Unit (GRU) based denoising auto-encoder is used to predict the multiple vibration value of REB of the following period are forecasted from the preceding period. These GRU-based non-linear predictive denoising auto-encoders (GRU-NP-DAEs) are trained with a strong generalization ability for each different fault pattern. The trained GRU-NP-DAEs are received a multiple-input data, and the fault diagnosis result is determined by the relevant GRU-NP-DAE that produces the minimum reconstruction error between the delay of the input and the model output. The concept of classification accuracy is adopted to evaluate the feasibility of the proposed method for health condition detection and fault type classification. During the supervised learning process, the data destruction method is selected, and the length loss method is proposed to enhance the robustness of models. The classification result of GRU-NP-DAE is not lower than 96%, and even SNR decreases to 1 dB.
Some other works on MFDD employing RNN or LSTM using the CWRU bearing dataset are tabulated in Table 8.
Moreover, [132]- [135] are some of the research carried out for MFD using a dataset other than the CWRU dataset.

F. REINFORCEMENT LEARNING
Reinforcement learning is an area of machine learning that deals with delayed reward and trains the systems to learn from the consequences of their own choices. It resolves a kind of problem where decision making is consecutive, and the goal VOLUME 8, 2020  is long-term. Employed by several machines and software to determine the best probable path or action it should take in a situation, reinforcement learning is all about taking favorable action to augment reward in that situation. With the help of simple reward feedback, the machine learns which action is the best to perform, and this is known as the reinforcement signal. Reinforcement learning and supervised learning are different in a way that the training data has the answer key with it in the supervised learning due to which the model is trained itself with the accurate answers, whereas in the reinforcement learning as there is no answer, the reinforcement agent chooses what to do to complete the given task. In the absence of a training dataset, it is bound to learn from its own experience [136], [137]. An overview of RL is shown in Fig. 11. From teaching the machines to play Atari games to its application in the manufacturing industry, reinforcement learning is widely used these days. This reinforcement learning is being applied in the machinery bearing fault detection and diagnosis also.
In [138], Wang et al. proposed a neural network architecture automatic search method based on reinforcement learning for fault diagnosis of rolling bearings. The framework of the proposed method contains two components: a controller model and child models. The controller is a recurrent neural network and generates a series of actions; each action specifies a design choice to construct the child models for fault diagnosis. Then, the controller parameters are updated using the policy gradient method of reinforcement learning by maximizing the accuracy of the child models. They used the CWRU dataset for the experimental verification in which they used 12K sampling frequency data. There are 12 different working states in which one was the healthy state, and the remaining 11 were fault states. The vibration data measured at 1750 rpm in each working state were divided into 400 samples, where each sample contains 300 data points. The first 350 samples are used to construct training samples, and the remaining 50 are test samples.
Other important details are shown in Table 9.
The researchers used the RL system with kurtosis as an index in [139] for MFD. The dataset used for carrying out this research was other than the CWRU dataset.

IV. DEEP TRANSFER LEARNING AND DOMAIN ADAPTATION METHOD
The ML and DL algorithms work efficiently when the data is massive. However, having a massive amount of data is not always favorable in real-world applications. So, to overcome this issue, transfer learning approaches were proposed. Transfer learning is an essential tool in machine learning, which solves the problem of insufficient training data [73]. Cross-domain adaptation, one of the most popular methods among the transfer learning approaches, is implemented with situations in which a model trained on a source distribution is used in the context of a different, but related target distribution. In general, domain adaptation uses labeled data in one or more source domains to solve new tasks in  a target domain [74]. By discovering the domain-invariant features, domain adaptation creates a knowledge adaptation from the source domain to the target domain, which mitigates the distribution discrepancy between two domains [140]. Fig. 12 shows an overview of the transfer learning method. Recently, the combinations of deep transfer learning with domain adaptation are being widely practiced. Some of the deep transfer learning and cross-domain adaptation strategies used in REB fault detection and diagnosis are briefly summarized and overviewed in this section.
In [141], the authors proposed a novel cross-domain fault diagnosis method based on deep GAN in which artificially generating the fake samples for domain adaptation, and their model provided the reliable cross-domain diagnosis results when testing data are not available for training. In this paper, the authors used the maximum mean discrepancy (MMD) to measure the distribution discrepancy. Here, they designed a two-stage model where the first stage aims at artificially generating different classes of fault data in the target domain, and the cross-domain classifier is trained in the second stage. In first-stage, three modules are adopted, i.e., feature extractor, generator, and a primary classifier. First, a feature extractor is used to process the data sample, which is the frequency-domain signal of the collected vibration data, from which the high-level representation of the input can be obtained. Next, a primary classifier is used for diagnosis, which ensures the extraction of discriminative features under source supervision. Again, the generators are used for data generation, and each generator is trained explicitly for generating data of each fault class, respectively. After training in Stage 1, the parameters of the feature extractor are fixed and used for obtaining high-level representations of input data. The cross-domain classifier module takes the extracted representations as inputs and outputs the final diagnosis results. Specifically, in the implementation, the zerospadding operation is used to keep the feature map dimension from changing. The dropout technique is preferred with a rate of 0.5 to avoid overfitting. The sigmoid and rectified linear unit activation functions are generally used in the network.
In [142], the authors proposed a model for overcoming the problem of the requirement of a large amount of training data for bearing fault detection and created a model that works efficiently with limited training data. They proposed a based few-shot learning approach, which is based on a siamese neural network model based on deep CNN with wide firstlayer kernels (WDCNN). The model works by exploiting sample pairs of the same or different categories, measuring the distance of two WDCNN twins feature vector outputs in terms of whether their outputs are considered quite similar versus dissimilar. In a small training set with 90 training samples, the method can achieve an accuracy of 92.56%. They used the raw data directly without any preprocessing.
Similarly, [143] talks about an automatic classification based on deep learning in which faulty signals are clustered without human knowledge. Here, a dataset, in which each sample is given a random label, is configured after extracting the features of vibration signals from the frequency-domain and used to train DNN to obtain an initial classification. The classification results are assessed by testing the sub-signals VOLUME 8, 2020 extracted from the raw data, and the sample labels are modified according to the evaluation result. The modified dataset is used to train the DNN a second time. Samples with characteristic faults are clustered in various classes after iterating the DNN training and testing. Here, time-domain signals are converted to the frequency-domain, and the feature extraction and dimensionality reduction are made. A sample dataset is built, giving a random label to each sample. Then, a DNN is constructed and trained. They used the PCA technique to validate their model.
In [144], a domain adaptation method for MFDD based on deep learning in which adversarial training is introduced for marginal domain fusion, and unsupervised parallel data are explored to achieve conditional distribution alignments concerning different machine health conditions. Here, FFT is firstly applied to the temporal signals to obtain the frequencydomain information, which is then fed into the network as inputs. The source-domain supervised data are explored to extract the discriminative features for different machine health conditions. The marginal data distributions of the source and target domains are supposed to be drawn into the same subspace to generalize the fault diagnostic knowledge in the presence of a significant distribution discrepancy. The additional unsupervised parallel data are utilized to align the domain representation structures, and sensors collect the supervised training data and the unsupervised testing data at different places on the machine. Four categories of the input data are considered, i.e., labeled source-domain data, unlabeled target-domain testing data, and additional unlabeled parallel data in both the source and target domains. The number of classes for CWRU bearing data was 10. SoftMax was used for a classification layer. The authors claim that when the unsupervised parallel data cover a more extensive range of classes, higher accuracies can be obtained. Generally, over 80% of accuracies can be achieved if the parallel data contain more than 3 classes. When only a few parallel data are available, higher than 50% accuracies can still be obtained.
Similarly, a transfer CNN for fault diagnosis based on ResNet-50 is proposed for fault diagnosis of bearings in [48] in which they used transfer CNN (TCNN) with the depth of 51 convolutional layers. TCNN applies ResNet-50, which was trained on ImageNet as a feature extractor. First, for the input of ResNet-50, a signal-to-image method was developed for converting time-domain fault signals to RGB images format. Then, a new structure of TCNN(ResNet-50) is designed. Finally, the proposed model was tested on three datasets, including bearing damage dataset provided by KAT datacenter, motor bearing dataset provided by CWRU, and selfpriming centrifugal pump dataset. The number of classes used for the CWRU dataset was 10, and they used SoftMax in the classification layer. ReLU was used as an activation function, and BN was adopted after each convolution layer. The prediction accuracies of TCNN(ResNet-50) for the CWRU dataset was found to be 99.99%. They compared their model with VGG-16, VGG-19 and Inception-v3, in which their model performed more efficiently than those all.
Some of the other works using transfer learning or domain adaptation strategy using the CWRU bearing dataset for machinery fault detection are tabulated in Table 10. Some of the works employing DL-based transfer learning and domainadaptation approaches using the dataset other than the CWRU dataset are [145] - [147].

V. WEAKNESSES OF CWRU BEARING DATASET
CWRU bearing dataset is long, varied, and complex. The data is collected from multiple sensors placed at different places. Each data file consists of data of different lengths, which is not an integer multiple of 2. Again, one can analyze that most of the datasets are exposed to non-classical features of fault recognition; relatively few of the records gave classical characteristics for the specified bearing fault type. Again, many of the datasets exhibit non-stationary characteristics. All the faults cannot be recognized correctly. The data records ranged from very easily diagnosable to not diagnosable [23]. The variance is high because the bands of measurements in the same file are dissimilar. Additionally, not all the frequency components are regular; some are occasionally large or small [143]. So, using preprocessors like FFT or other signal processing techniques for feature extraction can be tough. The main challenges will be in selecting the featureswhich feature to choose and which feature to ignore.
Again, some records are heavily affected by the data acquisition process, i.e., some records are corrupted with patches of electrical noise, some DE and FE measurements are identical except for a scaling factor, some records appear to be clipped in sections. Moreover, the bearing test rig assembly appeared to affect the detection and diagnosis results more than the fault itself, with an indication of mechanical looseness detected in many of the datasets [23]. Also, the CWRU dataset consists of the data which is operated under fixed speed and load. The significant difference can be seen in the same bearing fault class when operated under varying loads or speeds.

VI. CHALLENGES USING DL MODELS FOR MFDD
The mechanical vibration signal, indeed, is one of the most essential and abundant sources of information for a proper understanding of the phenomena related to bearing effect. Moreover, because of the numerous advantages, the use of DL algorithms for bearing fault detection has been practiced widely. However, DL algorithms have some limitations. The challenges of a DL model are related to its architecture and training process. Even though there is extensive published literature on DL implementations in FDD systems, they require prior knowledge regarding their architecture [153]. That is why many industrial applications do not generally prefer this black box tool. Some of the challenges using DL techniques for machinery fault detection are listed below: • Deep learning models perform well when fed with enormous data. So, it is a great challenge to collect vast amounts of faulty or healthy machinery data for DL algorithms to work effectively in machinery fault detection and diagnosis.
• Training DL models is extremely computational and expensive. The most sophisticated models take weeks to train using high-performance GPU machines. Also, a huge memory is required to train the model [154].
• Again, DL works poorly when the dataset is unbalanced. It misclassifies the class with a higher number of data samples. Furthermore, having a balanced bearing dataset is itself a significant challenge. So, other different approaches should be implemented to train DL models when the data is unbalanced.
• The working condition of mechanical systems are very complicated and keep on changing from time to time according to production requirements. So, it is quite unrealistic to collect and label enough training samples for DL algorithms to work on bearing fault detection and diagnosis.
• Industry systems are not allowed to run into faulty states due to the consequences, especially for critical systems and failures, and most of the electro-mechanical failures occur slowly and follow a degradation path such that failure degradation of a system might take months or even years, which makes it challenging to collect related datasets.
• Vibration signals need expensive sensors. Again, for the accurate data, the sensors must be mounted tightly on the machine for which the experts may be needed [9]. Otherwise, the data obtained may not be of good quality, which will result in the poor performance of DL models.
• The DL architectures like auto-encoders require a pretraining stage and become ineffective if errors are present in the first layers. Such errors may cause the network to learn to reconstruct the average of the training data. CNNs, on the other hand, heavily rely on labeled data and may require many layers to find the entire hierarchy. Moreover, DBNs do not account for the 2-D structure of an input image, which may significantly affect their performance and applicability when 2-D images are used as input. Again, the steps towards further optimization of the network based on maximum likelihood training approximation are unclear in DBN [155]. Furthermore, finding Nash-equilibrium in GAN training is itself a challenging process. RRNs, on the other hand, encounter the problem of gradient vanishing/ exploding [2]. Moreover, RL needs constant supervision of the subject, and it also needs to know where actions lead to estimate actions and make choices.
• Much deeper networks may have difficulties and challenges such as exploding or vanishing gradients and degradation in the training process, and there is a reduction of accuracy when the depth of the network exceeds maximum [156]. So, network optimization is another challenge with deep network architecture.
• The data quality may not always be good. Sometimes the data is of poor quality and redundant too. Though DL models work great with noisy data, they are still struggling hard to learn from weak quality data, which is a challenging work for DL algorithms in MFDD [38].

VII. RECOMMENDATIONS
For the successful implementation of DL algorithms in machinery fault detection and diagnosis using the CWRU bearing dataset, the authors make the following suggestions: • Before developing any model or algorithm, we highly recommend future researchers to study and analyze the CWRU dataset and the benchmark study thoroughly.
• CWRU is a varied and complex database. A welldesigned classification method and algorithm is needed for classifying such varied and complex dataset. For a DL model to work efficiently, there should be enough data. So, one can apply data augmentation techniques before training the model. The accessible data augmentation techniques like GAN, additional Gaussian noise, masking noise, signal translation, amplitude shifting, time-stretching, overlapping, and so on can be used.
• The CWRU dataset is also used as the validating dataset for the accuracy of the model. When the data from other sources than the CWRU dataset is used, the accuracy may drop down because of the influence of noise or variation in motor speed. So, proper measures should be taken before doing this.
• Much of the research has been carried out in a balanced dataset. Comparatively, it is easier to use a balanced dataset than the imbalanced one. The imbalanced dataset is the characteristic of real-world-applications. There are many approaches to deal with imbalanced data. Authors recommend the readers to have a thorough research about the methods. Traditional techniques simply could not identify all the features of imbalanced data. A better approach would be to develop a more generic organizing principle that can accommodate all possible types, rather than individual approaches that deal with specific types one by one.
• Condition monitoring with vibration signals is itself a challenging task. Vibration signals need expensive sensors. Again, for the accurate data, the sensors must be mounted tightly on the machine for which the experts may be needed [9]. So, if the researchers are interested in working with their data, a proper procedure must be applied when collecting the vibration data.
• Faults in bearings often manifest themselves at high frequencies [23]. If the researchers are interested in generating the fault samples, the use of a high sampling rate is recommended.
• The authors recommend the bearing fault researchers to follow the systematic and comprehensive approach in documentation as well, which will be helpful for future researchers.

VIII. CONCLUSION
In the age of industry 4.0, deep learning algorithms have attracted increasing attention for several research applications. Recently, DL models have been broadly employed in machinery fault detection and diagnosis systems. With continued rapid advances in computer technology, DL models will continue to be robust and attractive tools in machinery fault detection and diagnosis systems. An attempt to summarize and review the recent works and research on MFDD using the CWRU dataset applying deep learning algorithms has been made in this paper. We tried our best to cover the recent works which use DE defects bearing data of the CWRU dataset employing DL-algorithms detailly. The challenges in dealing with vibration data, employing DL-based models, and the recommendations from the authors are listed, which will surely help future researchers.