Health Indicator for Low-Speed Axial Bearings Using Variational Autoencoders

This paper proposes a method for calculating a health indicator (HI) for low-speed axial rolling element bearing (REB) health assessment by utilizing the latent representation obtained by variational inference using Variational Autoencoders (VAEs), trained on each speed reference in the dataset. Further, versatility is added by conditioning on the speed, extending the VAE to a conditional VAE (CVAE), thereby incorporating all speeds in a single model. Within the framework, the coefficients of autoregressive (AR) models are used as features. The dimensionality reduction inherent in the proposed method lowers the need of expert knowledge to design good condition indicators. Moreover, the suggested methodology allows for setting the probability of false alarms when encoding new data points to the latent variable space using the trained model. The effectiveness of the proposed method is validated based on two different datasets: from a workshop test of an offshore drilling machine and from an in-house test rig for axial bearings. In both datasets, the HI is exceeding the warning and alarm levels with a probability of false alarm (PFA) of 10−6, and the method is most effective at lower shaft speeds.


I. INTRODUCTION
Rolling element bearings (REBs) are widely used in heavy industrial machinery such as offshore drilling machines, wind turbines, and paper mills. A defect in such bearings might result in a catastrophic failure in the industrial system. Therefore, condition monitoring (CM) for REBs is important to avoid unplanned downtime and production loss in heavy industry. The majority of bearing condition monitoring techniques focus on detecting the presence and development of localized damage in bearing raceways or rolling elements [1]- [3]. CM of low-speed machinery, with a shaft speed below 10 Hz [4], is more challenging. The energy associated with faults is then smaller, resulting in a low signal-to-noise ratio (SNR). This requires more sensitive sensors and development of advanced signal processing methods to extract fault signatures. Operating conditions tend to be less stationary at lower speeds [4], thus resampling to angular domain The associate editor coordinating the review of this manuscript and approving it for publication was Hong-Mei Zhang . is necessary in low-speed applications [5], [6]. Health conditions of large bearings at low speed are usually observed via acoustic emission or vibration measurements [5], [7]- [11]. Cyclostationary methods [12]- [14], wavelet denoising and filtering [15], [16], and empirical mode decomposition (EMD) [17] have all been successful in low-speed bearing fault detection. Data-driven fault diagnosis methods based on machine learning have also been intensively developed in recent years [18]. Fault classifiers based on decision trees (DT) [19], [20], support vector machine (SVM) [21], [22], k-nearest neighbor (k-NN) [23], [24], convolutional neural network (CNN) [25]- [27] and deep belief networks (DBN) [28], [29] are well applied to deal with bearing fault detection. All mentioned machine learning based methods require historical failure data for training, which is hard to obtain in industry. In addition, the authors could not identify previous research dealing with faults on axial bearings, where a characteristic fault frequency might not exist or is inconsistent in spectra. This work aims to develop an anomaly detection method without using historical failure data. Tapered axial roller bearings, e.g. in drilling machines from the offshore industry, have relative sliding motion in the rib-roller contact area. Low speed makes this area particularly susceptible to wear. In [30], wear on the roller ends was observed in a tapered axial bearing, as shown in Fig. 1.
However, no characteristic frequency component associated with defects, i.e. roller frequency, was observed on the axial bearing during tests. This suggests that diagnosis methods based on detection of defect characteristic frequencies alone are ineffective in detecting wear in large and slow axial bearings. Identifying this defect on the axial bearings is currently relying on offline monitoring methods such as lubricant analysis and visual inspection combined with precautionary maintenance actions [25], [30]. This practice requires interruptions of production and may allow failure to progress inconspicuously between inspections. Therefore, development of online, non-intrusive monitoring methods is very important to facilitate condition based maintenance (CBM) for large axial bearings in heavy industry. Since data from a healthy state is easier to obtain than in a damaged state, a procedure of determining whether or not the observed bearing is normal based on prior knowledge of healthy behavior of the machine, would be very useful to avoid using failure data.
References [31], [32] proposed a method for health threshold setting based on healthy operating characteristics, allowing controlling probability of false alarm (P FA ). A whitening transformation was applied to a set of correlated condition indicators (CIs) with Gaussian or Rayleigh distributions. These CIs were then used to calculate a health indicator (HI) with a known probability density function (PDF) and cumulative distribution function (CDF). The HI is normalized by the inverse CDF evaluated at (1 − P FA ), and optionally multiplied with a warning factor w < 1. In this case, let HI 0 denote an observation from a healthy machine. The probability of observing HI 0 above the warning factor is then equal to the P FA , as shown in (1). The consequences of failures and false alarms must be considered when setting the P FA threshold. Additionally, the number of inferences to be done must be considered. Multiple testing increases the risk of false positive samples simply by chance [33].
s However, the method in [31] is only effective if CIs are well selected with known probability distributions. The overall goal of conventional approaches is to perform health assessment with a statistical foundation, based on a potentially large set of observed variables. Due to the curse of dimensionality, also known as Hughes' phenomenon [34], a single indicator might ''drown'' in high dimensional feature space, which reduces accuracy of the model. Thus, a method for dimensionality reduction of the features while maintaining most of the information is required. To implement the HI threshold setting, features are also required to be independent variables. Principal Component Analysis (PCA) can transform a set of variables to linearly uncorrelated features with decreasing contribution to the variance, but it does not account for nonlinear dependencies. Machine learning (ML) algorithms can be an alternative solution since they can capture complex dependencies among the observed variables. Autoencoders are successfully used for dimensionality reduction in fault detection and classification of rotating machinery [35]- [38], but lack a probabilistic latent representation. Generative models are capable of estimating complicated PDFs of given data, and can generate new samples, which follow the same distribution as the training data. In [39], it was shown that sequential training of restricted Boltzmann machines could discover hidden dependencies between observed variables and a sparse representation. However, training such networks typically requires an additional statistic method, e.g. Markov Chain-Monte Carlo (MCMC) methods, resulting in computational burden.
To achieve dimensionality reduction and reduce computational burden, this work uses a combination of a Variational Autoencoder (VAE) [40] and a Generative Adversarial Networks (GAN) [41], which is similar to the Adversarial Autoencoder (AAE) [42]. The VAE performs inference of variational parameters using neural networks in an encoderdecoder structure by minimizing the reconstruction error and the Kullback-Leibler Divergence (D KL ) [43] between an encoded sample and a Gaussian standard distribution, which is equivalent to maximizing the evidence lower bound (ELBO). This objective can be optimized with gradient descent algorithms through the "reparameterization trick" [40]. These generative models allow imposing a distribution on the latent variables. In [42], the latent distribution in AAEs seems to follow the target distribution closer, which is desirable for the purpose of a HI. However, the adversarial training in GANs and AAEs is often unstable [44]. This problem was also observed in experiments with AAEs while developing the proposed method. VAEs have been used in ball bearing fault classification by using the latent variables for each data point as input to a classifier [45]. The proposed approach instead utilizes the aggregated distribution of healthy conditions in the latent space of a VAE to calculate a HI for new observations.
The remaining of the paper is organized as follows: In section II, the network architecture, training procedure and HI calculation are described. Section III details the data VOLUME 8, 2020 acquisition and pre-processing. Results from two different datasets are presented in section IV. Section V provides conclusions and discussions.

II. METHODOLOGY
This section presents the approach for calculating a bearing health indicator, utilizing the latent variables in a VAE. The calculation of a HI limits the selection of CIs to those following known distributions as described in [31] for Gaussian and Rayleigh distributions. It also requires the user to pre-select suitable CIs based on domain knowledge. The proposed method performs unsupervised dimensionality reduction from a set of input features, while simultaneously imposing a Gaussian distribution on the latent variables. This section provides a review of the network components, the loss functions and training algorithm. The model was implemented in Python using TensorFlow r1.12 [46].

A. NETWORK ARCHITECTURE AND LOSSES
The network architecture is shown in Fig. 2. An encoder (red) and a decoder (green) are connected by the latent representation (yellow). Let x be the feature input vector and z be the latent variable vector. The encoder consists of a fully connected layer of size 1024 with weights, biases, an exponential linear unit (ELU) activation function, and 50 % dropout. In this work, the coefficients of an autoregressive model are used as features. The output includes two vectors, containing the parameters of the latent representation for each data point. Let J be the dimension of the latent space. The latent variables are constrained to have a Gaussian distribution with diagonal covariance matrix, so the encoder outputs a vector containing the means, µ, and log of the variances, log(σ 2 ), each of length J . Note that these parameters are for the individual data points, not the aggregated latent distribution q(z). Utilizing the reparameterization trick from [40], samples from a white noise vector are used to obtain a random sample z from the latent representation while still allowing gradients to flow through the network. The decoder has the same architecture as the encoder, with a fully connected hidden layer, ELU activation, and 50 % dropout. Weights and biases are denoted φ. The desired output is a reconstruction of the input, like a normal autoencoder. Combining these parts of the network results in the VAE. Originally, the VAE was developed as a generative model for producing reconstructions similar to the input by sampling from a given prior distribution p(z). The connection between data and p(z) is in general not known and must be approximated. Let the training data distribution be x ∼ p d (x), and VAE outputx ∼ p(x). Further, q φ (z|x) and p θ (x|z) are the encoding and decoding distributions of the encoder and decoder networks. Subscript φ and θ are the encoder and decoder variables. Thus, the aggregated posterior distribution of the latent variable, z ∼ q(z), is defined as in (2). To be utilized in the HI calculation, q(z) must approximate the desired prior p(z).
To ensure that the latent representation contains useful information about the input data, the encoder and decoder are trained to minimize the reconstruction loss function L R , as in (3). L R is the mean square error between each feature x i,j and its reconstructionx i,j over a minibatch, x M , of size M . The number of features per datapoint is denoted N .
This encourages similar input data to cluster in latent space, while dissimilar data are separated. Note that the square error is summed over a datapoint and averaged over the minibatch. This gives more weight to reconstruction error, which helps avoid mode collapse, i.e. the latent vector converges to a Gaussian that does not carry information. While reducing L R provides a good reconstruction, the aggregated latent distributions will not take a Gaussian distribution. To make the latent distribution approximate the desired prior, KL divergence is introduced as a regularization on the encoder variables φ. Given the assumption of diagonal covariance matrix, and Gaussian prior, the KL divergence for a data point can be calculated in a closed form. The combined KL loss over a minibatch is then calculated as in (4).
The objective function to be minimized is the sum of L R and L KL , as given in (5).
Pseudo-code for the training procedure is given in Algorithm 1. Training was repeated 5 times with different random 35844 VOLUME 8, 2020 Algorithm 1 Training Algorithm φ, θ ← x Initialize parameters repeat Shuffle training dataset repeat x M ← Get minibatch from the training dataset g ← ∇L VAE (φ, θ; x M , ) Calculate gradients θ, φ ← Update encoder/decoder parameters until Epoch is completed until Total number of epochs is completed return φ, θ seeds for weight initialization and shuffling. Hyperparameters used in the experiments are given in Table 1. Model weight updates are performed using the Adam [47] optimizer with cosine decay of the learning rate, as this has been shown to improve Adam performance [48]. Initial learning rate was set to 10 −4 , which decays to 10 −6 over the training epochs. Experiments with higher learning rate values caused unstable convergence and divergence of training loss. A dropout rate of 50 % was implemented to reduce overfitting [49]. The ELU activation function used in the encoder and decoder hidden layer is shown to outperform other activation functions both in CNNs and autoencoders [50]. As suggested in [40], minibatch size is set to 100. Early stopping was not implemented but could speed up the training process as the training loss converged well before the number of epochs.
The proposed settings allowed the latent variable distribution to converge to a Gaussian distribution while avoiding mode collapse, which suggests that the network configuration is suitable for this particular application. Further experiments to optimize layer size and latent dimension has not been performed. For another dataset with a different AR input size it may be necessary to change the hidden layer size if the latent distribution does not converge to a Gaussian distribution or suffer from mode collapse.
Reconstruction loss constrains the network to ensure that useful information is captured when forming Gaussian latent distributions. Limiting the latent dimension forces the network to infer underlying Gaussian variables that best describes the observed data, at the cost of overall reconstruction performance. Improved reconstruction of input data could be achieved by increasing the latent dimension as well as adding more hidden layers and increasing their size. It is important to note that optimal reconstruction in itself is not the main purpose for the network. Increasing the number of latent variables capability could, however, reduce the HI sensitivity to faults, as the HI is calculated as the norm of the latent vector.

B. CONDITIONAL VARIATIONAL AUTOENCODER
With the described approach, it is required to train a separate VAE for each speed. For machines with multiple operating conditions, this is impractical. Therefore, a conditional VAE (CVAE) is trained for each dataset. CVAEs utilize the same network structure and loss function L VAE as VAEs, but can be conditioned on additional information, such as speed. For each datapoint, the speed information is a categorical variable, one-hot encoded into a conditioning vector c. For example, the speed of 100 rpm in dataset 1 is encoded to c 100 = [0, 1, 0, 0, 0] while the speed of 60 rpm in dataset 2 is encoded to c 60 = [0, 1]. As the model order is different for rpms, x is zero-padded to the largest model order p. VAE training datasets consist of data from a single speed, while the CVAE uses data from all speeds. Except for these differences, VAEs and CVAEs follow an identical training procedure.

C. HEALTH INDICATOR
A methodology for threshold setting given CIs with Rayleigh or Gaussian distributions is proposed in [31]. In this work, a Gaussian distribution is chosen for the latent variables. To verify that q(z) approximates the standard normal distribution N (0, I), the Kullback-Leibler Divergence (D KL ) was calculated as given in (6) for the aggregated posterior.
is the covariance matrix of z, µ is a vector containing the mean values of z and J is the number of latent variables.
The norm of J Gaussian variables follows a χ distribution with v degrees of freedom. Let F(·) denote the CDF of a χ-distribution. The HI is normalized with a factor that is a function of the P FA . The HI is calculated as shown in (7).

III. EXPERIMENTAL SETUP
The proposed algorithm is tested on data from two experiments: Vibration data from a workshop test of an offshore drilling machine, and acoustic emission (AE) data from an in-house test rig for axial bearings. A further description of the experimental setup is given in the following sections. VOLUME 8, 2020 Dataset 1 (DS1) was collected from an offshore drilling machine taken out of operation for maintenance as described in [30]. A schematic drawing of the setup is shown in Fig. 3. Data is collected from an accelerometer mounted in the axial location. Data was sampled at 102.4 kHz and decimated to 81.92 kHz. The axial bearing showed signs of roller end wear as shown in Fig. 1. Data was first recorded using a healthy bearing, being denoted damage level (DL 0). Then, reassembling the machine with a slightly damaged bearing results in a change of the vibration characteristics and a reduction in root mean square (RMS) [30]. Distinguishing this change from any fault induced change is not possible. Thus, the slightly damaged condition is selected as the baseline condition (DL 1) for training data. Additional damages in the form of indentations from a carbide tip tool were applied to one of the roller end, producing data at DL 2. For data at DL 3, the bearing was further damaged and also run under poor  lubricating conditions. Data was recorded at 50, 100, 150, 200 and 250 rpm. At 50 rpm, only data from DL 1 and DL 3 was recorded. The machine was running unloaded, subject to the gravity by its own weight. A quantitative measurement of damage is not available, but a degradation resulting in a measurable change is expected. However, previous analysis of the vibration signal was not successful in detecting any clear indication of the damage [30]. Damage to a roller was expected to cause amplitude modulation at the roller frequency, but as shown in Fig. 4, no peak was observed at either one or two times the roller frequency in the envelope spectrum.
Segments corresponding to approximately 1 revolution are used for calculating the features. To increase the number of data points, an overlap of 75 % is applied. The autocorrelation function (ACF) is examined on a healthy dataset to determine if the signal is stationary or not. If the ACF reduces quickly, the signal is considered stationary [51], otherwise the signal is considered non-stationary. The ACF of a vibration signal acquired at 50 rpm is shown in Fig. 5 a). The ACF is slowly decreasing, and has a cyclic trend, and the signal is therefore considered stationary. To mitigate trends and cyclic signal components, the signal is differentiated once. Effectively, the jerk (m/s 3 ) is calculated with this differentiation, and low-frequency components from shaft and gearbox are mitigated, while high-frequency components are enhanced.  The resulting ACF after differentiation is shown in Fig. 5 b), showing that the ACF is now decreasing fast, and only varies randomly after 100 lags. Given this result, the vibration signal acquired on this test rig is differentiated once to make the signal more stationary.

B. DATASET 2: AXIAL ROLLER BEARING TEST RIG
Dataset 2 (DS2) consists of AE data from an in-house test rig, shown in Fig. 6. The test bearing was of type 29230 M from manufacturer ISB, subject to an axial load of 50 kN. Data was recorded at 30 and 60 rpm, in that order. AE data was collected at 1 MHz sampling rate for 10 seconds. Data was then split into constant length segments of 50 000 samples.
To emulate the distributed abrasive wear shown in Fig. 1, the rollers were removed, and roller ends were ground with sandpaper of grit size from ISO/FEPA grit grade P400 (finest), P320, P220 and P80 (coarsest), as shown in Fig. 7. ''Heavy'' and ''Very Heavy'' refer to relative degrees of damage using the same sandpaper grade.
The ACFs of the acoustic emission dataset before and after differentiation are shown in Figs. 8. a) and b), respectively. The ACF of the raw signal in Fig. 8 a) decreases rapidly, and differentiating the signal has little effect on the ACF as observed in Fig. 8 b). Therefore, the acoustic emission signal is considered stationary and requires no further differentiation.

C. FEATURE EXTRACTION AND PREPROCESSING
The input x to the autoencoder network is a feature calculated from the vibration and AE data. In the previous work, vibration energy was not significantly increased when the damage level on an axial bearing was escalated [30]. In addition, energies at specific characteristic frequencies do not increase either. However, the bearing condition degradation is expected to produce a change in frequency content of the associated signal. Therefore, features, which are sensitive to changes in the measured signal, are required to be used as input to the autoencoder. An autoregressive (AR) model of order p can predict the next signal sample based on a linear combination of p previous samples, assuming that the signal s is stationary.
Thus, changes in the AR model parameters should reflect that the vibration signal has changed. The AR coefficients may have arbitrary distributions, which makes it challenging to quantify a change. It is therefore easier to threshold in latent space, where the distribution of healthy latent vectors approximates a Gaussian distribution.
The AR model is depicted as where s i is the signal at i'th time step, ν is the model residual and a j is the j'th model parameter. The Yule-Walker equations [52], [53] are solved for an input signal s to obtain the AR model parameters. The order p is determined by calculating the partial autocorrelation function (PACF) [54] for an increasing number of lags. The model order p of a time series with N samples is considered sufficient where PACF at lag p is zero with a 5 % significance level [51], as given in (9)    The smallest lag p, which results in a PACF below the 5 % significance level, is determined for each healthy segment.
As an example, the PACF of a differentiated vibration signal acquired at 50 rpm using test rig 1 is shown in Fig. 9. At lag 29, the PACF is beneath the 5% significance level. This procedure is repeated for all signal segments, and statistics between all segments within each speed range are calculated afterwards and shown in Table 2. As seen in the table, the mean value is selected as model order p. Standard deviation (STD) and median are also given for each dataset and speed.
All input data was afterwards standardized using the mean and standard deviation of the remaining training data. Outliers in training data are removed if one AR coefficient differed from the mean value by more than five standard deviations. Baseline data (DL0) was shuffled and split in training (50 %), validation (25 %) and test (25 %) subsets. The remaining DLs were used for testing only. Table 3 shows sample rate, number of samples in the raw data, and size of the datasets at each DL.

IV. RESULTS
This section presents the results of the experiments, evaluating the calculated HI using both VAE and CVAE. The validity of the required assumptions of a Gaussian-distributed latent variable is also discussed. The presented results are the aggregate of the 5 models trained with different random initialization.

A. HEALTH INDICATOR EVALUATION
In the first dataset, DS1, an increase in HI with damage level is observed at all speeds. The alarm level (HI = 1) is calculated with P FA = 10 −6 . Boxplots of the calculated HI from VAEs and a CVAE are shown in Fig. 10. Whiskers are set to 2.5th and 97.5th percentile. In the following discussion, the median (orange line inside boxes) is considered as the HI value. In dataset 1, HI at DL2 exceeds the warning level 0.75 in all speeds except at 150 rpm for VAE (HI = 0.63) and 200 rpm for CVAE (HI = 0.69). Data for DL2 was not recorded at 50 rpm. At DL2, the HI exceeds alarm value of 1 at all speeds. Results from VAE and CVAE differ more as damage level increases, but the overall results are well aligned with an increase in HI with damage level at all rpms.
The HI calculated for dataset 2 with VAE and CVAE is shown in Fig. 11. At 30 rpm, the HI is above the alarm level from DL2. However, there is no monotonic increase in HI level with damage level. Still, this result should be considered as a clear indication of anomalous behaviour. HI for 60 rpm follow a similar trend, but the HI values are lower, exceeding the warning level in DL3-5 only. As in DS1, the HI values calculated using the standard VAE and CVAE are very similar.  Compared to dataset 1, there is less consistency in the HI with increasing HI, and larger differences between speeds. The inconsistency between damage levels may be caused by removing the bearing for applying damage. This procedure introduces differences in the mechanical assembly that may affect the results. Also, the damage was applied manually, which gives room for more variations between damage levels. Finally, data for increasing speeds were recorded consecutively. The seeded damage may therefore be smoothed over time during acquisition. This is a possible explanation for the differences between 30 rpm and 60 rpm. If the smoothing effect differs with damage severity, this will also contribute to the HI inconsistency between damage levels. Further, higher speed may generate high energy frequency components, which dominate the AR coefficients but are not associated with the bearing damage.

B. MODEL PROPERTIES
A summary of final training, validation and test losses for the VAE are shown in Table 4, including the median values for the 5 models. The ability of latent representations carrying useful information is measured by the reconstruction loss L R . Examining L R in Table 4 reveals that the value is correlated with model order p, which is expected from the square error summation per datapoint in (3). The reconstructed AR coefficients for DS1-100 are shown in Fig. 12. This speed has the lowest number of features (p = 12) in the dataset, and also the lowest reconstruction loss. Still, we observe that reconstructions of coefficient 8 and 9 are skewed. It is likely that further tuning of hyperparameters such as hidden layer size, number of hidden layers and latent dimension can improve reconstruction, but a systematic investigation of parameters search was not performed due to the associated computational cost of training.
The statistical properties of the HI assume that q(z) approximates a multivariate standard Gaussian distribution p(z) ∼  N (0, I). The Gaussian latent space is imposed by L KL , which takes values between 1.838 (DS1-250) and 2.106 (DS1-50) in the test dataset. The loss values are more stable than L R , as the latent dimension J is constant.
However, L KL describes the mean KL divergence of each datapoint rather than the aggregated distribution of q(z). Therefore, the KL divergence D KL between the aggregated distribution (after sampling) and p(z) are calculated as in (6). VOLUME 8, 2020    The value is bounded to D KL ≥ 0, and a value of zero means that q(z) and p(z) are identical distributions. In the test datasets, D KL takes values between 0.007 (DS2-30) and 0.112 (DS2-30). Fig. 13 shows histograms of each dimension of z for DS1-100, which has D KL = 0.05. A qualitative evaluation confirms that it approximates a Gaussian distribution. Table 5 lists L R , L KL and KL D from the network trained as CVAEs, where all speeds in the dataset are used simultaneously in training. L R in the test datasets is higher than the average for the separate speeds in Table 4. This is reasonable, as the same number of neurons in the network must learn to reconstruct data from 5 and 2 rpms in DS1 and DS2 respectively, instead of just one. However, we see that values for L KL and KL D are similar to the VAE. This indicates that the assumption of a Gaussian latent space is valid for the CVAE as well.

V. CONCLUSION
This paper proposes a method for unsupervised learning of a Health Indicator (HI), aiming to detect defects in large, slowrotating axial bearings, by performing variational inference using a variational autoencoder (VAE) and a conditional variational autoencoder (CVAE). Within the framework, coefficients from autoregressive (AR) models were used for both vibration and acoustic emission measurements. The proposed method is proven to be effective using both vibration and acoustic emission (AE) measurements. Using vibration measurements, as opposed to acoustic emission data, allows the proposed method to be cost-effective. In contrast, the previous work of dataset 1 was not able to reveal any degrada-tion of the bearing using vibration measurements. The HI calculated from AE data in dataset 2 was less consistent with the applied damage. However, the experimental design may have had an impact on the calculated HI, in particular at 60 rpm. In both datasets, the proposed method was able to uncover and quantify a significant change in machine operation through the HI. The possibility to calibrate the HI to a desired level of Probability of False Alarm (PFA) allows the alarm setting to adapt to the criticality of the equipment.
Challenges of detecting defects on axial, large bearings at low speeds were discussed in this study. The effectiveness of the proposed method for axial bearing fault detection at low speeds is validated by data from 2 test rigs. As the proposed method does not rely on detection of fault frequencies, changes in machine operation can be detected regardless of failure mode and fault location. In future studies, the methodology can be extended to include other types of feature input, such as time series data. The effect of the network hyperparameters on reconstruction error, latent variable distribution and HI sensitivity should be investigated along with evaluating generalization performance on other applications. The HI is capable of capturing changes in the condition of the axial bearing, so a logical next step is to incorporate it in prognostics and remaining useful life estimation.
MARTIN HEMMER received the B.Sc. degree in mechanical engineering from the Oslo University College, in 2012, and the M.Sc. degree in mechatronics from the University of Agder, in 2014, where he is currently pursuing the Ph.D. degree in mechatronics, as a part of the SFI Offshore Mechatronics project. His project deals with condition monitoring and condition-based maintenance in offshore applications, focusing on large, as well as axial rolling element bearings rotating at low speed. His research interests include the areas of machine learning, signal processing and condition monitoring, and condition-based maintenance of rotating machinery. TOR I. WAAG received the M.Sc. and Ph.D. degrees in signal processing for laser light scattering from the Norwegian University of Science and Technology (NTNU), Trondheim. His background is in technical physics from NTNU. He is currently a Senior Scientist at the NORCE Norwegian Research Center. His work has been concentrated on the entire chain from sensor data via signal processing to decision support, mainly for the offshore industry in Norway. He is a member of the Society of Petroleum Engineers (SPE) and of the Norwegian Academy of Technological Sciences (NTVA). His recent activity within the SFI Offshore Mechatronics at the University of Agder has been focused on conditionbased maintenance, mainly studying big, slow rotating bearings using vibration measurements, and acoustic emission.