Remaining Useful Life Estimation Using Long Short-Term Memory Neural Networks and Deep Fusion

Estimation of Remaining Useful Life (RUL) is a crucial task in Prognostics and Health Management (PHM) for condition-based maintenance of machinery. In order to transmit and store the sensor data for archiving and long term analysis, data compression techniques are regularly used to reduce the requirements of bandwidth, energy and storage in modern remote PHM systems. In these systems the challenge arises of how the compressed sensor data affects the RUL estimation algorithms. A main drawback of conventional statistical modeling approaches is that they require expert prior knowledge and a significant number of assumptions. Alternative regression based approaches and deep neural networks are known to have issues when modeling long-term dependencies in the sequential data. Recently Long Short-Term Memory (LSTM) neural networks have been proposed to overcome these issues and in this paper we create a LSTM network and data fusion approach that can estimate the RUL with compressed (distorted) data. The experimental results indicate that the proposed method is able to estimate RUL reliably with narrower error bands compared to other state-of-the-art approaches. Moreover, the proposed method is able to predict RUL from both the raw and compressed datasets with comparable accuracy.


I. INTRODUCTION
Prognostics and engineering maintenance of complex machinery, such as aero-engines, gearboxes or power generation systems, are crucial and expensive tasks.Failures in machine structural integrity can be a significant contributor to increased running costs, unplanned outages and even catastrophic events in machines.With increasing system complexity, conventional scheduled preventive maintenance processes are becoming less capable of meeting the increasing industrial efficiency demands.Condition-based monitoring using intelligent prognostic and health manage- The associate editor coordinating the review of this manuscript and approving it for publication was Nagarajan Raghavan .ment (PHM) technologies is capable of reducing risks and maintenance costs and increasing the reliability, availability and efficiency of machines.The estimation of remaining useful life (RUL) in machines is essential in PHM systems to improve the efficiency of maintenance schedules, avoid engineering failures and implement cost savings.
In general, the major task of RUL estimation is to predict the time left before the machinery losses its operational ability based on the information of the historical time-series sensors data which is obtained by the condition monitoring system [1], [2].The existing methods of RUL estimation can be categorized into four groups: physics model-based approaches, statistical-based approaches, machine learning based approaches and hybrid approaches.In order to facilitate these approaches, continuous streams of raw data are routinely captured by the condition monitoring systems with high precision requirements.The storage of high precision floating point data is costly, and a high bandwidth data throughput is required to transfer the data from the sensor to the central monitoring station.This is currently addressed by applying algorithms that produce low bandwidth condition indicators in the process discarding raw data.However, failure understanding and RUL estimation resulting from transient events in retrospect becomes more difficult if the reduced sampling frequency means that valuable information has been lost.An alternative approach is to apply efficient data compression algorithms [3] that can adapt its acquisition conditions and compression ratio to the machine state without loss of critical information.
This paper proposes a deep learning end-to-end Long-short Term Memory network with data fusion (LSTM-Fusion) method for RUL estimation in prognostics.Raw sensor measurements with normalization are directly used as inputs to the neural network.This means that no prior expertise in prognostics, physics model assumptions or signal processing is required and this facilitates the industrial application of the proposed method.By utilizing variable time window sizes, data normalization, LSTM and fusion techniques it is possible to capture the latent hierarchical structure in the timeseries sequence by encoding the temporal dependencies with different timescales using a novel fusion update mechanism.The proposed method is expected to obtain better (higher accuracy) and reliable (lower error-band) RUL estimation not only at the end of the machine life but also during the healthy status of the machine.Comprehensive analysis of the proposed approach and comparisons with existing methods are presented in this study.
The main contributions of this paper are as follows: • A novel RUL estimation method is proposed that utilizes LSTM-Fusion techniques.Experimental results demonstrate the superiority of the proposed network after comparison with seven state-of-the-art methods using the same datasets.
• An in depth evaluation that explores the number of layers of the proposed neural network which influence the performance of the proposed method.The performance is analyzed from both a reliability and computational complexity point of view.
• A thorough evaluation is performed on how the performance of the proposed LSTM-Fusion method is influenced by compressed/distorted sensor data using the publicly available C-MAPSS dataset.Experimental results indicate the optimum compression/distortion levels which can be tolerated by the proposed method.This paper starts with an overview of the related work in RUL estimation methods in Section II, along with brief introductions of Long-short Term Memory (LSTM).Section III presents the proposed LSTM-Fusion network architecture in detail.Section IV presents descriptions of the experi-mental dataset, while Section V presents the effectiveness and superiority of the proposed method after comparisons with other methods and different testing conditions.Finally, Section VI concludes this study and suggests potential future work.

II. RELATED WORK
In the past ten years, several papers and standards have reviewed RUL prediction approaches and categorized them into four groups according to their techniques and methodologies: physics model-based approaches, statistical-based approaches, hybrid approaches and machine learning based approaches [2], [4]- [6].In the following subsections, we review state-of-the-art RUL estimation approaches based on conventional physics/statistical model approaches and machine learning based approaches.

A. PHYSICS/STATISTICAL MODEL-BASED APPROACHES
Physical and statistical model-based approaches utilize empirical physical models and mathematical modeling approaches of failure mechanisms to interpret machine damage and degradation processes [5], [7].
Physics model-based approaches are correlated to material characteristics and stress levels, which are described by experiments, finite element analysis or empirical models.The famous physics model of fatigue crack growth, named the Paris-Erdogan (PE) law, has been first proposed in [8] for RUL estimation of machinery.Over the last few years, many different versions of PE models were developed and utilized in the field of prognostics and RUL prediction of machinery [9]- [13].Other physics models for RUL estimation of machinery include the Forman crack growth law [14] that has been utilized to estimate the RUL of cracked rotor shafts in [15].Moreover, particle filtering has been employed, in conjunction with the Norton law to characterize the turbines creep evolution and RUL estimation [16], [17].In order to obtain accurate estimation of RUL, the physics models are developed with prior knowledge of the failure mechanisms and empirical estimation of model parameters.The drawback of physics model-based approach is that it is difficult to understand and model the physics of damage for complex mechanical systems and parameters such as material inconsistencies, imperfect boundary conditions, loads and uncertainty are generally disregarded.
Statistical data driven approaches work by building statistical models based on available past observed data in a probabilistic way [18].The RUL estimation models have been constructed by utilizing the available data to fit a probabilistic model without relying on any physics or engineering assumptions.Statistical model-based approaches are effective in describing the uncertainty of the degradation process and RUL estimation.Therefore statistical data driven approaches have become the most popular among the four categories (including physics-, statistical-, hybrid-and machine learning-approaches).
Particularly effective are statistical data driven approaches for RUL estimation based on condition monitoring data (CM) modeling in conjunction with the estimation of the distance between the CM data and predefined threshold levels of the model.Previous papers have reviewed statistical model-based approaches systematically [2], [18], [19].
Regression-based methods are commonly used for RUL estimation due to their simplicity.Autoregressive models assume that the future state value of machines depends linearly on its previous observations and on a stochastic term [6].In [20], the autoregressive model is combined with a particle filter algorithm to predict the RUL of machinery.An autoregressive integrated moving average and polynomial regression method are employed in [21] for RUL estimation in the wavelet domain.The limitation of the regression-based method is that the model mainly depends on the trend of historical observations, which could lead to an unreliable RUL estimation over time.A random coefficient regression method is proposed for RUL estimation in [22], which characterizes the stochastic variability of degradation processes by adding random coefficients.The proposed method used a nonlinear mixed-effects model to characterize the machinery degradation processes and estimated the RUL probability density functions by utilizing Monte-Carlo simulation.Other variants of the random coefficient models combined with Bayesian filtering algorithm for machinery RUL estimation have been proposed in [23]- [25].The random coefficient regression models can work out probability density function of the machinery for RUL estimation.However, the models assume a Gaussian distribution for the random coefficients that can only achieve accurate results in some special cases and cannot model the temporal variability of RUL estimation [26].
Wiener processes have been proposed for machinery degradation process modeling, which are appropriate for the case that the degradation process varies bi-directionally with Gaussian noises over time [27].Wiener processes for characterizing the machinery degradation the distribution of the first passage time (FPT) can be formulated as the inverse Gaussian distribution [27].In [28], a time-varying degradation drift of Wiener process models is transformed into a constant degradation drift by a time-scale transformation method, where the RUL estimation can be represented by the FPT crossing the certain thresholds.The advantage of Wiener process models is that they can represent the temporal variability of the degradation processes.However, the drawback of these previous research works is that the Wiener processes are types of regression models, which are based on the assumption that the future state only depends on the current state and it is independent of the past state as it is the case with the Markov property.In order to solve this drawback, Si et al. in [29], [30] employed a particle filter to smooth out the non-Gaussian degradation state and random-effect parameters from previous measurements.The experimental results show that this approach can improve the model fitting and the accuracy of the RUL estimation.
Gamma process models work under the assumption that machine degradation is monotonic and evolving only in one direction.In this case the increments of the degradation processes are modelled with independent random variables with a gamma distribution [31].In [32], stochastic filtering with the Gibbs algorithm is proposed to find the hidden degradation state of the gamma process.The RUL can be estimated based on gamma process probability distribution with a predefined failure threshold.The shortcoming of the Gamma process models is that they are also restricted to the assumption of the Markov property which has been mentioned in Wiener processes.
Markov models underlying assumptions mean that the machine degradation processes should evolve in a finite state-space following the principle of the Markov property.Kharoufeh et al. proposed a hybrid RUL estimation method based on Markov models and stochastic failure models to compute the RUL distributions numerically along with their moments status.The number of the states in a Markovian model was estimated by a K-means clustering analysis which implied that there must be sufficient data on failures and degradation to estimate the model parameters [33].The drawback of the Markov models is that they cannot model the hidden health states of the degradation processes.In order to overcome this, the Hidden Markov Model (HMM) has been employed for machinery prognostics [34]- [36].Furthermore, hidden semi-Markov model was employed to improve the flexibility of the HMMs for representing complicated state transition processes in machinery degradation processes for prognostics and RUL estimation [37]- [39].It is worth noting that the Markov models based methods suffer from the Markov property limitation, which may lead to an approximation for the true process.
Covariate-based hazard models have been employed for lifetime modeling to describe the degradation processes caused by one or more factors in the conditioning monitoring (CM) information.The Proportional Hazards Model (PHM) was proposed by Cox in [40], and has been widely used to relate the system's CM variables and external factors to the failure of a system.The PHM assumes that the hazard rate of a system is composed of two multiplicative factors, a baseline hazard function and a function of covariates.The PHM with the Weibull distribution was employed in conjunction with the condition monitoring process for RUL estimation of machinery in [41].Tran et al. proposed a hybrid system including autoregressive moving average model, PHM and support vector machine to assess the machine health degradation and predict the RUL of the machine [42].The limitation of the PHM is that it requires sufficient failure event data and associated CM information simultaneously.However, these two types of data may not always be available in practice.

B. MACHINE LEARNING BASED APPROACHES
In the past years, machine learning techniques, especially deep neural network-based approaches have been developed to model the highly nonlinear, complex and multi-dimensional systems without prior knowledge of the system and machinery physical behaviors [43].Machine learning approaches attempt to learn the machine degradation characteristics using optimization techniques from sensor raw data instead of building physics-and statistical-models.
Artificial Neural Network (ANN), and Multi-layer Perceptron (MLP) are employed for modeling the RUL of the bearings in [44], [45].The proposed neural network takes the multiple CM measurement values at the present and past observations as the inputs and RUL as the output.Zhang et al. proposed a multi-objective deep belief networks (DBN) ensemble method for RUL estimation [46].An evolutionary algorithm is integrated with the conventional DBN training method to evolve multiple DBNs, which are combined to establish an ensemble model for NASA's turbofan engine RUL estimation [47].Bektas et al. proposed a neural network based method by learning a similarity model that is fed by the use of data normalisation and filtering methods for operational trajectories of complex systems for RUL estimation [48].
In the last few years, Deep Neural Networks (DNNs) are emerging as a highly effective neural network architecture for object recognition, classification tasks [49], speech recognition [50] and other applications for image quality assessment [51].All of these tasks demonstrate that the performance of DNNs can overcome conventional methods significantly.DNNs utilize a stack of multiple layers of nonlinear processing units for feature extraction and transformation.The ''deep'' in ''deep neural network'' refers to the number of layers through which the data is transformed.Each successive layer uses the output from the previous layer as input to fully capture the representative information from raw input data [52].The performance improvement may benefit from high-level abstractions of input data which can be modeled with the help of the complex deeper structures, leading to more efficient feature extraction compared to shallower networks.
In order to investigate the capability of DNNs for prognostics and RUL estimation for machinery, Babu et al. proposed a Convolutional Neural Network (CNN) to estimate the RUL of system based on normalized variate time series signals from sensors [53].Two convolutional layers, two average pooling layers and one fully connected layer are adopted in the proposed DNN architecture.The experimental results indicate that the proposed CNN based method can be used to outperform three other regression methods, including the Multi-Layer Perceptron (MLP), Support Vector Regression (SVR) and Relevance Vector Regression (RVR) on publicly available datasets.Li et al. proposed a deep CNN architecture for RUL estimation by using the raw collected data with time window approach for sample preparation in order to obtain better feature extraction by the CNN [54].However, since the collected CM signals have complex multimodalities and high heterogeneity, the CNN-based methods have difficulties extracting heterogeneous features representing the variation from running until the point of machine failure.Li et al. proposed an additional work by using Shorttime Fourier transform for time-frequency transformation and applied multi-scale feature extraction with CNN to enhance the network learning ability [55].Li et al. in [56] proposed a cross-domain fault diagnosis method for rolling element bearing, where a deep generative model has been employed for generating artificial fault signal in target domain based on healthy signal.Then the generated artificial fault signal has been used for domain adaptation.da Costa et al. [57] proposed LSTM based domain adversarial neural network approach to performing unsupervised domain adaptation for RUL prediction, where a gradient reversal layer has been employed to perform adversarial learning during training and induce a domain-invariant representation.
Recurrent Neural Networks (RNNs) were proposed for sequence learning, which have been employed in speech recognition, machine translation, natural language processing etc. and achieved state-of-the-art performance [58].Particularly, unlike conventional multi-layer perceptron networks that can only learn the representation from input data to target vectors, RNNs are not limited to discrete internal states and they allow for continuous, distributed sequence representations.This means that RNNs have the capability to learn the latent representation from the entire history of previous inputs to target vectors by building connections between units from a directed cycle.The benefits of RNNs allow a memory of previous states to be kept in the network's internal state.Malhi et al. proposed RNN based learning approach for longterm prognostics of machine health status monitoring [59].The Continuous Wavelet Transform (CWT) can be employed for preprocessing of the vibration signals from a rolling bearing.Gugulothu et al. [60] proposed a RNN based method for RUL estimation, which embedded time series data using RNNs as an encoder and drew a health index curve with it.The RUL is predicted by comparing the latter curve with the normal health index curve.
RNNs can be trained via backpropagation through time for supervised tasks with sequential input data and target outputs.However, due to gradient exploding and vanishing problems during backpropagation of model training, training a deep RNN is difficult and may not capture the long-term dependencies from the sequential signal [61], [62].Long Short-Term Memory (LSTM) is a type of RNN, which was designed to prevent backpropagated errors causing gradient exploding and vanishing issues in RNNs [63].LSTM employ input gates, forget gates and output gates, where forget gates were introduced in LSTM to avoid long-term dependency problems.Considering that the LSTM networks can capture long period temporal dependencies and nonlinear dynamics in sequential signal, LSTMs have achieved great success on applications of speech recognition and machine translation [64].
The characteristics of the LSTM make it a natural choice for machinery RUL estimation due to the considerable time lag between inputs and their corresponding outputs.A simple LSTM is employed in [65] and a bidirectional LSTM is proposed in [66] for RUL estimation.A hybrid deep learning model combining CNN and LSTM is demonstrated for machine health monitoring in [67], where a CNN is employed for local features extraction and bi-directional LSTM [68] is demonstrated and built on CNN outputs for the temporal information encoding and representation learning.Al-Dulaimi et al. [69] and Li et al. [70] proposed similar approaches, where instead of a series connection which proposed in [67], a hybrid network architecture by combining CNN and LSTM network in parallel manner for RUL estimation has been presented respectively.Particularly, in [70], a directed acyclic graph network architecture is employed in this hybrid design.The outputs from the LSTM and CNN network are summed by elements-wise and fed into a second LSTM network, then forwarded to a fully connected layer for RUL prediction.However, these hybrid network architectures could require a significant large number of network parameters for tuning and result in a large model, which may be difficult for practical applications.
Malhotra et al. proposed a LSTM-based encoder-decoder structure in [71], which is utilized to transform a multivariate input raw sequence to a fixed-length vector, then the vector is used to produce the target sequence by the decoder.The model learns to capture the behavior of a machine by learning to reconstruct multivariate time series corresponding to normal behavior in an unsupervised manner.Then, the reconstruction error is used as a health index to estimate the degradation, and in turn estimate the RUL of the machine.
In this paper, a LSTM-Fusion network for RUL estimation is proposed.Furthermore, industrial CM systems use multiple sensors and generate continuous streams of raw signals for machine health monitoring.Signal compression techniques have increasingly significant roles in modern CM systems.The effects of compression on the RUL estimation have not been previously investigated although they could have negative effects on the accuracy and robustness of the machinery degradation modeling and RUL prediction.To bridge this gap, we investigate the performance of the proposed LSTM-Fusion network by using compressed/distorted sensors data with different compression levels.

III. PROPOSED LSTM-FUSION NETWORK FOR RUL ESTIMATION
In this section, the proposed LSTM-Fusion network for RUL estimation is presented, including an overview of the standard LSTM network, and the proposed hierarchical multiscale fusion approach for better capturing the temporal dependencies in the time series sequence signal.The implementation details of our proposed method are presented in Section III-B.

A. LSTM NETWORK
LSTM as a special RNN structure has proven stable and powerful for modeling long-range dependencies for sequence modeling [62], [63].The major novelty of LSTM is the innovation of the memory 'cell', c t , which essentially acts as an accumulator of the state information in the sequence signal.The memory cell is accessed, written and cleared by several self-parameterized controlling gates.The information will be accumulated to the cell if the 'input gate', i t , is activated when a new input comes.Then, the past cell status c t−1 could be ''forgotten'' in this process if the 'forget gate', f t , is on.Whether the latest cell output c t will be propagated to the final state h t is further controlled by the 'output gate', o t .In order to avoid the vanishing and exploding problem for the traditional RNN model, the memory cell and gates have been designed to control information flow so that the gradient will be trapped in the cell and be prevented from vanishing too quickly (also known as constant error carousels) [62], [63].In this paper, we follow the formulation of the Fully Connected LSTM (FC-LSTM) as in [72].The key equations to define the hidden layer function, H, are listed in (1) below: where σ is the logistic sigmoid activeation function, '•' denotes the Hadamard product, model parameters including W ∈ R d×k and b ∈ R d are shared by all time steps and learned during model training, d and k is the number of sensors in each time step and dimension of hyperparameter of hidden vectors, h.Multiple LSTMs can be stacked and temporally concatenated to form more complex structures.The output at the terminal time step is used to predict the output by a linear regression layer, as shown in the equation below: where W r ∈ R k×z and z is the dimension of the output.The Euclidean loss has been employed as the model loss function in this task.We use the following shorthand notation to denote the recurrent LSTM operation as below: The following section describe in detail the proposed LSTM-Fusion network.

B. PROPOSED LSTM-FUSION NETWORK
In order to estimate the RUL of machinery, an algorithm must learn to predict the future given only a partial temporal  context.This makes the RUL estimation challenging and also differentiates it from fault detection of machinery by utilizing discriminative classifiers.The proposed network models the RUL estimation with a recurrent architecture which unfolds through time.This allows us to train a single model that learns to handle partial temporal context of varying time steps.Furthermore, machinery RUL estimation is challenging because the contextual information can come from multiple sensors with different data modalities.In such applications the way information from different sensors is fused is critical to the application's final performance.In this section, an end-to-end deep learning architecture which jointly learns to model and fuse information from different sensory signals with variable time window size is described.Figure 2 shows the overview of the proposed RUL estimation method.
Particularly, for training an RUL estimation model, we observe the time-series sequences and the labeled remaining useful life {(x 1 , x 2 , ..., x t ), y}.The target is to train a network which predicts the RUL given partial temporal observation samples of the time-series sequence, {(x 1 , x 2 , ..., x t ) | t ≤ T }, in a sequence-to-target prediction manner, where T denotes the variable time window size.In this experiment, T ranges between [10,100] with 20 step size.Given training examples, {(x 1 , x 2 , ..., x t ) j , y j } N j=1 , where j denotes the jth time window size in N , we train the LSTM neural network to map the sequence of observations of the sensors data, (x 1 , x 2 , ..., x t ) to the sequence of the labeled RUL, (y 1 , y 2 , ..., y t ), such that y t = y, ∀t.Trained in this manner, the neural network will attempt to map all sequences of partial observations (x 1 , x 2 , ..., x t ) j , ∀t ≤ T to the RUL label y t .In this way, our model explicitly learns the characteristics of the latent RUL representation with variable lengths of the observation samples, (h T t , c T t ).In the next step we describe how the proposed LSTM deep fusion method fuses the latent representations.
The proposed LSTM-Fusion architecture must fuse multiple learned latent representations from training samples: S i = {(x 1 , ..., x t )} j .An obvious way to fuse sensory data is by concatenating the time series signal as input to the network.However, we found that this simple concatenation performs poorly.The proposed method is to add and train an additional sensory fusion layer which combines the high-level representations of sensor data.In particular, the proposed architecture first passes the sensor signals S i with variable window size independently through a separate LSTM sub-networks defined as in ( 4).The high level representations from the LSTM sub-networks {(h x 1 , ..., h x t )} j , are then concatenated at each time step t and passed through a fully connected fusion layer which fuses the high level representations output from the sub-networks, the fusion layer can be obtained by (5).The proposed LSTM-Fusion architecture is shown in Figure 3.The output representation of the fusion layer, F t , is then passed to the softmax layer for RUL estimation by (6). LSTM-Fusion: where W * and b * are the learned model parameters, The main purpose of the training process is to optimize the learned parameters, including weights and biases, such that the loss function, L, is minimized.The loss function can be obtained as follows: where RUL t True and RUL t Est denotes the labeled RUL and estimated RUL at time t respectively.Particularly, the RUL t True is calculated as the difference between the current time t and the end of useful time.The ''RMSprop'' optimizer has been utilized in the training process, and dropout regularization has been employed to avoid model overfitting.The early stopping in the training process has been employed when the learned model does not show performance improvements on the validation dataset.
A solution must be found for the issue that the multiple raw sensor signals are in different units and magnitudes and the corresponding data does not have equal contributions to the neural network.Therefore, data normalization is necessary to convert all of the sensors signals to a normalized space.The Z-score standardizing method [73] has been employed to solve this problem.The Z-score measures the distance of the feature point from the mean in terms of the standard deviation, so called standardization of data.

C. PERFORMANCE EVALUATION
In order to compare this work with state-of-the-art methods, we employ two objective performance metrics, including Scoring function and Root Mean Square Error (RMSE) [47], [54], which are described in detail in the next paragraph.
The Scoring function is employed extensively for evaluating the performance of the RUL estimation methods.The metric can be described as follows: where S is the estimated Score, N is the number of engines in the test dataset and h is the difference between the predicted RUL and the ground truth.The advantage of the scoring function is that it penalizes late prediction more than early predictions.This is beneficial since late RUL predictions are more dangerous than early predictions since they could result in delays of the maintenance operations needed by the machine.However, the drawback is that a single outlier estimation (such as a very late prediction) would dominate the overall performance score.Therefore, a single evaluation score cannot fully represent the performance of the algorithm.Additionally, the Root Mean Square Error (RMSE) of the estimated RUL to the ground truth has been utilized to evaluate the performance of the algorithm.The RMSE has equal weight to both early and late prediction of the RUL.Therefore, the RMSE produces a better overall evaluation of the performance of the RUL estimation algorithm.The RMSE can be obtained as follows:

IV. C-MAPSS DATASET
The NASA C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) dataset has been employed to evaluate the performance of the proposed method.The C-MAPSS dataset contains 4 sub-datasets as listed in Table 1.These datasets are widely used benchmark data and contain simulated data produced using a model based Turbofan engine degradation simulation program, C-MAPSS, developed by NASA [47].The datasets include sensor data with a different number of operating conditions and fault conditions, as shown in Table 1.Each of these datasets have been further divided into training and testing subsets.In each dataset, the engine number, operational cycle number and operating settings are listed corresponding to the sampled data from the 21 sensors.More detail of the data structure of the datasets can be found in [47].The engine is operating normally at the beginning of each time series, and develops a fault at some point in time which is unknown.In the training set, the fault grows until a system failure.In contrast, the data in the test set are provided up to some time prior to the system failure.The objective is to estimate the number of remaining operational cycles before the system failure based on the provided data in the test dataset.

V. EXPERIMENTAL RESULTS
In this section, we have performed extensive experiments to evaluate the proposed method.Firstly, a comparison is performed of the proposed LSTM-Fusion method with seven state-of-the-art methods, including Multi-layer Perceptron (MLP) [74], Support Vector Regression (SVR) [75] and Relevance Vector Regression (RVR) [76], CNN based regression method [53], LSTM based RNN [77], deep convolution neural network based method [54], on the raw C-MAPSS dataset.Secondly, we evaluate our proposed method with multiple LSTM layers, ranging from 1-6 LSTM layers, to show how by increasing the number of LSTM layers it is possible to extract better the high level representation.This experiment is also designed to explore the computational complexity versus performance of the proposed LSTM-Fusion architecture.Thirdly, the evaluation of the proposed LSTM-Fusion model is performed on the distorted sensors data with different compression levels.

A. EXPERIMENT 1: PERFORMANCE OF THE PROPOSED METHOD
This experiment is designed to perform a comprehensive evaluation of RUL estimation using seven different methods.
The results are listed in Table 2.We report the experimental results with the same settings on the four sub-datasets.The two stacked LSTM layer structure has been employed in this experiment with 100 and 50 nodes in the first and second LSTM layer respectively.We performed 10-fold cross validation on the training and validation datasets to estimate performance of the learned model and the best model has been applied on the testing dataset.The train-test evaluations have been shuffled 10 times and we report the averaged results listed in Table 2.   Table 2 shows the results of RMSE and Score.We compare the proposed LSTM-Fusion results with other seven stateof-the-art methods.The experimental results indicate that the proposed LSTM-Fusion outperforms all other approaches under RMSE metric evaluation, and comparable results of the Score value are obtained compare to the state-of-theart methods.The experimental results show that our method outperforms slightly the method proposed in [70], this suggests that our core idea of fusing the extracted features which have variable time window sizes has merit performance for capturing the local and global characteristics of the signal for the RUL prediction.

B. EXPERIMENT 2: INFLUENCE OF THE NUMBER OF THE LSTM LAYERS
As shown in the literature, deep stacked LSTMs often obtain better prediction accuracy than shallower models.However,  simply stacking more layers of LSTM works only to a certain number of layers, beyond which the network becomes slow and difficult to train, likely due to the exploding and vanishing gradient problem [78].This experiment is designed to illustrate the relationship among the performance, the computational complexity and the number of layers in our proposed deep LSTM-Fusion architecture for RUL estimation.predicted RUL values (blue curve) and ground truth (green line).The light orange region shows the standard deviation of the predicted errors.
As shown in Figure 8 and 9, the experimental results indicate that stacking more LSTM layers can achieve better prediction accuracy (smaller standard deviation values of light orange region), however, the vanishing returns can be observed when the number of LSTM layers stacked is higher than 4 layers.Specifically, Figure 4 shows that the relationship between the averaged standard deviation of  4, the blue bars show that the accuracy of the RUL prediction can not be improved further when the number of LSTM layers is higher than 4, and the computational complexity increases linearly according to the number of the layers stacked in the proposed architecture.
In summary, the experimental results indicate that the optimal number of the LSTM layers for the proposed architecture is 4 and this number achieves a good balance between the RUL prediction accuracy and running time.It is difficult to obtain benefits by stacking more than 4 layers in the proposed method.

C. EXPERIMENT 3: INFLUENCE OF THE COMPRESSION/DISTORTION TO THE PROPOSED METHOD
This experiment is designed to evaluate the reliability and robustness of the proposed LSTM-Fusion architecture in a modern practical scenario that deploys lossy data compression -the bandwidth and storage requirements are reduced using the state-of-the-art vibration signal encoder proposed in [3] to compress the raw vibration signal.There are seven distortion levels created by compressing the vibration signal with different bitdepth representations, from 4-bitdepth (higher distortion, coarse accuracy) to 16-bitdepth (low distortion, high accuracy).All of the compression parameters and conditions have been set as originally presented in [3].
Figure 5 shows the compression ratio (CR) results and the corresponding value of signal-to-noise-ratio (SNR) for the four evaluated datasets respectively.As shown in the figure, the higher bitdepth representation (16-bitdepth) obtains better precision in the vibration signal, thus it achieves approximately 80 dB reconstruction accuracy, but it obtains lower compression ratio.On the other hand, the 4-bitdepth compressor achieves less than 30 dB reconstruction quality, but compresses the data significantly, achieving up to 1607:1 and 1721:1 compression ratio on FD001 and FD003 datasets respectively.
In order to evaluate the influence of compression/distortion on the proposed method, the averaged prediction errors (RMSE) of the last 50 cycles of the test datasets have been calculated corresponding to variable bitdepth representations, as shown in Figure 6.In general, the averaged prediction error decreases according to the bitdepth increment.The prediction error can be reduced significantly when the bitdepth increases from 4-bitdepth to 10-bitdepth.However, the prediction error is largely unaffected when bitdepths are higher than 10-bitdepth.This means that the proposed method can model the health status and estimate the RUL of the machine with a 10-bitdepth compressed/distorted sensor data accurately.We can also observe that the compressability of the datasets is not constant.Accuracy RUL estimation with the 10-bitdepth is calculated using on distorted/compressed data that has approximately 60.24 dB signal-to-noise-ratio (SNR) and 14.83:1 compression ration on the FD002 and FD004 datasets.For the FD001 and FD003 dataset, the proposed method can estimate the RUL accurately at 44.45 dB and 46.12 dB SNR with 340.4:1 and 247.9:1 compression ratios respectively.This may be cause by the higher complexity of the data in FD002 and FD004 (see Table 1), which contain six operating conditions while only one operating condition is present in FD001 and FD004.In addition, Figure 7 illustrates the absolute difference of the RUL estimation evaluated with compressed and raw data.A similar effect to the previous result can be observed in Figure 7 where the difference of the estimated RUL on compressed and raw data is reduced as the precision of the compression process increases.Due the larger number of operating conditions in FD002 and FD004 datasets, the result of 16 bitdepth in Figure 7 shows higher difference on these two datasets compared to others.The experimental results indicate that the proposed approach is able to robustly predict the RUL under challenging situations in which data has been dramatically compressed/distorted due to limited bandwidth and storage requirements in the condition monitoring system.

VI. CONCLUSION AND FUTURE WORK
In this paper, we have proposed a deep neural network approach for machinery health status modeling to perform remaining useful life (RUL) estimation.The proposed LSTM-Fusion neural network architecture addresses open challenges in the area of condition-based maintenance and RUL estimation for rotational machines.The contributions of this paper can be summarized in the following four points: • Firstly, the proposed LSTM-Fusion architecture can fuse multi-sensors data with variable time window sizes, which allows the neural network to be able to capture the short-range (local) and long-range (global) characteristics of the data from multiple sensors.The experimental results show that the proposed LSTM-Fusion architecture achieves the best RUL estimation performance and outperforms other seven state-of-the-art methods.
• Secondly, in order to explore the optimum number of layers of the proposed LSTM-Fusion architecture, the Exp. 2 has been designed.The experimental results indicate that stacking more LSTM layers can achieve better RUL prediction accuracy, however the computational complexity increases linearly.Moreover, the vanishing returns of the RUL estimation performance can be observed when the number of layers is higher than 4-layer.
• Thirdly, the proposed method is capable of achieving robust health status modeling and RUL estimation on the raw and compressed dataset.Particularly, the experimental results presented in Section V-C indicate that the proposed method can accurately estimate the RUL with distorted/compressed data which has up to 340.4:1 compression ratio with 44.45 dB signal-to-noise-ratio on the FD001 dataset.This enables remote prognostic and condition monitoring applications to be deployed with lower bandwidth and storage requirements.
• Fourthly, the proposed novel end-to-end deep LSTM-Fusion architecture for RUL estimation does not require prior knowledge of the condition-based health monitoring and does not rely on any empirical assumptions, such as ''piece-wise linear RUL target function'' as it was required in [53], [77].

FIGURE 4 .
FIGURE 4. Averaged standard deviation of RUL prediction error (blue bar), Normalized computational complexity (red bar) and error to runtime ratio (green bar).

FIGURE 5 .
FIGURE 5.The rate-distortion performance (Compression Ratio (CR) vs. Signal to Noise Ratio (SNR)) of four datasets.The distortion levels have been generated by different bitdepth representation of the dataset, e.g. higher bitdepth representation achieves better SNR with lower CR, but lower bitdepth (4-bitdepth) achieves higher CR with lower SNR.

FIGURE 6 .
FIGURE 6.The average prediction error with variable bitdepth compression (solid lines) and uncompressed raw data (dash lines) of four datasets respectively.

FIGURE 7 .
FIGURE 7. The absolute difference of the RUL estimation on compressed data and raw data.

Figure 8
and 9 shows a sample result depending on the number of the layers stacked in our proposed architecture, ranging between [1 6].The x-axis is the actual RUL of the test engine, y-axis is the predicted RUL.The green line indicates the ground truth RUL values of the example test dataset FD001 of the test engine ID: 49.The blue curve indicates the averaged RUL prediction values by proposed LSTM-Fusion architecture.The light blue region indicates the standard deviation of all RUL predictions at each time stamp.The orange curve indicates the difference between

FIGURE 8 .
FIGURE 8. Performance evaluations of stacking LSTMs from 1-3 layers with proposed LSTM-Fusion architecture, including averaged RUL prediction (blue line), averaged prediction error (yellow line) and ground truth (green line).Transparent regions are error band accordingly.

FIGURE 9 .
FIGURE 9. Performance evaluations of stacking LSTMs from 4-6 layers with proposed LSTM-Fusion architecture, including averaged RUL prediction (blue line), averaged prediction error (yellow line) and ground truth (green line).Transparent regions are error band accoringly.
Dropout is also applied after each LSTM layer to control overfitting (rate = 0.2).The final layer is a dense output layer with a single unit and linear activation for RUL prediction.This architecture can be extended to handle additional time series signals from multiple sensors.
x * } j denotes the sensor signals with variable window size T and the above operations are performed from t = [1, T ].Particularly, for the neural network in our proposed LSTM-Fusion architecture we use multiple LSTMs of size 100 with sigmoid gate activation and tanh activation for hidden representation.The fully connected fusion layer uses tanh activation and outputs a 100 dimensional vector.

TABLE 2 .
Performance comparisons of the proposed method and state-of-the-art methods on the C-MAPSS dataset.