A Distributional Perspective on Remaining Useful Life Prediction With Deep Learning and Quantile Regression

With the rapid development of information and sensor technology, the data-driven remaining useful lifetime (RUL) prediction methods have been acquired a successful development. Nowadays, the data-driven RUL methods are focused on estimating the RUL value. However, it is more important to quantify the uncertainty associated with the RUL value. This is because increasingly complex industrial systems would arise various sources of uncertainty. This article proposes a novel distributional RUL prediction method, which aims at quantifying the RUL uncertainty by identifying the confidence interval with the cumulative distribution function (CDF). The proposed learning method has been built based on quantile regression and implemented from a distributional perspective under the deep neural network framework. The results of the run-to-failure degradation experiments of rolling bearing demonstrate the effectiveness and good performance of the proposed method compared to other state-of-the-art methods. The visualization results obtained by $t{\text{-}}SNE$ technology have been investigated to further verify the effectiveness and generalization ability of the proposed method.


I. INTRODUCTION
P ROGNOSTICS and health management (PHM) techniques play a vital role in the condition-based maintenance of large industrial equipment, which could prevent unexpected failure and reduce downtime to achieve the purposes of saving the maintenance cost, maximizing the working time, safety, and reliability [1], [2]. The remaining useful lifetime (RUL) prediction is generally known as estimating the time before the machine completely fails and is used to support PHM in producing reasonable maintenance plans and strategies [3], [4], [5], [6]. Therefore, it is imperative that the RUL prediction method provides a precise estimation. However, the prediction of future conditions and precise RUL are significantly challenging from the massive data obtained from the operating systems due to the increasing complexity of industrial equipment.
As one of the advanced RUL methods, the model-based methods have proved their effectiveness in many studies [7], [8], [9]. They rely on accurate physical laws and knowledge of the degradation process for the machine system. However, it is very difficult to build an accurate physical degradation model in many industrial applications. The performance, robustness, and generalization of the model-based prediction model would be significantly decreased because of the more and more complexity of the modern industry [2], [4].
Meanwhile, the current real-world industrial system has been turned into a data-rich environment with the development of the Internet of Things (IoT). The data-driven methods of RUL prediction have been widely investigated and applied based on massive monitoring data, for instance, vibration, current, temperature, etc. As one of the powerful data-driven approaches, deep learning has been proposed to develop RUL prediction algorithm in recent years and achieve great results, for instance, deep neural network (DNN) [10], deep belief network (DBN) [11], [12], convolutional neural network (CNN) [13], [14], [15], [16], long short-term memory (LSTM) [17], [18], [19], [20], and their combinations [21], [22]. Moreover, some special deep models, such as the deep adversarial neural networks [23] and normalizing flow-embedded sequence-tosequence model [24], have been developed for estimating RUL. Differently from the model-based approaches, the degradation models within these data-driven methods are independent of the prior knowledge and they could be learned from the available data. Therefore, they are easier to use and capable of handling the large industrial machine prediction problem, whose degradation model is too complex to establish.
Nevertheless, these methods focus on estimating the RUL value without considering the various types of uncertainty inherent in the degradation process. These uncertainties need to be carefully considered in order to arrive at reliable and accurate predictions of the RUL values. This is always the case for RUL prediction in industrial applications [25]. Therefore, quantifying the uncertainty of RUL would be as important as estimating the RUL value. Moreover, predicting RUL with confidence interval could support human-expert making maintenance decisions more comprehensively. Recently, there are limited data-driven approaches considering the uncertainty in the RUL prediction process. Wang et al. [26] proposed to use variational inference for quantifying the uncertainty of RUL prediction after training the recurrent CNN. Zhao et al. [27] built a probabilistic RUL prediction framework and used the probability density function (PDF) as the quantified uncertainty. Pang et al. [28] proposed a Bayesian inference model for updating the posterior distributions of model parameters and calculating the confidence interval for uncertainty.
However, the existing approaches cannot directly quantify the uncertainty of RUL, and they still need to implement further processing in order to identify the uncertainty interval. In traditional data-driven RUL learning algorithms, the primary focus is to train the model based on the RUL value as the label, but it discards the label information of the RUL uncertainty. The main motivation of this article is to introduce a new data-driven method that takes full advantage of utilizing RUL uncertain information to support the learning procedure. Inspired by the idea that there are more benefits to learning an approximate distribution rather than an approximate value [29], we propose a new distributional remaining useful life prediction method with a DNN and quantile regression. The proposed method could directly output the cumulative distribution function (CDF) and calculate the confidence interval for estimating the uncertainty of RUL. The main contributions are summarized as follows.
1) The distributional learning method has been proposed for distributional RUL prediction and implemented by quantile regression optimization. The quantile regression loss has been deduced following the theory of the Wasserstein metric, which is designed to calculate the divergence of inverse CDF between parameterized and target distribution. 2) The proposed distributional RUL prediction method has been implemented by using the deep learning framework with the learning method from a distributional perspective. With the support of the quantile distribution and Dirac delta function, it could directly quantify uncertainty and calculate the confidence interval.
3) The novel quantile Huber loss (QH-loss) function has been designed and utilized to optimize the proposed deep model of distributional RUL prediction by the stochastic gradient descent (SGD) method. It combines the advantages of quantile regression loss and Huber loss. The comparison experiment demonstrates that QH-loss is outperforming the typical MSE and mean absolute error (MAE). 4) The effectiveness and performance of the proposed method have been verified using the run-to-failure degradation experiment of rolling bearings under different working conditions. The visualization results demonstrate the high generalization ability for the proposed method with different feature mapping.

II. THEORETICAL FRAMEWORK OF PROPOSED METHOD
First, we define the new distributional RUL prediction problem. To solve this problem from the data-driven perspective, we deduce the learning method from a distributional perspective according to quantile regression optimization. The Wasserstein metric is the theoretical basis for constructing the quantile regression.

A. DEFINITION OF DISTRIBUTIONAL RUL PREDICTION
For the classical RUL prediction, the theoretical formula is where Y denotes the ground-truth RUL value, X denotes the observation, and f denotes the prediction function with the parameter θ . The output of (1) is the RUL value. However, we expect to output the quantified uncertainty in this work. Therefore, we redefine a new distributional RUL prediction, which could directly output RUL uncertainty as a distribution function. The specific definition is to replace the RUL value Y with a certain distribution of Z whose expectation is the value Y This equation also defines that such distribution Z could be characterized by the conditional distribution function F with parameter θ .

B. WASSERSTEIN METRIC
The Wasserstein metric is used as the evaluation method for two different distributions, which has the mathematic property of continuous and differentiable almost everywhere [30], [31], [32]. It is the basic theory for deducing the quantile regression loss function. Müller [33] took it as the metric of L p on inverse CDF (inverse CDF). Therefore, the Wasserstein metric W p can be defined as where for a random variable Y, the inverse CDF F −1 Y of q is expressed as follows: where F Y (y) = Pr(Y ≤ y) is the CDF of random variable Y.

C. LEARNING FROM DISTRIBUTIONAL PERSPECTIVE
The proposed method is going to predict the quantiles of the target distribution, where q i = 1/N, for i = 1, . . . , N. So, it is called a quantile distribution Z Q , which has a fixed N. The discrete values derived from its CDF are τ 1 . . . , τ N , where τ i = (i/N) for i = 1, . . . , N, these also represent the cumulative probabilities with a certain distribution. Then, the quantile distribution Z θ ∈ Z Q projects the observation x to a probability distribution supported by the parameterized model θ i (x), which is defined as where δ z is a Dirac delta function at z ∈ R. This reformulation allows us to learn the distributional prediction model by using the Wasserstein metric and implement it by quantile regression [34].

1) QUANTILE PROJECTION
The distributional learning is projected to a parameterized quantile distribution optimization, it quantifies the projection of a random distribution Z ∈ Z into Z Q , which is expressed as arg min Assume that Y is the bounded target distribution and U is a quantile distribution based on the Dirac delta function, shown in (5), with the support z 1 , . . . , z N . Accordingly, the Wasserstein metric is When Quantile regression or conditional quantile regression could approximate the quantile function of distribution or conditional distributions, which is an effective method to solve the distributional learning problem [35]. The parameterized distributional model would be trained by minimizing the quantile regression loss, which is defined as . This objective function is convex and asymmetric, which could control overestimation errors with weight coefficient τ and underestimation errors with 1 − τ during the training procedure [34].

III. IMPLEMENTATION OF PROPOSED METHOD WITH DEEP LEARNING FRAMEWORK
In this section, we implement the distributional RUL prediction method according to the theoretical framework of distributional learning which is proposed in Section II. The overall architecture of our proposed distributional RUL prediction model is shown in Fig. 1. The data-driven RUL prediction framework is generally divided into three steps after acquiring the data: 1) constructing the labels based on the health indicators; 2) training the prediction model of RUL with the labeled data; and 3) testing the performance of the optimized model by using the unseen data.
In the first step, the root mean square (RMS) feature is selected as the health indicator, and then the first prediction time (FPT) is determined. The labeling for the normal period is constant, and the linearly decreasing function is built for labeling the degradation period. The training step uses the data from the degradation period of vibration and their labels. The DNN is used as the estimating function to directly output the quantile distribution. The new QH-loss has been proposed to optimize the parameter of the DNN. After the training process, the total new unseen test data, including the normal and degradation period of vibration data are directly put into the deep model. As the outputs are the quantile distributions, we can directly obtain the RUL confidence interval for every time period.

A. FPT DETERMINATION AND RUL LABELING METHOD
In order to construct the labels of bearing data, the FPT should be first determined. In this work, we follow the simple and effective way proposed in [16] to calculate the FPT. The mean μ and standard deviation δ are calculated from the early normal period of the RMS of each vibration sample. The FPT is confirmed when the feature RMS value f t − μ successive outside the 3δ, which can be expressed as follows: where i = 0, 1, 2 and only if all these sequence features f t+i satisfied (9), the current time t would be set as FPT.
As shown in Fig. 2(a) and (b), the FTP is determined to be 74 based on the proposed method for the original vibration signal of the overall period of the lifecycle.
Once the FPT has been decided, the ground-truth RUL label for the overall life cycle could be shaped in a segmental linear function, which is shown in Fig. 2(c). It consists of two periods: 1) the normal period is a constant and 2) the degradation period is a linearly decreasing function, which represents the life percentage of a machine. As one of the RUL labeling methods, it is a simple and effective way and has been proposed in many studies [13], [16], [36], [37]. For this kind of labeling method, only the labeled degradation date is used for training the prediction model. It should be noticed that the FPT could affect the performance of the prediction model. Therefore, comparing experiment has been carried out in Section IV-D for demonstrating the effects.

B. PROPOSED NETWORK STRUCTURE
After obtaining the ground true RUL labels of training data, the DNN has been proposed as the foundation of the distributional RUL prediction model. As shown in Fig. 1, the labeled data is used for training the proposed model, then the model would be directly tested by the unseen data. There are two parts in the network structure, including featuring mapping with parameter θ FM and predictor with parameter θ PD . The feature mapping is designed to extract the essential features of original sequence data. These features are learned during the training process and presented in high-level representation in each layer. The basic block is the convolution layer [38] or its advanced variants [39], [40]. The performance and effects of different kinds of basic blocks have been compared and analyzed in Section IV-E. Meanwhile, the predictor is designed to output the RUL distribution directly, and it consists of two linear layers and the ReLU activation function. Different from the typical RUL structure based on a neural network taking one node in the final layer, we propose to take N nodes in the final layer,

C. LEARNING PROCEDURE
During the training procedure, the SGD is used to optimize the parameters θ FM and θ PD in the proposed DNN structure.
In this work, a new QH-loss is proposed and built based on the quantile regression loss of (8) and typical Huber loss [41]. The Huber loss is defined as then the QH-loss has been designed as the combination of the Huber loss and quantile regression loss, that is However, the ground-truth RUL label y is a value, but a distribution function should be the target distribution for calculating the QH-loss. As a result, we design the CDF of target distribution F Z (z|y) based on y by using the Gaussian distribution function, which is Finally, QH-loss L QH for the proposed distributional RUL prediction network can be calculated by where M denotes the feature mapping, P denotes the predictor, and then P(M(τ i |x)) denotes the output of quantile distribution at every quantileτ i . The learning procedure of our distributional RUL prediction model has been summarized in Algorithm .

D. MEASURE INDICATORS
In this article, the prediction performance of the RUL model is quantitatively evaluated using the MAE, RMS error (RMSE), and R 2 score, which are where Q is the number of testing samples,ŷ i denotes the predicted RUL value, y i denotes the true value of RUL (label), andȳ denotes the mean of all the true RUL value.

A. DATA SET DESCRIPTION
The data set was acquired from the testbed built by the Xi'an Jiaotong University (XJTU), which is designed for the degradation experiment of rolling bearing [42]. As shown in Fig. 3, the testbed is composed of a controller for motor speed and loader force and a test bench body with motor, spindle, support, experimental bearing, and loader. The test bearing is installed on the outside of the test bench and the loader is directly forced on the outer race of the bearing. The vibration signal across the whole life cycle of the test bearing is collected by two accelerometers (PCB 352C33) which are located on the house of the bearing at 12 and 9 o'clock positions. The experiments have been conducted under three different conditions, the bearing type is LDK UER204 ball bearing. The detailed pieces of information of all experiments and their actual lifetimes are summarized in Table 1. The vibration data of each run-to-failure bearing experiment has been recorded every minute, the sampling frequency is 25.6 kHz and the length of each sample is equal to 32 768 (approximate 1.28 s). Three bearings of the overall life cycle vibration signal for three different conditions are shown in Fig. 4.

B. DATA PREPROCESSING 1) NORMALIZATION
As shown in Table 1, the experimental actual lifetimes indicate that there are significant variations even under the same working conditions. To reduce a certain amount of difference, we first use the z − score method to normalize the overall  life cycle vibration signal for each bearing experiment, the normalization equation is defined as follows: wherex t i and x t i are the normalized vibration signal and original vibration signal at time t of the ith set of experiment, respectively; μ i and δ i are the mean value and standard deviation of i th set of experiment, respectively. Furthermore, it is obvious that we could not directly use the actual lifetime values as the target labels because the huge gap between the actual lifetimes varies from 42 to 2496 min. Therefore, we normalize the actual lifetime at every collection point in the range of [1, 0] by the following equation: where y t i denotes the normalized ground-truth RUL value; and ActLife i is the actual lifetime of i th set of experiment. In this work, we use the normalized y t i as the label for training the proposed model. Then, the predicted RUL value PredRUL t i can be calculated by whereŷ t i is the predicted normalized RUL value from the deep model.

2) FPT DETERMINATION
Based on the proposed FPT determination method, we calculate the FPT for each bearing of three different conditions. The RUL is calculated by using the actual lifetime minus FPT. The results for all experiment scenarios are shown in Table 2, and the unit is minutes. Moreover, the average time of normal and degradation period for three conditions has been computed, the results are displayed in Fig. 5, which indicate the average lifetime of the experimental bearing will be significantly influenced by the normal period by the varying force and speed.

C. IMPLEMENTATION DETAILS
The proposed DNN structure is shown in Table 3, which consists of feature mapping and predictor. In the predictor, there are two layers, the first one is the linear layers with 128  nodes following the ReLU function and another linear layer with N τ nodes is the direct output of the quantile distribution of the Dirac delta function. The CNN-base is selected as the feature mapping, which has five convolution blocks proposed in [43]. This structure is used in the following experiment section for demonstrating the proposed distributional RUL prediction. Other kinds of feature mappings are included in Tables 7-9 of the Appendix and the state-of-the-art methods have been tested and analyzed in Section IV-E. In the present work, each sample has 2560 points drawn from the original vibration signal at each minute, which has been considered as the input of the proposed model. The sampling method of the vibration signal is following [44] to resegment each original vibration with certain overlapping. Meanwhile, only the labeled date of the degradation period is used for training, but the whole period of unseen data would be tested and analyzed.
All the experiments are tested on a computer with one Nvidia GeForce GTX 2060 GPU, one Intel Core i7-10750 H CPU of 2.60 GHz, and 16 GB of memory. The default values of hyperparameters are given in Table 4.

D. DISTRIBUTIONAL RUL PREDICTION USING PROPOSED METHOD
In order to demonstrate the performance of the proposed method, three experiment scenarios have been decided to test the distribution RUL prediction model first. As shown in Table 5, Bearing-1 of each condition is selected as the unseen data for testing, while the rest of the Bearings 2, 3, and 4 are set to the training data.

1) RESULTS AND ANALYSIS
The results of prediction RUL distribution are shown in Fig. 6, and the quantified indicators of three scenarios can be found in Table 6. In Fig. 6, the color of each time step is determined by the output of the proposed model. The blue represents the normal condition of the bearing, while the red means the bearing is close to the end of its lifetime. The visualization results of distributional RUL indicate that the proposed method can predict not only the RUL value but also the uncertainty at each time step with high performance. All vibration data in the normal period are predicted with a high RUL confidence interval, especially for the second and third scenarios, which is shown in Fig. 6(b) and (c). For the degradation period of the unseen testing data, the learned deep model can directly identify the decreasing process without any support from data and its label within the testing scenarios. The prediction of the first scenarios shown in Fig. 6(a) has more fluctuations, the reason is that it was running in the harshest working condition. The prediction are more stable in Fig. 6(b) and (c). From the selected CDF prediction at the degradation process, it can be noticed that the confidence interval will reduce with the RUL value close to zero. Based on the results, it is obvious that the optimized model can predict the normal period accurately, the essential rule of degradation period has been learned from the other three training data sets, and the uncertainty reduces as the RUL decrease. The trained deep model did not see any data from the testing scenarios during the training process, but it still perform very well. This result demonstrates that the proposed method can learn the essential function under the same condition, which proves its generalization capability. Fig. 7 shows the QH-loss and the three measure indicators (MAE, RMSE, R 2 ) during the training procedure.  It indicates that the new QH-loss can effectively converge and the plausibility of using this loss as the evaluation criterion.

2) COMPARISON WITH DIFFERENT LOSS FUNCTIONS
In order to further analyze the proposed QH-loss, the typical MSE and MAE loss and QH-loss with different N τ are selected for comparison. When using MSE, MAE, and QHloss (1), only one linear node is set to the last layer, while the QH-loss (33) and (97) mean that 33 nodes and 97 nodes are in the last layer, respectively. We use R 2 as the comparison index because it is a normalized index only reflecting the performance but would not vary with the number of the actual lifetime. The results are shown in Fig. 8, which demonstrate that the QH-loss performs significantly better than MSE and MAE under the same structure, and more N τ can not only predict the distributional RUL but also lead to better performance.

3) COMPARISON WITH DIFFERENT FPTS
The proposed method could be obviously affected by the FPT. The reason is that we only use the degradation period data to optimize the deep model and the FTP determines the start point of the degradation period. Therefore, the data of three unseen test scenarios are selected as examples for studying the impact of FPT. The current FPT for this situation is following Table 2. Based on the current FPT, we move the FPT proportionally between the normal period and the degradation period, then retrain the model and calculate the R 2 for the test data. The results are shown in Fig. 9(b) and (c), which indicate that the proposed FPT determination is reasonable and the performance of the proposed model will drop dramatically with moving away from the current FPT. Meanwhile, the FPT shows no major effect on the performance during the normal period of the first scenarios, shown in Fig. 9(a). However, the FPT is still sensitive during the degradation process.

E. COMPARISON WITH OTHER DATA-DRIVEN METHODS AND DIFFERENT FEATURE MAPPING STRUCTURES
In this section, more experiments have been conducted for comparison and analysis. While one bearing is selected for testing, the rest three bearings' data under the same condition are used for training.
For comparison with the state-of-the-art data-driven RUL prediction methods, TFR-CNN [14] combines the wavelet transform (WT) processing the degradation data with multiscale CNN and MEF-CNN [16] combining short-time Fourier transform (STFT) with multiscale CNN, have been selected. Meanwhile, following the typical data-driven RUL prediction research [13], we also select DNN and LSTM as the benchmark method for a fair comparison. Comparing the results shown in Table 6 between the proposed model of the CNN-base structure and the benchmark methods, it is obvious that our proposed method is significantly superior according to the three quantified indicators.
To further study the advantages and disadvantages of the proposed method, other three different feature mapping structures have been selected compared with the CNN-base structure. The detailed structures of CNN-3, ResNet-5, and DenseNet-5 are summarized in Tables 7-9, respectively. CNN-3 has three convolution basics, two blocks less than CNN-base. ResNet-5 and DenseNet-5 have five blocks similar to CNN-base, but they are following the structure of ResNet [39] and DenseNet [40]. The quantified results are shown in Table 6, which illustrates the proposed method need feature mapping to have enough complexity of highlevel representation so that the essential features could be learned during the training process and the obtained model can perform outstandingly. The results demonstrate that CNN-base has obtained good performance. Compared with ResNet-5 and DenseNet-5, it has a more simple structure and fewer parameters. Therefore, the training speed will be much quicker using the same computer hardware.

F. FEATURE VISUALIZATION ANALYSIS
For further investigating the effectiveness of our proposed method and explaining the performance of various feature mappings on the run-to-failure bearing experiments, the features of the linear layer before the output layer are used for visualization. We use the t-SNE [45] technique to compress these high-dimensional features into 2-D visual images.
The test scenario of bearing-1 of condition-2 has been chosen as the supporting example for investigation. The visualization results are shown in Fig. 10, the color bar is changing from 0 (red) to 100 (blue), which represents the normalized ground-truth RUL label y t i but it has been adjusted from [0,1] to [0,100] by multiplied 100 and rounded to integer in order to better view. The various feature visualization results of training data, shown in Fig. 10(a), (c), (e), and (g), demonstrate that the models have been trained well enough because the labeled features are changing from 100 to 0 regularly and there is no obvious misgrading. The high generalization ability of the proposed method has been proved by the results of the unseen test scenario, shown in Fig. 10(b), (d), (f), and (h). Although the learned model never sees the test data during the training process, it can still recognize the various degree of degradation clearly. These visualization results confirm the quantified results in Table 6, verifying the outstanding effectiveness and generalization ability of the proposed method. The visualization figures of training show the rule of gradual change, which indicates that the deep model is capable of learning the essential features, otherwise, the feature will be mixed together irregularly. This proves the effectiveness of the proposed method. The test visualization figures demonstrate a similar rule to the training figures, which verifies that the learned model has good generalization ability and can effectively predict unseen testing scenarios.

V. CONCLUSION AND DISCUSSION
In this article, a distributional RUL prediction method has been proposed, which is capable of directly estimating the RUL uncertainty using the DNN framework and quantile regression loss of distributional learning. The proposed approach has been verified in the real  the novel QH-loss can quickly converge and lead the optimization to obtain the best model. 2) In comparison with the state-of-the-art data-driven RUL prediction methods, the proposed model shows better performance and generalization capabilities. 3) In order to obtain better performance, more convolution blocks or the advanced ResNet block and DenseNet block should be used to construct the DNN framework. 4) The feature visualization can prove that the model has been trained well. These results also verify the outstanding performance and generalization ability of the proposed method for the unseen test scenario. The performance of the proposed method has been verified by a series of experiments and comparisons. However, there are still certain future directions that can be further investigated to promote and expand the current research, which is summarized as follows.
1) The FPT has been proved as one of the impact factors for the performance of the proposed method. However, the current work only uses a simple and effective way, and it can possibly be improved by more advanced FPT determination and RUL labeling method in the future to calculate a more reasonable FPT for the runto-failure bearing experiment. 2) The impact of different kinds of feature mapping structures has been studied. However, the constructed way of building the optimal structure should be further investigated when the data information is unknown in advance. 3) Currently, our proposed method has been tested in the unseen test scenario, but the working condition of training and test data is still consistent. To further promote the generalization ability, we will extend it to different working conditions.

APPENDIX
The structures of CNN-3, ResNet-5, and DenseNet-5 that have been designed for making a comparison with different feature mapping structures are shown in Tables 7-9, respectively. The comparison results and analysis are summarized in Section IV-E.