Remaining Useful Life Interval Prediction for Complex System Based on BiGRU Optimized by Log-Norm

The task of remaining useful life (RUL) uncertainty management is the major challenge in solving the failure of the complex mechanical system. Primary research methods use statistical models or stochastic processes to fit the distribution of historical degradation data. However, it is difficult to accurately capture the degradation information of monitoring big data through statistics in practice. In this paper, the prediction interval (PI) obtained by the proposed feature attention-log-norm bidirectional gated recurrent unit (FA-LBiGRU) model is adopted to quantify the prediction uncertainty of RUL. Initially, the critical feature vectors are extracted from multi-dimensional, nonlinear, and large-scale sensor signals using the feature attention mechanism. Additionally, the BiGRU network is used to model and learn the time-varying characteristics of the attention-weighted features from the forward and backward directions, and the network parameters are trained by the maximum log-likelihood loss function. Ultimately, the probability density function based on the lognormal distribution is calculated to measure the uncertainty of the equipment RUL. The effectiveness of the proposed method is verified through the well-known benchmark data set of the turbofan engines provided by NASA. The experimental results show that the proposed methods can obtain higher point prediction accuracy for the complex system compared with state-of-the-art approaches and high-quality PIs satisfying real-time requirements.


I. INTRODUCTION
The nonlinear coupling between systems and the uncertainty in the operation process increases the possibility of complex mechanical systems degradation leading to failure, as the structure and function of complex mechanical equipment grow into increasingly abundant ones. As one of the most challenging technologies to address these difficulties, the prediction of Remaining Useful Life (RUL) is an essential basis for evaluating the reliability of mechanical systems [1], ensuring the safe and reliable operation of significant equipment [2], reducing the massive cost of scheduled The associate editor coordinating the review of this manuscript and approving it for publication was Fu-Kwun Wang . maintenance [3] and provide technical support for predictive maintenance [4].
RUL prediction forecasts the remaining time for equipment to decline to the failure threshold by analyzing mechanical equipment's historical performance degradation process and dynamically anticipating future changes in equipment health state. Three prediction methods for prognostics are mainly: model-based, data-driven, and hybrid methods [5]. The model-based and hybrid methods can meticulously depict the system degradation process and establish a precise mathematical model derived from known mechanics principles to examine the equipment performance degradation process [6]. However, the sizeable mechanical system represented by the turbine engine [7] has a complex mechanical VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ structure, poor task environment, and diverse failure modes, leading to an extremely high cost to establish an accurate physical failure model and, thus, a challenge to deploy in practical application. Compared with the above methods, the data-driven methods can sufficiently dig up historical monitoring sensor data during machinery operation, constructing a RUL prediction model from derived degradation features. Under such circumstances, data-driven methods have become a research hotspot with the advantage that they can refrain from prior knowledge requirements and the modelling process of the complex mechanism [8], [9], [10]. Data-driven prediction models are mainly divided into Machine Learning (ML) and statistical models. The ML methods represented by Deep Learning (DL) realize end-toend RUL prediction by establishing the mapping relationship between monitoring data and failure time. Recurrent Neural Network (RNN) under DL, particularly Long-Short Term Memory (LSTM), presented by Mamo and Wang [11], and Gated Recurrent Unit (GRU), presented by He et al. [12], is suitable for processing time series via remembering historical information in the monitoring data. Also, they can solve the problems such as unobvious long-term features due to traditional shallow learning and gradient disappearance. Zhang et al. [13] investigated lithium-ion battery RUL prediction based on LSTM to solve the problem that existing techniques cannot effectively learn the long-term dependencies among capacity degradations. Reference [14] introduced a transfer learning method based on bidirectional GRU (BiGRU) to adapt RUL estimation of bearings under different working conditions, which efficiently recognized different running states of bearings and obtained corresponding training labels. In [15], LSTM is utilized for prognostics with the degradation sequence of equipment, which takes the operation state indicator and concatenated equipment characteristics as input. In short, the DL methods can thoroughly grasp the information contained by time series to more accurately predict RUL without the exact physical model and pre-knowledge. Nevertheless, these methods can only provide point-wise estimation rather than interval estimation of RUL, which is challenging to quantify the probability distribution of uncertainty in the RUL prediction process and increase the risk of subsequent maintenance decisions in health management.
In contrast, the statistical method utilizes a statistical model or random process to distribute the fitting of historical equipment failure data. The probability density function (PDF) of RUL is constructed using parameter estimation, and then the prediction intervals (PIs) of RUL are achieved by carrying out statistical analysis of real-time sensor data. Typically, to take nonlinearity and uncertainty of RUL prediction into account, representative univariance statistical methods, e.g., wiener and gamma degradation process, are developed via parameters updating of probability distribution [16], [17], [18]. Nonetheless, with the rapid development of sensors and information technology, various networking sensors are usually used to grasp multi-source data to describe the operating status of machinery comprehensively. Performance degradation indicators that characterize system health status are often not unique, showing the features of multiple degradation variables. Hence, to further constructs a 1-D health index (HI), the method of transforming multidimensional data into one-dimensional data has been widely studied. Lyu et al. [19] and Li et al. [20] proposed data fusion methods via automatically selecting and combining multiple sensor signals to better characterize the degradation process. In [21], to address the drawbacks of sixteen original features without better trendability, Principal Component Analysis (PCA) is used for dimensionality reduction. Although information fusion methods can be implemented in the traditional univariate degradation modelling, it is tough to describe the coupling relationship of multidimensional data. Also, the degradation information of monitoring data is prone to be utilized insufficiently.
Therefore, the fusion RUL prediction methods based on ML and statistical models have been widely discussed by global scholars. For instance, the combination of a deep belief network and particle filter can forecast the uncertainty of RUL [22]. In [23], an improved echo state Gaussian process, which is able to provide an estimation of prediction uncertainty, is developed as a novel data-driven approach for RUL prediction and PIs construction. Moreover, Peng et al. proposed a Bayesian deep-learning-based (BDL-based) method for health prognostics with uncertainty quantification in [24]. Hu et al. [25] proposed a RUL prediction model based on the deep belief network for feature extraction and diffusion process for constructing PIs. Liao et al. [26] obtained the prediction uncertainty based on a classical bootstrapping approach with LSTM. Although the above methods can combine the advantages of both ML and statistical methods, it still exists following challenges that impede the accuracy and efficiency of RUL interval prognosis in mechanical systems: 1) The above studies in the literature can only determine the optimal index of multidimensional degradation characteristics through the index filtration technology for the random degradation model, i.e. monotonicity, trendability, and prognosability [27], in which the risk of missing critical information in conversion from nonlinear to linear is exposed. 2) The deep network feature extraction is independent of the stochastic model of constructing the degradation process, thus former is hard to match the latter. Consequently, this article, aiming to concomitantly settle these two challenges, proposes a novel end-to-end approach, named feature attention-log-norm bidirectional gated recurrent unit (FA-LBiGRU) model, which is presented to quantify the predictive uncertainty of mechanical systems. The main contributions of this paper are summarized as follows: 1) The feature attention (FA) mechanism is utilized to withdraw key feature vectors from multidimensional, nonlinear, and large-scale sensor signals in order to realize the dynamic change of input features.
2) A log-norm bidirectional gated recurrent unit (LBi-GRU) is developed for a fusion of ML and statistical models to construct PIs while predicting RUL. More concretely, using the BiGRU network to study the timevarying characteristics of attention-weighted features from forwarding and backward directions. Moreover, due to its flexibility, a novel maximum log-likelihood loss function based on the lognormal distribution (LN) is framed to train the network parameters. The remainder of the paper is organized as follows. Section II presents the theoretical background of the proposed methodology. In Section III, the proposed framework and its operation process are revealed. A case study on the C-MAPSS dataset is performed in Section IV, which is utilized to demonstrate the validity and predominance of the proposed approach. Finally, the work is summarized and concluding remarks are drawn in Section V.

A. FEATURN ATTENTION MECHANISM
In order to more comprehensively evaluate health status and performance degradation trend, monitoring data with multiple features often require comprehensive consideration for mechanical system working in complex environments. Further, these features contain diverse contributions to RUL predictions. Hence, FA mechanism, whose specific structure is shown in Fig. 1, is applied to select the key features of the original input data, which can effectively reflect the correlation between features and RUL prediction, and improve the feature extraction ability of the prediction model. In Fig. 1, T represents the time length of the time series, and F represents the features length. Firstly, the mean value of each time sample segmented by sliding window is obtained through Global Average Pooling (GAP) of the feature dimension, which is compressed from T × F to 1 × F, and the calculation process can be expressed as where, y f is the vector obtained through GAP, g(·) represents the global average pooling operation of feature dimension, X f is data at all time under f th feature, x f t is the data at t th time of the f th feature.
Then, the attention weights corresponding to each feature are quantified through the two fully connected layers, and the weights are normalized by sigmoid activation function. Its calculation function is where, W 1 , b 1 and W 2 , b 2 are the weights and offsets of two layers respectively, δ is the ReLU activation function, σ is the sigmoid activation function. Finally, normalized attention weight α f multiplied by the corresponding input data x to enhance or weaken the f th eigenvector, and a new attention weighted input x is obtained as follows:

B. BIDIRECTIONAL GATED RECURRENT UNIT
The main way to predict the RUL is to learn the spatiotemporal information of performance degradation from the sensor data of mechanical system. LSTM can make full use of time correlation of data to process time series data. However, its complex internal structure leads to long training time, which is not conducive to practical application deployment [28]. GRU optimizes LSTM by combining forgetting gates and input gates to reduce the number of gates and improve the training efficiency of the network while ensuring memory ability. Memory unit is the core of GRU network, which is composed of update gate, reset gate and candidate hidden layer. Its network topology is shown in Fig. 2. It can be seen from Fig. 2 that h t−1 , h t and h t+1 are the hidden states at time t−1, t and t+1 respectively, representing the memory information of the input sequence. x t−1 , x t and x t+1 are the input of memory units at moments t − 1, t and t+1 respectively. In addition, the arrow indicates the direction of information flow. The specific calculation process of each GRU memory unit is where, W r and b r , W z and b z , W n and b n are the weight matrix and bias vector of reset gate r t , update gate z t and candidate hidden layer n t , respectively. Furthermore, r t is responsible for controlling the memory that needs to be retained in the transmission process of the unit. When r t approaches 1, n t mainly contains the information composed of the current input x t and the hidden state h t−1 of the previous unit input. When r t approaches 0, then n t contains mainly x t . Inversely, z t controls the degree of selective abandonment of h t−1 . When z t approaches 1, the memory unit mainly outputs h t−1 of the previous moment. When z t approaches 0, the memory unit mainly outputs n t , that is, h t at the current moment.
In summary, GRU gets two key gating states through the hidden state h t−1 of the previous unit and the input x t of the current moment. The output vectors of r t and z t determine which information can be used as the final output of GRU. In order to deepen the level of feature extraction of the original time series and further improve the accuracy of the model output, two independent GRUs with different directions are superimposed together to form a BiGRU network. The specific structure is shown in Fig. 3. The output h f of the forward hidden layer from time 1 to time t is calculated by inputting x to the forward layer. The output h b of the reverse hidden layer from time 0 to time t is calculated by inputting x to the reverse layer. Lastly, the output results of the forward layer and the reverse layer are input to the full connection layer to obtain the final output h. The calculation process is where, f (·) is the mapping function of the full connection layer.

C. LOGNORMAL DISTRIBUTION
The reliability of mechanical equipment is usually measured by probability index, and its life is a non-negative variable, which is difficult to fit its trend through normal distribution. However, LN is one of the commonly used life distributions in the field of reliability, which can convert the life data with large difference in values into standard normal distribution after taking logarithm. Suppose that a random variable Y = ln(X ) obeys LN with mean µ and variance σ , then the random variable X satisfies the LN distribution with parameter (µ, σ 2 ). The formulas for calculating the statistical characteristics are shown in Table 1. The distribution of statistical eigenvalues of LN is shown in Fig. 4-a, where the PDF of mode value is the largest. Furthermore, LN shows negatively skewness owing to the sample is mainly concentrated on the left side of the mean value. Therefore, LN is suitable for describing the probability distribution of RUL in the degradation process, since the predicted RUL should be less than the real RUL to ensure the safety requirements of the equipment. Fig. 4-b and Fig. 4-c show the influence of different µ and σ on PDF. The smaller µ is, the more positive the skewness of LN is, while the smaller σ is, the more LN tends to be normal distributed. Due to equipment safety requirements, not timely maintenance will cause disastrous consequences, so early prediction is preferable to delayed prediction. However, premature maintenance is sometimes costly. So, the trend of model prediction requires the choice of µ and σ in a reasonable range. The parameter µ of PI should be close to the actual value, so the prediction accuracy is higher. While the smaller σ represents a narrower PI, so the narrow prediction interval can better measure uncertainty. To sum up, LN can be changed into different shapes only according to two parameters, which is a more flexible distribution and is conducive to describing the life samples with roughly symmetric or right skewness.

III. PROPOSED METHODOLOGY
The proposed method consists of offline modelling and online application phase, which are described in Fig. 5. First, the offline modelling phase, in which a FA-LBiGRU model is trained on run-to-failure data composed of measurements of healthy and degraded states of the single mechanical system, is applied to establish the mapping between historical condition monitoring (CM) data and RUL distribution. On account of multidimensional historical time series are the main carrier of degradation information, N systems in the same operating environment are assumed to be monitored by F sensors for each system. Then, CM data whose lifetime is divided into T i by sliding window is measured for the system i, i = 1, 2, · · · , N , which can be represented by In addition, RUL labels corresponding to X i can be expressed as Y i = [y 1 , y 2 , · · · , y T i ], before integration of run-to-failure data to historical CM data set During the training period, FA-LBiGRU model takes D to find the optimal parameter θ i to maximize likelihood function of RUL, where depicts RUL distribution most accurately. If the RUL distribution follows LN, θ t i contains two parameters of (µ t i , σ t i ), where µ t i is the mean value and σ t i is the standard deviation derived from modelling stage of the system i at time t.
The construction result of system's PIs is obtained in the second phase, the online application phase, after handing field CM data into the trained FA-LBiGRU model, which translates the sequence of sensor signals into a couple of new practical predictive values. For effective adaptation of proposed model's data structure, the CM data and the homologous labels need a preprocessing step corresponding to the first half of the first phase, whose details are described in Section A. The framework of the proposed FA-LBiGRU model for construction of mechanical system's PIs is then demonstrated in Section B. VOLUME 10, 2022 The construction of feature sets are extracted of various types of CM data, e.g., vibration, pressure, temperature and flow, obtained by diverse sensors installed on mechanical system. However, not all of these data are sensitive to the degradation process, or can provide sufficient profitable information for the establishment of PIs. Hence, to further identify a data subset that recognizes essential roles from independent ones and thus raises the precision of prediction results, three standard feature selection metrics in [27] are inherited and exploited to initially pick useful and representative sensors in this paper. In practical engineering cases, the nonlinear correlation among features is common, yet not considered via these selection metrics. Therefore, FA mechanism in Section B is proposed to overcome this drawback.

2) DATA NORMALIZATION (DN)
Since the multi-dimensional monitoring data have different dimensions, normalization preprocessing must be carried out before the model construction to eliminate the influence of feature consistency. Among various popular normalization techniques, this paper chooses the minimum-maximum normalization method to unify the data to the range of 3) RUL LABEL RECTIFICATION Due to the system being in good condition at the beginning of the operation, the degradation degree is small. At the same time, its performance will decline sharply with the extension of time at the end of service. Suppose the label of monitoring data before the rapid deterioration of engine is set to the total operating cycle minus the number of current operational cycles. In that case, the lag of RUL prediction results will increase. Therefore, it can be considered that the RUL remains unchanged before the engine starts to degrade rapidly, that is, to set a threshold for the RUL label of the training set to make it a piecewise linear function. The research [29] shows that it is better to rectify the critical value of RUL label mutation in the training set to the 130th operation cycle, and the RUL label setting results are shown in Fig. 6.

4) SLIDING WINDOW PRETREATMENT (SWP)
In order to use the limited time series to dig deep into the degradation law, and transform the time series into a threedimensional input format that the BiGRU network is good at processing, the sliding time window segmentation method is used to process the normalized data. In this way, the time correlation between adjacent sequences can be fully retained, and the number of training samples can be increased so that the model has higher robustness and generalization ability. For the given time series x T i F of system i, where F and T i are the feature and the time dimension, the sliding window with window width S slides along its time direction. By superimposing the time series intercepted by each sliding step length to the third dimension, the tensor with (T − S, S, N ) dimension is formed, and the calculation process is where X 1:T −S is the converted 3D tensor, x i:i+S is a sequence of length S from i th time; ⊕ represents a connection of the segmented data in a third dimension, thus forming a threedimensional tensor.

B. FRAMEWORK OF THE PROPOSED MODEL
Instead of predicting the single-objective RUL value y, the FA-LBiGRU model can predict the LN parameters (µ t i , σ t i ) that construct PIs of system i at time t. Exceptionally, by constructing a novel maximum likelihood loss function, LN subject to a couple of parameters (µ t i , σ t i ) can maximize the probability when RUL t i = y. Fig. 7 shows the framework of the proposed innovation model in this manuscript. It consists of the following layers:

1) INPUT LAYER
It can input the historical CM data set to further operation, which is preprocessed and integrated as a 3D tensor with shape number of samples (S), the width of a time window (T ), and number of features (F).
For a matrix sample X = (x 1 , x 2 , · · · , x F ) = (x 1 , x 2 , · · · , x T ) T , the expansion can be expressed as where Fig. 7 is  the S th n sample of the input data set segmented by the sliding window.

2) FEATURE WEIGHTING LAYER
It is responsible for extracting the attention weight e of input matrix samples X, so as to obtain the attention weighted features X . Concretely, the correlation between input data and target modelling parameters is automatically extracted by FA mechanism. Through continuous iteration, the attention weights of each feature at all times are optimized, and then the critical features with enormous contributions to RUL prediction are gradually extracted. According to (1) and (2), attention weight can be expressed as Note that e = [e 1 , e 2 , · · · , e F ] ∈ R F×1 . Therefore, the attention weighted features are where X can record as [x F 1 , x F 2 , · · · , x F t , · · · , x F T ] T given F is the dimension of attention-weighted feature, and T is the width of time window.

3) FEATURE LEARNING LAYER
For the system RUL prediction task, the input sensor data feature is a continuous-time series with solid correlation, which requires the network to have a particular ''memory function.'' The RUL of the current device is predicted by learning the difference before and after degradation. Feature weighted layer can give different weights to multidimensional features, but it is challenging to learn sequence information with obvious time correlation. Therefore, the BiGRU network is built after the weighted feature layer to form the feature learning layer so that the prediction model can capture the temporal dependence of data.
Since the length of time series is T , and there are F input features, a BiGRU layer is composed of T memory units in series, and the input dimension of each GRU unit is F. The feature sequence is input to the forward and reverse layers at different times, as shown in Fig. 5, to get the forward output h ft and backward output h bt at time t, t ∈ [1, T ]. Finally, the output h t of the unit is calculated by (5), and the total output H of BiGRU can be denoted as [h 1 , h 2 , · · · , h t , · · · , h T ].

4) OUTPUT LAYER
A novel layer called lognormal layer, which forms an output layer together with Linear, is defined to seek RUL distribution parameters for specific features. The output of the feature learning layer is used as the input of the output layer, and then the PDF parameter prediction values y * t of the system RUL distribution are calculated, which construct PIs at time t, given by where, w o and b o are the weight matrix and bias vector of the Linear. However, the practical construction of PIs requires a slight positive standard deviation, while the ordinary activation function cannot meet this requirement. Hence, new standard VOLUME 10, 2022 deviation σ t can be obtained as follows: Then note that the total output Y * of the output layer can be recorded as Y * = µ 1 , µ 2 , · · · , µ t , · · · , µ T σ 1 , σ 2 , · · · , σ t , · · · , σ T T (13)

5) LOSS FUNCTION
So far, the FA-BiGRU model without log-norm distribution has now been built to construct RUL PIs, determined by sets of parameters (µ t i , σ t i ) for system i at time t, i ∈ [1, N ], t ∈ [1, T i ]. When training, the model whose name modifies to FA-LBiGRU updates weights and biases with LN participation, which will be altered to get parameters' sets (µ t i , σ t i ) that maximize the likelihood function. In other words, according to the probability density formula in Table 1, all the probability values from the start to the end of the monitoring are recorded so that the two parameters tend to the actual RUL distribution, given as . Whereas, in the actual training process, when the parameters are randomized away from their target values, the likelihood function is relatively constant, and the network update will be minor. Therefore, the logarithmic likelihood function is often used to accelerate the convergence of the model to the optimal value and can calculate the sum of single independent probability factors, which is helpful to the reverse derivation of the loss function. Besides, deep learning methods usually minimize the loss function, so the loss function is defined as

IV. EXPERIMENTAL VERIFICATION
In this section, a well-known prognostic benchmarking problem, i.e. the turbofan engine degradation problem, will be settled for RUL prediction and uncertainty quantification of systems using developed prognostic approaches. Section A presents the benchmarking dataset selected in this article and briefly introduces the pretreatment process. Then performance metrics for point-wise and interval prediction are demonstrated in section B. Section C is dedicated to analyzing the model hyperparameter setting process and results. Visualization and comparative analysis of prognostic results for complex system are described in section D.

A. DATASET DESCRIPTION AND PREPROCESSING
The engine degradation dataset C-MAPSS for prognostic is published by [30], which consists of four subsets (named FD001 ∼ FD004), and uses 21 sensors as typical features to characterize engine performance. Each subset comprises of a training set, a test set, and a RUL label set. The training set contains all the data for the whole life cycle of the engine, whereas the test set only contains the data from the previous period of the engine's life cycle. The RUL label is the RUL of the final monitoring time of the engine corresponding to the test set. Moreover, each set (see Table 2 for specific information) contains a different number of engines, each with varying degrees of initial wear, so the sequence length of each engine monitoring data is also different. In order to deal with complex and massive datasets effectively, a preprocessing method is adopted, which is mainly divided into four steps corresponding to A in section III. In particular, this section mainly describes the process of selecting sensor data, while data normalization, sliding window pretreatment and RUL label rectification need not be introduced because of their broad applicability. Due to space constraints, only the distribution of four sensor data features with different trends in the FD001 training set is shown in Fig. 8. It can be seen that the trend of T2 remains unchanged, indicating that the monitoring data of the No2 sensor have no significant changes in the whole life cycle. Therefore, these sensors with no change are difficult to provide any degradation information that contributes to RUL prediction and should be eliminated. In addition, Table 3 shows trendability, monotonicity and prognosability parameters to compare whether remaining candidate sensors are helpful for systembased prognosis. The most suitable sensors whose parameter is close to ''1'' can be determined by giving a sum function of the three parameters to compare, which can provide helpful degradation information. Finally, after comprehensively measuring all parameters and their sum in Table 3, ten sensors are selected as the input feature data set in the manuscript: No. 2, 3, 4, 7, 11, 12, 15, 17, 20 and 21, respectively.

B. PERFORMANCE EVALUATION METRICS
Performance evaluation metrics to quantify the performance of prediction model is divided into two aspects: point-wise prediction and uncertainty quantification. Among them, the point-wise prediction evaluation metrics is used to evaluate  the accuracy and posteriority of the model for deterministic point prediction on the test set, and the uncertainty quantification evaluation metrics is used to measure the quality and convergence rate of the PIs.

1) POINT-WISE PREDICTION EVALUATION METRICS
Given m engines, where y i is the actual RUL value of the i th engine, whileŷ i is its predicted value, the prognostic model's performance can be quantified by the following metrics.

a: ROOT MEAN SQUARE ERROR (RMSE)
RMSE is used to measure the deviation between the predicted value and the actual value, with its calculation function as Due to the high safety requirements of the engine, the cost caused by the catastrophic consequences of not timely maintenance is much larger than that of excessive maintenance, so earlier predictions are more desirable than delayed predictions. SF, as an authoritative official indicator in the 2008 PHM data challenge, imposes higher penalties for RUL delayed forecasts, and its calculation formula follows as where a 1 and a 2 are parameters controlling the degree of punishment of SF for prediction lag to achieve asymmetric prediction preference, which equals 13 and 10 respectively.

2) UNCERTAINTY QUANTIFICATION EVALUATION METRICS OF COMPLEX SYSTEM a: PREDICT INTERVAL COVERAGE PROBABILITY (PICP)
PICP, which can be used to evaluate the reliability of PIs, reflects the probability that the actual RUL value y i falls into the predicted fluctuation range [L α (ŷ i ), U α (ŷ i )] of the predicted valueŷ i .
where L α (ŷ i ) and U α (ŷ i ) are the upper and lower boundaries of the PI ofŷ i at the nominal confidence level (CL) (1 − α), respectively. The larger the PICP is, the more real values fall into the PI, and the more reliable the prediction results of the model are. In addition, PICP should exceed the given confidence as much as possible. Otherwise, it will be judged invalid in the actual maintenance decision.

b: PREDICT INTERVAL NORMALIZED AVERAGE WIDTH (PINAW)
The pursuit of PICP alone will lead to too broad a PI, and it is difficult to obtain the uncertainty information of RUL VOLUME 10, 2022 prediction accurately. Therefore, it is necessary to consider the interval width to improve the decision value. PINAW represents the average width of the PI, which can be used to assess the clarity of the PI. (19) where y max and y min are the maximum and minimum of the real RUL, respectively. When PICP is constant, the smaller PINAW corresponds to the narrower PI, the better the prediction effect of the model.

C. MODEL HYPER-PARAMETER SETTING
In order to accurately reflect the error between the predicted value and the actual value of different hyper-parameter models in the test set, the commonly used mean square error (MSE) is used as the evaluation metrics to adjust the hyper-parameter.
where y i andŷ i are the actual and predicted values of the i th input sample, respectively; m is the number of samples. This model is implemented by the PyTorch framework, and the main hyper-parameters involved are: network structure, learning rate, batch size, epoch, optimizer, dropout rate, etc. The model parameters have a great influence on the performance of the model, especially the time window width S and batch size. Therefore, the trial-and-error method is used to minimize the model's prediction error to obtain the optimal super-parameter combination, as shown in Table 4. Concretely, taking the above two parameters as an example, Fig. 9 shows the specific adjustment.
MScore and MMSE are the mean values of 30 test results in Fig. 9, respectively, and Score is the SF introduced in B. Considering that the time window width should not exceed the minimum length of engine monitoring in the dataset, its test range is [5,30], and the test range of batch size is [8,64]. In addition, more useful degradation information that may continuously enhance model's prognostic performance is accumulated due to time window width increase. Finally, comparing the lowest MScore and MMSE with the best performance, a time window width of 30 cycles and batch size of 32 are selected as the most reasonable parameters of the model, and other hyper-parameters are adjusted according to this method.

D. PROGNOSTIC RESULTS OF COMPLEX SYSTEM 1) VISUALIZATION ANALYSIS OF RUL PREDICTION AND UNCERTAINTY ESTIMATION
Taking the FD001 dataset as an example, the preprocessed high-dimensional time series was loaded into the FA-LBiGRU model for the point prediction of RUL, and the median prediction results of the test set are given in Fig. 10-a. Furthermore, all the engines in the FD001 test set  are reordered from large to small according to the real RUL values, and the results are shown in Fig. 10 It clearly can be seen that the actual value and the median value of predicted RUL value are very close in Fig 10-a, indicating the effectiveness of the point-wise prediction of the proposed method. Additionally, when the RUL value is considerable, the engine is healthy, indicating that the system operates well. Still, the monitoring data are insufficient to obtain sufficient degradation information. Therefore, as shown in Fig. 10-b, when the engine starts running, the RUL prediction error is large, and the fluctuation is relatively violent and shows noticeable hysteresis. However, after a long time of operation, the predicted RUL converges around the real RUL, and the prediction performance is significantly enhanced. Thus, it can be concluded that the more sufficient the history information of the engine, the more pronounced the performance degradation information, and the smaller the model's prediction error.
Next, Fig. 11 and Fig. 12 respectively present the RUL prognostics and interval construction results of engine #34 and engine #76 in the test subset of FD001. Fig. 11-a and Fig. 12-a show the predicted PDFs and RULs at the last period of all monitoring times respectively. It can be observed that the median value of predicted RUL (characterized by the red star point in Fig. 11-a and Fig. 12-a is close to the actual  value (characterized by the black hollow circle in Fig. 11-a and Fig. 12-a) at the end of the operation (155-175 monitoring times). Subsequently, the peak value of PDF at each time point increases with monitoring time, demonstrating that the point-wise prediction results of RUL are convincing in developing maintenance strategies. Fig. 11-b and Fig. 12-b respectively illustrate the RUL prognostics and interval construction results throughout VOLUME 10, 2022  lifetime observations of engine #34 and engine #76 in the test set FD001. For lifetime-cycle observations, as the degradation process is not noticeable compared to the end-oflife observations, the predicted RUL distribution in the early phase of the observation period is unsatisfied: ground actual RUL may exceed the 95 % RUL prediction CL, e.g. about 55 cycles in Figure 12-b. However, as shown in Fig. 11-b and Fig. 12-b, the prediction results are improved along the engine's lifetime: the widths of the 95%, 90% and 80% CL intervals become narrower as the monitored system runs, which means that the uncertainty of prediction decrease gradually with the number of collected measurements increasing. It also can be seen that the predicted RUL is getting closer to the ground actual one as the monitored system runs, which indicates that the performance degradation of the engine gradually accumulates with the increase of the operation cycle, making the model capture the degradation information in the time series, which effectively improves the prediction accuracy in the later stage of operation. In fact, at the end of the engine #34 observation period (cycle 173), its actual RUL (7 cycles) falls within a narrow enough 80% CL RUL interval [4.0, 7.3] (Fig. 11-c). Moreover, at the end of the engine #76 observation period (cycle 175), its actual RUL (10 cycles) falls within a pretty narrow interval [9.7, 15.8], representing the RUL PI of 80% CL (Fig. 12-c).

2) COMPARATIVE ANALYSIS WITH STATE-OF-THE-ART METHODS
To compare the last prediction performance of this method for engine RUL point prediction and interval prediction, the metrics in Section B are used to objectively evaluate the accuracy and generalization ability of different models on the test set. Table 5 compares the training effects of different recurrent neural networks, using the FD001 dataset in a single mode and the FD004 dataset in multiple modes. Notably, the feature extraction effects of different attention mechanisms, the self-attention (SA) and the FA mechanism, are compared. Detailed settings for comparison methods without citations are shown in Table 6. Furthermore, the point-wise prediction accuracy of the IESGP and LSTMBS, which can also be used for interval prediction, is also compared. In addition, observing the results in Table 5, owing to the asymmetric property of the LN, there exists a subtle disparity when using the mode, mean or median values for the evaluation of the point-wise prediction accuracy metrics. Among them, the mean provides better results for the RMSE than the other two accuracy metrics. Nevertheless, comprehensively considering the disaster of lag prediction, the median with a smaller SF value is selected as the result of the point prediction.
Compared with the other state-of-the-art methods according to Table 5, the RMSE and SF of proposed method are minimal, denoting that the proposed FA-LBiGRU in this paper has the highest prediction accuracy. For the FD001 dataset, RMSE = 13.13, where is reduced by an average of 11% compared with other methods, and SF = 238, where is reduced by an average of 33%. For the FD004 dataset, RMSE = 20.29, where is reduced by an average of 22% compared with other methods, and SF = 3395, where is reduced by an average of 37%. Further analysis shows that although GRU training is slightly less effective than LSTM, its training time is reduced by about 11%, which is conducive to real-time prediction deployment. Secondly, BiGRU can capture bidirectional long-time-dependent features, resulting in better prediction accuracy. Moreover, the FA mechanism can score the importance of different features rather than the weighted normalization of the SA mechanism, which further improves the model's prediction accuracy. In addition, although the average time-consuming of the proposed method in the offline training phase is 349s, the online test phase only takes 1.3s, which is only a 0.13s difference compared with the average time-consuming by other methods applied to FD001. On the other hand, the point prediction accuracy of the proposed method is higher, which can compensate for the time loss caused by the model's complexity. In conclusion, the longitudinal comparison of the point prediction results shows that the proposed method has higher prediction accuracy and better generalization ability while meeting the realtime requirements.
To further verify the advantages of the proposed method for interval prediction, IESGP [23], the BDL-based [24], and LSTMBS [26] are constructed as comparison models. The average interval predictions are obtained by 10 trials on the FD001 dataset, as shown in Figure 13. Compared to other methods, the proposed model provides the tightest width of 95% CL RUL prediction interval (PINAW = 0.3) with a good enough score of the prediction interval coverage percentage (PICP = 0.95). It can be seen that IESGP gets the best PICP, but its mean prediction interval width is more than a half of the target RUL's range (PINAW =0.540), indicating that it is slightly weak when facing high-dimensional features. Besides, although the PICP of LSTMBS has the same value as our model, the average PINAW decreases by 19.8%, 13.9%, and 16.9% at the CL of 95%, 90%, and 80%, respectively. In addition, the repeated test error of the proposed method is relatively low, indicating that the proposed method can quantify the uncertainty of performance degradation more accurately, and stably and effectively improve the prediction accuracy. The accuracy of point prediction is also better than the LSTMBS model, indicating that the proposed method has a more powerful feature extraction ability and data processing power.
To sum up, it is found that the proposed method can extract more accurate performance degradation trend characteristics from high-dimensional and sparse big data and effectively measure the uncertainty in the process of system degradation to achieve accurate prediction of RUL. The prediction interval has the advantages of high reliability and accuracy, which provides strong support for subsequent operation and maintenance personnel to make maintenance decisions.

V. CONCLUSION
In this work, a novel end-to-end methodology for RUL estimation and uncertainty quantification of a system was presented. The multidimensional original monitoring data is preprocessed into a data set for model input by data preprocessing technology. Additionally, the degradation rule of data sets is learned to predict the RUL distribution of complex system by FA-LBiGRU model. The famous benchmark data set C-MAPSS is used to verify the effectiveness and advantages of this method, and the following conclusions are yielded: 1) Sufficient historical data of systems contain obvious performance degradation information, which reduces the error between point prediction and interval prediction of the model. It is conducive to health management measures such as subsequent operation planning and maintenance decision-making. 2) FA mechanism makes end-to-end prognostics models focus more on essential features, and dynamic weighting of the importance of multidimensional features can be achieved without manual intervention. LBiGRU can deeply mine the degradation information before and after time series and measure the uncertainty of equipment degradation process based on LN distribution, effectively improving the accuracy and reliability of remaining life point-wise prognostics and interval prognostics of mechanical system. 3) Compared with several state-of-the-art approaches, the FA-LBiGRU fusion model is used to process the monitoring data of multi-dimensional mechanical systems with complex operating conditions and variable fault modes, which can generate the RUL PIs taking reliability and accuracy into account.