A Hybrid Temporal Data Mining Method for Intelligent Train Braking Systems

As big data mining technology penetrates into various fields, cross-domain topics driven by data predictive analysis have become important entry points for solving traditional problems. Due to the complex changes of the pressure sensor and the interaction of different grouped trains during the train braking process, the mechanism modeling is difficult, the data is highly temporalized, and the data distribution is not stable. Facing the development trend of long-grouped-heavy-duty train captains, if the braking analysis of the train by temporal data mining of small groups can be used for predictive analysis, it will make innovative progress in the entire train braking field. This paper focuses on combining latest technology such as machine learning, transfer learning and lifelong learning to construct the first predictive analysis research framework in the field of train braking systems. Based on the principle of train braking process and temporal data collected from intelligent experiment platform, a baseline has firstly been built to solve fixed-grouped and multi-grouped temporal prediction problems. Then a predictive algorithm for model verification and update for lifelong learning is established to automatically update model parameters over time. Finally, relying on the parameter transfer in transfer learning, a multi-grouped temporal data prediction analysis is performed. Through comparing the training results of the “pre-trained” model on the general domain, the “tuned” model on both general domain and the target domain, and the “target only” model on the target domain separately, multi-domain tuning results show their applicable scope and transfer conditions. In summary, this work can contribute to intelligently upgrading the semi-physical intelligent test platform for long-grouped-heavy-duty trains.


I. INTRODUCTION
W ITH the deepening of freight logistics cooperation all around the world, systematic and efficient research testing and onsite experiments have become the core steps in the development of all heavy-duty transportation technologies. Considering the developing and testing costs of new brakes, the research and development of an intelligent test platform for train braking systems based on the rapid upgrading of computer technology has been greatly improved in recent years. Almost all the research on simulation of heavy-duty train braking systems are concentrated on the analysis and calculation methods of fluid mechanics, then the modeling and predictive simulation of air braking systems are carried out. Classic studies include: the simulation prediction of the train braking system characteristics of ABDW series valves in the United States [1] [2], performance prediction and parameter research based on vacuum brake system in India [3], and calculation method for solving the gas flow equation of the braking system considering actual thermal effects from Poland [4].
From 2000 to 2004, a research route that made outstanding contributions came from Wei Wei etl. from Dalian Railway Institute in China. This research combines the gas flow theory and the principle of 120-type braking valve to establish a train braking performance analysis simulation software which could predict any group within 150 trains [5][6] [7]. In 2007, it further realized the calculation of real-time dynamic properties of the braking system, and used the system to predict braking characteristics of a 20,000-ton combined train [8]. As of 2012, the simulation system based on the basic theory of gas flow could already be combined with the air brake system for joint simulation, which has a significant impact on the simulation research of long and heavy trains [9]. Another analysis method focuses on the use of mature fluid simulation software AMESim, it uses graphic modeling to have good versatility. In 2015, the calculation model of jetting gas extinguishing agent from gas extinguisher vessel was constructed with AMESim and the jet performance of gas extinguishing agent was simulated by using the twophase flow model [10].
Although the above research routes have achieved some results, they are currently unable to adapt to future replacement of braking valves and the ultra-long group of heavyduty trains. On the one hand, there are errors between airflow-theory-based modeling and actual circuit tests. And because of long research and development cycle, researchers are required to master the physical mechanical structure of air control valves. On the other hand, this kind of system simulation platform lacks the versatility and portability, which cannot adapt to the rapidly updated braking system hardware. Therefore, researchers in the field of train braking systems are paying attention to another kind of intelligent test technology--hardware-in-the-loop simulation (HILS) or semi-physical technology, which has been rapidly developed and applied in various fields of industry since the 1960s.
However, The HILS platforms all over the world come from large companies. These companies have already had a certain international monopoly in the corresponding application fields. For most researchers, whether implementing rapid control prototyping or semi-physical simulation, platforms such as the dSPACE real-time simulation system cooperated by German dSPACE company and American MathWorks company, the RT-LAB launched by Canada Opal-RT Technologies and real-time LabVIEW-RT developed by US National Instruments, have become their first choice. The above HILS technology inevitably has the problems of extremely high cost, poor versatility, and poor portability. Large companies have increased the generalization performance of their simulation systems due to their business breadth, so these platforms are difficult to use in some professional fields, especially the semi-physical simulation analysis of train braking systems that involves safety performance issues.

A. INTELLIGENT EXPERIMENT PLATFORM OF TRAIN BRAKING SYSTEM
In the field of rail transportation, research based on the concept of HILS has developed to a certain extent and has obtained corresponding research results.The Chinese Academy of Railway Sciences established the main circuit model of the auxiliary converter of the CRH3 EMU through MATLAB/Simulink [11]. In 2017, a semi-physical simulation test rig of urban EMU network control was tested, which could simulate actual train operation [12]. In 2019, a research was dedicated to the development and validation of a HILS test bench for virtual real-time testing of the GAZ Group light commercial vehicles equipped with Electronic Stability Control(ESC) systems [13]. Though the application of mod-ern real-time simulation technologies based on HILS allows to decrease the number of onsite tests, most of them are all focused on high-speed EMU.
For the vacancy of HILS technology research and development in the field of heavy-duty train braking system, a research project cooperated by Tongji University and China CRRC Qiqihar aims to solve this cutting-edge problem. By 2017, this research has finished modelling freight train brakes in various stages of braking process based on the fluid balance equation [14], and a semi-physical intelligent experiment platform of freight train braking system was initially established with respect to this model [15]. By 2019, a more detailed design and implementation of simulation modelling and interface parameter provided an important basis for the control part of entire semi-physical braking intelligent test system [16]. At the same time, a scheme for obtaining the braking performance of large-grouped trains by intelligent control of small-grouped trains was proposed [17]. In this work, the neural network algorithm is used to predict the data online and the train is corrected with the predicted value. Figure 1 illustrates the topology and on-site platform of the whole semi-physical intelligent system.
However, this semi-physical system still has some drawbacks. On the one hand, the modeling process based on fluid mechanics is not scalable and versatile, it is limited by the error between numerical and actual solution. On the other hand, although Back Propagation(BP) neural network has been used for performance prediction, it is only the predictive fitting of the abstract parameter of throttling coefficient. Test data of a small grouped train is not enough to ensure the generalization of this prediction method for long and large groups. Moreover, test data of train braking process is not independent. Considering different groups, it has an inseparable relationship with the position and time of the sensor. Classic neural network cannot achieve a good predictive model. Thus, it is necessary to combine diverse methods to analyze large temporal data collected from this intelligent system and help improve the control strategy in return.

B. RELATED RESEARCHES AND TEMPORAL PREDICTION PROBLEMS
At present, the development of temporal data mining technology in the industrial field is still in its infancy, and the research methods for industrial sensor time-series prediction can be mainly divided into two categories. One is based on the classic models of statistics [18], such as the moving average model, exponential smoothing model, Autoregressive Integrated Moving Average (ARIMA) model and state space model, etc.. Because statistical models rely too much on assumptions of stability and so on, the data cannot always be suitable. The other is prediction model based on machine learning, such as K-Nearest Neighbor (KNN), Support Vector Machines (SVM) [19], BP neural network [20] [21] and Deep Neural Network (DNN). Among them, KNN, SVM, BP neural network have simple structures and stable performance, but the prediction accuracy is limited.
With the advent of the era of cloud computing and big data, the improvement of computing ability and the substantial increase of training data provide support for deep learning [22], and deep networks represented by Recurrent Neural Network (RNN) [23]. With its advantages of strong versatility and high prediction accuracy, time-series prediction has gradually become a popular research direction. In practical applications, the sensor data rules are mostly related to long-distance timeseries data, and the gradient explosion or gradient dispersion of ordinary RNN as the cycle progresses makes the model only learn the short-term dependence [24]. To solve this problem, Long Short-Term Memory Neural Network (LSTM) is introduced [25]. The long-term and short-term memory unit of LSTM can control the accumulative speed of information, and shows superior prediction ability in predicting longdistance dependent time-series data, which is also the theoretical basis of the core network construction of this study. By 2021, some improved models for industrial application based on LSTM have been presented. A multi-output sequential learning model is proposed for Heating and Cooling load prediction [26]. The DB-Net incorporating a dilated convolu-tional neural network (DCNN) with bidirectional long shortterm memory (BiLSTM) to predict power consumption in integrated local energy systems [27]. The AB-Net incorporating an autoencoder (AE) with BiLSTM for Renewable Energy (RE) generation forecasting [28]. A comparative analysis of a variety of deep features with several sequential learning models is presented to select the optimized hybrid architecture for energy consumption prediction [29].
Based on the prediction and analysis of industrial sensor time series, the introduction of spatial location is the industrial application of temporal data mining. At present, the latest research related to industrial temporal data basically focuses on operation monitoring or equipment failure detection, and is rarely used for modelling and updating of working process. This is why most industrial applications only mention time series rather than temporal data. To apply data mining technology and train braking process in this study, we should consider sensors located in different numbers of trains, different groups and different positions in the same train. A research from JD Financial City Computing Business Department at the 2018 IJCAI Conference provide the basic solution for the problem [30]. It is oriented to urban computing and has realized the temporal sequence prediction of geographic sensors based on the multi-layer attention mechanism neural network. Recently, a multisource adaptation diagnosis network (MADN) method is proposed to transfer the diagnostic knowledge existed in multiple sources to the target [31].
Relying on the above research background and the latest progress, this work focuses on the current industry bottlenecks of train braking systems and the application of temporal data mining technology in this field, mainly to solve the following four problems: 1) How to collect braking data based on the intelligent train braking system with multi-sensors and construct a temporal data set suitable for temporal prediction analysis? 2) How to accurately predict and analyse a single fixed train group based on the temporal data of the train braking system and output the predictive model? 3) How to design a lifelong learning-oriented predictive model in the field of train braking according to the update of data set to achieve continuous model updating and stability verification? 4) How to carry out the transfer training of the model under variable condition of train groups, that is, multigroups, and discuss the suitable application range of transfer learning?
Relying on the experimental platform to build data set to solve Problem 1), a basic model based on LSTM network [32] and a more complicated model [33] have been established to solve Problem 2). The simulation results show that the extended model can predict the following time series data in accordance with the experimental results with high accuracy, which means that the model has good predictive VOLUME 4, 2016 performance in long-grouped train braking problems. With this preliminary, instead of improving the modelling process, this work keeps more focus on solving Problem 3) and 4).

III. MODEL UPDATE AND VERIFICATION OF TEMPORAL PREDICTION BASED ON LIFELONG LEARNING A. NOTATIONS AND PROBLEM STATEMENT
Formal Problem I: Multi-sensor time series prediction in a single train In fixed-grouped situations, given n time series collected from input sensors t , x 2 t , · · · , x n t ) T ∈ R n to denote a vector of n exogenous input series at time t. Thus, given the previous values of the target series (y 1 , y 2 , · · · , y T −1 ) with y t ∈ R,as well as the present and past values of n exogenous series (x 1 , x 2 , · · · , x) T with x t ∈ R n , the predictive model aims to learn a nonlinear mapping F (·) to the current value of the target serieŝ Formal Problem II: Multi-sensor temporal prediction in long-grouped trains In fixed-grouped situations, suppose there are G g trains, each of which generates G l kinds of time series to construct individual temporal data set. Among them, one kind of time series is specified as target series for making predictions, while others are features. Given a time window of length T , we use Y = (y 1 , y 2 , . . . , y Gg ) ∈ R Gg×T to denote the readings of all target series during past T hours, where y i ∈ R T belongs to i th train. We can use ×T to represent inner features of train i. Among them, x i,k ∈ R T denotes the time series collected from k th sensor in this train. Hence, represents all temporal data of i th train at time t. Therefore, this prediction problem can be stated as predicting the temporal data of i th train after τ time given the data of all sensors of each train. The predictive model aims to learn a nonlinear mapping F (·) to make:

B. BASELINE FOR TRAIN BRAKING PREDICTION
To improve and generalize the network in [32][33] to build a baseline for further research. A two-layer LSTM-based model with an input attention layer added is established, the structure is shown in Figure 2. Given the input sequence of k th external feature as x k = (x k 1 , x k 2 , · · · , x k T ) T ∈ R T , that is, the sensor input feature of the previous trains. We can construct the input attention mechanism through the deterministic attention model, namely multilayer perceptron. The method is to use formula (3) and formula (4) to encode the previously hidden state h t−1) and cell state C t−1 into the first layer of LSTM cell unit.
A classic LSTM cell unit is composed of three gates, forget gate f t can be illustrated by formula (5), input gate i t is shown by formula (6) to (8), output gate o t is shown by formula (9) to (10).
and U e ∈ R T ×T are all trainable parameters. For more conciseness, the bias term in (3) can be ignored. α k t is the attention weight which is used to measure the importance of the input sequence of k th external feature at time t. The Softmax function applied after e k t is to ensure that the sum of all attention weights is 1.
The input attention mechanism is a feed-forward network that can be jointly trained with other structural cells evolved from RNN. With these attention weights, we can use formula (11) to automatically extract the features of external time series. Then the state of the hidden layer at time t can be updated with the cell state by formula (12).
Where f 1 is an LSTM cell unit operating according to (5) to (10), except that x t is replaced withx t . Through this input attention mechanism, the encoder can selectively focus on certain external input feature sequences without having to treat all input features equally.

C. LIFELONG LEARNING BASED ON DATA SET UPDATE
Lifelong machine learning [34][35] [36] (or lifelong learning, LML) is to imitate the learning process and ability of humans. Since the affairs around us are closely related and interconnected, this way of learning is very natural. There are three basic elements of lifelong learning: retention of learned knowledge, selective transfer of previous knowledge when learning new tasks, and systematic methods to ensure the effectiveness of retention and transfer of knowledge. It is through these three elements that lifelong machine learning has demonstrated a powerful model update effect and self-learning ability. Since lifelong machine learning is a continuous learning, compared to tuning through a single model, the core is to continuously update the learned model and to overcome "catastrophic forgetting" at the same time.
Considering all data sets collected from the same system, the stability of data source and application for the same formal problem ensure that it is not a lifelong learning scenario that easily leads to catastrophic forgetting. In the actual problem, the input data set is DS in a certain time period, and the current learned model is M , then the output is the updated prediction model M ′ . If the data set collected in an experiment is DS1, there is a function F (·) such that Then, keeping the test conditions unchanged and collecting another data set DS2. There is another function F (·) such that If F (·) = F ′ (·), the model remains unchanged; if F (·) ̸ = F ′ (·), the model needs to be updated through design judgement conditions. If it needs to be updated, let M = M ′ . Based on the above basic mathematical problems, the key lies in determining whether the model needs to be updated through the new data set. In normal circumstances, judging method mostly adopts loss function. Considering the continuous accumulation of experimental data in the future, the construction of a lifelong learning-oriented prediction method based on the similarity measurement and loss function of the data. Since this research focuses on the long-term regularity between trains and is limited by experimental conditions, it is not suitable for directly using the method of updating the data stream, so the overall mode of the data set is updating the whole data set. The lifelong learning prediction framework based on data set updating is shown in Figure 3. The updating model is divided into two steps in detail. One is the similarity measurement, which evaluates the degree of change between two data sets. The other is the prediction loss deviation, that is, by designing the loss rate function to determine whether to keep, discard or update the model. Firstly, since the data in this study are all numerical, the similarity measurement method uses the similarity coefficient, and the formula uses the Pearson Correlation Coefficient for calculation. However, considering that the data sets obtained under the same test conditions at different times have a certain degree of similarity, the standard Pearson Correlation Coefficient is taken as the absolute value to ignore the positive and negative correlations. The calculation formula is as (15).
By setting the similarity threshold SIM , when the calculated similarity is less than the threshold, the prediction loss deviation is calculated and evaluated. The value range of SIM is (0, 1). For the predicted loss deviation, it is mainly to judge whether the knowledge base is updated or discarded, and the evaluation method of formula (16) is usually adopted.
In this model, L is the loss function and Φ(θ) is the penalty function. For the problem of train braking process as continuous air pressure prediction, Mean Square Error(MSE) or Mean Absolute Percentage Error(MAPE) shown in equation (17) or (18) can be directly used for evaluation.
A percentage of predicted loss L p is defined here to determine whether the knowledge base is updated and discarded. If the prediction loss of the current model M is L M , and the loss obtained by training the new data set is L M ′ , then Two thresholds are set here as L low and L high , when L p < L low , the original model is kept, when L p > L high , the model is discarded, when L low < L p < L high , the model is updated.   focus here is mainly on the third type of Model-based transfer method, which is also called Parameter based Transfer Learning, which refers to the method of finding the parameter information shared between them from the source domain and the target domain to realize the transfer. The assumption required by this method is that the data in source domain and target domain can share some model parameters.

IV. MULTI-GROUED TEMPORAL PREDICTION BASED ON TRANSFER LEARNING A. NOTATIONS AND PROBLEM STATEMENT
Representative work mainly includes [38] [39]. An algorithm known as TransEMDT (Transfer learning EMbedded Decision Tree) integrates a decision tree and the k-means clustering algorithm for personalized activity-recognition model adaptation [38]. A new dimensionality reduction method is proposed to find a latent space, which minimizes the distance between distributions of the data in different domains in a la-tent space [39]. By 2021, analyzing related operating parameters and designing MLP's structure adjustment strategies can help knowledge transfer among domains for predicting the energy consumption of industrial robots [40].
Through the investigation of existing work, it can be found that most of current model-based transfer learning methods are combined with deep neural networks. By taking advantages of both deep learning and optimal two-sample matching, a unified deep adaptation framework for jointly learning transferable representation and classifier is proposed to enable scalable domain adaptation [41]. The joint adaptation networks(JAN), which learn a transfer network by aligning the joint distributions of multiple domain-specific layers across domains based on a joint maximum mean discrepancy (JMMD) criterion, is presented in [42]. A new CNN architecture in [43] is to exploit unlabeled and sparsely labeled target domain data. These methods modify some existing neural network structures, add a domain adaptation layer to the network, and then conduct joint training. Therefore, these methods can also be regarded as a combination of methods based on models and features. A standard transfer learning process based on the baseline in Section III.B can be realized by parameter transfer and model fine-tuning. The model training of the data set grouped by G ′ can be assisted by the learned model grouped by G through the parameter transfer under two different groups. However, it should be noted that the two domains transferred by transfer learning will not have a direct and continuous relationship with each other, which is different from the concept of model verification and update for lifelong learning in Section III.B. Now formalize the temporal prediction problem of multiple groups, and set the time series length as t. G is a group variable. If there are m kinds of groups, then G ∈ {G 1 , G 2 , · · · , G m }. Suppose N is the number of train number variables under this group, there are n number variables in total and n ≤ m, N ∈ N 1 , N 2 , · · · , N n . Let P be the total number of sensor variables in the train number of the group, P ∈ {P 1 , P 2 , · · · , P p }. d G N,P ∈ R t denotes the P th sensor data sequence collected by the N th train when the group variable takes G. Since the number of sensors is fixed during actual collection, it can be simplified by omitting the variable P , then d G N ∈ R p×t denotes the temporal data sequence collected by the N th train when the group variable takes G. All train data sequences d G N under this grouping can form d G ∈ R l×p×t . Furthermore, all the groupings finally constitute the cross-grouping temporal data set D G ∈ R m×l×p×t . According to the definition of the transfer learning domain, a data set D G with this nature is a domain. Such a data domain can be arbitrarily used as the source domain D s or the target domain D t . Given another definition: When a domain is superimposed as the union of the source domain D s and the target domain D t , it is called a general domain, denoted by D Gen .

B. TEMPORAL DATA MINING BASED ON PARAMETER TRANSFER
According to the formal problem in Section IV.A, the entire model/parameter-based braking prediction transfer method is designed. The goal is to clarify the help degree of transfer learning for training process and training result performance. The overall comparison design can be divided into two steps.
Step 1:Build a model on the source data set D s (as the source domain), and then transfer the weight parameters after training to be used as the initialization parameters (standard transfer learning) of the model to be trained on the target data set D t (target domain).
Step 2:Based on the combined data from two sources (source domain D s and target domain D t ), build the model into a general domain D Gen , and pre-train the model built on the source data set D s in the general domain data. The transfer model of a general domain is defined as the "Pre-tuned Model". Then put it on the target domain D t to obtain tuning, and the output multidomain transfer model is defined as the "Tuned Model". The performance of the two models obtained in this way is compared with the performance of the "Target-only" Model trained only on the target data D t . Figure 4 is the comparative design of the entire model/parameter-based transfer method for this problem. In cross-train transfer training, different target domains are the data domains of different trains in the same group, such as transferring the predicted parameters of the 1st train to the 8th car in the 8-grouped. In cross-group transfer training, different target domains are the data domains of the same train in different groups, such as transferring from the prediction parameters of the 1st train in the 8-grouped to that in the 1st train in the 20-grouped. For a specific problem, loss functions are caluated to evaulate which is the optimal trasfer model. Table 1 lists all variables related to the train braking system in the process of multi-sensor data collection. These variables are the basis for the construction of temporal prediction experiments. According to the analysis of train braking principle and braking conditions in Table 1. There are a total of 10 values for the working condition variable S. The specific input and output characteristics are shown in Table 2. In actual test process, there could be mixed working conditions. For multi-working mixed data analysis, S itself is only used for working condition explanation without participating in the predicting process. Table 3 lists part of the data sets collected by multi-sensors of the train braking system. The data sets gather all the characteristics of the data collected by the intelligent test platform, so they are also the main objects for mining and analyzing in this research. All data dimensions and volumes in the characteristic parameters come from the original raw data. Most of the dimensions here refer to the number of sensor groups. Since there are 5 air pressure sensors on a train, the data dimension is generally the number of Groups × 5. The specific prediction problem also involves the relevant variables in Table 1.

B. PERFORMANCE OF IMPROVED BASELINE FOR EXPERIMENTS
As illustrated in Section III, A two-layer LSTM-based model with an input attention layer added is the improved baseline. To compare the result with the model in [33], they are trained with the same hyper-parameters. The setting of hyper-parameters can be referred to [33]. Table 4 shows the comparative results,where Learning Rate initialized as 0.002 with Adam algorithm, Dropout = 0.5, Time Step(T) = 50, Iteration = 1000, Batch Size = 128, Unit = 32, Epoch=10. Compared with the model in [33], the evaluation parameters MSE and MAPE of the improved model perform better. MAPE can be increased by 77.87%, and the training time does not increase much.

C. EXPERIMENT I: MODEL UPDATE AND VERIFICATION 1) Model Update Tuning for Formal Problem I
The lifelong learning prediction framework based on the data set update has two main parameters, which are the similarity coefficient and the percentage of prediction loss. The choice of which will affect the stability of the entire prediction. When optimizing similarity for industrial big data, too high or too low thresholds of these two parameters will affect the update result of the model. When performing predictive analysis based on temporal data sets DS1 (Single Train) and DS2 (5-grouped), a better-performing baseline with an attention mechanism is used here. The data sets of different working conditions under the model are divided separately, VOLUME 4, 2016  Table 2.
The working conditions may be mixed.   and the similar threshold test of multi-sensor in a single train is oriented to the lifelong learning framework through experiments. We first fix L low = 0.4 and L high = 1 (simplified as 0.4/1), and set the similarity threshold to 0.3, 0.5, and 0.7 for experimentation. Set a flag parameter F LAG here to display the update status of the model. When F LAG = 1, the model is updated, when F LAG = 0, the model remains. As shown in Figure 5, the test results show that when the similarity threshold SIM = 0.5, the update of the model is relatively stable. When SIM = 0.3, the algorithm update frequency is low, and when SIM = 0.7, it is too high. Therefore, SIM = 0.5 is selected. After determining the similarity threshold, further experiments are carried out on the threshold of the predicted loss percentage, which are respectively set to 0.4/1, 0.4/0.9, 0.4/0.8, 0.3/0.8, 0.3/0.9, 0.3/1. At this time, the flag parameter F LAG has three values. When F LAG = 2, the model is discarded. When F LAG = 1, the model is updated. When F LAG = 0, the model remains. As shown in the experimental results of Figure 6, when the predicted loss threshold is 0.3/0.8, the overall update frequency is relatively stable. Based on the above experimental results, when SIM = 0.5, L low = 0.3 and L high = 0.8, the lifelong learning model update framework combined with the multi-sensor model in a single train can be relatively fixed.

2) Model Verification For Formal Problem II
Although the prediction network with the input attention mechanism can be used to optimize the prediction of Formal Problem I and Formal Problem II, the actual test data between long trains is not as large as the data collected in a single train. Here we use DS4 (10-grouped) and DS5 (15-grouped) to verify the baseline model. Table 5     baseline is relatively stable, and the MAPE and training time are smaller, indicating that the model has better robustness under the same test conditions.

D. EXPERIMENT II: STANDARD TRANSFER OF CROSS-TRAIN TRANSFER LEARNING
During the migration training test, three data sets were mainly analyzed, namely DS3 (8-grouped), DS6 (20grouped I: all working conditions mixed except braking) And DS7 (20-grouped II: all working conditions mixed except emergency braking). It should be noted here that although both DS6 and DS7 are 20-grouped temporal data sets, their are collected under two completely different test conditions. The two sets of temporal data show different temporal characteristics. The working conditions of mixed trains are also different, so it is valuable to do comparative research. However, they both come from the same set of sensor acquisition system, so they meet the universal conditions of transfer learning. The following types of independent tests are carried out on the basis of a learned model. The above experimental design is mainly for two goals. One is to test the impact of transfer learning on prediction performance in the absence of data, and the other is to study how parameter transfer is effective or better for cross-train transfer training problem.

E. EXPERIMENT III: MULTI-DOMAIN TUNING OF CROSS-GROUP TRANSFER LEARNING
Based on the model/parameter transfer method design, further multi-domain experiments are carried out. For the data sets of DS1, DS6 and DS7 across different grouped trains, a multi-domain tuning comparison experiment is carried out by constructing a general domain. The "Pre-tuned" model, "Tuned" model and "Target only" model are trained separately. In general, the Tuned model performs well in crossgroup transfer training, and the loss fluctuates relatively stable under various test conditions. This model is especially outstanding in data sets with similar transfer, as shown in Table 7, Figure 7 and Figure 8.

F. RESULTS ANALYSIS
In order to concatenate and apply all the best fit experimental results, we conclude and elaborate all models and quantifiable analysis results as a methodology which lays foundation for future work. Firstly, for multi-sensor time series prediction problem in long-grouped trains, a two-layer LSTM-based model with an input attention layer added in Figure 2 can be a baseline, where Learning Rate initialized as 0.002 with Adam algorithm, Dropout = 0.5, Time Step(T) = 50, Iteration = 1000, Batch Size = 128, Unit = 32, Epoch=10. Secondly, the framework in Figure 3 can be applied to make full use of a steady stream of experimental data, three core parameters are set as SIM = 0.5, L low = 0.3 and L high = 0.8. Last but not the least, as shown in Figure 4, parameter-based transfer learning could be a good method to benefit training performance. The multi-domain tuning of the "Tuned" model is more robust in cross-group transfer training, and the "Pre-tuned" model performs better on the head-end sensor.

VI. CONCLUSION
This work is the further study on the basis of an intelligent experiment platform of train braking system which can provide large amount of temporal data. We design a lifelong learning-oriented predictive model in the field of train braking according to the continuous model updating and stability verification. In addition, due to the limitations of experimental conditions, we were unable to obtain enough data to carry out the model updating test between long trains. However, this work has clarified the feasibility of this research route. Further study can be carried out in the following two aspects:1) Lifelong learning prediction based on data set updating between long trains;2) Transfer learning prediction for lifelong learning.