Bidirectional LSTM-Based Soft Sensor for Rotor Displacement Trajectory Estimation

Constant rotor system monitoring enables timely control and maintenance actions that decrease the likelihood of severe malfunctions and end product quality deficits. Soft sensors represent a promising branch of solutions enhancing rotor system monitoring. A soft sensor can substitute a malfunctioning physical sensor and provide estimates of a quantity that is difficult to measure. This research demonstrates a soft sensor based on bidirectional long short-term memory (LSTM), and a training procedure for rotor system monitoring at high sampling frequency and varied operating conditions. This study adopts a large rotor and bearing vibration dataset. The soft sensor accurately estimates lateral displacement trajectories of the rotor from the bearing reaction forces over a large range of constant rotating speeds and constant support stiffnesses. The mean absolute error (MAE) of the LSTM-based soft sensor is 0.0063 mm over the test trajectories in the complete operating condition space. The soft sensor performance is shown to decrease significantly to a MAE of 0.0442 mm, if the training dataset is limited in the rotating speed range.


I. INTRODUCTION
Measuring rotor vibration poses a common concern across numerous industries since excessive rotor vibration reduces machine lifetime and end product quality. For example, in paper production the paper thickness, gloss and basis weight are sensitive to vibration; thus the vibration tolerances are defined in micrometers. Furthermore, the rotor system components such as gears and bearings may wear down and fail before their expected end of lifetime due to rotor vibration. In order to control the vibration and to prevent such quality deficits and malfunctions, the machine vibration needs to be constantly monitored. Unfortunately, these vital vibration measurements may occasionally be prevented by a number of reasons. For example, machine geometry, processed materials or a hazardous environment might prevent the measurement of vibration. Furthermore, the financial burden and the high likelihood of sensor malfunctions associated with large The associate editor coordinating the review of this manuscript and approving it for publication was Li Minn Ang . sensor fleets might hinder rotor system monitoring in some large factories such as paper or steel mills.
Soft sensors appear a promising solution for the aforementioned problems related to rotor vibration monitoring. Soft sensors, or synonymously virtual sensors, are computational models estimating a physical quantity from other related information, such as secondary measurements. Soft sensors are often categorised either as model-based or data-driven soft sensors [1]. Model-based soft sensors are designed on the prior knowledge of the system. Some well-known techniques for designing model-based soft sensors include Kalman filters, [2], extended Kalman filters [3] and state observers [4]. Soft sensors based on these techniques have been shown to estimate for example torsional vibration and rotating speed of a thruster [5], eccentricity of steel mill rotors [2] and bioprocess states [6]. However, designing an accurate rotor system model for a soft sensor may be a difficult task due to highly non-linear interactions between the system components. These difficulties can be solved with some simplifying assumptions such as linearisation of the system dynamics.
For example, Kalman filter for estimating lateral displacements of a drill collar employed a lumped model where the drill components were joined by linear torsional springs [7]. The disadvantage of these simplifying assumptions is that they may lead to a reality gap between the model and the real rotor system, which might hinder the model-based soft sensor performance on real data.
Data-driven soft sensors are fitted on historical data measured from the system [1]. They require little prior knowledge and no simplifying assumptions of the underlying system dynamics. Traditional data-driven soft sensors are based on techniques such as self-organising maps (SOM) [8], partial least squares (PLS) [9], multilayer perceptrons (MLP) [10] and support vector machines (SVM) [11]. Typically, the disadvantage of these machine learning based traditional soft sensors is that they require complex feature extraction techniques and manual feature selection. Although, SVM-based solutions estimating rotor displacement directly from the control coils of a hybrid magnetic bearing have been developed recently [12], [13], there still are few data-driven soft sensors that have been shown to generalise over a large range of operating conditions. For example, varied rotating speed or support structure stiffness are such operating conditions for rotor systems. Most likely, the learning capacity of these traditional data-driven soft sensors limits their applicability to problems requiring accuracy over a large range of many operating conditions. Deep learning has been found to outperform other machine learning models in numerous benchmark tasks such as image classification [14], natural language processing [15] and reinforcement learning [16]. Unsurprisingly, these methods have also attracted the attention of soft sensor researchers. Numerous studies have indicated the applicability of different deep learning based soft sensors to many application areas [17]. For example, deep learning based soft sensors were adopted in monitoring industrial processes such as penicillin fermentation [18], polyethylene production [19] and hydrocracking [20]. However, deep learning based soft sensors for rotor systems have acquired less attention [17]. Although a few deep learning based soft sensors that estimate deformation of an air pre-heater have been proposed based on stacked autoencoder (SAE) [21], [22] and deep belief network [23], deep learning based soft sensors for rotor vibration are relatively scarce. Furthermore, most soft sensor applications function on data with low sampling rates, such as one sample per hour or day. Such soft sensors would likely be unsuitable for monitoring rotor vibration, such as rotor displacement trajectories, at the typical sampling rates of multiple times per second. A soft sensor for rotor vibration needs to learn dynamic changes in the system at high frequency instead of relatively static non-linear relations between numerous process variables.
This study demonstrates a data-driven soft sensor based on long short-term memory (LSTM) and a training procedure for rotor displacement trajectories. LSTM [24] is a recurrent neural network (RNN) capable of learning relations in sequential data over long time lags. LSTM can learn such relations due to the inbuilt memory that it learns to control at each time instant with gating mechanisms. This property has proven useful in previously developed soft sensors for a chemical process [25] and rotor speed [26]. This study employs the LSTM memory and gating mechanisms to learn the relation between the industrially available bearing vibration signals and rotor lateral displacement trajectories over a range of operating conditions. LSTM can encode this non-linear dynamic relation into its memory as latent hidden states with implicit state transition and observer functions parameterised by the gating functions. The rotor displacement values are accurately estimated at each time instant from the corresponding bearing vibration values at 2 kHz sampling frequency. The applicability of the LSTM-based soft sensor is demonstrated with a large and recently published rotor vibration dataset [27] including real rotor displacement trajectories and corresponding bearing vibration data under varied operating conditions. These conditions include rotating speed and horizontal support stiffness of the rotor.
The soft sensor performance is examined in three case studies. In the first one, the model is compared against other regression algorithms without in-built memory and shown to be superior in estimation accuracy. The second case study demonstrates LSTM-based soft sensor performance over a large range of operating conditions. The third case study highlights the importance of adopting the entire operating condition space in the training procedure. The results imply that the proposed LSTM soft sensor is accurate and applicable to rotor (lateral displacement) monitoring. Moreover, such a soft sensor can be useful for example in paper production. In paper production, the vibrational displacement is both, directly linked to the end product quality and difficult to measure due to the ongoing production process.
The contributions of this paper are the following: • LSTM-based soft sensor for rotor displacement trajectory estimation under different operating conditions is demonstrated.
• A soft sensor training procedure for varied operating conditions is developed and shown crucial for accuracy.
• A LSTM-based soft sensor is shown to be superior in trajectory estimation against models without memory. The paper has been structured as follows. Section II reviews LSTM architecture and relevant structural variations. Section III explains the required development and training procedure for a soft sensor based on LSTM. The experimental results of the LSTM-based soft sensor are shown in Section IV and discussed in Section V. Finally, the paper is summarised in Section VI.

A. RNNs
RNNs are known for their capability to learn temporal patterns from sequential data such as time series data. RNNs learn these temporal patterns by processing each time instant separately and by communicating information of previously VOLUME 9, 2021 detected patterns forward along the sequential process, and thus they are described as recurrent. This sequential process is visualised as generalised computation graphs in Fig.1. On the left, a folded RNN receives input x t and a hidden state h t−1 from the previous time instant. For each time instant, the RNN also computes an output value y t . In addition, these computations are expressed as an unfolded graph on the right in Fig. 1. The computations related to the hidden state are shown in (1), where a is an arbitrary non-linear activation function, W is the set of weights optimised for processing the inputs x t , V is the set of weights optimised for processing the signals h t−1 from previous time step, and b i is an additional optimisable bias term. Finally, the computations related to the output y t are shown in (2), where b is an arbitrary non-linear activation function, U is the set of weights used to process the hidden state h t , and b o is an additional optimisable bias term.

B. LSTM STRUCTURE
Long short-term memory (LSTM) [24] is a recurrent neural network structure that can learn to perceive and memorize patterns over thousands of timesteps [28]. LSTM has been shown to perform accurately in many problem domains including sequential data [28]. Examples of these problem domains are speech recognition [29], translation [30], and rotor system fault diagnosis [31]. The exceptional temporal learning capabilities of LSTM build on the gating mechanism that allows a sophisticated control over the communication between the sequential computations. The communication between the sequential computations relies on two signals commonly known as the cell state c t and hidden state h t . The gating mechanism consisting of four gates can learn the sequential patterns in the input signals x t . The gates are optimised to either change or not to change the values in the communication signals c t and h t . With timely changes to these recurrent signals, the signals may be passed forward through many sequential computations with little or no change. This property efficiently decreases the vanishing and exploding gradient problems related to the gradient based optimisation of RNNs.
The four LSTM gates process inputs at each time instant with a shared number of parameters. The computations related to the forget gate f t , input gate i t , output gate o t and update gatec t are shown respectively in (3), (4), (5) and (6). In these equations, the sets of weights W f , W i , W o and W c multiply the concatenated input of the current time instant x t and the previous hidden state h t−1 . Additionally, bias terms b f , b i , b o and b c may be added to the gate values. Finally, a non-linear activation function is applied to all computed values. The non-linear activation function σ in (3), (4), (5) is commonly the sigmoid function, which scales all the gate values between 0 and 1. The non-linear tanh activation function in (6) is the commonly known hyperbolic tangent, which scales the candidate valuesc t to the range between −1 and 1. Fig. 2 shows these gates as the green blocks with the corresponding activation labeled on top of each block. The orange circles and ellipse correspond to point-wise operations, where × is multiplication, + is summation and tanh is the hyperbolic tangent. These point-wise operations are shown in (7) and (8), where the previous cell state c t−1 is first updated, and then the hidden state h t is computed with the output gate o t and the updated cell state c t . The updated cell state c t and the hidden state h t are then passed forward to the next computation step t + 1. A copy of the updated hidden state h t moved upwards in the LSTM computation graph in Fig. 2 is sometimes adopted as the output of the cell at the current timestep t.

C. TYPICAL VARIATIONS TO LSTM NETWORK STRUCTURES
The learning capabilities of LSTM can be enhanced by increasing the amount of adjacent LSTM cells at each time instant. The number of adjacent cells in an LSTM network can be increased by stacking LSTM cells and by introducing a bidirectional path [32]. A stacked LSTM network can learn more non-linear relations between input and output sequences, since the network computes the output h t for each time-instant with multiple cells. The stacked cells consider the output h t from the previous cell as their input x t . A bidirectional LSTM has a broader perception on the sequence that it processes than a uni-directional LSTM. The bidirectional LSTM network processes the input sequence in two directions, i.e. forward and backward. A stacked and bidirectional LSTM network architecture consisting of two layers is shown in Fig. 3.

D. RNNs AS IMPLICIT STATE SPACE MODEL
Model-based soft sensors typically rely on state space models. State space models are based on physical equations of the system dynamics. With these physical equations, a relation between states and system inputs can be constructed. An example of such state space models is a linear, timeinvariant discrete time system shown in (9). The state space model consists of the system equation and the output equation. In the system equation, the new state x(k + 1) at time step k + 1 is computed with the state transition matrix A, current state x(k), input matrix B and input values u(k).
In the output equation, the system output y(k) of the current time step is computed with the output matrix C, current state x(k), feedthrough matrix D and system input values u(k).
RNNs can learn to function similarly as state space models from the input-output perspective. After the model optimisation, RNNs can compute the time series inputs deterministically. That is, the hidden state and the output values are computed for each time step based on the order and the values in the input time series data. More specifically, an RNN as shown in Fig. 1 computes an output y t and a hidden state h t from the input x t and the previous hidden state h t−1 . These values correspond to the state space model output y(k), system state x(k), input u(k) and previous system state x(k − 1) respectively. Furthermore, the weights W , V and U can be considered to parameterise the state space model matrices A, B, C and D. However, showing this correspondence is difficult since inferencing RNN computations and the hidden state h t is hindered by the stochasticity of the weight optimisation algorithm. That is, each RNN has likely a unique set of weights after every optimisation run. Due to the difficulty of inference and similarities between these two approaches, RNNs can be considered as implicit state space models. For some soft sensor applications requiring output values for every timestep, for example trajectory estimation, employing RNNs as implicit state space models may be the preferable choice. Many RNN-based architectures employing non-deterministic time series input processing, such as encoder-decoder structures or attention mechanisms, may hinder the soft sensor performance due to excessive model complexity.

III. LSTM-BASED SOFT SENSOR FOR TIME SERIES DATA
This section explains the optimisation and training procedure of LSTM-based soft sensor for rotor vibration. The procedure is covered in three subsections. In addition, Fig. 4 shows the whole procedure. In Fig. 4, sample-rate conversion, scattered data division in the operating condition space and time window division relate to preprocessing the rotor vibration data. These preprocessing techniques are described in Section III-A. Hyperparameter optimisation concerns the time window division parameters, the model architecture parameters and model optimisation parameters. Section III-B explains the model architecture selection. Section III-C visits the model optimisation for large operating condition space and model evaluation. VOLUME 9, 2021 A. PREPROCESSING TIME SERIES DATA FOR A DATA-DRIVEN SOFT SENSOR Neural networks can learn directly from raw data without complex feature extraction techniques. However, there are a number of preprocessing methods for time series data that can enhance RNN-based soft sensor training convergence and test performance. Such data preprocessing techniques are sample-rate conversion, time window division and scattered division to training and test data. These methods increase the consistency between the input variable distributions without losing significant amount of information.
Sample-rate conversion increases the consistency of the training data distribution. Sample-rate conversion is useful, if the sampling rate varies between the training time series samples. Sample-rate varies in the time series data sampled from the rotor system, if the sampling is triggered with a rotary encoder and the rotating speed changes. Sample-rate conversion downsamples the data with higher sample-rates to the lowest sample-rate in the dataset. However, downsampling requires caution, because the downsampled time series signals should still cover the input and output trajectories with high resolution. In this work, all time series samples were converted to the 2 kHz sample-rate.
Time window division serves two purposes: increasing the amount of training data, and introducing more stochasticity in the training. The amount of training data can be increased by sub-sampling overlapping time windows from each time series data sample consisting of multiple time steps. The number of time windows extracted from a time series data sample depends on the number of the overlapping time steps between the time windows, the length of time windows and the length of the sample. Fig. 5 illustrates time window division of a generic time series data sample consisting of vibration data. In this work, the training dataset consisted of such sub-sampled time windows from multiple time series samples. These time windows were then drawn from the training dataset in a random order during training. Randomly drawing the training time windows ensures that the angular position of the rotor at the first and at the final time step changes uniformly. This way, RNN optimisation was subjected to less bias. Furthermore, shorter time windows may avert vanishing and exploding gradient problems related to RNN optimisation.
Training a data-driven soft sensor for a dynamical system, such as a rotor system, requires a large amount of trajectories sampled at varied operational conditions. For example, the rotor rotating speed should vary in the training dataset between the typically used minimum and maximum operating speeds. If the training data includes trajectories from sparse distribution of operating conditions, the model performance is likely unsatisfactory under some typical operating conditions.
In this research, the training and test sets were divided by randomly selecting time series samples from a grid of operating conditions. Fig. 6 shows a simplified training and test data division along a rotating speed grid. In Fig. 6,  the red and the black lines are the time series samples, such as in Fig. 5 and 9, which would be included in the training set and the test set respectively. The advantage of this division technique is that the training data likely covers the operating condition space with satisfying resolution. That is, the model learns to estimate trajectories under many operating conditions, such as different rotating speeds, which leads to better performance under all operating conditions nearby the conditions in the training dataset.

B. LSTM NETWORK ARCHITECTURE
A suitable LSTM architecture for estimating the system output y t at each time instant t has enough computational capacity, memory and nonlinear representational capability. These attributes relate to the length of the cell state and the number of layers. By increasing the amount of values the cell state c t can hold to the next time step t + 1, the memory capacity increases. The number of parameters in each gate f t , i t , o t andc t are linearly dependent on the number of values held in the cell state. Therefore, by increasing the memory capacity, the LSTM computational capacity also increases. A large cell state may be crucial, if many input variables in x t show interdependence over multiple time steps. The nonlinear representational capability of LSTM network can be increased by stacking multiple LSTM cells and adding a bidirectional path to the network. After stacking LSTM cells, the superimposed cells each introduce another set of gates and memory to the function between the input variables x t and the output y t . Bidirectionality further increases the nonlinearity between the input variables x t and the output y t . The bidirectional paths of LSTM compute the output based on previous values from both directions in the sequence. Bidirectional LSTMs may increase the performance in computing output trajectories at the expense of introducing some time lag to the soft sensor outputs. The time lag is caused by the path processing input variables backwards in time, since it needs some inputs from the future to compute output y t for the current time instant. This might hinder using bidirectional LSTMs in some real-time tasks. Fig. 7 presents the LSTM-based soft sensor architecture used in this research. The architecture consists of n superimposed LSTM layers and a fully-connected layer (FCL). The bidirectional paths, as shown in Fig. 3, are not explicitly visible, but considered to be included in each LSTM block in the figure. The first LSTM layer processes input signals one time instant at time. I.e., the values of x t correspond to the values sampled by the sensors at time instant t. The following hidden LSTM-layers compute more hierarchical representations of the system state. The final hidden LSTM-layer passes the hierarchical system state representation to the FCL, which computes the output values y t corresponding to current time instant. A suitable number of LSTM layers is specific to the soft sensor application, and needs to be searched. In this research, random search was employed to find a suitable number of layers simultaneously with other hyperparameters, such as the size of the cell c t .

C. LSTM-BASED SOFT SENSOR TRAINING AND EVALUATION
In this study, the data-driven soft sensor was trained with supervised learning. That is, the estimated displacement trajectories Y consisting of the displacement values y t,k at each time instant t were compared to the measured trajectories Y true . For each corresponding trajectory Y and Y true , an error term was computed with mean squared error (MSE). The error term is expressed in (10). In (10), k denotes rotor displacement in horizontal and vertical directions. The parameters of the LSTM-based soft sensor were updated based on these error terms with Adam optimiser [33].
Two different weight saving schemes were designed for the model optimisation. If many operating conditions were included in the training dataset, the model weights were saved every 500 batches, if the running MSE was smaller than the previously smallest running MSE. 500 batches likely included trajectories with largely varying operating conditions, and thus a validation dataset was not considered necessary. If only one operating condition was included in the dataset, a validation trajectory was excluded from the training trajectory. The weights were saved every time the validation MSE decreased below the previous best validation MSE. The set of weights saved with the lowest error were utilised in testing. The test performance of the model was evaluated with the mean absolute error (MAE), since it is more interpretable than MSE. The formula for this test performance metric is shown in (11).

IV. VALIDATION OF LSTM-BASED SOFT SENSOR
The vibration of a rotor system is sensitive to the operating conditions, such as the rotating speed and the support stiffness. Rotor systems vibrate excessively, when excited at their natural frequency. The rotor vibration also significantly varies between different rotor systems, even in a case of two different realizations of a single rotor system design.
The uncertainties tend to grow with the rotor system size. For example, the foundation stiffness of a rotor changes the natural frequencies of the whole system. This section investigates the performance of a LSTM-based soft sensor under two scenarios: varying rotor speed and horizontal foundation stiffness.

A. DATA DESCRIPTION
The dataset consists of multiple time series samples each including 100 revolutions of the rotor at constant rotating speed and at constant foundation stiffness. The rotating speeds in the dataset range from 4 Hz to 18 Hz at 0.05 Hz steps. For each constant rotating speed, the dataset also includes multiple time series samples measured at different horizontal foundation stiffnesses. A horizontal stiffness controller was designed for acquiring this dataset. More details of the controller and its effect on the rotor system are demonstrated in [34]. This work considers merely the resulting horizontal foundation stiffness in the range between 2.04 MN/m and 18.32 MN/m. For convenience, appendix A shows some Overall, a time series sample in the dataset consists of eight signals, which were acquired with encoder triggered phase locked sampling. Therefore, 1024 samples were acquired from every sensor during each revolution. All time series samples in the dataset have been resampled to 2 kHz sample-rate for this study. The data in each time series sample consists of two signals of center-point displacement (horizontal and vertical) of the rotor middle cross-section, three drive-end bearing housing vibration signals (forces) and three tending-end bearing housing vibration signals (forces). Fig. 8 shows examples of such signals. The forces measured at the drive-end bearing housing are on the left, and on right are the corresponding center-point displacements. Tendingend bearing housing vibration signals appear similar to the corresponding drive-end signals in the time domain. There are two vertical force sensor signals per bearing housing due to the measurement setup. More detailed information regarding the dataset can be found in [27].

B. EXPERIMENTAL SETUP
The LSTM-based soft sensor for rotor vibration was validated in three cases studies, which all followed a similar scheme. In each case study, the model computed the trajectory of the rotor middle cross-section center-point displacement in horizontal and vertical directions. These displacement trajectories correspond to the lateral vibration of the rotor. The inputs for the model were the horizontal and vertical vibration signals measured with three force sensors at both bearing housings of the rotor. The parameters for the LSTM-based soft sensor and model optimisation are detailed in Table 1. These parameters were found with random search. More details regarding the random search procedure are found in appendix B.

1) CASE STUDY I: LSTM-BASED SOFT SENSOR PERFORMANCE UNDER FEW OPERATING CONDITIONS
This case study compares the LSTM-based soft sensor to other regression algorithms without a memory and with a limited perception over the input trajectories. The baseline models are linear regression (LR) and three formulations of  support vector machines (SVM). The model for LR consists of six weights per output trajectory (horizontal and vertical rotor center-point displacements). The weights for LR were optimised with the ordinary least squares method. All SVM-based models rely on the LIBSVM library implementations of -support vector regression ( -SVR) [35]. The three ( -SVR) models have different kernels. The kernel types were linear (LIN), radial basis function (RBF) and polynomial (POLY). The parameters and C for the SVR-models were grid searched. The grid search is explained in detail in appendix C.
This case study consisted of five independent tests. In each test, the training and test data shared constant rotating speed and constant horizontal foundation stiffness. Fig. 9 shows the training and test data division in all tests. The first 33 % was utilised in model optimisation and the last 50 % was employed in model evaluation. The middle 16 % was applied to LSTM validation. The first 33 % of every time series sample covered 33 full revolutions. This share of the sample holds enough data to omit the effect of random factors on the model optimisation. Similarly 50 % of the time series sample, corresponding to 50 rotor revolutions, holds enough data to omit the random factor effects on the model test performance. The validation data was considered necessary for LSTM optimisation alone, since other models were optimised deterministically.   Fig. 9. The white area in this operating condition space was not measured in order to avoid excessive rotor vibration at the rotor critical speed.
The input signals for all the models were the six bearing reaction force signals. The output from all the models were the horizontal and vertical rotor center-point displacement signals. The test error was computed as the mean absolute error (MAE) between the measured and estimated rotor displacement signals. Table 2 details the test performance of the LSTM-based soft sensor and the baseline models. LSTMbased soft sensor outperformed all the baseline models in terms of the MAE in these five tests.

2) CASE STUDY II: LSTM-BASED SOFT SENSOR PERFORMANCE UNDER MANY OPERATING CONDITIONS
Rotor systems are typically operated under more varied operating conditions than in the case study with few operating conditions in Section IV-B1. Although, a soft sensor capable of estimating trajectories at many rotating speeds could utilise an ensemble of models, a more desirable solution would require only a single model. Furthermore, more dynamic properties affecting the rotor system vibration exists. Neural networks offer an advantage for soft sensor development due to their capacity to learn from large data. Therefore, in this case study we explored the LSTM-based soft sensor performance over a large range of operating conditions. The operating conditions included constant rotating speeds between 4 Hz and 18 Hz at 0.05 Hz steps and a range of  The whole dataset of over 5000 time series samples, each such as the one shown in Fig. 8, was divided to training and test sets. The samples in the training set were chosen randomly based on the operating condition, and excluded from the test set. The resulting data division is shown in Fig. 10. The red blocks are the time series samples in the training set, which correspond to 25 % of the samples. The black blocks are the time series samples in the test set, which correspond to 75 % of the samples.
The LSTM-based soft sensor performance was evaluated with one 5-second window from every time series  sample in the test set. The mean absolute error over these time windows was 0.0063 mm. Fig. 11 shows examples of 5-second windows of estimated and measured displacement trajectories in horizontal and vertical directions and the corresponding discrete Fourier transforms (DFTs) at 10.5 Hz rotating speed and 18.32 MN/m. Fig. 12 shows DFTs of the 5-second windows of horizontal displacement trajectories in the test dataset. Fig. 13 shows the corresponding DFTs of the estimated trajectories. Fig. 14 shows the differences between the corresponding DFTs. Fig. 15, 16 and 17 show respectively the DFTs and their differences in the vertical direction. The difference between the DFTs of the estimated and measured displacement trajectories is consistently less than 0.05 mm in the entire operating condition space, as Fig. 14 and 17 show.

3) CASE STUDY III: LSTM-BASED SOFT SENSOR PERFORMANCE UNDER LIMITED OPERATING CONDITIONS
This case study highlights the importance of training the LSTM-based soft sensor with trajectories drawn from the  whole typical operating condition space. The operating condition space was constrained in the training set, as shown in Fig. 18. The model was first trained with time series samples from the rotating speed range between 4 Hz and 9 Hz. Then, the model performance was evaluated with time series samples from the rotating speed range between 9 Hz and 18 Hz. Following the procedures of the case study with many operating conditions in Section IV-B2, the test time windows were 5-second long trajectories from all test time series samples. The mean absolute error between the estimated displacement trajectories and true displacement trajectories was 0.0442 mm. Fig. 19 and 20 show the DFTs of these measured and estimated displacement trajectories in the horizontal direction, respectively. Fig. 21 shows the difference between these DFTs. Fig. 22, 23 and 24 show the corresponding DFTs and their difference in the vertical direction. The difference between the DFT amplitudes are much higher in the rotating speeds that are farther from the rotating speed  range utilised in training. These differences increase up to 0.3 mm in horizontal and vertical directions, as Fig. 21 and 24 show.

V. DISCUSSION
Based on the results in the case study with few operating conditions in Section IV-B1, LSTM is a better model for estimating rotor displacement trajectories from bearing housing vibration than linear regression and -SVRs. The results were consistent over five tests with significantly different rotating speeds and horizontal support stiffnesses. The better performance is likely due to the recurrent way of processing the input trajectories. That is, the model keeps the system state implicitly in the memory between the time steps, which is useful in determining the new system state and the output.  The results in the case study with many operating conditions in Section IV-B2 reveal the potential of the LSTM-based soft sensor. The model learned to estimate trajectories over a large range of operating conditions with a large amount of training data uniformly distributed in the operating condition space. The preprocessing techniques for building a large training set of the time series samples were shown applicable. By splitting the trajectories to shorter time windows and by scattered data division to training and test sets, the model achieved remarkable test performance based on the MAE of 0.0063 mm. Moreover, Fig. 14 and 17 show that the difference between the DFT amplitudes of the estimated and the measured displacement trajectories were small over all test operating conditions. The differences in horizontal and vertical directions were consistently under 0.05 mm. This indicates that the LSTM-based soft sensor generalised over the complete operating condition space with the proposed VOLUME 9, 2021 FIGURE 21. Collection of the differences between the DFTs of the corresponding measured and estimated horizontal displacement trajectories in Fig. 19 and 20. training procedure. Furthermore, merely 25 % of the time series samples was required for training.
The results in the case study with limited operating conditions in Section IV-B3 show the limitations of the LSTM-based soft sensor. The model performance drastically decreased in the operating condition space that was excluded from training data. The decrease in performance can be observed by comparing Fig. 14 and 17 to Fig. 21 and 24. The differences between the DFT amplitudes of the measured and the estimated trajectories increase close to 0.3 mm in horizontal and vertical directions at the higher rotating speeds. These differences are many magnitudes higher than the corresponding differences in case study with many operating conditions in Section IV-B2. It is evident from the figures in these case studies that the model underestimates the scale of amplitudes at the higher rotating speeds, if the training data includes only lower rotating speeds. These results confirm that the proposed  training procedure is crucial for LSTM-based soft sensor generalisation over large range of operating conditions.

VI. CONCLUSION
A training procedure was developed for bidirectional LSTM-based soft sensor for rotor displacement trajectory estimation at high sample-rates. The first case study compared the LSTM-based soft sensor trained with time windows to other regression algorithms without a memory and showed that LSTM-based soft sensor was superior in estimation accuracy based on MAE in five different tests. The second case study showed that LSTM-based soft sensor trained with trajectories from randomly chosen operating conditions resulted in a high accuracy over a large range of operating conditions. The MAE between the estimated and measured test trajectories was 0.0063 mm. Moreover, the differences between the DFT amplitudes of the estimated and measured test trajectories were below 0.05 mm over the complete operating condi- tion space. The importance of adequate training data distribution for model reliability was shown in the third case study. The differences between the DFT amplitudes between the estimated and measured test trajectories were up to 0.3 mm, if the training data was limited in the operating condition space. Furthemore, limiting operating condition space in the training data resulted in MAE of 0.0442 mm between the estimated and measured displacement trajectories. The findings related to LSTM-based soft sensor generalisation over varied operating conditions are significant for further soft sensor research aiming for reliable industrial applications for rotor displacement trajectory estimation.  Fig. 25. This implies that the horizontal foundation stiffness has an effect on the natural frequencies of the system.

APPENDIX B RANDOM SEARCH FOR LSTM-BASED SOFT SENSOR AND OPTIMISATION PARAMETERS
The motivation was to find such set of hyperparameters related to the model architecture and to the model optimisation, that the model could learn to estimate the rotor center-point displacement trajectories from the bearing housing force trajectories. Table 3 shows these hyperparameters and the respective search ranges for random sampling. All together 27 different sets of hyperparameters were sampled   and compared against each other in tests conducted with small datasets. Each small dataset consisted of one time series sample, including 100 rounds of rotation. 30 such time series samples were selected for the basis of comparison between the hyperparameter sets. Each of the 30 time series samples was randomly drawn from the rotating speed range between 10 Hz and 16 Hz and from the whole horizontal stiffness range. That is, each hyperparameter set was trained and tested 30 times, once on every small dataset. The 30 small datasets were same for all the 27 hyperparameter set tests.
Each hyperparameter set was tested with the following scheme. First a small dataset was divided in to time windows. The time windows were randomly divided in to training and test window sets including 66 % and 33 % of the windows, respectively. The model was then optimised with the training window set to compute rotor displacement trajectories from bearing housing reaction forces. The trained model was then evaluated on the test windows by computing the average of mean squared errors over all test windows. This test was repeated on each of the 30 small datasets. The set of hyperparameters that resulted in the lowest average test error over the 30 tests are shown in Table 1.

APPENDIX C SVR GRID SEARCH
The parameters C and have an impact on the SVR performances. Therefore, they were grid searched for each SVR -model. The grid for both parameters consisted of the following values: 0.1, 0.05, 0.01, 0.005 and 0.001. All combinations of these values were tested with all SVR -models. The grid search was conducted on the time series sample shown in Fig. 25. The first 33 % of the sample was utilised in fitting the models and the final 50 % of the sample was utilised in evaluating the model performances. The parameters resulting in highest test performance were used in all the tests in IV-B1. The searched parameters are listed in Table 4.

ACKNOWLEDGMENT
The calculations presented above were performed using computer resources within the Aalto University School of Science ''Science-IT'' project.