Kernel Density Estimation Based Gaussian and Non-Gaussian Random Vibration Data Induction for High-Speed Train Equipment

Because general statistics tolerance is not applicable to the induction of non-Gaussian vibration data and the methods for converting non-Gaussian data into Gaussian data are not always effective and can increase the estimation error, a novel kernel density estimation method in which induction is carried out on power spectral density data for the measured vibration of high-speed trains is proposed in this paper. First, data belonging to the same population of power spectral density are merged into the same feature sample. Then, the probability density function of all power spectral density values at the first frequency point is calculated through the kernel density estimation method, and the upper-limit estimate of all power spectral density values under the set quantile is obtained. This process is repeated, and the upper limit values of the power spectral density values at all frequency points can be obtained to convert the measured acceleration data to the acceleration power spectral density spectrum of the vibration test. Engineering examples are used to verify the proposed method. For the same Gaussian power spectral density data, the relative error between the root mean squares of the power spectral density spectrum obtained from induction by the kernel density estimation method and the statistics tolerance is 0.155% ~1.55%; for the non-Gaussian power spectral density data, the acceleration power spectral density spectrum of the non-Gaussian vibration can be obtained with the induction by the kernel density estimation method. The proposed kernel density estimation method satisfies the induction requirements for the measured Gaussian and non-Gaussian vibration data of high-speed trains with two different distributions, and its induction results have very good universality and estimation accuracy.


I. INTRODUCTION
When the operating speed of a high-speed train reaches 300 to 380 km/h, the vibration environment experienced by its on-board equipment also becomes harsher, and increasingly more fatigue failures, such as cracks in the components, the deformation and dropping of parts, and the loosening of fasteners, are triggered by the vibrations. Thus, the faults caused by vibration fatigue are an important factor affecting the safe operation of high-speed trains, and they pose a The associate editor coordinating the review of this manuscript and approving it for publication was Hassen Ouakad . difficult problem that urgently needs to be solved to increase the service life of on-board equipment.
To ensure the fatigue durability and vibration environment adaptability of high-speed train equipment, manufacturers in various countries generally complete vibration testing for rail transit equipment according to IEC61373 [1]. The IEC61373 standard was formulated in 2010. At the time, only data from some European railways were used; the corresponding testing conditions were formulated according to the Gaussian data processing method, and the effect of non-Gaussian vibration on the onboard equipment was not taken into account. To cope with the continuous development of vibration testing technology, IEC/TC9 MT 61373 convened a working group conference in December 2019 to clarify that the operating vibration data of trains in various countries will be collected under the premise of fully taking into account non-Gaussian data to reevaluate the harshness grades of vibration testing in IEC61373 [2], [3]. In addition, in the IEC 60068-2-64, which was published in 2019, guidelines for the non-Gaussian vibration test, including three methods for generating non-Gaussian random vibrations-amplitude adjustment, phase modification, and uneven phase-were added, but specific testing conditions were not given [4]. Therefore, non-Gaussian vibration has become a new direction for the development of vibration testing technology [5]- [8]. At the same time, scientific non-Gaussian vibration testing conditions still need to be formulated to ensure that this testing method is normative and operable. However, if vibration testing conditions have to be formulated according to the measured environmental data, an appropriate vibration data induction method, which can determine the success or failure of the vibration testing results, is crucial [2], [3], [10], [11].
Commonly used vibration data induction methods include the extremum enveloping method and the statistical tolerance method; these methods solve the induction problem of data that conform to a Gaussian distribution and are widely applied in engineering practice [9]- [14]. However, in the extremum envelope method, enveloping is carried out according to the maximum order of all data in an induction state without taking into account the data sample size and data distribution characteristics; therefore, the induced testing conditions are often overly harsh. The statistical tolerance method is a leap of vibration data induction from the extremum envelope method to a statistical method, but this method is only applicable to the processing of Gaussian vibration data and cannot satisfy the induction of non-Gaussian data [10], [11]. To solve the problem of the induction of measured non-Gaussian vibration data, the Johnson or Bootstrap method is often used. These two methods can convert non-Gaussian data into data that approximately obey a normal distribution, but their transformation process is not always effective; at the same time, more intermediary conversions will also increase the estimation error [15]- [18].
With the kernel estimation method [19]- [22], there is no need to assume in advance that the data obey a certain specific standard parameter distribution, and this method can directly fit the Probability Density Function (PDF) of the data sample. Due to this unique advantage, this method has attained wide application for Gaussian or non-Gaussian data processing in many fields, such as engineering, medicine, finance, and the Internet [23]- [27]. However, according to the literature, there is still no reporting on the application of the kernel estimation method in vibration data induction. In this paper, the fact that the kernel density estimation method does not need to assume the sample distribution in advance is utilized, and an induction method for the Power Spectral Density (PDF) of the measured vibration based on density estimation is proposed to thereby obtain the upper-limit estimate of the PDF data under a set confidence level, which satisfies the induction requirements for the measured Gaussian and non-Gaussian vibration data of high-speed trains. Through engineering examples, the induction method is verified to have very good universality and estimation accuracy.

II. PROBLEM DESCRIPTION
The measured vibration data of high-speed trains are timedomain acceleration data in which abnormal data such as idling-speed operation and transient impact while hooking up are included. To this end, time-domain data should be tested before data induction to reject measurement data that cannot be effectively modified, and then the measured time-domain data are tested for stationarity, periodicity, ergodicity, and normality. Then, the commonly used Cooley-Tukey method [28] is used to obtain the PDF of the time-domain acceleration data.
In the GJB/Z 126 [10], MIL-STD-810G [12], and NASAhdbk-7005 [13] standards, it is suggested that the statistical tolerance method be used to carry out induction on the vibration data that conform to a stationary Gaussian distribution, and the induction method is given.
Set all channel measured acceleration data PDF as G k (i, j)(i = 1, 2, · · · , L 1 ; j = 1, 2, · · · , M 1 ; k = 1, 2, · · · , N 1 ), where L 1 is the number of channels, M 1 is the sample capacity, and N 1 is the number of spectral lines; set the PDF of the same data channel within the scope of the frequency band being analyzed to obey an χ 2 distribution. Calculate the root mean square value RMS(i, j) of G k (i, j); then when the statistical degree of freedom is greater than 45, RMS(i, j) approximately obeys normal distribution.
The statistical tolerance induction method mainly includes the following two steps [10], [11]:

2) HYPOTHESIS TESTING
Calculate the statistics F n (i, m) and t n (i, m) for the mean value X i and the variance S 2 i : If the PDF values of channels i and m belong to the same population, then F n (i, m) obeys the F distribution at the VOLUME 8, 2020 degree of freedom (M 1 − 1, M 1 − 1), and t n (i, m) obeys the central t distribution at the degree of freedom 2(M 1 − 1).
Under a given confidence level (1 − α), if (3) holds, then channels i and m are merged into the same population.
Merge the PSD belonging to the same population to form the feature sample G k (p, q)(p = 1, 2, · · · , P 1 ; q = 1, 2, · · · , Q p ), where P 1 is the number of feature samples, and Q P is the feature sample capacity.

B. STEP 2: CALCULATION OF THE UPPER TOLERANCE LIMIT 1) PARAMETER ESTIMATION
Carry out the transformation on the obtained feature sample according to (4) to obtain the sample x k (p, q) that approximately obeys a normal distribution: Estimate the mean value X k (p) and the variance S 2 k (p) of the sample x k (p, q):

2) CALCULATION OF THE UPPER TOLERANCE LIMIT COEFFICIENT
Under a given confidence level (1 − α) and quantile β, calculate the upper tolerance limit coefficient as follows:

3) ESTIMATION OF THE UPPER TOLERANCE LIMIT OF THE PSD
The upper tolerance limit estimation for the pth feature sample is The aforementioned statistical tolerance method is applicable to the processing of Gaussian vibration data but cannot be used for the induction of non-Gaussian data. To solve the induction problem of measured non-Gaussian vibration data, the Johnson or Bootstrap method is used [11], [12], [21]. Before Step 2 above, it is necessary to first convert non-Gaussian data into data that approximately obey a normal distribution by using the Johnson or Bootstrap method.
Step 2 is then carried out. However, the method is complex as many iterations are needed to find the optimal conversion function, resulting in that the overall conversion efficiency is low and the calculation cost is large. Because the conversion will increase the estimation error and is not always effective, new methods need to be studied and proposed to realize the induction of non-Gaussian data. This is the motivation for proposing the kernel density estimation induction method to carry out induction on Gaussian and non-Gaussian data in this paper.

III. NEW DATA INDUCTION METHOD BASED ON KERNEL DENSITY ESTIMATION
Generally, when carrying out induction on data, it is necessary to know that the data obey a certain specific standard parameter distribution. For Gaussian data, there are standard induction methods, but the induction of non-Gaussian data is a greater challenge. Because the kernel density estimation method does not require the inducted data to obey a certain specific standard parameter distribution, this method has the unique advantage of directly fitting the probability density function. In this section, the application of the kernel density estimation method to the induction of Gaussian and non-Gaussian data is explored.

A. PROCEDURES OF THE NEW INDUCTION METHOD
Set the acceleration PSD for the jth test of the ith channel in the vibration test as where L is the number of test channels; M is the number of samples for a certain test channel, which is the number of PSD values for a certain channel; and N is the number of spectral lines. Therefore, G k (i, j) represents the PSD value at the kth frequency point of the jth portion of the PSD for the ith test channel and the data induction method is as follows: 1) Carry out hypothesis testing on the data of all channels (a total of L) and merge the PSD samples belonging to the same population as the same feature sample to thereby obtain P merged feature samples. The data induction flow is shown in Figure 1.

B. FORMATION OF THE FEATURE SAMPLE BY MERGING CHANNELS
Before carrying out kernel density estimation, the PSD samples belonging to the same population should be merged into the same feature sample. When the data approximately obey a Gaussian distribution, the sample parameter testing method can be used to carry out hypothesis testing. When they do not obey a Gaussian distribution, then nonparametric statistics are used to determine if the distribution functions of the two populations are equal. There are many nonparametric test methods including the runs test, the signed rank sum test, and the Kolmogorov-Smirnov (K-S) test [29], [30]. For the K-S test, a distance between the empirical distribution functions of two samples is defined first by the K-S statistic and then the null distribution of this statistic is calculated under the null hypothesis that the samples come from the same distribution [29]. Among these test methods, the runs test and the K-S test are relatively complex, and the signed rank sum test is relatively simple and efficient in data utilization. Therefore, the signed rank sum test method is used in this study. Let there be PSD samples of the two channels i and j. The sample capacities of the two channels are both n. Calculate and obtain the RMS value of each PSD as RMS 1 (x), RMS 2 (y)(x, y = 1, 2, · · · , n), and then carry out a sign rank sum test on the two RMS samples, RMS i , RMS j ; if the test result accepts H 0 , then the PSD data of the i and j channels are deemed to be from the same population and are merged into the same feature sample.

C. KERNEL DENSITY ESTIMATION 1) BASIC CONCEPT OF KERNEL DENSITY ESTIMATION
Set the distribution function of variable X as F(x). Its probability density function is Because one estimate of the distribution function is an empirical distribution function, its definition is In (9), I is an indicator function.
Substituting (8) into (9), the estimation of the probability density function is obtained: From (10), one can obtain Equation (11) obeys the binomial distribution B(n, In (11), when h → 0, the mean value E [f n (x)] and the variance D [2nhf n (x)] are respectively given as When h → 0 and nh → ∞, then Based on (13) and (14), the estimator f n (x) is the consistent estimation of f (x), and the variance tends to 0; thus, f n (x) is an estimation of f (x).

2) KERNEL FUNCTION SELECTION
Commonly used kernel functions include the uniform kernel function, the triangular kernel function, Epanechnikov kernel functions, and the Gaussian kernel function. The evaluation of these kernel functions can be achieved by comparing their PDFs of the same PSD data, which are shown in Figure 2. This figure shows that different kernel functions have little influence on kernel density estimation, which is consistent with the existing study that the selection of different kernel functions has little effect on the estimation results [20]. However, a slight difference exists. It can be seen from Figure 2 that, Gaussian kernel and Epanechnikov kernel functions are smoothest for PDF, a triangle kernel function is the second, and uniform kernel function is the worst. For this reason, the most commonly used Gaussian kernel function is selected in this study. The Gaussian kernel function k(u) can be expressed as According to the definition of the probability density function, +∞ −∞ k(u)du = 1. When selecting the kernel function shown in (16) From (17), the kernel estimation f n (x) of f (x) is the probability density function.
As the proposed kernel density estimation method is suitable for continuous random variables, it can be used for all variables satisfying continuous distribution. Moreover, if a random variable satisfies the single peak unbiased distribution, its PDF calculated by kernel density estimation will be more accurate.

3) BANDWIDTH SELECTION
Bandwidth selection will have a greater effect on the results. If the bandwidth is too large, then the density curve is too smooth, and the results are not accurate enough; if the bandwidth is too small, then the shape of the curve is complicated. Therefore, selecting a reasonable bandwidth is very important.
If the Gaussian kernel function shown in (16) is selected, according to the rule of thumb [20], the optimal bandwidth can be obtained as In (18),σ x is the standard deviation of the sample, andR is the interquartile range.

4) QUANTILE SELECTION
Set the probability density function of variable X as f (x).The distribution function is F(x); if x β is satisfied, then In this case, x β is the quantile that satisfies P(X ≤ x β ) = β, and x β is the upper-limit estimate of the S PSDs under set quantile β.
In summary, in using the kernel density estimation method, there is no need to assume that the data being inducted obeys a certain specific standard parameter distribution. Therefore, the kernel density estimation method is applicable to the induction of non-Gaussian and Gaussian data.

IV. ENGINEERING EXAMPLES
The PSD of the measured time-domain acceleration data of high-speed trains is calculated according to the Cooley-Tukey method, and the sign rank sum testing method is applied to merge all PSD values into P same feature samples. Then, the PSD data of a certain channel is extracted from the same feature samples to carry out induction.
Let the PSD data extracted from the ith channel be recorded as G k (i, j) (i is the channel number, j = 1, 2 . . . .51, and k is the number of spectral lines). The ith channel has a total of 51 portions of the PSD spectra, and the frequencies of all PSD values are from 2 to 349.6Hz. The frequency resolution is 0.4Hz, and each portion of the PSD is dispersed into 8,740 frequency points.   at the same coordinates as shown in Figure 4. The fifty-one PSDs were extracted from 51 sets of measured vibration data of channel No. 2 by the Cooley-Tukey method. These PSDs will be used to induce the measured spectrum. Now, arbitrarily select data at the frequency point f = 2.8Hz(see Table 1) as an example to induce the PSD upper limit x β of the frequency point. The Anderson-Darling normality test is used to carry out a normality test on the data in Table 1 and the corresponding quantile-quantile plot is shown in Figure 5. The middle straight line of Figure 5 is an indicator of the normal distribution. If the data well obey the normal distribution, the calculated statistics plotted in Figure 5 will be closely distributed along the straight line. The upper and lower sidelines of Figure 5 represent the lower and upper limits of the 95% confidence levels for each percentile, respectively. According to the normality test, p = 0.595 (> 0.05) is obtained, which means that the data approximately obey a Gaussian distribution (see Figure 5).

2) INDUCTION BY THE STATISTICAL TOLERANCE METHOD
Because the data conform to a Gaussian distribution, the statistical tolerance method in Section II can be used. First, the mean value m = 0.00004288 and standard deviation St = 0.000005244 of the data in Table 1 are calculated. Then, the upper tolerance limit coefficients of the acceptance test and the qualification test are calculated to be F 90/50 = 1.3087 and F 90/90 = 1.6600, respectively. Here, F 90/50 is the tolerance upper limit coefficient with  0.9 quantile and 50% confidence level, and F 90/90 is the tolerance upper limit coefficient with 0.9 quantile and 90% confidence level. As a result, the calculated upper tolerance limits are X 90/50 = 5.03196684237743e-05 and X 90/90 = 5.15838893824322e-05, respectively. Here, X 90/90 is the upper limit of PSDs induced by the statistical tolerance method with 0.9 quantile and 90% confidence level, and X 90/50 is the upper limit of PSDs induced by the statistical tolerance method with 0.9 quantile and 50% confidence level. The PSDs with F 90/50 and F 90/50 obtained by induction are shown in Figure 6.

3) INDUCTION BY THE KERNEL DENSITY ESTIMATION METHOD
According to the kernel density estimation method in Section III, first calculate the PDF of the data in Table 1 using (19), as shown in Figure 7. Then, according to (20) and under the condition of the selected VOLUME 8, 2020   Table 1 Probability density functions of the data. quantile β = 0.9, the PSD upper limit value X 90 = 5.03196684237743e-05 under β = 0.9 can be calculated from (16). The induced PSD is shown in Figure 8, which is similar to that obtained by using the statistical tolerance method as shown in Figure 6. The measured vibration data from Channel No. 17 in this example are shown in Figure 9. A total of 51 PSDs are placed at the same coordinates as shown in Figure 10. The fifty-one PSDs were extracted from 51 sets of measured vibration data of channel No. 17 by the Cooley-Tukey method.    Table 3. Now, arbitrarily select data at the frequency point f = 2.8Hz (see Table 2) as an example to induce the PSD upper limit x β of the frequency point. The Anderson-Darling normality test is used to carry out a normality test on the data in Table 2 and the corresponding quantile-quantile plot is shown as Figure 11. As the calculated statistics plotted in Figure 11 do not distribute closely along the straight line, the data do not obey the normal distribution. Moreover, p < 0.05 is obtained according to the normality test, which indeed means that the data do not obey a Gaussian distribution (see Figure 11).

2) INDUCTION BY THE KERNEL DENSITY ESTIMATION METHOD
As the data do not conform to a Gaussian distribution, the statistical tolerance method in Section II cannot be used. According to the proposed kernel density estimation method, first, calculate the PDF of the data in Table 1 through (19) as shown in Figure 12. Then, according to (20) and under the condition of the selected quantile β = 0.9, the PSD upper limit value X 90 = 5.49122571451624e-04 is calculated.
Here, X 90 is the upper limit of PSD induced by the proposed kernel density estimation method. The induced PSD is shown in Figure 13.  Table 3.

C. DISCUSSION OF THE INDUCTION RESULTS
First, the accuracy of the induced PSD for Gaussian data by using the proposed kernel density estimation method is discussed. For Gaussian data, the PSD obtained by using the proposed kernel density estimation method shown in Figure 8 looks similar to that obtained by using the statistical tolerance method shown in Figure 6. The upper limit values of the PSDs shown in Figures 6 and 8 for the same Gaussian data are  obtained first by using the statistical tolerance method and the kernel density estimation method respectively, and then the RMS values of the PSDs are calculated, as shown in Table 3. Based on Table 3, the relative errors of X 90 with X 90/50 and X 90/90 are 0.155% and 1.55%, respectively. The results show that the PSD obtained by using the proposed method for Gaussian vibration data is very close to that obtained by using the statistical tolerance method, demonstrating that the induction accuracy of the proposed method is high for Gaussian vibration data.
Second, the effectiveness of the proposed kernel density estimation method on the induction of non-Gaussian data is discussed. For the induction of non-Gaussian vibration data, the existing Johnson or Bootstrap method is complex and inefficient because many iterations are needed to find the optimal conversion function. However, the proposed kernel density estimation method can be used directly for the induction of non-Gaussian vibration data. According to the proposed kernel density estimation method, the induced PSD for the non-Gaussian vibration signal is shown in Figure 13. To demonstrate the reasonability of the proposed induction method, the induced PSD shown in Figure 13 with kurtosis value 7 is inputted to a closed-loop vibration test system that consists of a VR9500 non-Gaussian vibration controller made by the Vibration Research Corporation (USA). The random vibration data can then be obtained from the closed-loop vibration test system as shown in Figure 14. It can be seen from Figure 14 that a non-Gaussian random vibration that is related to the inputted PSD and the kurtosis value is replicated.
According to the above discussion, several results can be obtained as follows. The proposed kernel density estimation method can be used directly for the induction of both Gaussian and non-Gaussian random vibration data satisfying continuous distribution, avoiding the normality tests on the measured data, and decreasing the calculation cost. Especially for the induction of non-Gaussian random vibration data, it is not necessary to first convert non-Gaussian data into data that approximately obey a normal distribution. Thus, the complexity of data induction and the error of the induced PSD can decrease greatly because many iterations are not needed to find the optimal conversion function. As a result, the proposed method can be widely used for the induction of Gaussian and non-Gaussian random vibration data.

V. CONCLUSION
A novel method for carrying out induction on the power spectral density data by applying the kernel density estimation is proposed for the measured vibration of high-speed trains. First, the power spectral density data belonging to the same population are merged into the same feature sample, and the probability density function of all power spectral density data at the first frequency point is calculated through the kernel density estimation method. Then, the upper-limit estimates of all power spectral density values under the set quantile are obtained, and the upper limit of the power spectral density values at all frequency points can be calculated by repeating this process to convert the measured acceleration vibration data to the acceleration power spectral density spectrum of the vibration test.
Finally, two engineering examples are used to verify the proposed data induction method for Gaussian and non-Gaussian data. For the same Gaussian distribution data, the relative error between the root means square values of the power spectral density spectrum obtained from the induction using the kernel density estimation method, and the ordinary statistics tolerance is 0.155% ∼ 1.55%. For the non-Gaussian distribution data, the kernel density estimation method is used to carry out the induction and an acceleration power spectral density spectrum is obtained for a non-Gaussian vibration test. The proposed method satisfies the induction requirements for the measured Gaussian and non-Gaussian vibration power spectral density data of high-speed trains, which have different distributions, and its induction results have very good universality and estimation accuracy.