Optimized KLD-Based Fault Detection Method for Complex System Under Multi-Operation Conditions

With the increasing development of industrial technology, fault detection technology has been widely applied in many complex systems. Aiming at the problems of multi-operation conditions and the data length needed for fault detection using Kullback- Leibler Divergence (KLD), a novel fault detection method for complex systems based on optimized KLD under multi-operation conditions is proposed in this paper. In the first place, based on the historical data of the system, the complex operation conditions of the system are divided into several simple mutually exclusive operation conditions, and the division standard is established. In the next place, the optimal length can be quickly determined by autocorrelation analysis and applied to various operation conditions. After the next, in the training stage, the KLD values of all training data are calculated with the benchmark that is the initial optimal data length of each operation condition. And the maximum value is taken as the threshold for fault detection. Afterward, in the test phase, it starts by judging the type of operation condition that the data belongs to, then the corresponding KLD value is calculated, which is compared with the corresponding threshold, so as to determine whether the fault occurs. Eventually, this method is applied to the suspension system of the maglev train and respectively compared with the fault detection methods based on Euclidean distance or Mahalanobis distance. The results show that the proposed method possesses a low false alarm rate and high sensitivity.


I. INTRODUCTION
Nowadays, in order to achieve more functions and meet the needs of life and production, the structure of the system is becoming more and more complex, the degree of automation is higher and higher, and the relationship between the various parts of the system is getting closer. As a result, a chain reaction is often triggered by a small fault somewhere, leading to catastrophic damage to the whole system and even the environment related to the system. It doesn't only cause huge economic losses but also endanger personal safety. The consequences like that are extremely serious. Therefore, it is of great significance to study fault detection technology for a complex system.
At present, there are many documents studying on fault detection, among which Kullback-Leibler Divergence (KLD) The associate editor coordinating the review of this manuscript and approving it for publication was Nishant Unnikrishnan. is a common method on fault detection. Aiming at the problem of sensor initial fault detection and isolation, Homi Bhabha National Institute proposed a fault detection index and fault signature based on extended Kalman filter and designed fault decision statistics using KLD [1]. To improve the accuracy of the quantitative evaluation of fault detection, a method based on KLD to design the permissible area of measurement errors is proposed [2]. According to the difference between online estimated and offline reference density functions, Bounoua et al. proposed a principal component analysis method based on KLD [3]. A method of extracting Transformed Components (TCs) online by Principal Component Analysis (PCA) and estimating the time-varying characteristics of the most sensitive part by Kernel Density Estimation (KDE) is proposed, which can be easily integrated with the data storage units of Rooftop Mounted Photovoltaic (RMPV) system [4]. Shiva et al. uses KLD pair to quantify some characteristics of the current waveform VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ into high-impedance fault (HIF) standard, thus proposes a time-domain high impedance fault detection algorithm based on monitoring substation current waveform [5]. Chen et al. proposed a method based on KLD and independent component analysis (ICA) to solve the problem of initial fault in the electrical traction systems of a high-speed train. The method is more sensitive to initial fault than ICA only fault detection method [6]. A method for bearing fault detection based on Bayesian robust new hidden Markov modeling (BRNHMM) is proposed by analyzing the acoustic emission signal, which accesses the divergence of the probability function of the BRNHMM through KLD [7]. Chen proposed an improved KLD fault detection method, which is applied to the early fault detection of electric drive system [8]. Joelle et al. has developed an optimal threshold method based on improved KLD and a multi-sensor fusion strategy, which can exclude the wrong data [9]. A detection algorithm for a sensor precision degradation fault based on KLD is proposed, which uses KLD to quantify the dissimilarity between probability densities of each reference score and the actual one within the PCA framework [10]. Delpha et al. proposed a method based on data-driven for fault detection, isolation, and estimation. The PCA is used to extract the features and reduce the dimension of the data, and then the KLD is used to detect the fault occurrence [11]. A novel control performance monitoring method based on KLD is proposed and applied to a multi-input-multi-output (MIMO) control system [12]. Hamadouche et al. proposed a modified KLD detection algorithm based on non-parametric approximation, which was applied to the fault detection of large-scale industrial systems with high coupling in a noisy environment [13]. A method based on data-driven using the statistical feature is proposed, which uses KLD as a nonparametric fault indicator, and evaluates the severity of the fault through the characteristics of small cracks [14]. Harrou proposed a method based on KLD for detecting initial anomaly in highly correlated multivariate data [15]. An analysis model based on KLD is proposed, which can be used to estimate the initial fault scale in the multi-variable process [16]. Xie proposes a KLD-based method to detect the initial faults of complex dynamic systems [17]. Ferracuti et al. presents a method based on the analysis of motor current characteristics for the fault detection and diagnosis of induction motor. Where KLD is used to identify the difference between the two probability distributions and as an index for automatic defect identification [18].
Aiming at the problem of fault detection and diagnosis of asynchronous motor, a method based on KDE and KLD is proposed [19]. These documents have developed new fault detection methods using KLD, and have achieved good results in solving practical problems in their respective fields. However, there are two problems when using KLD to detect fault in the above literature: one is that the accuracy of fault detection is reduced by the complex working conditions of the system; the other is that the data length required in the calculation of KLD has a certain impact on the results. Moreover, there is no relevant documents to explore the impact of data length on the detection results. If we ignore that two problems and use the above method directly, it will probably lead to low fault detection rate and high false alarm rate.
Aiming at the above two problems, this paper proposes a fault detection method based on optimized KLD for a complex system under multiple-operation conditions. On the basis of prior knowledge and historical data, the complex working conditions of the system are divided into several simple working conditions, thereupon then partition rules of operation conditions are established. Aiming at the problem of data length, the optimal length is determined by autocorrelation analysis and applied to various operation conditions. In the training stage, the initial optimal data length of each working condition is used as a standard for calculating the KLD value of the training data, meanwhile, the maximum value under each operation condition is taken as the fault threshold. In the test stage, according to partition rules of operation conditions to judge the working condition of the current data, the corresponding KLD value is calculated, and whether the fault occurs is judged on account of the corresponding fault threshold. Eventually, the method is applied to the suspension system. The experimental results show that the proposed method has a low false alarm rate and a high sensitivity to fault detection.
The remainder of this paper is structured as follows: Section II details he processing method of complex operation conditions, how to rapidly determine the optimal data length through autocorrelation analysis, and proposes the steps of optimizing KLD algorithm, section III details the results from the application of the proposed method for the fault detection in the maglev suspension system and comparison with other methods, and section IV details the conclusions.

A. CLASSIFICATION OF MULTI-OPERATION CONDITIONS
During actual running, due to the interaction of factors such as internal, environmental, human, and operational requirements, the operation conditions of the system become complicated. There are generally two methods to solve the problem of multi-operation conditions. The first is to establish a general fault detection model. The second is to simplify multi-operation conditions into several simple mutually exclusive operation conditions and establish fault detection models for each simple operation condition. The sticking point of the first method is to ascertain a general model, which can tackle the problem of delicate operation conditions and large data discrepancies caused by diverse operation conditions. While the crux of the second method is that the researchers have a certain prior knowledge accumulation of the system operation data. Due to the sophisticated operation conditions and the data discrepancies caused by diverse operation conditions, the first method is difficult, and it could result in low fault detection rate and high false-positive rate. Consequently, on the basis of the existing historical data and the accumulated prior knowledge, the second method is adopted to resolve the problem of multi-operation conditions in this paper.
In addition, considering the demand for real-time detection in the project, this paper divides the different operation conditions and realizes the real-time switching of the operation conditions according to the partition rules of the working conditions.

B. TRADITIONAL KLD ALGORITHM
Complex systems are always designed based on certain task requirements. Therefore, the healthy operation data of the system always fluctuates within the scope of task requirements and obeys certain distributions, such as Gaussian distribution, Poisson distribution, and mixed distribution. However, when the system fault occurs, the operating data are mutated abnormally, which causes the distribution situation of the fault data to change. Therefore, there is always a significant difference between the distribution of fault data and health data. Based on this, the fault detection problem can be simplified as an alarm problem that when the probability attribute of the data distribution changes. In problems of this kind, as long as the test data is different from the original data distribution (usually training data), it can be regarded as fault data. However, in actual detection, due to the presence of noise and disturbance in the system, fault detection usually requires a certain degree of robustness. Therefore, an effective fault detection scheme must be able to correctly distinguish interference and faults, and achieve a compromise between detection sensitivity and detection robustness.
Aiming at the quantitative issue of distribution differences, an accurate model may not be obtained by fitting with actual data. Therefore, an effective computational quantity, named KLD, is proposed to quantify the difference between two probability distributions on the same parameter space [20], [21]. The formula for calculating KLD of a discrete random variable is as follows: where, i is the number of samples, P(t) and Q(t) are the probability distributions of discrete variables. A lower value of D KL (P||Q) brings out a higher similarity between P and Q. The formula of probability distribution P(t) and Q(t) is as follows: where, p t is the t-th sample in the data vector p, q t is the t-th sample in the data vector q. n 0 is the selected calculation length.
The traditional KLD algorithm is not sensitive to parameter changes and doesn't need to smooth the probability density, so the fault detection algorithm based on KLD has been extensively used.

C. OPTIMIZED KLD ALGORITHM 1) AUTOCORRELATION ANALYSIS
The selection of data length n 0 in (2) has a great influence on the calculation amount and results of the traditional KLD algorithm. From the perspective of informatics, if the selection of data length is too small, the information contained in the data is not comprehensive, leading to the increasing uncertainty of the results. If the selection of data length is too large, although the information contained in the data is relatively comprehensive and the uncertainty of the result is reduced, the amount of calculation is larger, taking resources on the computer. In addition, most of the previous documents rely on prior knowledge when choosing the calculation length, and there are no clear selection criteria, so the selection of calculation length is arbitrary. Aiming at the selection of calculation length, a method based on autocorrelation analysis is proposed to determine the calculation length quickly. The specific steps are as follows: Step 1: supposing that the complex system has N variables, according to the prior knowledge and historical data, complex operation conditions are divided into T simple operation conditions and the switching rules of operation conditions are established.
Step 2: taking a certain operation condition as an example, the α section data under the operation condition is extracted, and then the autocorrelation length of each variable is calculated by the α section data.
According to (3), vector C (i×j) is obtained by calculating the autocorrelation of the vector X (i×j) corresponding to the j-th variable in the i-th section data.
Then find out that the position of 0.5 in the left and right half of each C (i×j) are L ij,f and L ij,r respectively, then the autocorrelation length of j-th variable in the i-th section data is: where, L ij is the autocorrelation length of j-th variable in the i-th section data. Finally, the autocorrelation length of the j-th variable can be obtained by the following formula.
Step 3: the autocorrelation length of each variable is obtained through Step 2 for calculating the autocorrelation length L of the system. The calculation formula is as follows: VOLUME 8, 2020 According to the above method, the optimal length L of each operation condition is determined, which is taken as the calculation length of KLD under each operation condition.

2) OPTIMIZED ALGORITHM
In this paper, an optimized KLD algorithm based on autocorrelation analysis is adopted for fault detection of complex systems. According to the optimal calculation length L, the KLD between the training data set and the initial training data under each working condition is calculated, and then the fault detection is carried out. The scheme design is shown in Fig.1.
The specific steps are as follows: Step 1: on the basis of prior knowledge, the working conditions of a complex system are divided, and the training data sections of each working condition are extracted.
Step 2: according to the extracted data sections under various working conditions, the optimal length L for KLD calculation under each working condition is determined through (3)-(7).
Step 3: the KLD of the j-th (j = 1, . . . , n) variables in the training set is respectively calculated. The formula is as follows: where E j (L) is the probability distribution of the j-th variable of the data to be measured, and the calculation formula is as follows: where τ = 1, 2, . . . M (M is the length of data section e). e jl is the i-th data of the j-th variable. Actually, the essence of E j(L) is to extract L data to be tested through a moving window of length L and calculate the probability distribution.
where f jk is the k-th data of the first section of the j-th variable of the training set. Step 4: determine the fault detection threshold. According to D KL (E j ||F j ) obtained in step 3, the maximum value of D KL (E||F j ) is set as the fault detection threshold, as shown below.
where TD j is the threshold for fault detection of the j-th variable. The threshold for fault detection of each variable under each operation condition is determined through (11).
Step 5: judge the working condition of the test data and switch to the corresponding working condition model. According to (7), KLD of the test data and the preceding l training data of the first segment of the training set is calculated under each variable. After comparing with the threshold for fault detection of the corresponding variable under this operation condition, the running condition is obtained. As long as the KLD of one variable in the test data exceeds the threshold for fault detection, the system is considered to be faulty and an alarm is sent to the system.

III. EXPERIMENTAL RESULTS AND ANALYSIS A. THE SOURCE OF DATA
The data adopted in this paper is the historical operation data of a maglev train on a maglev line. The suspension system is one of the most important subsystems of the maglev train. The main function of the suspension system is to keep the gap between the train and the track at a fixed value to ensure the stable suspension of the train. In case the suspension system fails, the train cann't be able to levitate normally, which is prone to accidents. Therefore, considering the significance of the suspension system to the safe and stable operation of the maglev train, the method in this paper is applied to the fault detection of the maglev train suspension system.
Each carriage of the maglev train has five bogies, each bogie has two suspension modules and each suspension module has two suspension units. Therefore, the whole suspension system is composed of 20 sets of independent control single point suspension control systems (referred to as suspension points). The suspension point is composed of a suspension control box, suspension sensor, and electromagnet. Fig.2 is the internal composition diagram of the suspension system, in which the data of all suspension points in each carriage are collected by a monitoring suspension unit (MSU) with a sampling frequency of 0.1Hz and transmitted to  the train control and management system (TCMS) through multifunction vehicle bus (MVB). Fig.3 shows one day's historical run data of a node, including current data, voltage data, speed data, and gap data. It can be found from the figure that there are multi-operation conditions of the train and the data difference between each operation condition is large, so it is difficult to establish a general fault detection model. Therefore, in this paper, the operation conditions of the suspension system are divided into several mutually exclusive operation conditions, and the specific operation condition information and division criteria are shown in Table 1.
The suspension controller analyzes the gap value between the bottom of the train and the track, and controls the current in the suspension electromagnet by pulse width modulation (PWM) wave, and then adjusts the gap between the train and the track to keep the gap at a certain fixed value. Therefore, clearance data plays an important role in the operation of the suspension system. It is feasible to detect the operation condition of the suspension system by analyzing the clearance value.
The data adopted in this paper is historical gap data of a suspension point in a maglev train in which there is a section of overload fault data. Fig.4 shows the distribution histogram of two sections of health gap data. As can be seen from the  figure, the health data almost follow a certain distribution, and there are few differences between them. Fig.5 shows the distribution histogram of health gap data and fault gap data. It can be seen from the figure that there are obvious differences between the distribution of fault data and that of health data. Therefore, the run condition of the system can be judged by calculating the KLD between data segments.
In this paper, taking the driving between stations operation condition of a maglev train as an example, according to the division criteria of operation condition in Table 1, the gap data of 56 sections under this condition are extracted from the historical data. According to the ratio 3: 7, the first 16 segments are selected as the training data, which is mainly used to determine the optimal length L and threshold TD for fault detection. Meanwhile, the last 40 segments of data are selected as the test set. Then the KLD between the last 40 sections of data and the first l training data in the first segment is calculated by (8). Eventually, the calculated KLD is compared with the fault detection threshold, so as to judge the current operation of the system.

B. EFFECT OF DIFFERENT DATA LENGTH
Aiming at the training data of the first 16 segments, the optimal length L is determined by (3)-(7). Fig.6 shows VOLUME 8, 2020   the autocorrelation length curve of the first segment of data. It can be seen that the autocorrelation curve is axisymmetric with respect to x = 5569. As shown in Fig.7, the value of standardized autocorrelation length is between 0 and 1 by (4). The optimal length L is determined as 4574 through (5), which is shorter than the all data length. Moreover, it is 2160 less than the longest data 6743. Therefore, it reduces the amount of calculation to a certain extent. Fig. 8 is the KLD curve of the training data when the calculated length is the optimal one. In order to verify the effectiveness of the proposed method, 2000, 3000, and 4000 are selected as the calculation length, and the corresponding KLD curves are obtained as shown in Fig.9, 10, and 11 respectively. The volatility of the KLD curves decreases with the increase of the length. Moreover, the variance is the index reflecting the volatility. Therefore, In order to   further understand the influence of calculation length on the calculated value of KLD, values are obtained at intervals of 100 from 2000 to 4500 and at intervals of 10 from 4510 to 4550 as the data length for calculating the variance of the KLD curve. It can be seen from Fig.12 that when the calculation length is between 2500 and 4500, the variance decreases rapidly. When the calculation length is greater than 4500, the variance is approximate as a horizontal straight line. That shows that the variance of KLD decreases with the increase of calculation length, that is, the larger the calculation length is, the smaller the fluctuation range of KLD is. When the length reaches a certain value, the fluctuation range of KLD does not change much. This is because the smaller the calculation length is, the less information of the data is, the greater the uncertainty of the KLD of the calculated training set is, and the greater the fluctuation of the curve, that is, the greater the  variance is. At the same time, when the calculation length reaches a certain value, the information needed to calculate KLD has met the calculation requirements. If the calculation length continues to increase, the information contained in it will not help to reduce the variance but will increase the amount of calculation. In order to meet the requirements for the calculation and reduce the additional calculation, the optimal calculation length should be about 4500. In this paper, L m = 4574 (close to 4500) is determined by autocorrelation analysis. It not only meets the calculation requirements and does not need too much extra calculation, but also does not need to obtain the optimal length by enumeration method, which greatly reduces the workload.

C. ANALYSIS OF EXPERIMENTAL RESULTS AND COMPARISON OF METHODS
According to (8)- (10), the KLD between each segment of the training set and the data whose calculation length is optimal in the first segment. Fig.13 shows the KLD curve of the training set, then the fault detection threshold is TD = 1.048 × 10 −3 . According to the optimal calculation length determined in Section 3.2, the KLD between the last 40 section test set data and the training data of the first section optimal calculation length is calculated. The KLD curve of test data is shown in Fig.14. It can be seen that most of the data are below the threshold for fault detection, accounting for about 99.69% of the calculation results of the test set. Only a small part of the data falls above the fault detection threshold, accounting for about 0.31%, which shows that KLD can realize fault detection.  The actual fault data is located in section 23, which accounts for 2.73% of the test set, and the calculation interval is 5643-7217. In the KLD curve, the fault threshold is exceeded for the first time at the 5643rd calculation point (that is the calculated occurrence time of fault), which is consistent with the actual fault location. At the 7203rd calculation point (that is the calculated end time of fault), the calculated KLD exceeds the fault threshold once again, which is 14 calculation points ahead of the actual fault end time. This shows that the proposed method can realize fault detection quickly. Fig.15 shows the KLD of the 23rd segment data. It can be seen that when the fault occurs, KLD quickly exceeds the threshold for fault detection. During the fault period, the value of KLD is lower than the threshold for fault detection, and the KLD value decreases rapidly at first and then tends to be stable gradually. This is because there is a large fluctuation in the first 2000 sampling points of the gap when the system fails, as shown in Fig.16, but the gap tends to a new steady value from the 2000th sampling point to the 5000th sampling point. Fig.17 shows the distribution histogram of the 2000-5000th fault data and the distribution histogram of the training data whose calculation length is optimal in the first segment. It can be seen that although the distribution difference of the two-segment data is large, the morphological difference of distribution is small. Therefore, the KLD of this part fault data is small, even lower than the threshold. At this time, the system mistakenly identifies the data as health status. It is also the reason for the low fault detection rate of the proposed method. After the 5000 sampling points, the gap data changes greatly, and the KLD value increases continuously, especially   after 6000 sampling points, the KLD value increases greatly, exceeding the fault detection threshold. It is because after the fault is repaired and the train restart, the gap returns to the normal value. Meanwhile, the probability attribute changes, leading to the KLD value exceeding the threshold for the second time.
To sum up, although the fault detection rate of the proposed method is low, once the system fails, the method proposed in this paper can detect the occurrence and end of the fault quickly because KLD directly reflects the change law of the probability attribute of the data. In addition, the false alarm rate is low.
In order to verify the effectiveness of the proposed fault detection algorithm, the proposed method is respectively compared with the fault detection method based on Euclidean distance or Mahalanobis distance. Table 2 shows the statistical results of the fault detection method based on Euclidean distance, the fault method based on Mahalanobis distance and the method proposed in this paper [22]. It can be found from the table that the health samples obtained by the method based on Euclidean distance account for 100% of the total samples, and there are no fault samples, which is inconsistent with the actual sample distribution. This shows that the method based on Euclidean distance can't realize fault detection. While the method for fault detection based on Mahalanobis distance can achieve fault detection, and the fault proportion is 0.74%, which is 0.43% higher than that of the KLD method. This is because the method based on Mahalanobis distance has high false alarm rate of 0.45%, and some health samples are detected as faults. Compared with the other two methods, the proposed method achieves better results in terms of sensitivity of fault detection and false alarm rate.

IV. CONCLUSION
In order to solve the problem of multi-operating conditions in complex systems and the determination of calculation length in the method for fault detection based on KLD, a method for complex systems based on a multi-operating conditions model and optimized KLD is proposed in this paper. Firstly, based on prior knowledge and the historical data of the complex system, the operation conditions of the system are divided into several simple mutually exclusive operation conditions. According to the established partition rules, the run data under an operation condition is extracted. In the next place, based on autocorrelation analysis, a method to quickly determine the optimal length is proposed, and the optimal length of each operation condition is determined. What's more, in the training stage, the KLD of the training data is calculated based on the training data of the first section of each operation condition, and the maximum value is selected as the threshold for fault detection. In the test stage, after the operation condition of the test data is judged, the corresponding KLD value is calculated, then combing with the corresponding threshold to determining whether the fault occurs. At last, this method is applied to the fault detection for the suspension system and compared with the other two fault detection methods. The result shows that the proposed fault detection method possesses low false alarm rate and high sensitivity. In addition, the effectiveness of the proposed method is verified by analyzing the KLD under different calculation lengths.