Fault Diagnosis of Oil-immersed Power Transformer Based on Difference-mutation Brain Storm Optimized Catboost Model

To address the problem of low accuracy of power transformer fault diagnosis, this study proposed a transformer fault diagnosis method based on DBSO-CatBoost model. Based on data feature extraction, this method adopted DBSO (Difference-mutation Brain Storm Optimization) algorithm to optimize CatBoost model and diagnose faults. First, for data preprocessing, the ratio method was introduced to add features to the original data, the SHAP (Shapley Additive Explanations) method was applied for feature extraction, and the KPCA (Kernel Principal Component Analysis) algorithm was employed to reduce the dimension of data. Subsequently, the preprocessed data were inputted into the CatBoost model for training, and the DBSO algorithm was adopted to optimize the parameters of the CatBoost model to yield the optimal model. Lastly, the DBSO-CatBoost model was exploited to diagnose the transformer fault and output the fault type. As indicated from the example results, the accuracy of the transformer fault diagnosis based on DBSO-Catboost model could be 93.71%, 3.958% higher than that of CatBoost model and significantly exceeding that of some common models. Furthermore, compared with other preprocessing methods, the accuracy of fault diagnosis by employing the data preprocessing method proposed in this study was significantly improved.


I. INTRODUCTION
Transformer is vital equipment of power system, capable of achieving voltage transformation, power distribution and power transmission. Its safe and reliable operation is correlated with the safety and power supply quality of the whole power grid. Accordingly, accurate diagnosis of transformer faults is critical to maintaining the safe operation of power grid and ensuring the quality of power supply [1]- [5].
The causes and types of power transformer faults are difficult to directly detect. Currently, Dissolved Gas Analysis (DGA) has been the most common fault diagnosis method. When the power transformer is overheated and discharged, its insulating oil will emit gases that dissolves in the oil. By analyzing the dissolved gas, the DGA method can determine the operating condition of the transformer. Conventional DGA methods consist of three-ratio method, Rogers ratio method and non-coding ratio method [6]- [9]. The mentioned methods exploit the relative content of dissolved gas to determine the fault type, and the calculations are simple. However, the classification effect of data close to the threshold is relatively poor, and there are common 'missing code' or 'super code' phenomena [10]- [12].
Over the past few years, as artificial intelligence is leaping forward, several intelligent algorithms combined with DGA method are applied to the fault diagnosis of power transformers. On the whole, the mentioned intelligent algorithms fall to non-ensemble learning and ensemble learning. Non-ensemble learning algorithms consist of BP neural network, support vector machine, extreme learning machine and others, each of which exhibits certain advantages, whereas some problems remain unsolved [13]- [17]. Zhang et al. combined the optimized BP neural network with DGA method to increase the accuracy of transformer fault detection to a certain extent, while defects remain (e.g., slow training speed and difficult parameter determination) [18]. Huang Tongxiang et al. used a support vector machine for transformer fault diagnosis. Such a machine exhibited strong learning generalization ability, whereas the accuracy is not high when there are many fault types and information is missing [19]. Du Wenxia et al. used an extreme learning machine for transformer fault diagnosis, which exhibited the advantages of fast learning speed and high generalization performance. In the diagnosis process, however, hidden layer neurons are prone to redundancy and classification accuracy decline [20].
The ensemble learning algorithm integrates multiple learners and exhibits higher learning performance. Gradient Boosting Decision Tree (GBDT) is a branch of the ensemble learning algorithm, which reduces the total error by decreasing the deviation and raises lower requirements for parameter adjustment and better robustness. GBDT is extensively adopted in transportation, medical, financial and other fields, whereas it has been rarely applied in power system fault diagnosis. Liao Weihan et al. built an oil-immersed transformer fault diagnosis model based on GBDT, Li Hejian et al. investigated an oil-immersed transformer fault diagnosis method by complying with extreme gradient lifting. As demonstrated from the comparative experiments of two literatures, the accuracy of the transformer fault diagnosis based on GBDT could be higher than that of non-ensemble learning algorithm [21], [22].
CatBoost is a machine learning library based on GBDT framework, which was proposed by Yandex in 2017. CatBoost, as compared with XGBoost, LightGBM and other GBDT algorithms, has been improved in numerous manners. It addresses the problem of gradient deviation in the iteration by complying with orderly principle, orderly enhancement algorithm and greedy strategy. In addition, it is capable of reducing the possibility of over-fitting of the model, increasing the execution speed of the model, improving the robustness of the model, and further increasing the prediction accuracy. On the whole, the performance of CatBoost is determined by the appropriate hyper-parameter set [23]- [26]. At present, the hyper-parameter optimization of the ensemble learning model largely adopts the grid search method, so the parameter set should be traversed. As impacted by the considerable parameters, the efficiency is low, and even the dimension explosion is triggered. Thus, the optimization algorithm should be applied for super-parametric optimization [27]- [29].
Brain Storm Optimization Algorithm (BSO) simulates the process of human creative thinking to tackle down problems, and it exhibits a strong global and local search ability [30]- [33].
Brainstorm optimization algorithm and optimized algorithm have exhibited prominent performance in numerous fields (e.g., medical image registration, image segmentation, engine parameter prediction, data feature selection and multiobjective optimization [34]- [38]). Many scholars have optimized the brainstorming optimization algorithm [38]- [42] to form various variants of brainstorming optimization algorithm, as an attempt to improve the performance of the algorithm [39]- [42]. ZHU H Y et al. proposed using kmedians algorithm for clustering, as an attempt to avoid the weaknesses attributed to outliers in k-means clustering, while increasing the algorithm speed [43]. Pourpanah F et al. extended BSO to an adaptive algorithm based on multiple groups, thereby improving the mutation effect of BSO, whereas the effect on multi-parameter optimization was insignificant [44]. In this study, the difference-mutation Brain Storm Optimization Algorithm (DBSO) replaced the Gaussian mutation of the BSO algorithm by complying with the BSO algorithm, which could improve the convergence rate and especially apply to the hyper-parametric optimization of the ensemble learning model [45].
As chromatographic technology has been advancing over the past few years, the detection of gas composition and concentration turns out to be rapid and accurate [46], [47]. Accordingly, in this study, the chromatographic technology acted as the vital technology of the transformer fault diagnosis. The chromatographic technology was employed to detect the transformer oil of the respective fault type, and the relevant data information was acquired. A series of preprocessing was performed on the data, and the data characteristics were extracted and normalized. A variety of fault identification models were built and classified for the processed data [6]- [12].
This study proposed a transformer fault diagnosis method based on DBSO-CatBoost. First, the dissolved gas data in transformer insulation oil were preprocessed by feature extraction, dimension reduction and normalization. Subsequently, the CatBoost model optimized by DBSO algorithm was built. Next, the processed data were trained and tested by using DBSO-CatBoost model. Lastly, the running state of the transformer was determined, and the power transformer faults were accurately diagnosed. This study builds various classification and recognition models, compares multiple models, and lastly develops a more suitable classification model for the transformer fault diagnosis. In the end, the whole study is summarized.

A. CATBOOST MODEL
CatBoost is a machine learning library supporting categorical variables, which complies with the GBDT algorithm framework. It is capable of effectively solving various data migration problems in the original GBDT, while exhibiting the advantages of fewer parameters, high accuracy and good robustness [23]- [24].

1) GBDT ALGORITHM
Ensemble learning builds multiple machine learners, trains them to form multiple weak learners, and combines multiple weak learners via some combination strategies to form a strong learner. Fig. 1  The algorithm is a framework algorithm of ensemble learning, with a basic idea to exploit the basic classification weak learner to obtain a strong learner by linear weighting and iterative training.
GBDT algorithm acts as an ensemble learning algorithm based on Boosting algorithm, which combines gradient lifting algorithm and decision tree. The model is an additive model, the learning algorithm is forward step-by-step algorithm, and the basis function is CART tree.
The concrete steps of GBDT algorithm are elucidated below: Step 1. Initializing the weak learner: Where L( , c) denotes the loss function; represents the first prediction target; c expresses the parameter with the least square loss function.
Step 2. Calculate the negative gradient of the current loss function as sample residuals: Step3. With ( , ) i im x r as the training set of the next tree, fitting a CART regression tree, get the leaf node set jm R , Leaf nodes 1, 2,..., j J = , J represent the number of leaf nodes in the regression tree.
Step4. Calculate the minimum loss function for leaf node j : i jm Where γ denotes the parameter of the respective leaf node.
Step 5. Update the strong learner: Where jm I represents the jth regression tree.
Step5: Combine weak learner to form strong learner: 2) CATBOOST ALGORITHM The prediction model in GBDT algorithm is determined by the target variables of training samples, and there is an over-fitting problem attributed to biased point-state gradient estimation. Catboost algorithm is an improvement based on GBDT framework, which can effectively address the mentioned problems [48]. Compared with other GBDT algorithms (e.g., XGBoost and LightGBM), Catboost has been optimized in numerous aspects. First, CatBoost adopts the 'ordered principle' to avoid the conditional displacement issue inherent in the iteration of GBDT algorithm, while making it possible to exploit the whole data set for training and learning. Second, CatBoost transforms the conventional gradient enhancement algorithm into Ordered Boosting algorithm, thereby solving the inevitable problem of gradient offset in the iteration, improving the generalization ability, reducing the possibility of overfitting and enhancing the robustness of the model. Lastly, CatBoost builds the combination of classification features through greedy strategy, and takes the mentioned combinations as additional features, which makes it easier for the model to capture high-order dependencies and improve the prediction accuracy more significantly. Furthermore, CatBoost selects the forgetting decision tree as the basic prediction period, thereby reducing the possibility of overfitting and increasing the execution speed of the model [25]- [29].
Set the dataset to: Where i = 1,2 … n, n is the number of sample groups. The respective group of samples x i m is the first feature vector of group i samples. Y i denotes the label value. The main methods of CatBoost algorithm include: Multiple rankings are randomly generated for learning, the same class samples are found under the respective feature, and the classification feature conversion value is calculated: where φ denotes the indicator function, which is 1 when is satisfied; otherwise, it is 0.p is a priori value.α is a priori weight.
The respective group of samples X i in the training set has a model obtained by training the other training sets without X i . The combination of classification features is built in accordance with greedy strategy, and the tree structure is selected. The Ordered Boosting algorithm is adopted to calculate the gradient of X i , and the gradient is employed to train the weak learner. Besides, the final model is developed by weighting.

1) BSO ALGOAITHM
Brain Storm Optimization Algorithm (BSO) is an intelligent algorithm proposed by Professor Shi Yuhui in 2011, largely simulating the group behavior in human creative problem solving. It exploits the clustering idea to search the local optimum, while obtaining the global optimum by comparing the local optimum [49]. The mutation idea complicates the algorithm and avoids the algorithm falling into local optimum, which applies to solving the multi-peak high-dimensional function problem.
The BSO algorithm mainly comprises the steps below: ① Initialize the population. ② Individual evaluation and clustering. ③ Selecting cluster centers. ④ New individuals are generated through variation and then updated. ⑤ If the maximum number of iterations is reached, the optimal individual is outputted; otherwise, it is transferred to the second step.
The main part of BSO algorithm is clustering and mutation [50].
BSO employs K-means clustering algorithm to cluster individuals into k categories in accordance with the distance between individuals, while taking the individuals with the optimal fitness function value as the clustering center. To prevent falling into local optimum, the mutation individuals generated by probability replace one of the clustering centers.
BSO variation covers four major ways: (1) adding random disturbance to a random class center, i.e., the optimal individual of this class, to generate new individuals; (2) randomly selecting an individual in a random class to add random perturbations for generating novel individuals.
(3) randomly fusing two class centers and adding random perturbation to generate novel individuals; (4) randomly fusing two random individuals in the two classes, while adding random disturbance to generate new individuals.

2) DBSO ALGOAITHM (DIFFERENCE-MUTATION BRAIN STORM OPTIMIZATION)
For several complex optimization problems, BSO algorithm exhibits slow convergence speed or premature problem. To improve the optimization performance, this study adopted DBSO algorithm to optimize the parameters of CatBoost model.
The DBSO algorithm exhibits the identical overall structure to the classical BSO algorithm, whereas the difference mutation is applied, other than the Gaussian mutation in the fourth step.
The classical BSO algorithm applies Gaussian mutation, and the novel individual generation equation is expressed as: where nd where T and t respectively represent the maximum number of iterations and the current number of iterations. k could adjust the slope of the lg () sig function, and (0,1) R is a random value from 0 to 1.
In this variation, the requirements can be met at the early stage, whereas the coefficient of variation of Gaussian variation tends to be fixed at the subsequent stage, so it cannot well capture the search characteristics [45].Thus, DBSO algorithm adopts differential mutation.
In human brain storms, everyone' s ideas at the early stage will be significantly different. Differences in existing ideas should be considered when creating novel ideas. Accordingly, DBSO algorithm determines the mutation step by differential mutation. The specific operation is defined as follows: x express two different individuals selected in the contemporary global. According to Eq. (10), compared with Gaussian variation, the calculation amount of the above differential variation is significantly reduced. Moreover, since the variation could be adaptively adjusted by complying with the dispersion degree of individuals in the group, it could more effectively share information and improve the search efficiency. Thus, compared with the BSO algorithm, DBSO algorithm could better balance local search and global search, and improve the algorithm performance.

III. TRANSFORMER FAULT DIAGNOSIS MODEL BASED ON DBSO-CATBOOST
This study adopted CatBoost model to diagnose transformer faults. As impacted by some parameters of Catboost model under the default value, there would be overfitting or underfitting. If manually adjusted, it would be timeconsuming to find the optimal value. Accordingly, DBSO optimization algorithm is adopted to optimize the parameters of Catboost model to improve the performance of diagnosis model. For transformer fault diagnosis, this study built a DBSO-CatBoost model (Fig. 2). The transformer fault diagnosis model based on DBSO-CatBoost primarily comprises data preprocessing, DBSO optimization and fault diagnosis. Data preprocessing mainly covers feature extraction, dimension reduction and normalization of the collected DGA sample data, as well as sequence division. The DBSO optimization part exploits the DBSO model to optimize several parameters of the CatBoost model to obtain the optimal parameters. The model training and testing part is training and testing CatBoost model, while outputting transformer fault types and assessing the model.

A. DATA ACQUISITION
The data in this study was provided by a power grid in the northwest of the State Grid Corporation of China, and 2 H , 4 CH , 2 6 C H , 2 4 C H , 2 2 C H were selected as the attributes of the transformer fault diagnosis, including 381 groups of fault data. The three-dimensional view of the data was shown in Fig.  3.

Fig. 3. 3D view of DGA data
According to Fig. 3, any single feature with large difference in DGA data cannot accurately determine a fault type of transformer, and there were some coupling relationships between the feature attributes of the data, so it was necessary to extract the feature of the data [51], [52].

1) FEATURE EXTRACTION
According to GB-T 7252 -2016 Guidelines for Analysis and Judgment of Dissolved Gases in Transformer Oil, the gas production rate of transformer insulating oil is correlated with the fault type of transformer, i.e., the fault type of transformer is correlated with the ratio of the respective gas concentration. Thus, the ratio between the characteristics of transformer fault types and input attributes was related. The common three-ratio method and non-coding method could roughly determine a fault type of transformer independently [6]- [9], so the characteristic variables generated by the interactive ratio method of input attributes exerted the decoupling effect on transformer fault diagnosis data.
The common three-ratio method and non-coding method could roughly determine a certain fault type of transformer separately, whereas the characteristic dimension generated by them cannot completely decouple the data. To achieve better decoupling effect, this study selected to traverse the data attribute ratio. The selected data feature variables were mainly composed of component concentration and its ergodic ratio. The form of ergodic ratio is determined by the formula below. DGA data had five-dimensional attributes, so the interactive ratio of data attributes is expressed below: ). Using enumeration algorithm, the new 145dimensional feature variables were obtained by traversing all the permutations and combinations of four groups, and the original 5-dimensional feature variables were added to 150dimensional data feature variables.
Since some data in the collected DGA data were zero, the feature attributes added by the ratio method achieved the case of zero division, so abnormal data would be generated.
On the whole, the processing methods of abnormal data comprised Laida criterion filling and fixed value filling. The DGA data was excessively scattered, and the data level difference was significant. Filling with the Layida criterion would eliminate most of the data, so this method was not suitable for DGA data. Here, the fixed-value filling method was used to process the abnormal data.
Each of the 150-dimensional data feature variables had different contributions to the sample, and the addition of some variables sometimes increased the complexity of the model, while affecting the accuracy of the model. Accordingly, the Shapley Additive Interpretation (SHAP) method was used for feature extraction. The SHAP value method builds an additive explanatory model. The core idea was to calculate the marginal contribution of features to the output of the model, and then explain the black box model from the global and local levels. All features were regarded as ' contributors '. For the respective prediction sample, the model produces a predictive value, and the SHAP value was the value assigned to the respective feature in the sample [53].
The SHAP values of the respective feature were calculated for 150-dimensional feature variables, and the feature density scatter plot was made. The respective row in the beeswarm graph represents a feature. Considerable samples were gathered in a wide area, and the abscissa was the SHAP value. A point represents a sample, and the color of the point represents the relative value of the point. The redder the color, the greater the blue would be, and the smaller the color would be. The ordinates in Fig. 4 were sorted by descending order of the average absolute value of SHAP value, and the first 20 characteristics of the intermediate temperature overheating category were taken as beeswarm diagram (Fig. 4). C H , 2 4 C H , 2 2 C H , respectively. Fig. 4 shows that the average absolute value of SHAP value of 2 2 C H was the largest, 2 2 C H has the greatest impact on the classification of samples. In addition, 2 H , 4 CH , 2 6 C H , 2 4 C H was also very important for sample classification [54].
The beeswarm graph only visualizes the SHAP values of all samples in one category, which does not represent the interpretability of the overall model. For the multiclassification situation in this study, the mean of the average absolute value of SHAP in the respective classification was taken to obtain the overall average absolute value of SHAP, and the sample characteristics were used to influence the histogram [55].
In the histogram, the respective column represents a feature.  C H was the largest, which has the greatest impact on data classification. According to the curve in Fig. 5, the average absolute value of the top 60 cumulative SHAP in the figure took up nearly 90 %, so the mentioned 60 features were taken as the attributes of the data.
Principal component analysis (PCA) maps the original variables to a new variable space. In the new variable space, several variables could be used to replace the original variables, and the data content of the original variables could be retained as much as possible. The new variables were orthogonal to each other to eliminate the collinearity of the original variables.
Kernel principal component analysis (KPCA) achieved the nonlinear mapping of data by mapping the original data to a higher dimensional space, and then employed principal component analysis to reduce the linear dimension of data from high dimensions [56]- [58].
PCA, PLS and KPCA were adopted to reduce the dimension of the data, and the results were shown in   Fig. 6 shows that the cumulative contribution increases with the increase in the dimension, and no longer increases after reaching 100 %. The cumulative contribution of KPCA was obviously higher than other dimension reduction algorithms. When the dimension was 7, the cumulative contribution rate of KPCA was 99.9 %, while the cumulative contribution rates of PCA and PLS did not reach 90 %. Subsequently, with the increase in the dimension, the cumulative contribution rate of KPCA increased slightly, and the training time of model increased with the increase in the dimension.
According to Fig. 6, KPCA was significantly better than the other algorithms, so this study uses KPCA algorithm to reduce the data dimension to 7 dimensions.

3) DATA NORMALIZATION
The difference of DGA data was large, affecting the processing speed of the model, so the data normalization processing [61]. In this study, the interval value method was used to normalize the data, so that the data was scaled to a specific interval in proportion to avoid the interaction between values. Here, the extreme value method was selected for linear function transformation: ( 1, 2,..., ) i n = denotes the normalized data, and the mapping interval is [ -1,1 ]. i X represents the original data. max i X denotes the maximum value in the data sample. min i X expresses the minimum value in the data sample. The normalized data after dimension reduction could be inputted to train and test the model.

C. FAULT STATE CODING AND SEQUENCE DIVISION
The output result of the diagnosis model was the fault type of the transformer. According to GB-T 7252-2016 transformer oil dissolved gas analysis and judgment guidelines, this study takes low temperature overheating, medium temperature overheating, high temperature overheating, partial discharge, low energy discharge, high energy discharge and normal operation as the output characteristics of the transformer fault diagnosis. In this study, a training set, a validation set and a test set were set at a ratio of 3:1:1. The number of fault state codes and their corresponding sequences was shown in Table  I.

D. COMPARISON OF MULTI-MODEL DIAGNOSIS RESULTS
For the preprocessed data, six models, including extreme learning machine (ELM), support vector machine (SVM), GRNN, random forest (RF), XGboost and Catboost, were used for fault diagnosis to test the performance of various models for transformer fault diagnosis. The diagnosis results were shown in Fig. 7. According to Fig. 7, the overall accuracy of Catboost was the highest in the six models, and the SVM algorithm was the highest in the single learner. The accuracy of ensemble learning algorithm was higher than that of single learner. The specific accuracy of the respective model for each type was shown in Table II. According to Table II, the overall accuracy of the six models from low to high was GRNN, ELM, SVM, random forest, XGboost, Catboost. The overall accuracy of Catboost algorithm was the best when the empirical parameters were used, but the training time of ensemble learning algorithm was long. If the grid search traversal parameter adjustment method was used, the time required was too long, and the parameter adjustment range was relatively limited. However, single learner classification was not good. Compared with the single learner model, the ensemble learning model exhibited higher fault diagnosis accuracy for oil-immersed transformers.

E. COMPARISON OF PARAMETER OPTIMIZATION ALGORITHMS OF CATBOOST MODEL
The performance of Catboost model was better than other models. The training set of Catboost classification model was analyzed, and the data processed by ratio method combined with KPCA were used as input features. The diagnosis results of Catboost model training set and test set are presented in Fig.  8. Test Set Sample Number Test Set Sample Number In Fig.8, Catboost model uses default parameters. According to Fig.8, the Catboost model classification results were over-fitting, so the parameters of the Catboost model should be optimized.
If Catboost model adopted manual adjustment of parameters, it would not only take a long time to adjust parameters, but also find the global optimum of parameters. If the grid search method was used for parameter adjustment, the time required was too long and the range of parameter adjustment was limited. Accordingly, the optimization algorithm was used to adjust the parameters of Catboost model.
Catboost was trained by the gradient lifting method. In the respective iteration, the basis for producing a new learner was that the regularization objective function was the smallest, and the regularization parameter L2_leaf_reg was too large or too small, which would cause over-fitting or under-fitting of the model. The learning rate parameter learning _ rate was too small, and the gradient descent was too slow. Too large, it may cross the optimal value and produce oscillation. The iteration number parameter iteration was too small would cause underfitting, resulting in insufficient model solving ability. Too big would cause overfitting, resulting in a decline in generalization ability of the model. In addition, the random strength parameter random _ strength of the model was used to score the split tree, and improper selection would affect the learning ability and classification ability of the model [29]. Thus, this study selects the optimization algorithm to optimize the parameters of the above four Catboost models to improve the performance of the diagnosis model.
The common parameter optimization algorithms were particle swarm optimization (PSO), sparrow search algorithm (SSA) and so on. In this study, DBSO, BSO, PSO and SSA were used to optimize the four hyperparameters of Catboost model, and the results were compared [62]- [64].
The fitness function curve was made with the error rate of the classification results of the validation set as the fitness value. The fitness curve of the respective optimization algorithm was shown in Fig. 9. According to Fig. 9, the DBSO algorithm first reached the optimal result, and the number of iterations to achieve the optimal fitness was 11, and the fitness value at the optimal time was the same as that of SSA and BSO algorithms, which was 2.132 %. The final fitness value of PSO algorithm was the largest, and the optimization effect was the worst.
The Catboost model optimized by four algorithms was used for fault diagnosis, and the results were shown in  Test Data According to Fig.10, the test set accuracy of DBSO-Catboost model, BSO-Catboost model and SSA-Catboost model was the same, which was higher than that of PSO-Catboost.
In summary, although the accuracy of DBSO-Catboost model was the same as that of the other two models, it could find the optimal point faster and the optimization effect was the best.

F. CASE DATA ANALYSIS
Using 381 sets of data collected to build the model, some sample data are listed in Table III . The above data is adopted to construct features by ratio method, then feature screening, KPCA dimensionality reduction, normalization, and finally DBSO-Catboost algorithm is applied for prediction. The results are listed in Table IV. According to Table III and Table IV, the proposed model achieves better accuracy than the traditional three-ratio method. According to the samples presented in Table IV, Catboost model and DBSO-Catboost model are used to analyze the confidence of the samples [65] [66].
The confidence of Catboost model is listed in Table V. The confidence of DBSO-Catboost model is shown in Table VI.  According to Table V and Table VI, the confidence of the DBSO-Catboost model in the correct classification of samples is higher than that of the Catboost model, so the model classification method proposed here is effective.

1) DIAGNOSTIC RESULTS UNDER DIFFERENT PRETREATMENT METHODS
In this study, the ratio method was used to process the data and then the dimension reduction algorithm was used to reduce the dimension of the data. The original five-dimensional data, the data formed by the dimension reduction based on ratio method combined with KPCA, the data formed by the dimension reduction based on ratio method combined with PCA, and the data formed by the dimension reduction based on ratio method combined with PLS were used to form four different data sets with four different data processing methods. The DBSO-Catboost model was used to classify the four data. The classification results of the test set were shown in Fig. 11.  According to Fig. 11, when the data were reduced to seven dimensions, the data classification effect of ratio method combined with PLS was the worst, and the classification effect of ratio method combined with KPCA was the best. In the case of DBSO-Catboost model, the accuracy of data after the dimension reduction based on ratio method combined with KPCA was 3.950 %, 10.526 % and 5.263 % higher than that of the ratio method combined with PCA, the ratio method combined with PLS, and the original five-dimensional data, respectively. Accordingly, the classification effect of the data processed by the ratio method and the KPCA dimension reduction algorithm was better than that of the original data.
In addition, when the classification algorithm is implemented, the precision, recall and F1 score of the model are the three main indicators to judge the classification effect of the model [67].
The recall rate is determined by: F1-Score, also known as the balanced F-fraction method, is calculated by : Where TP is true positive; FP is false positive; FN is false negative.
Taking the normal operation of the category as an example, the true positive represents: it is predicted as the correct number in the normal operation. False positive denotes the number of errors predicted in normal operation. False negative represents: the true value is the number of normal operation and prediction errors.
Macro-F1, i.e., the macro average method, is obtained by substituting the precision rate and recall rate of each transformer state into formula ( 14 ), and then the values of seven F1 -Scores are averaged.
Calculate the precision, recall, and F1-Score values of KPCA-DBSO-Catboost in Fig.11. The KPCA-DBSO-Catboost detailed prediction results are shown in Table VII. According to Table VII and formula 12-14, the precision, recall and F1-Score of KPCA-DBSO-Catboost method can be calculated. The details are shown in Table VIII.   TABLE VIII DETAILED INFORMATION TABLE OF  Therefore, combined with formula 12-14, Table 7 and Table 8, the F1-Score is 93.42 %, and the Macro-F1 value is 92.63 % by adding the F1-Score values of each class and dividing them into 7. The Macro-F1 value of Orginal-DBSO-Catboost model in Fig. 11 can be calculated by the same method, and it is found that it is less than 90 %. This shows that the KPCA-DBSO-Catboost model of transformer fault diagnosis classification method is effective.

2) COMPARISON OF DIAGNOSTIC RESULTS OF DIFFERENT MODELS
DBSO optimization algorithm is employed to classify ELM, SVM, GRNN, Random Forest, XGboost and Catboost. After the data is processed by ratio method and KPCA, the optimal classification model is built.
The optimization algorithm is adopted to optimize the initial weights and thresholds of ELM model. The penalty factor C and kernel function parameter g of SVM model are optimized by optimization algorithm. The smoothing factor of GRNN model is optimized by using the optimization algorithm. Optimization algorithm is used to optimize decision tree tree and split feature number of Random Forest [68]. The regression tree k, learning rate η, maximum regression tree depth ( max_depth ), regularization coefficient λ, min_chile_weight and minimum splitting gradient descent δ of XGboost model are optimized by optimization algorithm. The optimization algorithm is adopted to optimize the regularization coefficient L2_leaf_reg parameter of Catboost model, the random strength random_strength parameter of splitting tree score, the iteration number iteration parameter and the learning rate learning_rate parameter. The population size is set to 20, and the number of iterations is set to 100. The classification diagnosis results of the respective model are presented in Fig. 12.The detailed information of each model classification is listed in Table IX.  TABLE IX DETAILED INFORMATION TABLE OF  low temperature overheating 6/9 6/9 7/9 7/9 7/9 7/9 middle temperature overheating 7/10 9/10 8/10 9/10 9/10 9/10 high temperature overheating 8/10 9/10 9/10 9/10 9/10 9/10 partial discharge 7/9 9/9 8/9 9/9 9/9 9/9 low energy discharge 4/9 6/9 7/9 7/9 7/9 9/9 high energy discharge 15 Analysis of Fig. 12 and Table IX indicates that the DBSO-Catboost model has the optimal classification effect. After optimization by the optimization algorithm, the diagnostic effect of the classification model is improved. After the optimization algorithm, the classification effect of the respective model is significantly improved.
In order to verify the performance of the model, the PSO-RF model established by using the feature of non-coding ratio is compared with the model proposed in this paper [22]. The experimental results are shown in Fig.13.