Compound Fault Diagnosis Method of Modular Multilevel Converter Based on Improved Capsule Network

When submodule (SM) compound fault occurs in the Modular multilevel converter (MMC), the time-domain waveform characteristics of output current and internal circuiting current of MMC are not obvious, especially on the condition of high-level MMC. To address this issue, A compound fault diagnosis method based on an improved capsule network (CapsNet) is proposed in this paper. Firstly, The front end of the network adopts the feature extraction structure of One-dimensional Convolutional Neural Network (1DCNN) combined with Long and short-term memory network (LSTM). Then, the original data of MMC three-phase output currents and three-phase circulating currents are used as fault detection signal. The back-end employs a main capsule layer and a digital capsule layer structure, and realizes the transfer of feature vectors through a dynamic routing algorithm. This feature extraction structure combines the light weight of 1DCNN and the sequential sensitivity of LSTM. While ensuring that the information is fully extracted, the computational cost of the model is greatly reduced. Furthermore, the overlapping sampling method is used to construct and expand the sample sets. Compared with the other three common deep learning methods, their fault diagnosis performance under the condition of different levels and levels changing are analyzed respectively. The experimental results demonstrate that the proposed method has excellent cross-domain learning ability and high fault recognition accuracy.


I. INTRODUCTION
As a new type of voltage source converter topology, modular multilevel converter (MMC) has many advantages, such as modular structure design, easy expansion, high output waveform quality, low operation loss and common DC bus. It has been more and more widely used in medium and high voltage direct current transmission, new energy grid connection, high voltage electric drive and other occasions. MMC is formed by cascading a large number of submodules (SMs). Each SM uses insulated gate bipolar transistors (IGBTs) and diodes as commutation devices. Compared with diode, IGBT has low over-voltage and over-current resistance and is more prone to failure. When a single SM is not detected in time after open-circuit failure and a long period of time with failure operation, it will lead to a sharp increase in heat and loss of switch devices in the SM, which is very likely to cause other normal SMs to fail [1], [2]. When two or more SMs have open-circuit failure, this fault form is called compound fault in this paper. Compound failure is very common in the operation of devices [3], so it is necessary to diagnose the compound failure of MMC system's SMs [4]- [6].
In recent years, the literature on MMC SM fault diagnosis is increasing. The fault diagnosis methods for multilevel converter switching devices at home and abroad are mainly divided into three types: signal processing method, mathematical model method and data-driven method. Among them, the method based on signal processing is mainly used in the system where it is difficult to establish a mathematical model, but the system input and output signals are highly correlated with internal faults and the input and output signals are easy VOLUME 4, 2016 to monitor [7]. Based on the signal processing method, the collected original signal is transformed from time domain to frequency domain by spectrum analysis, and the spectrum characteristics of fault waveform are extracted to realize fault diagnosis. However, it is difficult to complete the fault diagnosis task only by relying on the spectrum characteristics and manual identification. It is usually combined with other datadriven methods to complete various fault diagnosis tasks. The model-based fault diagnosis method needs to establish its accurate mathematical model on the basis of knowing the mechanism of the diagnosed system, and then carry out fault diagnosis [8]- [11]. In fact, multilevel converter system has the characteristics of nonlinearity, high-order and strong coupling, so it is often difficult to establish its accurate mathematical model. In the past few years, data-driven fault diagnosis method has gradually become a research hotspot in the field of fault diagnosis [12]- [15].
The deep learning technology represented by convolutional neural network (CNN) has its unique network structure and training method. The input end of the network directly takes the original data as the input signal, and the output end of the network directly outputs the fault diagnosis result. CNN is an end-to-end fault diagnosis method, which can reduces the dependence on expert experience and does not require manual feature extraction, and is more and more widely used in the field of fault diagnosis [16]- [20]. Fault diagnosis models based on one-dimensional convolution neural network (1DCNN) and two-dimensional convolution neural network (2DCNN) have been proposed successively [21], which have higher accuracy, better generalization ability and end-to-end fault diagnosis mode than traditional fault diagnosis methods such as SVM and multi-layer perception machine.However, convolution neural network has achieved some results in the field of fault diagnosis, but the calculation process of pooled layer in CNN will lose a large amount of spatial feature information, resulting in inadequate extraction of detail information, which limits the accuracy of fault diagnosis. Capsule network (CapsNet) is a new type of neural network with a new architecture proposed to solve the shortcomings of CNN in this area [22], [23]. By comparing the networks, each neuron in CapsNet is a vector, which allows CapsNet to reduce the loss of detail features, so that more detail features can be extracted from the input data, resulting in a strong discriminant ability [24]- [31].
Inspired by the above discussion, this paper presents an improved CapsNet diagnosis method for MMC SM compound fault, which is called Fault Diagnosis-Feature Extraction Promotion-Capsule Network (FD-FEP-CN). The main contributions are highlighted as follows: 1)An improved Cap-sNet feature extraction structure replaces the original single convolution layer, extracts important features from original time series data and discriminates key information; 2)The effects of the batch size, optimizer and iteration times of the CapsNet on the diagnosis performance of the model are analyzed, and the main parameters of the model are set; 3)The fault diagnosis performance of the proposed method  and the other three methods under the condition of different levels and levels changing is analyzed respectively. The experimental results veriy that FD-FEP-CN model has the best fault diagnosis performance.

A. THE TOPOLOGY OF MMC
The circuit topological structure of MMC is shown in Fig. 1, in which Fig. 1(a) is the main circuit topology of the threephase MMC and Fig. 1(b) is the half-bridge SM topology. In the figure, SM represents the submodule, VT represents the switching device, and D refers to the diode. The MMC is composed of three-phase and six bridge arms. Upper and lower bridge arms of Each phase form a phase unit. Each bridge arm contains a bridge arm reactance and N series submodules. The main function of the bridge arm reactance is to suppress the internal circulating current between the bridge arms and reduce the current rise rate when the converter fails.
By controlling the number of SMs on the upper and lower bridge arms of each phase, MMC outputs a multilevel waveform fitting sine wave. In addition, the number of SMs in the bridge arm of the series converter can be adjusted to meet the requirements of different power and voltage levels, which is convenient to realize capacity expansion and redundancy design, shorten the project construction cycle and save cost.
Due to the low loss and low cost of half-bridge SMs, most MMC-HVDC projects adopt half-bridge SM structure. Each submodule has two IGBTs, two anti-parallel diodes and a floating capacitor are connected in parallel.

B. THE WORKING PRINCIPLE OF MMC
MMC can be divided into three working states and six working modes [32], [33]. Assuming that each bridge arm of MMC contains N series SMs, each phase unit contains 2N submodules. During steady-state operation, N + 1 levels can be output at the AC side of MMC by controlling the number of SMs input by the upper and lower bridge arms. In order to maintain the constant DC side voltage of MMC, the total number of conduction SMs of each phase unit shall meet the (1) as follows: Where U dc is the DC side bus voltage of MMC, N is the number of SMs without bridge arms in series, and U c is the capacitance voltage of SMs.
Taking phase a as an example, the upper and lower bridge arm currents are shown as follows: Where i jp and i jn are the upper and lower arm currents, i dc is the DC side current, i j is the output current of MMC. Taking (2) into account, the circuiting current equation is written as Where i dif f.j is the three-phase internal circulating currents. Therefore, The output voltage of MMC can be adjusted by flexibly controlling the number of SMs at the inserted state on the upper and lower bridge arms in each phase unit.

A. FEATURE EXTRACTION STRUCTURE COMBINING 1DCNN AND LSTM
1DCNN can process each input sequence segment separately, but it is not sensitive to the order of time series. One method is to stack the convolution layer and pooling layer together so that the layer above the network can obtain a larger receptive field. Although this method can identify longer input sequence fragments, it is still not a good method to introduce sequence sensitivity. Therefore, this paper proposes a feature extraction structure of 1DCNN combined with LSTM, which uses 1DCNN as the data preprocessing link in front of LSTM. For 1DCNN can convert longer input sequence data into shorter sequence data composed of advanced features, and  then take these sequences composed of extracted features as the input of LSTM. It is worth noting that LSTM has a high computational cost when directly processing long sequences, while 1DCNN has a low computational cost. Therefore, this improved feature extraction structure combines the speed and weight of 1DCNN with the order sensitivity of LSTM.

1) One dimensional convolutional neural network
CNN was first applied in the field of image recognition and achieved good results. CNN is composed of several convolution layers and pooling layers. Different from the feature of global mode extracted from the input feature map in the full connection layer, the convolution layer learns local mode from the input feature map. This important feature makes C-NN have three main characteristics: local connection, weight sharing and hierarchical expression. Therefore, CNN greatly reduces the number of network parameters, reduces the risk of model over fitting, and improves the operation efficiency.
Two dimensional convolutional neural network (2DCN-N) is involved in many literatures, such as the field of machine vision, while 1DCNN is rarely involved. Similar to the structure and properties of 2DCNN, 1DCNN is also composed of several one-dimensional convolution layers and one-dimensional pooling layers. 1DCNN is mainly applied to the analysis of time series data. 1DCNN can extract local one-dimensional subsequences from sequence data. This one-dimensional convolution layer can recognize the local patterns in the sequence, which makes the onedimensional convolution neural network translation invariant. One-dimensional pooling operation can also extract onedimensional subsequence from the input sequence, and then output its maximum value. It can reduce the length of onedimensional input sequence, which is a sub sampling process.
The working principle of one-dimensional convolution neural network is shown in Fig. 2. The size of one-dimensional convolution window is 3, and each output time step is obtained by using a small segment of input sequence in the time dimension.
Assuming that the one-dimensional signal X is the output of a characteristic graph of the second layer, the calculation formula of one-dimensional convolution can be expressed as follows: Where X i j is the j-th output of layer l, M j is the j-th convolution region, and X l−1 i is the element. w l ij is the corresponding convolution kernel, b l j is the offset vector of layer l, and f (x) is activation function of layer l convolution layer.

2) Long short term memory network
Recurrent neural network (RNN) is usually used to process time series data with causality or sequence sensitivity. LSTM is a variant of RNN, which can well solve the gradient explosion and gradient disappearance problems of simple RNN. Because LSTM adds a way to carry information across multiple neural network layers, it can save information for later use, so as to avoid the problem of gradient disappearance. An LSTM neuron is mainly composed of three gates: input gate, forgetting gate and output gate.The network structure of LSTM is shown in Fig. 3.
In the figure, i t , f t and O t are input gate, forgetting gate and output gate respectively. f t is used to control how much information the internal state C t−1 needs to forget at the last time. Its operation process can be expressed as follows: i t is used to control how much information needs to be saved in the candidate state at the current time. The calculation process of the input gate can be expressed as The output gate O t controls how much information in the internal state C t at the current time needs to be output to the external state h t . The calculation formula of the output gate is as follows Where σ(·) is the logistic function, and its output value is in the (0, 1) interval. x t is the input at the current time, and h t−1 is the external state at the previous time.

B. CAPSULE NETWORK
CapsNet is a new neural network structure first proposed by Hinton in 2017, which aims to solve the problem that CNN loses the spatial position information of objects in the process of image recognition. CapsNet are some nested neural layers, they use CNN layers, but remove the maximum pool. CapsNet is a carrier containing multiple neurons, its input and output are a vector. The length of each vector represents the estimated probability of the existence of the object, and its direction records the attitude parameters of the object. The capsule network is mainly composed of main capsule layer and digital capsule layer. The operation process in the capsule can be divided into three steps: the first step is matrix transformation, and its formula can be expressed as Where u i is the output of the upper layer vector neuron, that is, the lower layer feature, W i j is the attitude matrix between low-level features and high-level features, u j |i is the prediction vector of high-level features inferred from lowlevel features.
The second step is input weighting. The prediction vector is weighted and summed to obtain an output vector. The calculation process can be expressed by (9), which is the coupling coefficient determined by the dynamic routing algorithm.
The third step is nonlinear compression. A new nonlinear activation function of vector that is Squash function, also known as compression function, is used to nonlinear transform the output vector to obtain the output vector of digital capsule layer. The mathematical formula is as follows: By adjusting the coupling coefficient, the input neurons can automatically select the best path to transmit to the next layer of neurons. The calculation formula can be expressed as The calculation process of dynamic routing is shown in Figure 4. In the forward propagation process, the offset coefficient b i,j is initialized with 0, the initialization value of coupling coefficient is calculated by (10), and then obtained by (8). After that, the output vector can be obtained by nonlinear transformation. After that, (11) is used to iteratively update the bias coefficient. The above process is cycled according to the number of dynamic routing iterations, Ultimately, a set of optimal coupling coefficients are obtained. In the forward propagation process, the offset coefficient b i,j is initialized to zero. The equation (10) and equation (8) are used to calculate the initialization value of the coupling coefficient and s j respectively. Nonlinear transform s j to obtain the output vector v j , equation (11) is used to iteratively update the offset coefficient b i,j . The above process is cycled according to the number of dynamic routing iterations to finally obtain a set of optimal coupling coefficients.

IV. COMPOUND FAULT DIAGNOSIS BASED ON IMPROVED CAPSULE NETWORK A. MODEL STRUCTURE
In the traditional CapsNet, the feature extraction structure is a single convolution layer, and the feature extraction ability is insufficient. In order to obtain more useful information from the original current signal of MMC, this paper improves the feature extraction structure of the CapNet. The fusion of 1DCNN and LSTM is used as the feature extraction unit of CapsNet to construct a more comprehensive and rich feature extraction unit with prominent spatial features, then it is combined with the main capsule layer and digital capsule layer to form a feature extraction unit and an improved CapsNet structure. This network structure is referred to as fault diagnosis feature extraction promotion capsule network (FD-EP-CN), the network structure is shown in Fig. 5.
The front end of network employs the feature extraction structure of 1DCNN combined with LSTM to directly receive the original MMC three-phase output current and circulating current as fault detection data, ensuring that the information is fully extracted, and greatly reducing the computational cost of the model. The back-end of the network adopts the structure of main capsule layer and a digital capsule layer, realizes the transmission of feature vectors through dynamic routing algorithm, and uses Relu function as the activation function.

B. LOSS FUNCTION
In the training process, the weights of each layer in the model and the weight parameters in the CapsNet need to be iterated and updated by the back-propagation (BP) algorithm. In the BP algorithm process, a loss function that can measure the distance between the predicted value and the real value of the model output is minimized to iteratively update the parameters of the model weight. The output probability synthesis of the CapsNet is not always equal to 1, the CapsNet has the ability to identify multiple objects at the same time, the CapsNet allows multiple classifications to exist at the same time. Therefore, the traditional cross entropy loss function cannot be used directly. The loss function in this paper consists of two parts, using margin loss and reconstruction loss Combined method to calculate the total loss. The margin loss function is expressed as follows: Where k is the number of classifications, T k is the classification indicator function, when the current data is class k, T k = 1, otherwise T k = 0. m + is the upper bound, which is used to punish false positive, that is, the predicted class exists but does not exist. It is recognized but wrong. m − is the lower bound, which is used to punish false negative, that is, the prediction class does not exist, but actually exists and is not recognized. The values of upper and lower boundaries are m + = 0.9 and m − = 0.9 respectively. λ is the proportion coefficient, which is used to adjust the specific gravity of the two. The value here is 0.5. The total loss is equal to the sum of the losses of each sample.
The calculation method of reconstruction loss is to construct a three-layer fully connected network behind the capsule layer to obtain the output data with the same dimension as the original input data, and take the square sum of the difference between the final output and the value on the initial input unit as the loss value. The calculation process is shown  in Fig. 6. The overall loss is the interval loss plus α times the reconstruction loss. The value of α is 0.005. Obviously, the interval loss is dominant.

C. FAULT DIAGNOSIS SCHEME
Considering the large difference between three-phase output currents and circulating currents of MMC, if the original characteristics are not standardized, the accuracy will be reduced or the loss function will not converge in the training process. Therefore, before the original data is input into the model, the input data is normalized to the [0,1] interval by using the deviation standardization method, and its calculation formula can be expressed as Where max(x) and min(x) are the maximum and minimum values of features respectively, x and x represent the characteristics before and after normalization respectively. In the training process of the network, the weights of each layer in the model and the weight parameters in the CapsNet need to be iterated and updated by the BP algorithm. The flow chart of the proposed fault diagnosis algorithm is shown in Fig. 7 Table 2. Each data set contains 15 fault states and one normal state, a total of 16 types, and the number of samples of each type is

B. ANALYSIS OF COMPOUND FAULT FEATURES FOR MMC SUBMODULES
The open-circuit fault is set at 1.045s. When a single submodule open-circuit fault occurs in MMC, the three-phase ac output and internal circulating currents of the converter will change. For example, When the submodule open-circuit fault occurs on the bridge arm of phase a, the asymmetric operation of the bridge arm will result in a large fluctuation of the DC side current and the circulating current of the fault phase, while the other two phases are running in a nonfault state, and all the increase components of the circulating current of the fault phase bridge arm will flow to the DC side [34].  (a) Three-phase output currents.

1) Fault characteristic analysis of output current
When MMC operates normally, the three-phase output current is symmetrical, the phase difference of each phase current is 120 degrees, and there is no DC bias. In order to illustrate the change of current waveform after different types of compound faults, SM open-circuit faults are set at different times. Here, three compound fault types of phase a upper and lower bridge arms, phase a and phase b upper bridge arms, and phase a and phase c upper bridge arms are taken as examples to illustrate the current waveform after compound faults of different bridge arm SMs. Fig. 9(a), Fig. 10(a) and Fig. 11(a) are the output current waveforms of 11-level MMC after three different types of compound faults respectively. It can be seen from the time domain diagram that after the compound fault of SM, the three-phase output current will be seriously distorted and no longer maintain symmetrical distribution. For example, when the SM of a bridge arm has an open-circuit fault, the amplitude of the phase output current will decrease. The output current of the non fault phase also decreases, but the decrease is not as large as that of the fault phase. Fig. 12 and Fig. 13 are time domain current waveforms of compound faults of 31-level and 61-level MMC respectively. It is obvious that the change of output current waveform at this time is not very obvious, indicating that the time domain waveform characteristics will become smaller and smaller with the increase of MMC levels.

2) Fault characteristic analysis of internal circulating current
The three-phase internal circulating current waveform after different compound fault of 11-level MMC is shown in Fig. 9(b), Fig. 10(b) and Fig. 11(b) respectively. It can be seen from the figures that the circulating current is characterized by the increase of circulating current of fault phase and that of non fault phase, but the increase of circulating current of fault phase is significantly greater than that of non fault phase [35]. Similarly, with the increase of MMC level number, the time domain waveform characteristics of circulating current after MMC compound fault will become smaller and smaller, as shown in Fig. 12(b) and Fig. 13(b). To sum up, with the increase of the number of MMC bridge arm SMs, when a compound fault occurs, the difference between the time-domain waveform characteristics of output current and circulating current will become smaller and smaller.  (a) Three-phase output currents. (b) Three-phase circuiting currents.   Consequently, it is necessary to propose a more accurate and sensitive fault detection method.

1) Structure and parameter setting of FD-EP-CN model
The network feature extraction part employs the structure of two convolution layers and a pooling layer, followed by an LSTM layer. In order to increase the size of the receptive field of the network and fully extract the information from the original data, the convolution kernel of the first convolution layer is large and the size is set to 64 × 1. The convolution kernel size of the second convolution layer is set to 6 × 1. The window size of the pooling layer is set to 2 × 1.
The purpose is to reduce the number of network parameters without losing too much information and prevent over fitting. The function of convolution pooling layer is to convert the input sequence into a short sequence composed of shorter high-level features, so as to facilitate the subsequent RNN layer to process data. In order to enhance the order sensitivity of the network to the original data extraction, an LSTM layer is added. The second part of the network is to build the capsule unit, The output of the capsule layer is a vector of 16 × 16, the first 16 represents the number of classifications, and the second 16 represents the dimension of the vector. The number of iterations of the dynamic routing algorithm in the capsule layer is set to 3.

2) Convergence analysis of FD-EP-CN model
After repeated adjustment, the convergence process of FD-EP-CN model with the number of iterations is shown in Fig. 14, which shows the change of model loss and accuracy with the number of iterations. In Fig. 14, it can be found that the data of the training set and the verification set decrease with the increaseing of the number of iterations. When the number of iterations is equal to 100, the research on the loss function of FD-EP-CN model tends to be stable, which means that the model has converged.

3) Effect of batch size on Model
Due to the large number of sample sets, if all data are used in each iteration process, the training speed of the model will be greatly reduced. In order to speed up the training process, the loss function is calculated on only a small part of training samples at a time, which is called a batch. Through matrix operation, optimizing the parameters of the neural network on a batch of training samples each time will not be much slower than a single sample. It is worth noting that using one batch at a time can greatly reduce the number of iterations required for the convergence process. In order to analyze the impact of the size of batch on the network model, its sizes are set to 16, 32, 64, 128 and 256 respectively. Then simulate one by one and count the model output results, as shown in Table 3.

4) Effects of different optimizers on the model
In order to analyze the impact of different optimizers on model performance, SGD, adagrad, Adam, rmsprop, Adamax and Nadam optimization algorithms are tried as optimizers to test the model on the verification set and test set. The iteration cycle is set to 100, the test results are shown in Fig. 15. The results show that when Adam and Adamax algorithms are used as the optimizers of FD-EP-CN model, the diagnosis accuracy is the highest, while the model test accuracy corresponding to SGD and adadelta algorithms is relatively low, which is difficult to converge in the training process. Therefore, these two optimizers are not suitable as the optimizers of the model. The model test accuracy corresponding to the other three optimization algorithms is at the middle level.

D. COMPARISON WITH EXISTING FAULT DIAGNOSIS METHODS
In order to verify the effectiveness of the method proposed in this paper, it is compared with three other deep learning methods: CNN, CNN + capsule and CNN + LSTM. The structural parameters of four models are shown in Table 4.
In the table, the type and layer output shape of each layer of the four network models are listed , and the structure and parameters of CNN network in these four methods are the same. The Test and analyze under the MMC data set with different level numbers and the MMC data set with varying levels respectively. In addition, because the number of training samples has a great impact on the performance of the deep learning model, the accuracy of the model is tested with the number varying of training samples. Each method was verified by five independent repeated trials to reduce the impact of impact of randomness. In order to evaluate the diagnostic performance of the model under different levels of MMC with different number of training samples, 31-level and 61-level MMC data with the same number of samples as 11-level MMC are built, as shown in Table 5.

1) Performance analysis of fault diagnosis under different levels
As shown in Fig. 16, the curve with error bar is used to represent the mean and standard deviation of the test accu-  racy of the four models under the condition of changing with the number of training samples in three different data sets. In the figure, the horizontal axis represents the number of samples in the training set, the vertical axis represents the test accuracy, and A → A represents that the data used for training and testing the model are from data set A. From Fig. 16, The fault diagnosis accuracy of the deep learning model increases with the increase of the number of training samples. When the number of training samples reaches 1600, it indicates that the four methods can learn enough feature information from the original data to obtain better fault diagnosis results. When the number of samples is small, such as 100, 300 and 500 samples, method 4(CN-N+LSTM+capsule) has higher test accuracy than the other three methods. Furthermore, method 4 also has a fast convergence speed. For example, in Fig. 16(a), when the number of samples is 500, the diagnostic accuracy of method 4 has reached 92.8%, method 2(CNN+capsule) and method 3(CN-N+LSTM) are 84.6% and 82.8% respectively, while method 1(CNN) is the lowest, only 71.8%. These indicate that adding LSTM or capsule layer to the traditional convolutional layer, especially the improved feature structure proposed in this paper, can greatly improve the fault recognition ability.
In terms of the final accuracy under three different data sets, the four methods have the highest accuracy on data set A. This is because with the increasing of MMC level number, the fault characteristics of output current and circulating current change more and more slightly, which requires higher feature extraction ability of the model. Comprehensive comparison of the fault diagnosis performance of the four methods under different levels, method 4 has a higher fault recognition rate and better stability.
In addition, taking method 4 as an example, the fault classification confusion matrix under the condition of different MMC levels is analyzed. The number of samples in the training set is 1600. For the condition of Data set C → C, taking the average result of five trials, the confusion matrix of the diagnostic accuracy of different failure modes is drawn as shown in Fig. 17. It is noticeable that the classification accu-  In the practical application of MMC, the training data and test data may come from MMCs with different levels. The working conditions of MMCs with different levels are different, such as input voltage and load. When the level of MMC system changes, the instantaneous values of output currents and circulating currents waveform are different.
Deep learning is an adaptive feature extraction method, which automatically extracts the potential features of each category by training a large number of sample data. When the number of MMC levels changes, the same characteristics of the same fault type will become less. This also leads to the degradation of the diagnosis performance of many deep learning fault diagnosis systems. Therefore, the sample data collected under different working conditions of different MMC levels will affect the diagnosis accuracy of the model. In addition, training the deep learning model under this working condition requires more sample data to adapt to this change. However, in real application scenarios, the number of known fault samples is usually very small. Therefore, it is of great significance to establish a small sample fault identification model. In order to verify the fault diagnosis performance of the model under the condition that the level number changes, the data of training set and test set are from MMC systems with different level numbers. The model is trained and tested with data samples under different levels of MMC. For example, the data of 11-level MMC system is selected as the training set to train the model, and then the data of 31-level MMC system is selected as the test set to test the model. The variation curve of the diagnostic performance of the four fault diagnosis models with the number of samples is shown in Fig. 18. AB → C indicates that data sets A and B are selected to train the model and data set C is used to test.
As can be seen from Fig. 18, the overall recognition accuracy under the condition of level change is lower than that in Fig. 16, the main reason is that there is a condition that has not been learned by the model, especially when the number of samples is small. Taking Fig. 18(a) as an example, when the number of samples increases from 100 to 1300, the test accuracy of the four methods is increasing, and the final accuracy is 92.1%, 94.2%, 95.4% and 98.6% respectively.
Under the condition of data set AB → C, The number of samples in the training set is set as 1300. The confusion matrix of the diagnostic accuracy of different failure modes for method 4 is shown in Fig. 19. It should be noted that the test accuracy under AB → C condition is relatively the lowest. The reason is that the number of MMC levels in test data set C is high, resulting in unclear fault features, and it is difficult for the model to extract these subtle fault features. Therefore, the difference of fault characteristics between dataset C and the other two datasets A and B is obvious, which leads to its relatively low fault identification accuracy. Even so, the test accuracy of method 4 still reaches 96.8%. Based on the above analysis, it is proved that method 4 has excellent cross domain learning ability and fault recognition accuracy, especially the best fault diagnosis performance under the condition of small samples.

VI. CONCLUSION
This paper propose a compound fault diagnosis method of MMC based on improved CapsNet, which is referred to as FD-EP-CN. Aiming at the problem that the SM compound fault characteristics of MMC are not obvious, this paper improves the feature extraction structure of the CapsNet. The integrating of 1DCNN and LSTM is used as the feature VOLUME 4, 2016 extraction unit of CapsNet to construct a more comprehensive and rich feature extraction unit with prominent spatial features, then it is combined with the main capsule layer and digital capsule layer to form a feature extraction unit. In addition, the overlapping sampling method is employed to construct and expand the sample set.
The experiment analyzes the effects of the batch size, optimizer and iteration times of the CapsNet on the diagnosis performance of the model. The experiment is also compared with other three methods such as CNN, CNN + CapsNet and CNN + LSTM, their fault diagnosis performance under different levels and levels changing of MMC is analyzed respectively. Experimental results confirm that the proposed method (FD-FEP-CN) has good diagnosis accuracy under the data set of high-level MMC, which proves that the model has strong ability of feature extraction and nonlinear mapping. In terms of fault diagnosis performance under the condition of MMC level changing, FD-FEP-CN has excellent cross domain learning ability and fault recognition accuracy, especially under the condition of small samples. In addition, this method only uses six current sensors, which greatly reduces the detection cost compared with other MMC detection methods.