Faults in Modular Multilevel Cascade Converters - Part II: Fault Tolerance, Fault Detection and Diagnosis, and System Reconfiguration

Modular multilevel cascade converters (MMCCs) are considered a promising power electronics topology in industry. Their scalability allows to reach (ultra/very) high voltage levels with low harmonic content and high efficiency and makes MMCCs an ideal solution for high-power applications; such as electrical drives, solid-state transformers, and high-voltage direct-current (HVdc) transmission systems. However, the high levels of thermal, electrical, and mechanical stress on the power electronics devices and the large number of components (e.g., capacitors or semiconductors) make MMCCs prone to faults. Fault detection and diagnosis (FDD) in combination with fault isolation and system reconfiguration techniques, based on cell redundancy, can increase the reliability, availability, and safety of MMCCs, which is crucial for their utilization in critical energy applications. This second part of the article comprehensively surveys: 1) fault tolerance and FDD; e.g., expert system, model-, or hardware and data-based FDD methods, and 2) system reconfiguration strategies (e.g., cold- or hot-redundant) for MMCCs. Finally, the state of the art, challenges, and future research trends and opportunities toward reliable MMCC-based systems are revealed.

One major drawback of MMCCs is the high number of power semiconductors and capacitors required for its operation, which potentially undermine its reliability [22], [23]. Fault in capacitors and power semiconductors represent more than 50% of the failures in power converters, being critical for its reliable operation [24], [25]. The probability of a failure in an MMCC depends on the reliability of its cells and its subsystems (i.e., clusters, legs, and subconverters) [26], [27]. Additionally, the number of cells, the topology of the converter, its operation mode, the structure of the cells and their maintenance also affect the reliability of the converter. The lifetime of the MMCC can be extended by condition monitoring and predictive maintenance in combination with a fault-tolerant control strategy and component or subsystem hardware redundancy. 1 These actions may even triple the mean time to failure (MTTF) 1 of the converter [28], [29].
Several fault-tolerant control strategies have been proposed in the literature to increase the reliability of MM-CCs [30], [31], [32], [33]. These strategies are based on active fault-tolerant control systems (AFTCS) 1 in which the control system adapts its structure, set-points (reference) or limit values, and enables or disables subsystems (i.e., system reconfiguration 1 ) in order to maintain the converter in operation even during a fault. AFTCS combine fault detection and diagnosis (FDD) 1 methods, in which the fault is detected and localized, and then the system is reconfigured, to achieve fault tolerance 1 [34].
A variety of FDD methods have been developed and discussed in the literature, which can be classified into qualitative and quantitative methods [35]. These techniques represent the most promising methods to improve the reliability of MM-CCs [36], [37]. Qualitative methods, e.g., expert systems, use previous knowledge of distinctive features of a fault (fault symptoms), as signal deviations or the appearance of auxiliary signals in combination with fault trees, or rule-based decision algorithms, to identify the type of fault and its location [38]. Quantitative methods are either 1) model-based approaches using mathematical models or 2) hardware-based approaches incorporating additional hardware/sensors or components, or 3) data-driven methods relying on statistical analysis of large datasets.
Hardware-based FDD methods can detect and localize faulty cells by incorporating additional hardware or sensors used exclusively for fault detection purposes. The additional hardware allows to measure and/or extract decision variables which describe fault symptoms. The measurements are compared to prespecified thresholds, reference, or limit values and/or other measurements of the same quantity recorded by other sensors again to obtain residuals which allow to detect and/or localize the fault(s) [43], [44].
Data-driven FDD methods have also been proposed for fault detection in MMCCs and do not rely on mathematical models. These methods use statistical multivariate analysis [45], [46] or signal processing tools, e.g., fast Fourier or 1 For a list of definitions of the used terminology see Appendix A.
wavelet transformations [47]. Most recent approaches incorporate machine learning techniques with neural networks to model the behavior of the converter and to adapt these models to external changes [48], [49], [50].
Once the detection and localization of a fault in an MMCC, using a particular FDD method, is accomplished, a hardware and control reconfiguration of the MMCC is necessary to, e.g., disable the faulty cell and ensure uninterrupted operation. For this purpose, the converter must comprise one or more redundant cells to be capable of operating with nominal power even under fault conditions [51]. Alternatively, a reduced number of cells may be used if a degradation of the rated power is permitted [52]. Hence, MMCCs special mechanism of hardware [40], [53] and software redundancy [32], [54] must be incorporated to the topology and control algorithms of the converter.
This article provides a comprehensive review and discussion of the state of the art and a future perspective on the reliability of MMCCs. The discussion focuses on the reported FDD methods used for MMCCs and the most promising reconfiguration strategies to achieve fault tolerance. This work provides a guideline for future research in the field of reliability and fault tolerance of MMCCs stimulating novel and interdisciplinary research activities on FDD, reconfiguration, fault-tolerant control strategies, and converter topologies.
The rest of this article is organized as follows: Section II details the concepts of fault tolerance, fault detection, fault diagnosis, and system reconfiguration. In Section III, rule-, model-, hardware-, and data-based methods for MMCCs are presented and their strengths and drawbacks are discussed. Section IV reviews available reconfiguration strategies for MMCCs. Finally, Section V concludes this second part of the overview article.

II. CONCEPTS OF FAULT TOLERANCE AND FAULT DETECTION AND DIAGNOSIS A. FAULT TOLERANCE CONCEPT
The fault tolerance capability of power converters is provided by fault-tolerant control systems (FTCS), which adjust the control system and enable or disable components or subsystems to maintain an acceptable performance in the event of internal or external faults which would lead to system failure without proper countermeasures [34]. Two types of FTCSs are usually defined in the literature: 1) passive fault-tolerant control systems (PFTCSs) and 2) active fault-tolerant control systems (aftcss) [34]. PFTCSs are robust against a set of presumably known faults which are already considered during control system design. However, PFTCSs provide only limited fault tolerance capabilities if unknown faults occur. On the contrary, AFTCSs incorporate adjustment and reconfiguration capabilities which automatically react to system faults by adjusting the control system and by reconfiguring the system to maintain a stable and acceptable performance of the system in the presence of arbitrary faults.  An AFTCS is illustrated in Fig. 1. It consists of 1) an FDD system, 2) a reconfiguration system, 3) a reconfigurable control system (including adjustable set points and controllers), and 4) reconfigurable subsystems. The FDD system continuously monitors the system. When it detects and localizes a fault in a subsystem (e.g., an MMCC cell), this information is sent to the reconfiguration system which immediately acts on the control system by adjusting set points (e.g., cell capacitor voltage references) and/or controllers (including parameter retuning). If necessary, it will disable subsystem(s) affected by the fault (e.g., broken cells are bypassed) and enable redundant subsystem(s) of the converter (if those are available, then, e.g., redundant cells are inserted).

B. FAULT DETECTION AND DIAGNOSIS
FDD methods utilize system monitoring to detect anomalous operations (fault detection) and evaluate their causes (fault diagnosis) [55]. The FDD process is illustrated in Fig. 2, as an essential part of AFTCSs, FDD must: 1) ensure a timely detection of a fault (i.e., fault detection); 2) localize the source of the fault (i.e., fault localization); 3) identify the (possibly time-varying) fault characteristics (i.e., fault identification); and 4) evaluate the severity of the fault and its impact(s) on the overall system behavior (i.e., fault evaluation). Fault diagnosis [i.e., 2)-4) above] is performed to find a suitable system reconfiguration strategy [56]. FDD strategies, techniques, or algorithms use analytic and/or heuristic methods to generate fault reconfiguration signals to achieve fault tolerance [57].
A classification of available FDD methods is shown in Fig. 3. Two main approaches are commonly used; namely: 1) qualitative methods and 2) quantitative methods [58], [59].
Qualitative methods are based on previous knowledge of the system. Detection and/or diagnosis methods reuse knowledge from previous faults to support decision-making during diagnosis. Expert systems and fault-tree analysis are widely used for FDD in industry [60]. For MMCCs, traditional expert systems are the most commonly used methods [61], [62]. The knowledge can be obtained by data (measurements), models (simulations), or a combination of both [60]. The decisionmaking is based on a set of comparison rules (if <condition > then < conclusion>) [63] or based on a causal reasoning process, in which 1) one rule represents a unique fault type or 2) a combination of rules reproduces all viable diagnosis candidates (fuzzy logic) [59], [64].
Quantitative methods use statistical and/or dynamical models to relate system variables to estimates and to describe their behavior [64]. These methods are classified in 1) modelbased, 2) data-based, and 3) hardware-based methods [35].
The reaction time of an FDD method is critical to implement counteractions, such as reconfiguration and maintenance [65] and also to minimize performance degradation and prevent dangerous operation of converters [66]. There is no standardization for reaction time of an FDD, since different fault modes cause different failure mechanisms and differ in complexity to detect, however, it can be said that an FDD only fails if the system is not able to keep on operation anymore.
Generally, an open-circuit fault (OCF) is less time critical in MMCCs than an short-circuit-fault (SCF) [67]. Reaction times reported in literature can be as fast as 5 μs [68], including extra hardware, or up to 150 ms [61], using just software. The ms range can be acceptable for OCFs, but a SCF must be detected within μs range (commercial IGBTs have an SCF withstand capability of around 10 μs [69]). Fig. 3 presents a summary of the classification of FDD methods and it highlights the methods which have been implemented in MMCCs. Complementary, Table 1 presents a comprehensive summary of the highlighted FDD methods reported for MMCCs. Similarly to the classification of Fig. 3, the Table 1 is grouped as expert systems based, model-based, hardware-based, and data-based FDD methods. Clearly, still most methods are not comprehensively tested or are not applicable to every fault condition and further research is required. Hardware-based methods arise as the faster methods to detect and identify a fault, expert systems are simple to implement, but are prone to false fault detection and are difficult to implement when multiple or complex faults are present, model-based, and data driven are promising methods, however, several challenges regarding real-time computational burden must be faced to become the preferred alternative. According to the evidence, FDD for MMCCs can be considered as an incipient and stimulating research topic, where several different expertise areas confluence.

A. EXPERT SYSTEM-BASED FDD METHODS
An expert system is a computer-based decision-making system, which uses deep knowledge and rule-based decisionmaking to take control actions over a system. Expert systembased FDD methods for MMCCs are based on deep knowledge of the system behavior, used for creating suitable thresholds for admissible cell voltages, circulating currents, and/or input/output currents. Thresholds deviations of these variables are considered as fault symptoms. Their waveforms and transients are also used to identify and localize the fault. In [38], an expert system for an MMCC in DSCC configuration with 21 levels is presented. The method is able to detect and distinguish semiconductor OCFs, voltage, and current sensor OCFs (measurements close to 0) and grid faults. The inference engine of the FDD method includes a 1) monitoring unit used to detect deviations in the circulating currents and capacitor voltages measurements to references, and 2) a decision-making unit (DMU) to identify the faulty cell. Depending on predefined thresholds for circulating current and cell voltages, different indication bits are set. Afterward, the DMU identifies the fault due to the set indication bits. Several cases are studied, e.g., when a deviation in the circulating current occurs for more than 100 ms, an OCF in a switch of a cell is detected and the faulty leg is identified. Fault diagnosis is applied to both clusters of the faulty leg by comparing the measured capacitor cell voltages with their references. The sign and magnitude of the capacitor voltage residual allows to identify the faulty cell. The method is successfully validated by a hardware-in-the-loop simulation. In addition, an experimental validation is performed for a single-phase MMCC.
The computation times for the proposed FDD method varies between 22 to 43 ms depending on the type of fault.
In [62], a method based on the monitoring of the cell output voltages in relation to the cell switching states is presented. The method can detect OCFs and SCFs of IGBTs in one phase of a DSCC-MMCC. The cell output voltage is compared to the minimum possible capacitor voltage under healthy conditions. Based on the behavior of the healthy cell (for details, see [21]), a set of rules is designed taking into account switching state and cell output voltages to detect and diagnose the faulty cell. To implement this method without adding extra sensors, the voltage sensors are moved to the output of the cell and the capacitor voltages are estimated. The presented method is validated by simulations for a two-cells-per-cluster single-phase DSCC-MMCC, only for rated load condition. The results show that the faults can be detected and diagnosed in about 60 μs. The same method is used in [70], but it is applied in a distributed control architecture which allows that cell operation and FDD work independently. The reported experimental assessment uses three cells per cluster for one load condition only. The FDD takes 3 ms, which is fast enough to start a reconfiguration process. A similar approach for a TSBC configuration using a full-bridge model is presented in [71].
In conclusion, expert system-based FDD methods use deep knowledge and rule-based decision-making to detect and diagnose a fault. They are one promising approach as several methodologies can be combined and real-time implementations are available (including fuzzy interference systems) [72]. However, the fault knowledge acquisition could be tedious and requires a nontrivial translation into computer algorithms. Moreover, not all fault types and locations can be detected; e.g., multiple faults occurring at the same time can be difficult and extensive to describe in the FDD sets of rules. Hence, the implementation of effective expert systems can be timeconsuming and costly in particular for large-scale systems (such as MMCC with hundreds of cells). The considered expert systems also might not have the desired accuracy especially for faults with similar symptoms.

B. MODEL-BASED FDD METHODS
Model-based methods compare the output of a mathematical model of the system to measured quantities of the system to generate a residual vector which acts as a fault indicator [73], [74]. The model-based FDD process is shown in Fig. 4. The residuals analysis must uniquely identify a fault and it also must be robust to avoid false fault detection under disturbances, noise, or transients [57]. Residual evaluation algorithms and threshold selection must find a tradeoff between detection speed and detection reliability. For most reported model-based methods, the residual evaluation is a simple threshold function or value which is evaluated using simple rules, but also statistical or machine learning classifiers are feasible [58], [75], [76].
Model-based FDD methods are widely used for power converters as their system behavior and dynamics can be modeled with high accuracy using linear and nonlinear ordinary differential equations. State estimation using Luenberger, extended state or sliding mode observers and Kalman filters are the most common methods reported for FDD in MMCC cells.
Luenberger observers are state estimators specially designed for linear models, which push the observer error e x := x − x asymptotically to zero. In [40] and [77], a Luenberger observer is proposed to estimate the sigma and delta currents of a single-phase DSCC-MMCC to detect IGBT OCFs. The fault detection method compares the estimated state vector x = y = ( i a , i a ) to measurements. Finally, the obtained residual vector of the sigma currents is compared to preset threshold values and if one or more residuals exceed their respective thresholds for more than 2 ms, a fault is detected. To locate the faulty cluster and finally the faulty switch, similar to expert system-based FDD methods, a set of rules based on cell operation is constructed (recall monitoring conditions in [21,Appendix B]). In order to localize the faulty cell, the capacitor voltages of the detected faulty cluster are compared to their threshold values, and if one cell capacitor voltage is higher than its threshold for a certain time, the faulty cell is found. The method is validated by measurements using a single-phase MMCC prototype with six cells per cluster. Control system and FDD method are implemented in an FPGA. Robustness against modeling uncertainties has not been addressed and is open to further research.
Extended state observer (ESO) is used to estimate all states of a system including extended states. ESOs are an extension of the traditional Luenberger observer in which only the unmeasured states are estimated [78].
An FDD method using an ESO applied to a single-phase MMCC in DSCC configuration is reported in [79]. Using the dynamic model of one leg of the converter, the system states ( x 1 ) and the extended states ( x 2 ) are defined as cluster currents and cluster inductor drop voltage, respectively. If x 1 converges, then the extended state x 2 also converges. Moreover, the faulty cluster can be detected if the extended state and its calculated value differ. Once the faulty cluster is identified, the fault diagnosis method assumes a fault in a random switch of the affected cluster and starts to compare the extended state with its calculated value, localizing the faulty switch. A similar approach is reported in [42], however, the states and extended states are obtained from the sigma currents model. The method is validated by simulations only and the results are obtained solely for a single load condition. The computation times are presumably high due to the required large number of calculations in view of the random switch selection.
Unlike Luenberg observers, sliding mode observer (SMO) has the ability to generate a sliding motion for the observer error e y = y − y by utilizing a nonlinear feedback with switching term, which forces the observation error to converge to zero in finite time, making it very robust in presence of unknown disturbances or uncertainties of the model. Implementation of SMOs for FDD in MMCCs are presented in [55], [80], and [81], for a single-phase MMCC in DSCC configuration. An individual fault function f i is added for the ith (each) cell as an external input to the system, which becomes very large if an OCF occurs. The proposed SMOs are fed by the measured circulating currents and cell capacitor voltages. If estimated and measured states differs for more than a certain time, the fault is detected. The fault diagnosis method iteratively modifies the observer dynamics by increasing the respective fault function f i of the ith cell until the failed cell is detected. The FDD method was validated by simulations and experiments for an MMCC with four cells per cluster. The diagnosis times are rather long with 100 ms under full load conditions and 139 ms for light load conditions. The method cannot detect simultaneous OCFs and it is not clear if it works for large-scale MMCC with a large number of cells. The article presents a robustness analysis to demonstrate functionality under parameter uncertainties and external disturbances. The same method is improved in [41] by first detecting the faulty leg using circulating currents and then identifying the faulty switch modifying the SMO dynamics. This improvement allows fault diagnosis in 50 ms. A similar approach is proposed in [82] for a single-phase DSBC. Two independent SMOs are implemented to estimate the circulating and output currents. the method includes a "fault tolerance" step, which compensates the cell voltages of the not-faulty cells by adjusting the capacitor voltage references values to obtain a smooth transition after the OCF is diagnosed. Simulations show a detection time less than 10 ms. In [83], a capacitor voltage model is used to implement an SMO for each cell of a single-phase DSCC (see MMCCs model in [21]). It can detect and diagnose OCFs and SCFs using preset threshold. OCFs or SCFs are detected and the faulty cell is identified within 5 ms. The results are validated for an experimental setup with five cells per cluster for one load condition only. This method can detect OCFs and SCFs very quickly and indistinctly unlike most of the other reported methods, however, it requires large number of SMOs, being difficult its implementation in a centralized controller.
Kalman Filters are also used as observer-based state estimator which have the ability to estimate the system states in presence of unknown disturbances, uncertainties, and measurement noise. In [39], a fault detection method using Kalman filters is reported. It was implemented for a DSCC configuration and can detect OCFs in cells. The method uses a model of the converter delta currents (for model details see [21]). The experimental setup for its validation was composed of four cells per cluster and OCFs are detected in less than approximately 15 ms. SCFs or capacitor faults cannot be detected nor diagnosed. A similar approach is found in [61], but it uses the average of n measurements of the dc capacitor voltages. The method is experimentally validated for a DSBC configuration. The method can successfully identify a OCFs in 70 to 150 s depending on the operation conditions. However, it only detects OCFs in the lower switch of the chopper cell and cannot detect SCFs or cell capacitor faults.
In conclusion, model-based methods, such as SMO or Kalman filters are 1) easy to implement, 2) can provide accurate and 3) robust estimation. However, for large-scale systems with high dimensionality or severe nonlinear behavior, the real process may not be correctly represented and the estimation may differ from the real values even under normal conditions due to high computational load requirement [84]. For MMCCs, model-based state estimation methods are widely used and they differ mainly in robustness and estimation speed due to linear or nonlinear modeling and/or the consideration of noise/disturbances. From a performance point of view, in [34], different capabilities of FDD methods, such as the impacts of faults in sensors and actuators on the estimation performance and the robustness of different observers are analyzed. According to their conclusion, there are no noticeable differences between observers and Kalman filters. Extended approaches (e.g., extended observers or EKF) usually outperform the others. In particular, EKFs achieve the best performance [34]. Significant research effort is still needed to improve detection speed and detection reliability of residual evaluation algorithms for MMCC-based systems.

C. HARDWARE-BASED FDD METHODS
Hardware-based methods are suitable for critical processes or systems, where mathematical models are not available. These methods exploit the redundancy of critical hardware components. The inputs to the redundant hardware are duplicated and their outputs are measured (by redundant sensors) and compared in order to generate a residual. For normal conditions, the residuals are close to zero, otherwise, the faulty hardware component can be detected [85]. In addition to the hardware redundancy, a voting algorithm (usually based on decision-making techniques, such as Bayesian networks, distributed systems, fault trees, or expert systems) must be implemented to analyze the residual, detect the fault, and invoke proper reconfiguration [86], [87], [88]. Hardware-based methods allow for fast and reliable FDD, resulting in an attractive alternative for MMCCs [44], [68], [89], [90], [91].
In [89], additional hardware is included to measure the voltage drop over the cluster inductance in order to detect OCFs in a DSCC (see Fig. 5). Under normal conditions, the sum of the upper and lower cluster voltages is close to the dc voltage v p + v n and, therefore, the sum of the inductance voltages is close to zero, while containing only second-order harmonics. When an OCF affects one or more cells, the sum of the cluster voltages is modified and the circulating currents  increase. An OCF is detected if the rectified and filtered measurement exceeds the expected voltage. This hardware-based FDD method was validated by simulations and experiments on a single-phase DSCC with ten cells per cluster including one spare cell. The control system is implemented on a DSP; whereas the FDD algorithm on an FPGA. The fault detection time is very small with 22 μs which allows to use additional FDD methods for further fault diagnosis.
In [68], an additional voltage sensor is used to monitor the cell output voltage and detect OCFs in CC switches (see Fig. 6). The measured cell output voltage, the measured capacitor voltage and the switching state are used to check instantaneous output voltage and detect an OCF. The method was validated for an MMCC in DSCC configuration and exhibits the smallest detection time of around 5 μs.
In [91], another hardware-based FDD method is applied to a single-phase DSCC. It replaces individual capacitor voltage sensors by one single voltage sensor to measure the output of a group of cells which is used then to control the converter and to develop the FDD algorithm. First, it detects an OCF in the group of cells, to later localize the faulty cell by analyzing which cells were not inserted for the last two control periods. The method was successfully validated by simulations.
Detection time, robustness, or functionality under different operation conditions were not discussed.
In [44], three voltage sensors are implemented for a DSCC with n = m + r cells where m "normal" cells and r redundant cells are considered, namely, a sensors which measures; 1) the full cluster output voltage, excluding cluster inductor, ("supervisory cluster sensor"), 2) the the output voltage of a group of m cells, and the 3) r redundant cells ("supervisory set sensors"). To detect OCFs, the supervisory set sensors are compared to an "expected voltage" of the cluster, which is the cluster voltage obtained by the model equations of the converter. Likewise, SCFs are obtained comparing it to a theoretical voltage, which is calculated by the number of activated cells multiplied by the average value of the cell output voltages within the cluster. The measured cluster voltages are also compared to the reference values, the resulting residual are matched against preset thresholds obtained by the analysis presented in [21]. This allows to distinguish between lower switch and upper switch OCFs, OCFs in a capacitor voltage sensor or SCFs in switches. The method was validated with an experimental setup composed of an MMCC with eight cells per cluster (including one spare redundant cell per cluster). The detection time is not reported. But the overall diagnosis time for all kind of faults was presented with around 5 ms.
In [90], a voltage sensor to measure the upper switch voltage of a CC is proposed. This measured voltage is then normalized by the cell capacitor voltage reference. Under normal conditions, this division is approximately 1 if the upper switch is open (gate signal equal to 0) and approximately 0 if the upper switch is closed (gate signal equal to 1). Implementing an AND operation between the gate signal and the normalized voltage of the upper switch of the CC allows fault identification. The method works successfully, the cell capacitor voltage must be observed to control the DSCC using one voltage sensor per cell.
In conclusion, hardware-based FDD strategies typically use device redundancy and/or additional (measurement) hardware to quickly compare different measured values with computed estimates and/or references. Such strategies are easy to implement and can achieve accurate and very fast results. However, the main drawbacks for MMCCs are increased costs, space requirements, and weight. Some contributions [44], [90], [91] take in advantage alternative control schemes of MMCCs, which uses cluster voltage measurements instead of individual floating capacitor voltage cell [92]. In this case, fault detection can be achieved directly using cluster voltage sensors (no need of additional sensors), however, it represents an special case as high performance control of MMCCs typically use voltage measurements in each of its cells.

D. DATA-DRIVEN/BASED METHODS
Data-based methods rely on the availability, analysis, and interpretation of historical data collected and monitored during the fault-free operation [93]. As these methods do not use models, their performance is not affected by modeling and parameter uncertainties, structural and parameter variations, and nonlinearity mismatches of the system. Therefore, they are attractive for time-varying or highly nonlinear systems, whose models entail a high uncertainty or computational burden [73].
The classic stages of data-driven FDD are illustrated in Fig. 7: 1) preprocessing (data filtering and adaptation); 2) exploratory data analysis (extraction of relevant variables and data reduction); and 3) model selection or classification functions (construction, training, and validation).
Data-based methods used for FDD in MMCCs can be classified into three groups: 1) signal processing, 2) statistical analysis, and 3) machine learning.

1) SIGNAL PROCESSING-BASED METHODS
Signal processing methods analyze the frequency spectrum of the converter measurements, using 1) Fourier or 2) Wavelet transforms. Most of these methods are used in the preprocessing stage (feature extraction) of the data, to later be used in a method based on machine learning. a) Fast Fourier transform (FFT): An OCF in one cell produces unbalanced cluster voltages, generating a dc component in the cluster current, which can be detected with FFTs. Cui et al. [47] proposed a simple FFT of the output currents to detect OCFs in DSCCs, presenting simulations on a real-time simulation platform (RT-LAB). If the dc component is negative/positive, the fault is found in the negative/positive cluster, but the faulty cell or semiconductor cannot be identified. Jiao et al. [94] proposed an FFT approach to process the output voltages of an MMCC in DSCC configuration to generate a harmonic content dataset as input to a support vector machine (SVM; see Sect. III-D3). b) Wavelet Analysis: Wavelets contain information on both the frequency and the time domain, so they have been proposed to preprocess measurements collected to be used in machine learning methods (ANN and SVM; see Sect. III-D3). Lin et al. [95] proposed the use of wavelets to extract frequency characteristics to deliver them directly to an artificial neural network (ANN), reducing the size of the input layer of the network and improving the convergence speed. Wang et al. [37] also used wavelets for feature extraction but to be used in a SVM.

2) MULTIVARIATE STATISTICS-BASED METHODS
Methods based on statistics are broadly used in industry [96] because they are simple to implement, do not need a mathematical model, and require little knowledge of the process and faults. The historical data are used to extract fault detection and analysis patterns to identify similar fault conditions [96], [97]. The first step is the quantitative characteristics extraction, where the statistical representation of the normal behavior of the converter is obtained through statistical variables [97]. When a fault occurs, the measurements are used in a statistical system to evaluate the divergence from the expected values during normal operation, using one of the following five methods. a) Correlation coefficient methods: They measure the strength of the linear relationship between variables [98], [99]. A correlation coefficient value is between -1 and 1, where 1 indicates a strong positive relation, -1 indicates a strong negative relation, and 0 indicates no relation at all. The authors in [100] and [67] proposed to detect and identify OCFs in a DSCC through Pearson's correlation coefficient (the most popular), measuring the cluster voltages and comparing them to each other. The difference between the measured and estimated values, i.e., v x y , is used to generate a normalized cluster voltage error ε x y (1), which remains close to 0 for a healthy converter and reaches 1 or -1 under an OCF in the upper or lower switch occurs, respectively, The correlation coefficient is compared to a detection threshold to identify the fault. This threshold is a free parameter, making it too small can cause false positives, but making it too large can increase detection times.
The failed cell will have a much lower correlation with the others, while the others will have a correlation close to one with each other, so the method fails when more than one failure occurs at the same time, as two faulty cells could have a strong correlation and no conclusion about their state can be obtained. Experimental assessment is performed with a single-phase converter, obtaining a detection and identification time around 1.5 and 10.4 ms, respectively. The proposed correlation calculation reduces the computational burden, increasing the controller running time only from 9440 to 10 520 μs when implementing the method. b) Least square methods: As a means to detect a fault in a capacitor of a DSCC MMCCs, Jo et al. [101] proposed the injection of a circulating current, which in combination with the states of the switches and cluster current, is used to estimate the current through the capacitor. Then, a recursive least squares algorithm converges to a capacitance estimation, starting from the differential equation of a capacitor. The simulated and experimental results have an estimation error of 1.3%. The method is also able to identify the change of the capacitance during steady-state operation, which is useful for predictive maintenance or the state of health. c) Principal component analysis (PCA) methods: PCA is a statistical technique for reducing the dimensionality of a dataset, eliminating redundant information. The PCA can be seen as a linear projection of the state variables in a lower dimension, using a covariance matrix to represent the correlation between these variables. Houchati et al. [45] proposed a PCA-based method to detect OCFs in DSCC, using the measurements of the capacitor voltages, and circulating and output currents under normal operating conditions. These datasets are normalized and a covariance matrix is constructed. Then, the eigenvalues and eigenvectors are calculated and ordered according to the size of the eigenvalue and its association with each vector. The dimensionality reduction is carried out by discarding the lowest eigenvalues and their respective vectors. Finally, two data dispersion measures are calculated (Hotelling's T 2 and Q) using the original data and eigenvalues, and dispersion thresholds are estimated. Then, a fault is detected when one of the calculated dispersions (T 2 and Q) exceeds a predefined threshold. Simulation results show success and a faster detection in comparison with a SMO (less than 12 ms). In [94], PCA is used to reduce the set of variables of the input data for a SVM (see Section III-D3). d) Single ring theorem (SRT) method: SRT postulates that having a sufficiently large matrix whose elements are independent and identically distributed (IID), its eigenvalues are located in the complex plane, forming a ring with a certain structure. This structure is supported on a single ring, centered at the origin, whose radius is disrupted when a fault appears in one or more cells. Fig. 8 illustrates a representation of a normal and faulty condition of the eigenvalues distribution. The work [46] proposes an SRT method for the detection and identification of OCFs in DSCC is proposed. The SRT is used only for fault detection and then statistical outliers in the measurements are looked for to identify the fault. Explained in a simplified way, the capacitor voltages are ordered in quartiles, and a fault is detected if the distance between the first and third quartile is greater than a threshold (1.5 times) in comparison with normal conditions. The method uses a moving average window of the last N measurements of each capacitor voltage to obtain its dc value. The values are normalized and added to a uniform distributed noise to then be grouped into a matrix mmX . Then, using matrix and elementwise operations, a new matrix mmZ is derived from mmX . Then, if the converter is healthy all the elements of mmX are independent and identically distributed (IID), and the eigenvalue λ of mmZ (cell capacitor voltages) has a probability distribution function f (λ) described as Therefore, in a healthy converter all the eigenvalues λ scatter within an area limited by an inner circle with radius (1 − N/T ) 2 and an outer circle with a radius of 1, as N and T approach infinity. Otherwise, there is a fault when the average radius of the ring crosses the threshold. The fault identification is done by comparing the entries of matrix mmZ (cell capacitor voltages) with the threshold obtained for normal conditions. The provided simulation and experimental results show that the detection and identification of single and multiple faults is feasible in less than 40 ms. e) Simple statistical methods: Zhou et al. [36] proposed a method for detecting and identifying failures in a DSCC. The method is based on the variance of each cell voltage, using the last N measurements and an updated calculation of the variance. Under normal operation, the variances of the cell voltages of the same cluster are very similar. When the upper switch of a cell fails, the variance of its voltage decreases due to the faulty cell has one less discharging period than a normal cell.
On the other hand, if the lower switch fails, the variance of the cell voltage increases due to the faulty cell having one more charging period than a normal cell. Therefore, a fault is detected if the variance of a cell crosses one of the thresholds for a specified number of measurements, identified the fault in the lower or upper switch depending on which threshold was violated. Experimental results show the identification of single and multiple failures between 12 and 68 ms. The computational burden is not discussed but it is presumably very low. This method merges both stages (detection and identification) into one stage.

3) MACHINE LEARNING-BASED METHODS
Machine learning (ML) methods use large amounts of data to perform specified tasks, such as fault detection and identification. These methods are not explicitly programmed, instead, they use the output and input data from the power converter to learn how to identify and classify faults. The process of learning starts with the input of huge amounts of data from the converter, which can include variables from simulations or experiments during operation under normal or faulty conditions. All the proposed techniques use supervised learning, so the greatest computational load falls onto their training. The ML methods reported for fault detection in MMCCs are a) artificial neural networks (ANNs), b) SVMs, and c) K-means clustering.
a) Artificial neural networks (ANNs) methods: The ANNs applied to fault detection in MMCCs can be divided in two groups; 1) multilayer perceptron (MLP), and 2) convolutional neural networks (CNN). Both have high computational demand, which restricts their implementation to only a few applications with relatively small networks.
Multilayer perceptron (MLP) uses neural networks in combination with other techniques to reduce the network size and computational burden in real-time operation.
Ke et al. [102] proposed an MLP to identify OCFs through two stages. The first stage detects the faulty cell by comparing the capacitor cell with the other healthy ones. The second stage uses an MLP to determine the failed switch using the capacitor voltage waveform pattern of the defective cell. The experimental results present a fault detection and identification time of 0.5 s, but the MLP cannot differentiate between a failure in the lower switch or in both.
Lin et al. [95] also proposed a two-stage method. The fist stage uses wavelet analysis only using the MMCC output voltage, and the second stage uses an ANN to detect OCFs. When a fault occurs, the output voltages changes quickly, showing different frequency components that are overlapping. Each component contains a wealth of fault information, so then the wavelet transform is used to extract the band energy information as a fault feature vector, helping to analyze the characteristics of the signal in the time-frequency domain [103]. Then, a back propagation ANN determines the number of defective cells and the cluster in which they are located, but it is not able to identify which are the defective cells. The method is validated in simulations.
Abdelsalam et al. [48] proposed a model-based hybrid method, where the model is an ANN. It is proposed to use adaptive linear neuron (ADALINE), a particular type of single-layer MLP, to estimate the voltage of the cell capacitors. The network uses as input the switching signals of the cluster cells and the output voltage. A special algorithm is used to train the network, where the values of the weights of the nodes are constantly updated. When the prediction error converges to zero, due to the structure of the network and the input signals, the weights of the nodes converge to the capacitor voltages. Faults are detected when two conditions are met: 1) the estimated capacitor voltage remains sufficiently far from its nominal value for a certain time, and 2) the error between the estimated value and the measured value grows faster than a defined threshold. This technique achieves the lowest latency times for data-driven methods (≤ 5 ms) and detects OCFs and SCFs. However, the technique identifies only the faulty cells, not the exact switch. Moreover, only simulation results are presented. Convolutional neural networks (CNN) calculate the value of a node from an operation of a subset of nodes from the previous layer (Fig. 9). It is composed of three main stages, 1) convolution layer (filtering stage), 2) subsampling or pooling (reduction of data cardinality and parameters), and 3) fully connect layer, where the layers are unwrapped and stacked in a 1-D vector. CNN methods usually do not require a feature extraction stage (as MLP), since they directly use the measured raw data. It can be an advantage in power electronics, because it eliminates a computationally demanding stage, allowing real-time monitoring and low time delay for fault detection and identification [104]. However, previous feature extraction stages use mature techniques and could contribute with information that the network is not necessarily capable of identifying from the raw data.
Kiranyaz et al. [50] proposed a 1-D CNN (1D-CNN) in a DSCC to detect and identify OCFs in a single stage (feature extraction and classification), using the voltages of each cell, the circulating and output currents. The 1D-CNN is trained for eight fault cases (one per switch) and one normal case. This is one of the most conclusive and complete proposals in ML, presenting large number of experiments and data, and analysis under multiple metrics, such as the detection rate and the false positive rate. The experiments show a detection rate of 98.9% and an identification rate of 99.7%. Furthermore, as it does not require complex data processing and uses mostly linear operations on 1-D vectors, low latency times are obtained (less than 100 ms).
Qu et al. [49] proposed a CNN method to detect and identify OCFs using measurements of all the cell voltages. These voltages are normalized and grouped into a 2-D vector, which is used as the CNN input to perform the detection and identification. The method was validated for 72 simulated faults with a computation time of 20 ms, and identifying the defective semiconductor with 98.16% of success rate. Additionally, the failure can still be identified in 80.34% of the cases when there is 30% noise in the signal. b) Support vector machine methods: SVMs are a classification ML technique with supervised learning and training from a set of trained data classified in different categories. The SVM builds a hyperplane from a dataset in R n by separating the data in n vectors. Its objective is to find or approximate the boundaries between the characteristics of the data that divide the n categories.
Jiao et al. [94] proposed an hybrid method for integrating an SVM with signal processing techniques and statistical analysis. It detects and identifies OCFs in MMCCs with half-bridge cells, only based on the output voltages. The data are preprocessed using an FFT, filtering only the ten main frequency components. Then, a PCA selects the three frequencies with most contribution to the variance of the data. PCA is a wellknown statistic-based technique used in ML to reduce the dimensionality of data without great loss of information. Then the features found from the techniques of signal processing (FFT) and statistical analysis (PCA) are used as input of the SVM. The results proof to detect and identify the failed cluster and position of the failed switch within the cell with a 97% precision, but cannot identify the individual faulty cell. Detection and identification takes on average 11.01 s for a 31-level converter, while 28.42 s are needed for a 101-level converter.
Wang et al. [37] proposed another hybrid method, which identifies SCFs using only the output voltages of a modular serial/parallel multilevel converter (MMSPC), where each cell is composed of a capacitor and two full-bridges in parallel. First, a wavelet analysis is applied as feature extraction for the SVM, which also uses the switching states as input. The method manages to detect faults and to identify the faulty cell, but it fails to identify the faulty semiconductor. The simulation results show a correct detection for some test cases under ideal conditions. c) K-means Methods: It is a signal processing technique to identify data clusters, like an SVM, but with unknown categories of the data [105]. This method iteratively identifies the centroids of the clusters that best separate the data, using a predetermined number of clusters.
Yang et al. [106] proposed a K-means method to detect and identify OCFs in a DSCC, using data vectors corresponding to the samples of a fundamental period of each cell. It is defined a number of centroids equal to the number of clusters, and the K-means algorithm finds the positions of these centroids. During the converter operation, the distances between the data and the centroids are measured, and if the vector of a cell exceeds the distance threshold for all centroids, a fault is detected. Then, the algorithm is executed again, generating a new centroids of data, which are associated to the faulty cells. The simulated results show a fault detection time of 40 ms, but there is not information about the fault identification time.

IV. SYSTEM RECONFIGURATION IN MMCCS
The AFTCS guarantees continuous operation even after a fault occurred. The process involves an FDD and a fault reconfiguration stage. After a fault is detected, localized, identified, and evaluated by the FDD strategy, a fault reconfiguration strategy is implemented based on hardware and/or software redundancies as defined in Appendix A-D [41], [107]. While hardware redundancy uses more hardware (i.e., cells for MMCCs) than required, software redundancy modifies control systems and (slightly) increases processor utilization. Both methods can be applied separately or in combination. According to [53] and [40], hardware redundancy is divided into cold-redundant (or cold-reserve) mode and hot-redundant (or spinning-reserve or active redundancy) mode. Fig. 10 shows a cluster with m cells and r redundant cells. Each cell is equipped with a low conduction losses bypass switch [108]. From Fig. 10(a), in cold-redundant mode, all r redundant cells are bypassed during normal operation and their capacitors are not charged. Therefore, when a fault occurs in one cell, one redundant cell is inserted instead after a certain start-up time. During this time, the cell capacitor is charged leading to high inrush currents which affect transient operation and converter performance [109]. The cold-redundant mode is preferred in high-power applications, such as HVdc systems, as it avoids additional switching and conduction losses in the redundant cells [108], [110]. In contrast, in hot-redundant mode [see Fig. 10(b)], the r redundant cells are also active during normal operation in order 1) to keep their capacitors charged and 2) to prevent high circulation currents after a fault event. It is commonly utilized in low-/medium-power MMCC applications with a small number of cells [110].
Before a fault occurs, the hot-redundant mode can operate either in voltage-sharing or fixed-voltage mode [109], [111]. In voltage-sharing mode, all cells are permanently included during normal conditions which leads to a higher number of available voltage levels, lower mean cell capacitor voltage, less component voltage stress, and lower THD [33], [111]. After a fault occurrence, the faulty cell is bypassed and the voltage reference has to be adapted at least in the faulty cluster in order to maintain symmetry. In fixed-voltage mode, the number of voltage levels is equal to the number of levels in cold-redundant mode, but all cell capacitors are being used. This is achieved by cycling the active cells with a low  to CC x y,m are active; at T p ≤ t < 2T p : the m cells CC x y,2 to CC x y,m+1 are active; and so on. As a consequence, the switching frequency of each cell is reduced, while the effective converter switching frequency is unaffected and cell capacitor voltage ripples are kept low [112]. Fig. 11 summarizes the reported fault reconfiguration methods, divided into the redundancy on hardware and software level.
Important power converter manufacturers, such as ABB, Siemens, Mitsubishi Electric, General Electric, and Hyosung Heavy Industries, have incorporated system reconfiguration capabilities to their new MMCCs models. For instance, ABB has registered two patents for cold-redundant reconfiguration strategies applied to DSCC [113], [114]. Moreover, Siemens has applied hot-redundant reconfiguration strategies for its new medium voltage air-cooled drives Perfect Harmony GH180 [115], [116]. Other manufacturers also have listed patents about hot-redundant reconfiguration strategies for their next-generation MMCCs converters [117], [118], [119], [120]. This highlight the importance and constant progress of this topic.

A. COLD-REDUNDANT MODE RECONFIGURATION STRATEGIES
In [121], a cold-redundant approach for a DSCC with nearestlevel-control (NLC; see [122], [123]) in a HVdc application is presented. The converter consists of 28 CCs per cluster plus 2 redundant cells that run in fixed-voltage hot-redundant mode and also an additional spare cell which operates in cold-redundant configuration, being bypassed during normal operation. After a fault occurrence (type not specified), the faulty cell is automatically detected and bypassed. Afterward, either a redundant cell or a spared cell is inserted. The DSCC can handle tow faults before a spare cell is inserted. According to [121], the spared cells are a simple and cost-effective solution since they can be installed in a joint format instead of preparing many redundant cells in each arm. However, if a spare cell is inserted, the affected arm is unbalanced and it takes up to 200 ms until the spare cell capacitor reaches the nominal voltage. During this time, an unavoidable current transient of ≈ 6.5 p.u. maximum peak occurs in the affected phase and the cell voltages in the affected leg oscillate with ≈ 0.1 p.u.
During normal operation, the redundant cell is bypassed by a vacuum contactor. As soon as a fault occurs, the faulty cell is detected and bypassed, while the redundant cell is inserted simultaneously [124], [125].
In [108], a transition control method is proposed for a single-phase DSCC, to overcome the large voltage and current transients during bypassing faulty cell and inserting redundant cell. The method uses the measured redundant cell voltages as feed-forward compensation for the other cells, when generating its voltage references. The presented approach is experimentally validated on a test bench consisting of five cells and one redundant cell per cluster. After a fault occurs in the lower cluster, the faulty cell is bypassed and the redundant cell is inserted immediately. The redundant cell capacitor is charged within ≈90 ms. During this transition, the phase currents and output currents are slightly distorted. Also, the circulating current increases to ≈ 1.33 p.u.

B. HOT-REDUNDANT MODE RECONFIGURATION STRATEGIES
The hot-redundant mode is the most reported reconfiguration method for MMCCs. The fault handling can be implemented using three different strategies as follows: Strategy 1, Strategy 2, or Strategy 3 [110] (see Fig. 11). Assuming that n f ≤ r cells of one cluster fail, all three strategies bypass the n f faulty cells within the cluster. However, strategy 1 also bypasses n f cells in all the other clusters of the converter, Strategy 2 bypasses n f cells within the other clusters of the same phase, and Strategy 3 does not bypass any other cells, beside the n f faulty cells. The latter enables maximum utilization of all healthy cells but requires more sophisticated (software) adaptations to ensure proper functionality (e.g., concerning modulation index and reference angle).
In [126], a hot-redundant modulation adaption is presented for a single-phase DSCC which employs Strategy 2. It modifies a conventional level-shifted pulsewidth modulation (LS-PWM; see [127], [128], [129], [130]) using a self-generated staircase signal, which is modified according the number of active cells, allowing to use only one carrier signal. The method is experimentally validated on a test bench with four cells per cluster. The DSCC works in voltagesharing hot-redundant mode under normal conditions. After a fault, the faulty cell and a healthy cell in the complementary cluster (Strategy 2) are bypassed. Thus, the number of levels of the staircase signal, and converter voltage levels, decrease. Also, the voltage stress per cell is increased. This over-stress in the cell is reduced as the converter is composed by more cells.
Two similar approaches, which utilize fixed-voltage hotredundant mode with Strategy 3 and phase-shifted pulsewidth modulation (PS-PWM; see [127], [128], [129], [130]) are presented for a single-phase DSCC in [51] and three-phase DSCC in and [53]. A maximum of m cells, out of m + r cell, per cluster must operate at each sampling and an even power distribution over all m + r cells is desirable. The methods propose the use of m carrier signals, which are rotated through the cells by a so called rotating sliding choice box at fundamental frequency [51]. This box consists of m + r sectors and each of them maps one carrier. When a fault occurs, both approaches exclude permanently the sector that bypasses the faulty cell. In [53], simulations results validates a fault detection and fault isolation 1 , excluding the corresponding sector from the rotating process, within 400 μs. In [51], experimental validation with higher rotation frequency is implemented, leading to lower cell voltage deviation within a cluster and smaller current transients during a fault, but increasing switching losses and introducing mismatch pulses caused by switching dead times during sector transition.
To overcome the drawback of undesired switching while utilizing fixed-voltage hot-redundant mode (Strategy 3), [131] proposes a modified PS-PWM for DSCCs. This method uses m + r carriers for the m + r cells. As soon as a fault occurs and is detected, the offset and amplitude of the carriers are adapted according to the remaining number of redundant cells, avoiding rotation. The method was successfully validated through experimental assessment, however, capacitor ripple is increased to 1.3 p.u. Further research and comparisons are needed considering (control) performance and THD induced by the modulation scheme.
Similarly, a fixed-voltage hot-redundant mode (Strategy 3) is achieved by a modified modulation in [40]. In contrast to [131], the cluster reference signal is modified instead of the carrier signals. The cluster reference signal is restricted by the factor m m+r−n f , where n f are the number of faulty cells in the corresponding cluster. After a fault occurrence, the faulty cell is bypassed as soon as it is localized. The reference signal restriction reduces leading to a larger utilization of the remaining cells in the cluster. The proposed method is validated experimentally on a single-phase DSCC. It can be seen, that the cell voltage ripples and sigma current increase significantly. The proposed method performs properly only if the number of faults is known, therefore, a reliable fault detection method is crucial.
The NLC is also proposed in [132], as an alternative to avoid complex multiple carrier coordination (PS-PWM) or control modification in fixed-voltage hot-redundant mode (Strategy 3). The method is validated through simulation, however, high circulating currents of 7.5 p.u. arise when the method is activated 100 ms after the fault occurrence. Highfrequency voltage sorting algorithm are not considered.
In [30], voltage-sharing hot-redundant mode (Strategy 3) for a DSCC with PS-PWM is presented. During normal operation, all cells are involved and provide at least m + r + 1 voltage levels and the cell voltage reference is equal to v DC m+r . As soon as a fault occurs, the faulty cell is bypassed, the cell voltage reference in the affected phase increases to v DC m+r−1 , the number of carriers is reduced by 1 and carrier phase angle, carrier frequency as well as cluster voltage references are adapted by the factor m+r m+r−1 only in the faulty cluster. If additional cells fails, the controller incorporates a neutral-shift method that modifies the cluster voltage references by adapting the modulation indices and phase angles, enabling the converter to keep operating even if more than r cells fail. The operation is limited by the maximum rated cell capacitor voltage. The method is successfully validated through experimental results.
In [54] and [32], an adaptive energy-balancing control on voltage-sharing hot-redundant mode approaches are presented. The method continuously adapts the cell voltage references in the faulty cluster to achieve the same overall cluster energy as in the healthy clusters, even under asymmetrical operation. The adaptive reference voltage is obtained by mathematical derivation.
In [133], a mitigation algorithm for eliminating fundamental frequency component on circulating currents, produced by the asymmetric operation of a voltage-sharing hot-redundant mode after a fault, is proposed. Simulation and experimental results validate the approach. When a fault occurs, the faulty cell is bypassed and the emerging fundamental component of the circulating current is suppressed by the controller. The ac output current and the cell capacitor voltage ripple remain unaffected although the energies of the faulty cluster and healthy cluster differ. However, no detailed information is provided if an overall energy balancing control is implemented or not.

C. RECONFIGURATION STRATEGIES WITHOUT REDUNDANT CELLS
If no redundant cells are available, e.g., due to cost or space restrictions, only software redundancy is feasible. To guarantee full functionality in case of a fault, all cells must be over-sized since voltage stress will increase during faults. Otherwise, the system must be derated during post-fault operation. Simple fault-tolerant controllers can be prepare to be reconfigured against a group of known faults, while sophisticated faulttolerant controllers adapt its operation in case of arbitrary faults.
In [134] a sophisticated fault-tolerant controller for a DSCC with PD-PWM is proposed. The method is separated into three steps. The first step includes FDD and bypasses the n f faulty cell and a healthy cell in the complementary cluster. Afterward, the cell capacitor voltage reference is increased by the factor m m−n f . Finally, the cluster voltage reference in the faulty phase is limited by the factor m−n f m . The simulation results prove the approach, however, large oscillations occur immediately after a fault reaching up to 20 % and decay slowly within 4 s. The reduced utilization is disadvantageous since a healthy cell must be bypassed. Moreover, the THD of the ac output voltage increases slightly, e.g., from 10.8 % to 13.5 percent if two cells out of ten fail.
In [135], a modified and fault-tolerant space vector modulation (SVM) strategy based on symmetrical phase voltages for circulation current suppression is proposed and analyzed for MMCC. It provides a balanced line-to-line voltage even if a faulty cell is bypassed. In order to keep the voltages balanced, the number of cells in upper and lower cluster must be equal and, therefore, also a healthy cell in the complementary cluster needs to be bypassed. The modified SVM strategy reshapes the space vector diagram after a fault event from which the modified duty cycles for each voltage reference can be estimated. The approach is successfully validated through experimental results, and the converter can operate after a fault, however, a larger THD on ac output voltages is generated.
In [136], [137], and [138], the injection of a fundamental frequency zero-sequence component, i.e., neutral-shifting method, is proposed to avoid higher voltage reference on cell voltages, reducing its stress. The faulty cluster is clamped to its maximum voltage (i.e., v a ref = where v DC is considered the maximum voltage during normal operation). Same fundamental frequency offset voltage is injected in the remaining phases, leading to zero-sequence injection. The method is validated for a DSCC by simulations and, in [138], also by experimental results. The methods lead to an balanced line-to-line voltages and currents although the cluster voltages are unbalanced. The cell capacitor voltages remain unaffected (voltage ripples are negligible).

D. RECONFIGURATION STRATEGIES FOR FAULTS ON CLUSTER LEVEL
Besides the previously discussed faults on cell level (such as OCFs or SCFs in switches or passive components), there exist a few publications which investigate the impact of cluster faults on MMCCs in DSCC [139] or Triple Star Bridge Cell (TSBC) configuration [140], [141]. A cluster fault can be caused, e.g., by multiple cell faults within one cluster, a mechanical connection failure or a control unit failure [141].
In [139], several serious cluster faults are discussed for DSCCs. The proposed fault-tolerant control method is capable of handling single cluster faults under light loading conditions. The method can 1) maintain the desired balanced three-phase ac output voltages, 2) achieve energy-balancing by keeping the dc power of each leg close to 0 W, and 3) suppress a fundamental ac current component in the dc-terminal currents. These three constraints guarantee a stable operation of the converter. However, the absence of one cluster reduces the maximum output power to 28.9 % of the rated power due to higher leg currents, increased modulation indices and larger cell voltage fluctuation. To overcome this drawback, the cell capacitances have to be doubled and redundant cells should be inserted in the remaining operating branches.
A cluster fault in TSBCs have less impact on the nominal power reduction, since nine branches are involved during normal operation. A fault-tolerant control strategy is presented in [140] which reconfigures the converter to an Hexverter in case of a cluster fault. The proposed controller uses the four controllable circulating currents of the TSBC to regulate three branch currents to 0 A in order to form a Hexverter.
Hexverter require larger branch currents and large commonmode voltages when a reactive power differences between both ac terminals exist, being its main drawback. Also, two healthy branches must be bypassed, since only six branches are required for a Hexverter. Thereby, additional redundant or oversized cells are required.
In [141], a nonlinear optimization with three linear constraints is used to determine eight adapted cluster current references when one cluster fails in a TSBC configuration. The formulated objective function is used to limit the branch current magnitude and to obtain an optimized branch current configuration. The postfault operation leads to slightly increased cluster current and cell voltage ripples. Compared to a postfault Hexverter reconfiguration, it shows better performance, with peak-to-peak cluster currents reduced by 17 % and the cell capacitor voltage ripples reduced by 40 %.

V. CONCLUSION
This article provides a comprehensive study and discussion of FDD methods and system reconfiguration of MMCCs.
As major contribution, it summarizes, compares, and describes the most relevant FDD methods applied to MMCCs in the literature, which are grouped as expert-system-based, model-based, hardware-based, and data-based methods. Table 1 collects all this information in compact form, highlighting the required time for fault detection and diagnoses and how the fault mode is detected and which methods are used for the detection.
Expert systems have been implemented in MMCCs as FDD methods. They can be used to rapidly detect a fault, but they are effective only for detecting a group of predefined faults. Also, in more complex cases, it can be difficult to translate human understanding/decision to a robust algorithm.
Reviewed hardware-based methods are reliable and can identify a fault very fast, however, they are (more) expensive, specially for MMCC converters with hundreds of cells and complicate the hardware wiring and structure. Furthermore, the FDD process is implemented based on simple threshold comparisons in order to obtain only few information about the fault mode.
Model-based and data-driven methods are most widely used. The most commonly used quantities for FDD are the circulating currents and capacitor voltages due to their intrinsic impact on the distorted MMCC behavior in case of faults (e.g., OCFs or SCFs in cells). While model-based methods are (very) accurate in predicting a fault, the MMCC models have typically a large number of states, and these complex systems may not be executed in real time.
Data-driven methods are an attractive technique to analyze data and rapidly obtain information about a fault, which might be faster than a model-based FDD, in particular for large systems but still require quite some time to detect a fault (several ms). Nevertheless, data-driven methods are an interesting and probably more flexible tool to approach FDD from a wider and more complex perspective. Also, most of the data-driven methods can identify a fault and the type of the fault, but they fail in detecting exactly the faulty switch or cell. The implementation of efficient data-driven methods in real-time applications represents the major challenge in current state of the art for FDD of MMCCs.
To the opinion of the authors, combinations of qualitative and quantitative methods, each of them acting on a different time scale, seems to be the future direction for fast and reliable FDD. None of the reported FDD methods can comprehensively solve the FDD problem by its own at the moment. Moreover, such an approach will enable real-time aging calculations to take preventive actions before a fault occurs, which is a topic rarely addressed in the literature.
This article also provides the state-of-the-art discussion about hardware and software reconfiguration strategies. Dominant system reconfiguration strategies for MMCCs were discussed, highlighting the cold-and hot-redundant mode reconfiguration as the most promising techniques, but also reconfiguration without redundant cells or reconfiguration for faults on cluster level are addressed. Most approaches can be traced back to hardware redundancy and they only slightly differ in modulation technique and how the transition from normal to postfault operation is handled and implemented. This transition is still heavily under research, where most linear controllers fail in regulating properly transient voltages or currents from prefault to postfault operation, nonlinear controllers seem to be more suitable for this task, but this is still incipiently studied in the literature. The considered application is crucial for the choice of the reconfiguration method and depend on the number of cells per cluster, admissible system costs and specified volume. While in applications with hundreds of cells, a cold-redundant mode seems superior due to its simplicity and easy implementation, a hot-redundant mode seems preferable in application with a low number of cells.
This article has covered a very complex and very recent research area, some future challenges include the analysis of additional MMCC configurations, inclusion of SCFs and capacitor aging/faults in FDD methods, identification of multiple faults, integration of different FDD methods for detecting faults on different time-scales; robustness analysis of the methods against parameter or modeling uncertainties and for different load conditions and in presence of noise, disturbances, communication delays, grid frequency variations; and modification of FDD algorithms for real-time implementation. Furthermore, as a complementary to FDD methods, condition monitoring and predictive maintenance are foreseen as an interesting technique to be applied to MMCCs, where similar methods can be used. To overcome these issues is essential in order to be able to evaluate feasibility and applicability in industrial large-scale MMCCs.

APPENDIX A TERMINOLOGY
Herein, the relevant terminology used throughout the article is collected and defined.

A. FAULT ISOLATION
Fault isolation is performed after a fault is detected. It involves actions which, as part of FDD, allows to distinguish and isolate a particular fault among others [142], it is also called fault localization.

B. FAULT TOLERANCE
Fault tolerance is a property of a system which allows the system to continue its operation in the event of a fault in one or more of its components or subsystems [24].

C. MTTF/MTBF
Mean time to failure (MTTF) describes the expected time until a nonrepairable component or system fails [24]. Mean time between failures (MTBF) is the estimated time between two failures, this average time excludes the time required for repair or additionally induced by any downing events, such as inspections or preventive maintenance. It only takes into account the time period when a system is available and operating.

D. REDUNDANCY
Redundancy is the provision of additional (functional) capabilities of software and/or hardware which are not necessary in a fault-free environment [143]. In [107] redundancy is categorized in hardware and software redundancy.

E. ERROR
An error represents an invalid state of the system. The error itself is the manifestation of a fault at component, subsystem or system level. An error is the symptom of a fault meaning that a fault may result in one or multiple errors [144].
Hardware redundancy is referred to the utilization of (additional) hardware components to provide fault tolerance and full functionality after a fault occurs. It can either be static or dynamic. For static hardware redundancy, redundant components are used to mask the effect or impact of faulty components or hardware on a given component or hardware; e.g., several (hardware) modules perform the identical task at the same time, if one of the modules fails, its impact will be masked by the other healthy modules. For dynamic hardware redundancy, the redundant hardware components are activated after successful FDD. This action is called system reconfiguration.
Software redundancy enables a system to be robust against predefined faults. System degradation is expected if an unknown fault occurs and no additional hardware redundancy is usually required. According to [107], software redundancy can be separated into passive and active software redundancy. While passive redundancy is achieved simply by robust controller design (no FDD is required at all), active strategies are implemented in addition to FDD.

F. RELIABILITY
Reliability is an indicator which gives the probability for a component, subsystem or system, to perform as specified for a certain period of time and under defined conditions [145].

G. SYSTEM RECONFIGURATION
After fault diagnosis, in particular after the fault has been localized or isolated, system reconfiguration is the fault-tolerant approach which switches off the faulty component(s) in order to guarantee further operation and functionality of the system using the remaining healthy or redundant hardware components only [146].