Faults in Modular Multilevel Cascade Converters—Part I: Reliability, Failure Mechanisms, and Fault Impact Analysis

Modular multilevel cascade converters (MMCCs) are considered a promising power electronics topology in industry. Their scalability allows to reach (ultra/very) high voltage levels with low harmonic content and high efficiency and makes MMCCs an ideal solution for high-power applications, such as electrical drives, solid-state transformers, and high-voltage direct-current (HVdc) transmission systems. However, the high levels of thermal, electrical, and mechanical stress on the power electronics devices and the large number of components (e.g., capacitors or semiconductors) make MMCCs prone to faults reducing its reliability. In this first part of the article, a comprehensive overview of the reliability of MMCCs, failure mechanisms, and fault impact analysis in MMCCs, including failure rates and fault modes is presented. Also, a set of tables that collect all information to easily detect and identify faults in MMCCs is presented.


I. INTRODUCTION
Modular multilevel cascade converters (MMCCs) are part of the family of multilevel converters [1]. MMCCs are suitable for medium and high power applications as energy generation systems, HVdc transmission systems, high voltage power factor compensation, and large-scale alternating current (AC) drives. Due to their intrinsic scalability and modularity, MM-CCs exhibit very low harmonic voltage distortion and are capable of working at (very/ultra) high voltage and power levels with high efficiency. For instance, in HVdc applications, the MMCCs reach voltage/power levels of 400 kV/1000 MW [2]. Due to the series connection of power modules, the power rating of MMCCs is only limited by the admissible currents in the power electronic devices; however, the exploitation of the MMCC's current scalability, based on parallel connections of power modules, is being under research currently [3]. The modularity and scalability of MMCCs make them an attractive choice for high power applications which require bidirectional ac/dc, dc/ac, or even direct ac/ac power conversion. At the moment, MMCCs are changing the industry and their penetration into the market is evidently increasing (in particular for high voltage/power applications) [4].
MMCCs topologies have been classified by Akagi [5] based on their internal connection, the number of input and output ports, and the used cell types. A comprehensive overview of all feasible MMCC configurations and their resulting converter topologies can be found in [5] and [6]. Reasonable choices for cell topology are either chopper cells (CCs), for ac/dc or dc/ac energy conversion, or bridge cells (BCs) for direct ac/ac conversion, however, further cell topologies that require more electrical components were investigated and compared with each other [7]. The input and output voltages of MMCCs are distributed among all cells of the converter, which allows to reach (almost) arbitrarily high voltage levels [8].
Two topologies, or configurations, of MMCCs have been extensively studied and reported in recent years: the doublestar-chopper-cell (DSCC) [9], [10], [11], [12], and the triplestar-bridge-cell (TSBC) configuration [13], [14], [15], [16]. These topologies are the focus of study of this article, however, conclusions regarding reliability and failure mechanisms can be extrapolated to well established MMCCs converters, such as cascade H-bridge converter (CHB) or new configurations based on floating capacitors such as: SSBC for reactive power compensation or Hexverter. The DSCC configuration possesses five terminals and is used for ac/dc and dc/ac conversion. Its CCs allow for low power losses. From a hardware and control point of view, the DSCC configuration has a lower complexity than the TSBC configuration, but it does not allow for a bipolar operation and is not capable of blocking dc faults [17]. The DSCC configuration is predominantly used for high-power ac drives [18], [19], [20], grid applications (e.g., in HVdc [10], [21], [22]) or back-to-back converter topologies [11]. The TSBC configuration possesses six terminals and is used for direct three-phase to three-phase ac/ac energy conversion. The use of BCs allows for moderate power losses. The complexity of this topology is higher than that of the DSCC configuration, but it enables bipolar operation [17]. The TSBC topology is mainly used in high-power ac drives applications [23], [24], [25]; particularly in cement or steel mill drives [6], wind energy conversion systems [26], railway power conditioning [27], grid-connected hybrid generation units (combination of solar and wind energy sources) [28], or applications incorporating battery energy storage systems (BESS) [29].
Depending on the application, required power range and environmental conditions, MMCCs have different reliability 1 requirements [30]. For instance, MMCCs used in aircraft applications [31] have very strict safety and reliability standards, since it is very difficult or even impossible to perform maintenance or to repair the components during flight [32]. Moreover, a fault 1 in these kind of applications could be fatal to passengers. Reliability probabilities of MMCCs are based on empirical failure 1 rates of each individual component and subsystem over a specified time. Based on the MMCC topology and its components, a reliability model can be obtained, which combine fault probability with fault propagation analysis to predict the mean time between failures (MTBF) 1 [33].
Several experimental datasets, surveys, and studies have concluded that the components with the highest failure rates in a power converter are the controllable power semiconductors (switches) and capacitors [30]. They contribute with more than 50% of the faults in power electronics systems [34]. The 1 For a list of definitions of the used terminology see Appendix A. main causes for faults in these components are their exposition to thermal, electrical and mechanical stress; e.g., due to long-term and/or frequent transient overload conditions or heavy environmental conditions, such as extreme temperatures, moisture, mechanical vibration, chemical erosion, radiation, and lightning [35]. A study on reliability, conducted by the "Zentralverband Elektrotechnik (ZVEI)" for automotive electrical/electronic modules, shows that 55% of the faults in power devices are due to the extreme temperature exposition caused by steady-state overloading or transient over-current/over-voltage [36]. Thus, poor cooling designs, poor maintenance and manufacturing inaccuracies impact reliability, leading to 1) short-circuit faults (SCF); 2) open-circuit faults (OCF); 3) premature aging of switches and capacitors in the converter.
This article provides a comprehensive review and discussion of the state-of-the-art and a future perspective on the reliability of MMCCs. The discussion focuses on the analysis of the dominant factors that usually lead to open-and shortcircuit faults in the cells and the impact of these fault modes 1 on the converter operation. This work provides a guideline for future research in the field of reliability and fault tolerance of MMCCs stimulating novel and interdisciplinary research activities.
The remainder of this article is organized as follows. Section II briefly summarizes the modeling and operation principles of the two most commonly used modular multilevel cascade converters. In Section III, the failure rate and reliability calculation for MMCCs is presented, Section IV analyzes the fault modes and failure mechanisms 1 that typically occur in MMCCs most critical components. Section V summarizes the impact of a single failure on the overall operation of the MMCC. Finally, Section VI concludes this article. Fig. 1 illustrates the DSCC and TSBC configuration. Each cluster is formed by m cells, chopper cells (CCs) (in DSBC) or bridge cells (BSc) (in TSBC), connected in series to achieve m + 1 voltage levels for DSCC configurations, and 2m + 1 levels for TSBC configurations. Classical modulation techniques such as nearest level control (NLC), level-shifted (LS-) PWM or phase-shifted (PS-) PWM are modified in [37], [38], and [39], respectively, to obtain 2m + 1 output voltage levels in DSCCs. In high power (e.g., HVdc) applications, more than 200 cells are used in one single cluster [40], [41], which increases the probability of faults in a cell or cluster, but also increases the redundancy of the converter leading to a reconfigured MMCC, improving its availability. 1 Fig. 2 shows the modeling for a bridge cell (BC), typically used in TSBC. The term BC x y,k represents the kth cell, with k ∈ {1, . . . , m}, of the cluster connected to the ac port 1 x ∈ {a, b, c} and ac port 2 y ∈ {u, v, w}. For a DSCC, the notation is same, i.e., CC x y,k , but with y ∈ {p, n}. The current flow to the capacitor of the  cell can be written as

II. MODULAR MULTILEVEL CASCADE CONVERTERS (MMCCs)
The cell output voltage v x y,k is obtained similarly. In Appendix B, all possible current paths and cell output voltages of a fault-free CC and BC are listed and illustrate. This information is very useful to control the cell balancing and to identify abnormal current paths which allows to identify failed cell. Equations (2) and (3) represent the complete dynamic model of the DSCC and TSBC, respectively. This model can be used when MIMO or nonlinear controllers are implemented [42], particularly important when faults identification or converter reconfiguration is needed. However, typically, standard decoupled SISO controllers are implemented in the literature, and a decoupling procedure is required [43], [44] to obtain a decoupled model in αβ reference frame. Decoupled model is simple to control, however, it requires assumption of balanced parameters and balanced operation, when these assumptions are not valid, a coupled model is preferred instead The major challenge of MMCCs is to maintain its floating capacitors balanced to the same reference voltage level, i.e., v x c,y,k ≈ v c,ref ∀k ∈ {1, . . . , m}. To achieve this balance, the power flow of each cluster must be controlled. Six or nine independent power flow terms can be obtained for the DSCC or TSBC correspondingly [45], [46]. These power terms are carefully regulated to control the input and output currents and also internal circulating currents, used to balance the voltages among all floating capacitors. Firstly the total energy stored in the converter is controlled through the average cell capacitor voltage. Once the converter reaches its required total energy, it must be distributed and balanced among clusters (intercluster control) and cells (intracluster control) (see [47], [48] for DSCC, and [47] for TSBC). Sorting algorithms are commonly implemented for intracluster control [49], [50]. Alternatively, to avoid the high computational burden of sorting, the typical carrier-based level-shifted or phase-shifted pulse width modulation schemes (i.e., LS-PWM or PS-PWM) can be modified to naturally balance the voltages among the cells [49], [50], [51], [52], [53], [54], [55].

III. RELIABILITY OF MMCCS
Reliability assessment has become one of the most relevant topics in the design and operation of power electronics systems [30]. Most malfunctions in power electronics systems occur by neglecting reliability considerations during the design phase [56]. Reliability modeling is a fundamental tool to: 1) schedule maintenance intervals; 2) to repair and/or replace faulty system components [57]; 3) to provide predictions to support decision making of, e.g., investments [58]. From the design point of view, reliability models are important to design reliable power electronics systems [59] and, in advance, to evaluate the impact of faults 1 and failures 1 on the control system and the operating conditions of the converter during its lifetime [60], [61]. Reliability 1 and/or availability 1 models allow to analyze and define different reliability metrics such as "mean time to failure" (MTTF) and "mean time between failures" (MTBF) [35], [56]. MTTF 1 and MTBF 1 vary depending on the operation conditions and stress levels of the system [62]. Most important for the metrics above are the failure rate λ(t ) and the reliability function R(t ). Note that MTBF(t ) = 1 λ(t ) .

A. FAILURE RATE AND RELIABILITY FUNCTION 1) CONSTANT FAILURE RATE MODEL
Failure rates models of an electrical/electronic component, subsystem or system have been usually obtained based on statistical results of hundreds of stress-tests performed under a wide range of different stress conditions. This approach is known as constant failure rate model. The failure rate approximation and reliability for a time interval i is usually obtained using the Weibull density function (WDF), given by with Weibull slope β i and the parameters η i and γ i which define the shape, scale (dispersion) and location (starting time, i.e., t ≥ γ i ) of the failure rate [63]. Parameters of WDF are obtained based on a guide or handbook (e.g., the military handbook MIL-HDBK-217F [64] or the European FIDES Guide 2009 [65]). The statistical failure rate over lifetime of the analyzed system or component is represented by a "bath-tube curve," shown in Fig. 3 [66], which is divided in three time intervals i = 1, 2, 3, defined as the following. 1) The infant mortality period is characterized by a high initial failure rate (β 1 < 1) and results from a poor design, substandard component usage or lack of control during the manufacturing process [67]. The shape of this time period is usually estimated by exposing the system to accelerated stress conditions to predict infant mortality failures [68]. 2) The second interval is called useful life period. It is essentially characterized by a constant failure rate (β 2 = 1) and dominated by unusual failures. The system and its components are affected by stress levels which exceed the nominal values and these kind of failures can appear even if the equipment was subjected to regular maintenance [69].
3) The third period is called wear-out period (β 3 > 1), during which the components are rapidly degrading e.g., due to aging, wear, accumulated corrosion, micro fractures and deformation [70]. This period is determined by the remaining useful life (RUL) 1 which can be predicted using analytical models, empirical knowledge, and hybrid-based methodologies [71].
The reliability function for MMCC is usually constructed bottom-up, i.e., from component level (IGBT and capacitor) to subsystem level (gate-drive and power cell) and then to system level (power cluster and the MMCC [72]). It has been successfully implemented in MMCCs using combinatorial model and poisson process to obtain reliability and cost analyses of redundancy [73], deriving a reliability function that includes periodic preventive maintenance and redundant cells [74] and for evaluation of redundancies using hybrid cells topologies which uses more than the required semiconductors to increase reliability [75]. Also, statistical techniques, such as: 1) Latin hypercube sampling (LHS) are used to obtain statistical approximations of the parameter values of β, γ , and η; and 2) copula multivariate density/distribution probability functions are proposed to find correlation coefficients which model the statistical dependencies between the cells [76], [77], [78]. Unfortunately, there is still small amount long-term usage data available from the commissioned MMCCs to rely on its statistical failure rates. Also constant failure models do not consider temperature cycling, combined environment and manufacturer variations which are very important for accurate evaluation of reliability of power electronic systems.

2) PHYSIC-OF-FAILURE (POF) MODEL
Physic-of-failure (PoF) is considered by industry and academia as the future trend for assessing reliability in power electronics systems [79]. Instead of approaching the reliability assessment based on pure empirical results, as in constant failure rate approach, PoF concept uses deterministic science, i.e., material science, chemistry, physics, and statistics to develops a degradation model of a component aimed to predict the root cause of failures and derive a reliability function (R(t )) [80]. Failures can be classified by overstress (single destructive stress) or wear-out (cumulative stress).
A comprehensive approach on reliability of power electronics devices complements a PoF-based degradation model with the following methods. 1) Design for reliability (DFR): make use of accumulated field experience and design tools into each development process of a power electronic product (concept, design, validation, production, and release) and simulate the degradation of the system based on a mission profile technique [81], which incorporates a large parametric variation (temperature, input/output power, load changes, etc.) under the most representative operation and environmental conditions. 2) Verification and monitoring methods: use of accelerated tests, including thermal cycling and overstress tests, to validate the degradation models and characterize the components lifespan in a short period of time [80]. Also, diagnosis 1 (determine the location and cause of a fault), prognosis 1 (forecast of the RUL of a component or system) and health monitoring 1 (real-time estimation of the state of health of the system) are important techniques to enhance the reliability, implement health management 1 and improve degradation models in power electronics [82], [83]. This technique is gaining attraction in the power electronics field and has been recently applied in MMCCs for its overall reliability prediction [84], lifetime estimation considering thermal and power cycling in offshore wind power HVdc systems [85], thermal performance of IGBT modules in MMCC configuration [86] and condition health monitoring [87]. Fig. 4 shows the main components of a chopper cell, such as controller (CON), gate drivers (G 1/2 ), switches (S 1/2 ), diodes (D 1/2 ), and capacitor (C) [88]. The reliability R CC of the chopper cell is then obtained as the product of the reliability functions of each considered components (each function as in (5)) [89], i.e.,

B. MMCC RELIABILITY CALCULATION 1) POWER CELLS
The reliability R BC of a bridge cell is obtained in a similar manner and can be computed as follows: 2

) POWER CLUSTERS
A cluster is composed of m + r cells, where m is the minimum number of cells for the converter to work with nominal rates and r cells are redundant. This is illustrated in the reliability diagram of Fig. 5. Thereby, the cluster fails when r + 1 cells fails simultaneously. Assuming that the converter has m identical cells in each cluster, having identical cell reliability R CELL ∈ {R CC , R BC }, then the ith cluster reliability function is simply given by When additional r redundant cells are introduced in the ith cluster, its reliability is increased as follows: where r k := r! k!(r−k)! represents the combination of available (operational) cells [89]. Clearly, r k increases when more redundant cells are included and a large r leads to a more reliable system. Redundant cells should be implemented in combination with appropriate fault-tolerant control systems [90]. Redundant cells can be designed identical to the other cells of the converter. Clearly, other cell topologies or designs are feasible. Each design has a different reliability function which must be considered when calculating the overall cluster reliability [91].

3) POWER MMCCS
Finally, the reliability of the whole MMCC with c clusters can be computed as where e.g., c = 6 and c = 9 for DSCC and TSBC configuration, respectively. Assuming that all clusters have the same reliability (i.e. R CLUSTER,i = R CLUSTER ) and using the Weibull distributions (5) in (9), then (10) becomes the simplified probability function of an MMCC as follows: In (11), it is assumed that all cells are statistically independent and that the reliability functions of the same components (such as are identical. If the components are not identical, e.g. due to manufacturing tolerances or unequal aging, different reliability function for each term of (10) must be obtained. Unequal aging among the components is caused by internal or external faults, these faults have impact on all cells of one cluster which leads to a statistical dependency of the reliability functions of the cells. This is studied in more detail in [92]. Notice that cluster redundancy has not been reported in the literature, and is not included in (10). However, some authors propose converter reconfiguration strategies to adapt the converter when one cluster fails and maintain operation [93], [94]. This can be considered as a cluster redundancy approach, however, more research is needed to accurately obtain its reliability function.
The propagation of aging effects can be minimized by predictive maintenance and/or condition monitoring of the main components, adjusting the maintenance intervals in accordance to the varying failure rates of the individual components [74]. By adding redundant components or cells in combination with fault-tolerant control strategies, the reliability of the MMCC can be substantially increased [74], [92]. In [95], an interesting estimation of lifetime based on operating conditions of a converter, neglecting electro-mechanical information, has been proposed. The method uses nonlinear data from the reliability analysis and calculates remaining life-time using artificial neural network (ANN). The method has been proposed for a simple electrical power converter, but could be extended to MMCCs, which is still an open research field.

IV. FAILURE MECHANISMS IN MMCCs
Failure mechanism in power electronics components are subdivided into the following major categories: thermal shocks, dielectric breakdowns, and environmental or mechanical impacts [96]. These failure mechanisms can lead to partial, intermittent, and catastrophic 2 failures.
Thermal shocks occur when too high temperature gradients lead to different expansion magnitudes in different but closely attached parts of the device [35]. In power semiconductors, conduction and switching losses are dominant losses [97], while equivalent series resistances (ESR) and core losses are dominant in capacitors and inductors, respectively [98].
Dielectric breakdowns lead to a rapid reduction of the dielectric resistances due to the applied voltage across the dielectric [99]. In solid dielectrics, the faults are produced by displacement currents, which moves charges within the dielectric and change its isolation distribution [100], also by cracking because of thermal accumulation. Gas or gel insulators are less prone to dielectric breakdown. Moreover, partial discharge due to environmental factors such as atmospheric humidity, contamination of the dielectrics, and corrosive environment [101] may induce additional stress to the dielectrics.
The environmental conditions significantly affects the reliability of the power electronics components. Examples are: 1) accumulated dust over the (printed) circuit boards could reduce their dielectric strength; 2) moisture may increase the conductivity of permeable insulators; 3) too low humidity may dry out materials or make them prone to electrostatic charging; 4) environmental temperature changes can lead to condensation on the circuit boards or insulators; 5) humidity in saline or chemical environments can produce corrosion or electrolysis in metal components [102]. Mechanical stresses such as bending of the circuit boards or vibrations in general may produce cracks in the solder joints, partial discharging or carbonizing paths which may lead to intermittent or total failures of the connections [103]. Fig. 6 summarizes two studies conducted in 2007 [104] and 2011 [30] which agree on the fact that MOSFETs/IGBTs, capacitors and gate drives are the most important sources for faults in power converters. IGBTs and capacitors count for more than 50% of the faults in power converters.
Also MMCCs are susceptible to such faults since a lot of these electrical components are utilized. The exact number of components increases rapidly and depends on the number of cells and the chosen cell topology, e.g., in the Trans Bay Cable Project more than 1200 cells, and thus, more than 2400 IGBTs and 1200 cell capacitor banks are deployed [41]. The cells could be build up different types of semiconductor switches such as MOSFETs or IGBTs but in high-power modular multilevel topologies IGBTs are the favorable devices [105] and are used in several applications including the HVDC Light by ABB and the HVdc Plus by Siemenes [106], [107]. Due to the required power ratings of the semiconductor devices, there are mainly two packages available on market, the wire-bond and the press-pack modules whereby the former is widely adopted in industry so far [108]. However, also the press-pack was already introduced and adopted in MMCCs [40], [106].
The cell capacitors are either metallized polypropylene film capacitors (MPPF-Caps) or aluminum electrolytic capacitors (Al-Caps), and nowadays, the MPPF-Caps are more adopted in industrial MMCC cells due to higher lifespan and selfhealing capability [109], [110] while Al-Caps are especially used in prototypes and demonstrators [111]. However, according to [112] and [113] Al-Caps are still advantageous in some MMCC applications where design objectives are high energy density and low costs.
Thus, the fault modes 1 and failure mechanisms 1 of IGBTs and capacitors are explored in detail in the following subsections.

1) FAULT MODES AND FAILURE MECHANISMS IN IGBTs
IGBTs come in two different packing styles-the wire-bond IGBT module and the press-pack IGBT module. Both technologies are used in industrial applications. The selection is based on requirements and specifications of the individual application such as rated voltage, power density, and/or cost [114].
A classical wire-bond IGBT module is shown in Fig. 7. It comprises Copper (Cu) base plate, direct copper bond (DCB) substrate, package terminal(s), and Silicon (Si) chip(s) (e.g., with IGBTs and/or diodes), all connected through aluminium wire-bonds [115]. The DCB substrate usually consists of two layers of copper and a ceramic substrate in-between. It is soldered to the base plate and the silicon chip(s) to electrically isolate them. Moreover, it allows to dissipate the heat more easily via the cooling system [see Fig. 7(b)]. An encapsulation of silicon gel is used for isolation, which inhibits humidity and prevents partial discharges and contamination of the internal circuit. Although this type of module is very popular and widely used, typically, its wire-bond connections and solder joints suffer from frequent lift-offs or malfunctions when the module is exposed to (cyclic) thermal stress. The  press-pack IGBT module was introduced to eliminate wirebond lift-offs, [114]. It has been widely used for integrated gate-commutated thyristors (IGCTs), gate turn-OFF thyristors (GTOs), injection-enhanced gate transistors (IEGTs), or classical diodes [116]. The packaging of the press-pack technology is illustrated in Fig. 8. The press-pack IGBT module consists of two Cu electrodes (collector and emitter) which wraps several layers of metals used in semiconductor applications such as Silver (Ag) and Molybdenum (Mo) with different electrical and thermal properties and the metalloid Silicon (Si) that contain the semiconductor chip. The Cu base plates at top and bottom work as thermal paths for heat dissipation. The Mo layers distribute the clamping forces and provide isolation [114]. A slim and flexible Si shim supports the Si/Mo packaging mechanically as it compensates for deformations of the whole structure [117]. Regardless of the chosen packaging technology, IGBTs can exhibit three fault modes 1 : 1) open-circuit faults (OCFs); 2) short-circuit faults (SCFs); or 3) premature aging (PA). While in wire-bond IGBT modules the OCFs are more likely to occur, the SCFs are more likely in press-pack IGBTs. However, the kind of fault always depends on the fault energy. The key characteristics of both technologies are summarized in Table 1.
OCFs disable the turning-ON capability of a power module. They do not immediately lead to a failure of the converter as usually the converter can maintain its operation with reduced performance [118]. However, OCFs may lead to premature aging or secondary failures in other components or subsystems of the converter. This fault mode is a consequence of internal or external disconnections [119].
SCFs eliminate the capability of the module to be turned OFF. One switch in SCFs within a cell might lead to shortcircuit current and destruction of the affected MMCC cell when a complementary switch of the same leg is closed and the cell capacitor is charged. Five different failure mechanisms during turn-ON and turn-OFFperiods are usually responsible for SCF modes: 1) high-voltage breakdown; 2) device latch-up; 3) secondary breakdown; 4) energy shocks; 5) absence of IGBT gate drive signals. Nowadays, device latch-up is quite uncommon in newer modules with trench-gate structure [119].
All possible failure mechanisms, resulting from internal or external failure sources and leading to the three fault modes OCF, SCF, or PA, are collected in Table 2.

2) FAULT MODES AND FAILURE MECHANISMS IN CELL CAPACITORS
Power capacitors are the second most critical component in power electronics systems [136]. As central part of each power cell, faults in the capacitors have significant impact on the MMCC operation. During the design of an MMCC cell, the power capacitors are selected as compromise among high power density, low cost and high reliability [111], [137]. Large capacitance values lead to small voltage and current ripples, and therefore, small circulating currents and reduced power losses in the MMCC [137], [138], but power density is decreased and cost is increased [139]. The total energy store in MMCC capacitors as function of the nominal power is also an important design factor, namely, virtual inertia, as it enables voltage ride through operation [140]. The properties of the dielectric material and the construction topology are the most important factors influencing their performance and reliability [141]. The main differences between both capacitor types are collected in Table 3. Fig. 9 shows the layer structure of polarized power Al capacitors composed of foil anode and cathode (electrode), isolating paper, and a positive and a negative terminal. Due to the extreme purity of Al, the surfaces of the foils are treated with an etching process to create a microscopically roughened surface to obtain the maximally feasible capacitance value [142]. Also, the Al foil is exposed to a thermo-chemical process to produce a very thin layer of Al-oxide (Al 2 O 3 ), which acts as dielectric of the capacitor [143]. The dielectric thickness is calculated to meet its rated voltage specification. Anode and cathode Al foils are then joined through a isolating paper, ensuring the required dielectric strength between anode and cathode foils, and filled by a liquid electrolytes to provide the positive and negative terminals of the capacitor [144]. Power MPPF-Caps have no polarity (as Al-Caps) and come in two different packing types [see Fig. 10(a)]: The "wound type" where the layers are wound around a hollow mandrel and the "stacked type" where the layers are stacked on top of each other. Regardless the packaging, each of these layers are made up of a polypropylene (dielectric) film, a sprayed-on metallized surface [typically Al or Zinc (Zn)] and one clear edge, used to create the capacitive structure [145]. The thermoplastic dielectric lack polarity and will not change their orientation under electrical field stress. Thanks to this feature, the dielectric has a low loss factor and a low conductivity making MPPF-Caps relatively independent of temperature and frequency changes [146]. A metallic contact layer ("metal schoopage") is sprayed on each side of the layers to obtain both terminals of the capacitor [see Fig. 10(b)]. Top and bottom of MPPF-Caps are covered by protective films.
In terms of reliability MPPF-Caps possess a self-healing capability. Once an electrical breakdown event occurred, the metallized film is oxidized or vaporized around the breakdown area. Once the MPPF-Cap has healed, it continues operating, but it is slightly affected by aging, loss of capacitance and an increased ESR [147]. Due to their self-healing capability,   MPPF-Caps are more resistant to over-voltages than Al-Caps. Also, MPPF-Caps has lower ESR than Al-Caps, reducing the required capacitance to achieve the same magnitude of voltage ripple compared to Al-Caps and reducing power losses due to current ripples, better suitable for converters which operate under faults or suffer from higher current ripples [141].
MPPF-Caps and Al-Caps can fail due to internal or external factors, which lead to failure mechanisms such as design defects, material degradation, over-temperature, over-voltage, over-current, and (thermo-)mechanical stress. The resulting fault modes can be divided into three categories: 1) OCFs; or 2) SCFs, both resulting from one single or several failure mechanisms; and 3) wear-out faults (WOF) due to degradation or aging of the capacitors over a long(er) time horizon. WOF produces an abnormal operation of the capacitor, such as decreasing the capacitance, increasing of ESR and losses, increasing the leakage current, or degradation of the dielectric strength. All three fault modes will eventually lead to MMCC failures. Table 4 summarizes the main electrical and environmental factors which invoke failure mechanisms and lead to different fault modes.

V. FAULT IMPACT ANALYSIS OF MMCCS
A fault in a power semiconductor or capacitor of an MMCC could lead to internal failures, malfunctioning, potential destruction or may even cause severe damage to the connected load. This section shows how a fault propagates within the

TABLE 5. Key Information to Detect an OCF in a CC and Its Possible Location (Faulty Switch)
MMCC, which impact it might have on MMCC operation and how to detect and identify fault symptoms.

1) Open-circuit faults in CCs and BCs:
When an OCF occurs in one or both switches of a CC, it is possible to identify six fault scenarios which lead to twelve cell current paths, which are summarized in Fig. 11. The scenarios are derived based on the selection of the faulty switch(es) (S 1 , S 2 or S 1 & S 2 ) and the gate signals, while the current paths are obtained based on the cluster current direction. Comparing the current path between a faulty condition and normal condition, Fig. 11(a) and Fig. 14(a), respectively, it can be noticed that for a positive current i x y > 0, the current paths and output voltage are identical. Hence, it is impossible to identify or detect abnormal operation in this case. However, for a negative current i x y < 0, the cell output voltage changes with respect to the healthy cell operation as the converter behavior is affected and fault detection is feasible. Table 9 in Appendix B summarizes these observations for each possible fault scenario. For the BC, the OCF leads to 16 fault scenarios with up to 32 cell current paths; all of them are illustrated in Fig. 12. Similar to CCs, some fault scenarios for BCs also do not allow to identify the faulty switch. For instance, comparing the current path between a faulty condition and normal condition, Figs. 12(a) and 15(a), respectively, reveals that for a positive current i x y > 0, the current paths and output voltage are identical for faulty and healthy operating conditions. However, for

TABLE 7. Delta and Sigma Currents in a DSCC Affected by an OCF in One of Its Clusters (i x ,ref and i x ,ref Represent the Rms Values of the Delta and Sigma Currents Under Normal Conditions)
a negative current i x y < 0, the current paths of the faulty cells are different from the current paths of the normal cell resulting in changed cell output voltages with impact on the converter behavior, enabling fault detection. Table 12 in Appendix B summarizes these observations for each of the 16 fault scenarios for OCFs in BCs.    When an MMCC operates in steady state, the capacitors of each CC and BC are equally charged and discharged after a cycle. However, if an OCF occurs, Tables 9 and 12 reveal that the capacitors are predominantly charged which leads to a permanent increase of the capacitor voltages over time. This represents a typical symptom of an OCF. By knowing the output voltage of a cell, the cluster current and the gate signals, it is possible to detect an OCF in a CC and BC as summarized in Tables 5 and 6, respectively. These tables are crucial to implement OCF detection algorithms for MMCCs. 2) Short-circuit faults in CCs and BCs: When an SCF occurs in one or both switches of a CC, it is possible to identify six different fault scenarios and twelve cell current paths as illustrated in Fig. 16. Four of the six fault scenarios lead to a capacitor short circuit which must be considered a destructive fault, for semiconductors and capacitors, if the SCF is not cleared rapidly. For these four fault scenarios, the capacitor voltage/current decrease/increase rapidly. Fault detection methods based on the slope of any of these variables can be used as an indicator to detect one of these faults [155]. Fault detection circuits integrated into the gate drivers are typically used to clear the fault promptly [156]. Note that measuring external cluster currents does not provide enough information to detect a SCF in a cell. It is possible to distinguish two SCF scenarios which do not lead to capacitor short-circuits [see e.g., Fig. 16(a) and (e)]. However, for those fault scenarios, the cell current paths are identical to the current paths of the healthy cell which makes it impossible to detect these types of SCFs.
Similarly, Fig. 17 shows the 16 fault scenarios and 32 current paths for SCFs in BCs. Table 12 in Appendix B summarizes the current paths for each case. The eight fault scenarios which lead to capacitor short circuit are also included and highlighted (in green). In general, a faulty switch in an MMCC cell can produce an unreliable operation or malfunctioning of the entire system, increasing the cluster-current total harmonic distortion or even stopping the entire converter operation due to over-current or over-voltage protection triggered. The TSBC and DSCC have different fault patterns in cluster currents under faults, which can be used to detect and identify the type of fault, a detailed study about the impact of submodule faults on the performance of MMCC can be found in [157], [158], and [159].

B. IMPACT OF CAPACITOR FAULTS ON CELL OPERATION
The impact of aging on the performance of the capacitor is usually associated with the loss of dielectric properties, a decrease of capacitance, and an increase of the ESR [112] and power losses, which can cause a destructive short-circuit [150] (see Table 4). These last two effects force the cell to increase the commutation frequency of the IGBTs to maintain the voltage ripple within suitable boundaries and also to achieve voltage balancing among the cluster cells [160].
Generally, the cells of an MMCC have a parallel connection of several capacitors to increase the capacitance and current distribution, therefore, OCFs on one cell capacitor does not lead to dc-link OCF, maing this kind of fault less common. However, if an OCF occurs and is not detected quickly, this will interrupt the inductive cluster current, causing high di/dt and, therefore, over-voltages in the faulty cell and, eventually, lead to the destruction of the affected IGBTs. To avoid this, the output of the cell must be short-circuited to provide a permanent path for the cluster current to flow [161].

C. IMPACT OF IGBT FAULTS ON MMCC OPERATION
Appendix B shows the voltage deviations in a CC and BC when an OCF in an IGBT occurs. Measuring only the output voltage of the cell is not sufficient to detect the kind and location of the fault. However, OCFs cause unequal voltage contributions and energy imbalances among the cells within a cluster, and consequently among the clusters of the converter. For instance, for a DSCC configuration, these imbalances produce negative sequences [162] and inject second-and third-order harmonic components [163] at the ac port. Such unbalances are directly reflected in the sigma i x := 1 2 (i x p + i x n ) and delta i x := i x p − i x n currents of the decoupled model (recall Section II; see [44]). Their current dynamics are obtained as and when the assumptions 1 2 , c} hold and only -transformation is applied [based on (2)]. Equations (12) and (13) Table 7 summarizes these observations in order to ease detection of the faulty switch and the faulty cluster in a DSCC configuration affected by an OCF.
According to [157], during a fault, the following anomalies can be detected in MMCC clusters: 1) new frequency components to the sigma currents; 2) unbalanced power losses in the converter; 3) increased ripples in the cluster currents; 4) the no-commutation of faulty switch(es) in a cell can produce an increased current total harmonic distortion (THD) at the input port of the converter, reducing the modulation index range and consequently also the output voltage range.
To avoid the consequences of premature aging or catastrophic faults in the MMCC cells, the MMCC must be equipped with: 1) fault-tolerant control strategies; 2) fault detection and diagnosis methods, which monitor the MMCC behavior; 3) reconfiguration methods, which allow to reconfigure the system (hardware and/or software) in the case of faults. These topics are studied in the second part of this article.

VI. CONCLUSION
This article presents a comprehensive study and discussion of the reliability of MMCCs, the analysis of the dominant factors that usually lead to open-and short-circuit faults in the MMCC cells and the impact of these fault modes on the converter operation.
Besides collecting most of the publications on this topic, this article also provides: 1) a dynamic model for the DSCC and TSBC converter, required to unify the discussion on reliability found in the reported publications; 2) an explicit reliability function for the calculation of the MMCC reliability; 3) a summary of fault modes and failure mechanisms of IG-BTs and power capacitors (main components of an MMCC); 4) a set of tables which collect all current paths, capacitor voltages, and cell output voltages under OCF or SCF in CC and BCs, and a set of tables that use this information to easily detect and identify faults in TSBC and DSBC MMCCs. It also provides a summary of the relevant terminology used in this field easing readability of the article.
MMCC mathematical models are well established and studied for its normal operation. However, more research has to be done to adapt and/or extend these models for its use under faulty conditions. Models of free-wheeling diodes of converters cells, bypass switches for disabling faulty cells or more sophisticated models of capacitors and inductors are key for obtaining more information during or previous to a fault. This is very important for enabling fast and reliable fault detection and diagnosis (FDD) and robust converter reconfiguration.
An explicit reliability function to calculate the reliability of an MMCC has been presented in (11). However, much effort must still be done for calculating reliability functions of each critical element of the MMCC under all real operating conditions. State-of-the art solutions use reliability function parameters provided by manufacturers. However, that information might not match with the actual stress conditions of the real application. In the authors' opinion, the reliability functions of each critical component should be computed online to achieve online condition monitoring and predictive maintenance. This will lead to a real-time converter "state-of-health," improving converter reliability. This aspect is an interesting topic for further investigations.
A summary of most important failure sources, fault modes, and failure mechanisms of different types of IGBTs and power capacitors was introduced in Tables 1-4. This collected information is very valuable for designing a converter, however, to the best knowledge of the authors, this information and such models have not yet been introduced into reliability functions of the converter to obtain qualitative information about the impact of using different technologies.
As an important contribution, this article presents a set of tables which summarize the current paths and output voltages of MMCCs cells under all faults (see Tables 8-11). Based on these tables, key information can be obtained to identify a fault and its type (see Tables 5-7). These tables are used in most of the reported publications for fault detection and diagnosis in MMCCs. They represent the state-of-the-art for fault identification. Although these tables are very useful for fault identification, they still fail when several faults occur at the same time. Also, there are some conditions where the fault can not be detected and most approaches only consider that one switch fails at a time or neglect the fault analysis in other components of the converter, which can lead to a false interpretation of a fault.
The impact of a fault in one cell of the converter on the MMCC operation is still an incipient topic, only few case studies and analyses of failures are reported. Tremendous work is still needed in this direction to obtain robust information which associates anomalous voltage/currents of the MMCC to a specific type of fault. To comprehensively approach a fault in an MMCC, a fault detection and diagnosis process must be accomplished, which can involve complex quantitative or qualitative methodologies, and also hardware and software reconfiguration strategies must be taken into account to improve the reliability of MMCCs. These topics are investigated in the second part of this article.

APPENDIX A TERMINOLOGY
In this appendix, the relevant terminology used throughout the article is collected and defined.

A. RELIABILITY
Reliability is an indicator which gives the probability for a component, subsystem or system, to perform as specified for a certain period of time and under defined conditions [164].

B. AVAILABILITY
Availability means that the system is available and is able to provide the desired or specified functionality or to perform the service when asked to do so. Availability is defined as the probability when a system is operational at a given point in time. An indirect availability measure is the mean time to repair or the mean time to recover (both MTTR) required to repair or restore system functionality after a failure has occurred [165]. Note that "mean time to recover" includes "mean time to repair" and the mean time for fault detection and diagnosis.

C. MTTF/MTBF
Mean time to failure (MTTF) describes the expected time until a non-repairable component or system fails [35]. Mean time between failures (MTBF) is the estimated time between two failures, this average time excludes the time required for repair or additionally induced by any downing events such as inspections or preventive maintenance. It only takes into account the time period when a system is available and operating.

D. REMAINING USEFUL LIFE (RUL)
RUL is the life span or time interval for which a device, component, subsystem, or system operates before it is required to be replaced [71].

E. FAULT
In the norm ISO 10303-226, a fault is defined as an abnormal condition or defect of the system at component, subsystem or system level. A fault may: 1) produce an unexpected deviation from the normal condition of one or more features or functionalities; 2) degrade the performance; 3) lead to undesirable behavior [166] at component, subsystem or system level. From a control point of view, faults can affect actuators, system (sub)components, control system (sub)components, and/or sensors [167].

F. FAULT MODE
For a given, desired or required functionality, a fault mode or fault event is one of the possible states of a fault [168]. It directly affects a failure mechanism.

G. FAILURE
A failure is the result of at least one fault and its manifested errors. It occurs when an actually running system deviates from its specified or desired behavior leading to a deteriorated performance (at least one specification is not met anymore) or an intolerable malfunction at component, subsystem or system level [165]. In Fig. 13, a basic scheme of the possible interactions between faults, errors and failures is shown.

H. FAILURE MECHANISM
A failure mechanism is a physical process, state or condition, which leads to a failure [168]. It is a direct cause of a fault mode in, e.g., material or (software) code. A failure mechanism replaces the failure mode concept described in  which was deprecated in [168].

I. THERMAL RUN-AWAY
It describes a process that is accelerated by increased temperature. During a thermal run-away of an electrical material, its resistance decreases in the affected (heated) regions over their critical temperature. The decreased resistance leads to higher currents through the overheated regions, in turn causing yet more heating and leading to further temperature increase and resistance decrease [169].

J. DIAGNOSIS
The action of determining the cause of an error in location and nature [83].

K. PROGNOSIS
The process of predicting an object system's RUL by predicting the progression of a fault given the current degree of degradation, the load history, and the anticipated future operational and environmental conditions to estimate the time at which the object system will no longer perform its intended function within the desired specifications [83].

L. HEALTH MONITORING
The function of estimating a system's health state, including measurement of state variables and identifying if the states of these variables indicate an off-nominal condition [83].

M. HEALTH MANAGEMENT
The process of decision-making and implementation of actions based on the estimate of the state of health derived from health monitoring and expected future use of the system [83].

APPENDIX B TABLES SUMMARIZING THE IMPACTS OF OCFS AND SCFS ON CELL OPERATION
Note that, for simplicity, the bypassing diodes D 1 to D 4 are not considered in the modeling. Extended models which also consider diodes can be found in [170] and [171]. He worked for 12 years as Senior Engineer for Siemens AG in industrial projects development. He is currently with a solar industry in Chile. His research interests include control of modular multilevel converters, renewable energy conversion, and fault detection and diagnosis techniques for power converters.