Novel Distributed Control Platform and Algorithm for a Modular Multilevel Matrix Converter

The modular multilevel matrix converter (M3C) is an attractive topology for low-speed drives and doubly fed induction generator applications. Both modularity and scalability make the topology attractive for high-power medium-voltage systems. One of the main challenges for the design and implementation of an M3C is the control platform, which has to handle a high number of submodules and measured values. This article introduces a new control platform including a high-speed communication network between the distributed control units. The control platform and algorithms are implemented on a 15 kvar M3C test bench with 108 full-bridge submodules.


I. INTRODUCTION
A. Importance of the Modular Multilevel Converter Family (MMCF) T HE importance of the MMCF has drastically increased in recent years. Modular multilevel converters (M2Cs) have been extensively investigated due to their promising characteristics for high-power systems, e.g., several MVA up to several 100 MVA [1], [2], [3], [4]. So far, the most attractive topologies with industrial applications are the M2C for ac-dc conversion in HVDC applications as well as medium voltage drives [5] and the single-star bridge cells (SSBC) [6] for STATCOMs. Also, the modular multilevel matrix converter (M3C) for three-phase ac-ac conversion [7] is an attractive topology for certain applications although its importance on the market is distinctly lower. However, the introduction of new products is expected for the coming years [8]. In general, the MMCF features substantial advantages like reduced filter size and costs, minimized power semiconductor losses, high availability due to redundancy on the submodule level, and simple voltage scalability. ac-to-ac power converters play an important role in the context of the transformation of the electrical grid toward sustainability. Considering the MMCF, a back-to-back configuration of the M2C (M2C-B2B) or the M3C can be used to achieve three-phase ac-ac conversion. Compared to an M2C-B2B, the M3C is attractive for low output frequencies as stated in [4], [9], and [10]. Interesting applications for the M3C are low-speed, high-torque electrical drives, which are applied in mills, extruders, kilns, and conveyors [4], [11], [12], [13]. Furthermore, the M3C was investigated for use in transmission lines with reduced fundamental frequency [4], [11].
The M3C has been proposed for use in the rotor circuit of medium voltage, high power doubly fed induction generators (DFIGs) [4], [14], [15]. Since the rotor frequencies of the DFIG are low compared to the grid frequency when operating with an appropriate slip, the M3C is a valid choice. Wind energy conversion systems and pumped storage power plants are typical applications of this configuration. Kienast et al. [8] introduce the M3C as a converter for a 300-MVA DFIG as part of a flywheel energy storage system to support the grid.
For a DFIG application, especially grid voltage dip events affect the converter design, because of the high voltages induced in the rotor circuit [16], [17], [18]. If these operating points are not considered in the converter design, they could cause converter damage. The common use of a crowbar to protect the converter will aggravate the effect of the voltage dip to the grid since the DFIG will act as an inductive load in this case [19]. Kammerer et al. [14] point out that the overload capability of the M3C [13] can be useful in a voltage dip event since the crowbar in the rotor circuit can be avoided. Furthermore, in this case, the converter and DFIG are able to support the grid in fault events.

B. Requirements for a Control Platform for M3Cs in Research Application
In general, the MMCF is especially advantageous in high power and high voltage applications, which results in a high number of submodules (typically ≥ several 100). For converter control, the submodule dc-link voltages, the arm currents, input and output currents, and further values must be measured and communicated to the control platform [20], [21], [22]. In return, the control platform sends the switching vectors to the submodules. Therefore, a bidirectional communication between the submodules and the control platform is required. Usually, field-programmable gate arrays (FPGAs) solve this task in state-of-the-art converters. Unfortunately, the number of pins on available FPGAs is limited. Since the number of pins correlates to the amount of submodules in the converter the output voltage and therefore the converter power is restricted by the used FPGA. Furthermore, complex control schemes are an important characteristic of M3C and M2C. High computing power is required to meet the demand for high control frequencies to achieve a robust control even in transient operating conditions like, e.g., low voltage ride through in grid faults.
Research and laboratory prototypes play an important role in the development of M3C and M2C. New modulation, control, and protection schemes are implemented and tested in laboratory prototypes to achieve a technical breakthrough. In addition, when individual submodule-capacitor voltages are known, new condition diagnosis could be carried out [23]. The implementation and experimental verification of new schemes can be substantially simplified and accelerated if many measured values and control variables can be visualized. For those reasons, a high data throughput from the CPU to FPGA as well as from the control platform to a data visualization and tracing (DVT) tool is advantageous.
Section I-C will provide information on the extent to which these requirements are met by existing control systems, whereas Section I-D describes the main contributions of the proposed control system. The platform substantially improves important characteristics like computational power, control frequency (by a factor of 6.6), data visualization rate and resolution (8333 double data values at 3 kHz) compared to the state of the art. New findings can be also transferred to other multilevel converters like, e.g., the M2C.
C. State of the Art 1) Centralized Control Platform: A centralized control platform for M3C is presented in [12]. All control algorithms are implemented on one CPU. In order to match the I/O requirements, several FPGAs are used [see Fig. 1(a)]. The interface between the FPGA and CPU is critical in order to enable a high data throughput. The maximum data rate between FPGA and CPU is primarily defined by their physical distance. Accordingly, a high level of integration, as is the case with systems on a chip (SoCs), between FPGA and CPU is necessary to enable low latency and high data rates. By connecting multiple FPGAs to a CPU via a shared bus system, the data rate between FPGAs and CPU is limited as latency increases significantly [12]. This implies that the inner arm balancing has to be done in the FPGA, which complicates the development and testing of new inner arm balancing algorithms drastically. Also, the visualization and tracing of all capacitor voltages is impossible.
2) Distributed Control Platform: A distributed control is characterized by a distribution of CPUs and control tasks. Yao et al. [24] describe a distributed algorithm for M3C, with a system controller and nine local controllers [see Fig. 1(b)]. While the system controller processes the measurement of the voltages and currents outside the converter, the local controllers measure the corresponding arm currents and capacitor voltages. The local controllers only take care of low-level control tasks as modulation and current control. The communication between the controllers is realized with a CAN-Bus.
The implementation of [24] does not take full advantage of the increased computing power enabled by the distributed approach by implementing a great share of the algorithm on the system controller. In addition, the CAN-Bus limits the dynamic of the control substantially (1 MBit/s), since only one local controller can communicate with the system controller. Therefore, the system control algorithm is executed with a frequency of only 444 Hz, which may lead to stabilization problems in transient operating points such as grid faults. Also, the work in [24] points out that using a CAN-Bus is not applicable if the transmission length is larger than 40 m, which rules out high power and high voltage applications due to the large converter dimensions.

D. Contributions of This Article
In order to match the requirements of high computing performance, high number of I/Os, and the visualization, processing, and recording of all measurements and control values, while using state-of-the-art computing components, the authors propose a distributed control platform based on several parallel working Xilinx Zynq Systems on a Chip (Zynq-SoC) [see Fig. 1(c)]. Zynq-SoCs are characterized by the integration of powerful CPUs and a high-performance FPGA on one chip. The use of Zynq-SoCs enables a high data throughput (27.5 GBit/s [25]) between FPGA and CPU, which makes, i.e., the transmission of all individual capacitor voltages possible. Therefore, inner arm balancing techniques for M3C can be implemented in the CPU instead of the FPGA, which gives a great degree of freedom in developing new algorithms. Since the number of pins on available Zynq-SoCs is limited, the parallelization of several Zynq-SoCs is necessary. In order to maximize the total computing power of the control platform, the control tasks are distributed on all Zynq-SoCs whereby an equal sharing of workload is targeted.
Another important difference compared to [24] is the use of an small form-factor pluggable (SFP)-based, high-speed, lowcost, and galvanic isolated communication network (5 GBit/s) between the CPUs instead of the CAN-Bus (1 MBit/s). The use of the network enables the following: 1) full use of the installed computing power; 2) simultaneous communication between the controllers; 3) transmission lengths over more than 40 m (needed in highpower applications). The higher computing power allows control frequencies well above 3 kHz while processing complex control algorithms like model predictive or flatness-based control.
A common disadvantage of distributed control algorithms is the additional dead times caused by the data transfer between the control units, which is typically performed after the control cycle. Enabled by the fast communication network, the authors propose a control scheme that completely eliminates the dead time through fast data communication between the CPUs within the control cycle.
Both, the high sampling frequencies (>3 kHz) and the reduced dead time effects (reduced by nine control cycles) increase the robustness of the control compared to [24]. To validate the distributed control structure and algorithms, the authors use a 15 kvar M3C test bench. The M3C converter consists of 108 full bridge submodules. A total of 129 analog measurements are used to enable converter control. The number of submodules and measurements is substantially increased compared to the literature [4], which results in high requirements for the control electronics. In terms of submodule count, this test bench is similar to a high-power, medium-voltage application. Thus, the results of this article are also relevant for high-power industrial applications.
The rest of this article is organized as follows. Section II describes the mathematical model for the M3C control as a basis for a description of the control structure and algorithm in Section III. The implementation of the distributed control and the hardware design are considered in Sections IV and V, respectively. An overview of the test bench and experimental results are provided in Sections VI and VII, respectively. Finally, Section VIII concludes the article.  u C,xy,z to the output terminals. Bypassing the capacitor generates the third voltage level of zero.

A. Modular Multilevel Matrix Converter
The arm voltage consists of the sum of the submodule output voltages synthesized by each full bridge. The capacitor voltages of each cell add up to the arm capacitor voltage The common mode voltage between the star points of the grid N G and the load N L will be referred to as u 0 . The controlled voltage sources in Fig. 2 represent the series connection of the submodules.

B. Transformation of Arm Quantities
Several papers have shown the threefold transformation of the arm quantities of the M3C, which was introduced in [26]. The composition of the arm voltages is maintained by the nine voltage loops M xy (see Fig. 2), applying Kirchhoff's law. This is shown in (3), where I is the identity matrix Using the threefold transformation from [26] to (3) results in four independent space vectors and a common mode component. These are assigned to the input side (vertical direction), output side (horizontal direction), and the inner quantities (diagonal 1 and diagonal 2). The diagonal components do not interfere with the output or input quantities. They are used for balancing the arm capacitor voltages or compensating large energy fluctuations [26]. After using the threefold transformation, the following equations are obtained. Vertical Horizontal Common mode The indices α, β, and 0 indicate the α, β, and 0 components of the well-known Clarke transformation [27]. The inverse transformation procedure, also called output transformation, calculates the arm components based on the four space vectors and the common mode component. This results in the following composition of the arm voltage: If applied on the arm currents, the inverse transformation results in Equations (8) and (9) show that the arm quantities are a superposition of the input, output, diagonal, and common mode quantities.

C. Transformed Arm Model
The arm powers of the M3C are also transformed into four space vectors resulting in the vertical p V,αβ , horizontal p H,αβ , diagonal 1 p D1,αβ , and diagonal 2 p D2,αβ power components. The common mode component p 0,0 equals the active power and indicates charging or discharging of the converter. Considering (8)-(10), the arm power contains several combinations of the input frequency, output frequency, and frequencies of the diagonal and common mode components. The same procedure can be done with the arm capacitor voltages, giving the four space vectors u C,V,αβ , u C,H,αβ , u C,D1,αβ , u C,D2,αβ and the common mode component u C,0,0 . Taking the   TABLE I  CONTROL TASKS OF AN M3C example of the vertical component, (11) shows the impact of the arm power on the capacitor voltage [13] (11) Equation (11) shows that the frequency components on the arm powers will show on the capacitor voltages too. It follows that the capacitor voltage combines a dc partū C,V,αβ and an ac part u C,V,αβ .
The dc component of the four space vectors indicates an asymmetry of the arm capacitor voltages. In order to operate the converter in a balanced regime, the dc component on the arm capacitor space vectors must be controlled to zero [20].
The common mode capacitor voltage u c,0,0 provides information about the total stored energy in the capacitors of the M3C per and therefore is an indicator of the total energy stored in the converter.

III. CONTROL ALGORITHM
The authors implemented the control algorithm, which was introduced in [28]. This concept consists of six parts, which are shown in Table I. The following sections present the main tasks of the control. Fig. 3 gives a comprehensive overview of the control structure. The following section presents the functional blocks from Fig. 3.

A. Total Energy Control
The total energy control consists of one proportional integral (PI) controller, which controls the common mode component of the arm capacitor voltage to track the desired common mode component u * C,00 . The output of this controller is the active power of the grid p * G , which influences the stored energy of the converter.

B. Input Current Control
The input current control works on a rotating reference frame synchronized with the grid voltage (dq coordinates). Due to this synchronization, the d component of the grid current i G,d causes active power and the q component i G,q determines reactive power [29].

C. Balancing Control
The balancing control algorithms consist of four PI controllers, a calculation of the dc components of the arm capacitor space vectorsū C,V,αβ ,ū C,H,αβ ,ū C,D1,αβ ,ū C,D2,αβ , and the calculation of the necessary diagonal currents and common mode voltage [see (11)]. An indicator for an unbalanced state is a nonzero dc component of the arm capacitor space vector. These are calculated using a low-pass filter.

D. Output Voltage Control
The output voltage control depends on the connected load. In case of a three-phase induction machine, a field-oriented control and direct torque control are valid choices [29].

E. Modulation
There are two challenges when modulating the M3C. 1) Synthesizing the desired arm voltages. 2) Balancing the capacitor voltages within the arm. In order to solve this problem, a modulator based on a sorting algorithm is used [1], [28], [30], [31].
The modulator selects cells that are to be turned ON/OFF during the sampling period. In addition, the algorithm chooses one cell for performing a pulsewidth modulation (PWM) pulse pattern in order to meet the desired arm voltage u * xy [see Fig. 4(b)]. The algorithm selects the inserted submodules based on the individual capacitor voltages within the arm.
A positive/negative arm power charges/discharges an inserted submodule. To keep the capacitor voltages balanced, the modulation scheme sorts the capacitor voltages in ascending/descending order [see Fig. 4(a)]. If the arm power is positive/negative, first the submodule with the lowest/highest capacitor voltage is inserted in order to charge/discharge it.

F. Supplementary Control
The supplementary control controls the contactors to connect or disconnect converter parts to the grid and/or the load. The supplementary control is especially important during converter start-up and shutdown and for protection schemes.

A. Distributed Control Algorithm
Section III shows that the control algorithm of the M3C cannot be reduced to a fully distributed control where every arm or PC is controlled based only on its own quantities. The control algorithm requires access to the relevant electrical quantities of all arms. A communication of the sensor values is necessary if measured on different units. In conventional distributed control algorithms the data transfer results in dead times, which decrease the dynamics of the control algorithm. Sections IV-B-IV-D present a way to split the control algorithm in order to minimize the transferred data. Section IV-E introduces a way to eliminate the dead time at certain data transfers in order to increase the dynamic performance of the distributed control algorithm.

B. Control Structure
The distributed control platform consists of four identical control units. Three control units communicate with the PCs. These control units are assigned as secondary control units 1-3 (SCU 1-3). Additionally, there is one PCU, which connects the three SCUs (see Fig. 5).

C. Measuring and Output Responsibility
For the distributed control approach, the control algorithm described in Section III is divided among the four control units. Fig. 5 shows the measurements and the output of the control units and the necessary communication buses. Each SCU measures the arm currents and the capacitor voltages of one PC. The PCU measures the voltages and currents outside of the converter. In total nine arm currents, three grid voltages and currents, three output voltages and currents, and 9 · N capacitor voltages add up to 129 measurements necessary for an M3C with N = 12. The SCUs communicate the switching states via a bidirectional fiber optic interface to the submodules. The response to this protocol is the actual capacitor voltage. The PCU also communicates the switching signals to the corresponding contactors K.

D. Control Task Assignment
When distributing the algorithm to the different control units, the following objectives should be pursued.
1) Maintaining the dynamic performance of the control algorithm. 2) Minimizing data communication between the control units as it results in dead time or restricts available computation time (see Section V-B). 3) Achieving a balanced distribution of control tasks to maximize utilization of computational capacity. To reduce data communication, tasks relying solely on measurements and calculation results of one control unit should be executed locally on the concerning control unit. As every SCU is linked with a PC, the sorting algorithm and modulation can be done on the corresponding SCU without incurring data communication. Thus, the modulation and sorting tasks are evenly divided among the three SCUs.
In order to meet objective 3), the remaining algorithm is distributed between the four control units. This means, data communication between the control units is necessary. Usually, data transfer occurs at the end of a control cycle, thus, other control units receive the data delayed by one control cycle [24], [32]. This effect is known as the dead time effect. When the time constant of specific parts of the control algorithm is distinctly larger than the control cycle, the dynamic performance of those parts will remain unaffected by the dead time.
Accordingly, it is advisable to divide the algorithm into parts of large and small time constants and to distribute these parts to different computational cores.
The balancing control (shown as blue-dashed in Fig. 3) and the supplementary control can be identified as parts of the algorithm with large time constants. The balancing control has a large time constant due to the control loop's time constant [12], and the use of a low-pass filter to compute the dc components of arm capacitor voltages (see Section III-C). The slow response time of the contactors causes a large time constant for the supplementary control. As a result, these control parts are assigned to SCU 1 and SCU 3.
The time constants of the other control parts are substantially smaller. Additionally, adequate response to transient events, i.e., a voltage dip in the grid, necessitates dynamic input current Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
control and output voltage control. Hence, the influence of the dead time cannot be neglected. The input current control depends on the results of the total energy control, the phase-locked loop (PLL), and the set point calculation (see Fig. 3). Consequently, these calculations take place on the same control unit to avoid a negative impact by the dead time. As the PCU processes necessary sensor data (grid voltage, input current) to perform the aforementioned control parts, they are executed on the same control unit to prevent unnecessary sensor data transmission.
The output voltage control operates independently of the other control parts. Therefore, it may be executed on a separate control unit (SCU 2) in order to meet objective 3). The outputs of all controllers are eventually consolidated by the output transformation. The PCU performs the output transformation since all SCUs can directly communicate with it (see Fig. 5). The resulting dead time negatively affects the dynamic performance of the output voltage control. Therefore, the authors propose an algorithm to eliminate this dead time as described in Section IV-E. As all capacitor voltages are measured on the corresponding SCU but required on SCU 2 for balancing, they must be communicated. To limit data traffic, only the sum of the voltages is transmitted. This is sufficient for control and substantially reduces the data volume. Section IV-E describes a procedure to eliminate the dead time resulting from the transmission of capacitor voltages. Fig. 3 shows the assignment of the different control tasks by the following: 1) red dashed lines for the PCU; 2) yellow solid lines for the SCU 1; 3) green dashed dotted lines for SCU 2; 4) blue dotted lines for SCU 3. Fig. 5 also summarizes the control tasks calculated on each control unit, and gives an overview of measuring and output responsibility for the test bench presented in Section VI.

E. Dead Time Compensation
As outlined in the previous section, the distribution of the control algorithm generates dead time, which impacts the dynamic performance and stability of the algorithm. In [24], a CAN-Bus is used for data communication, allowing only one control unit to transmit its data at a time. Yao et al. [24] suggest running the nine local controllers with a fixed control frequency (4 kHz). After each control cycle T S , only one controller transmits its data to the system controller, which generates the set points for the local controllers. The scheme is depicted in Fig. 6(a). Therefore, the system controller is executed with a reduced control frequency of 444 Hz, resulting in a dead time of nine control cycles. This significantly lowers control dynamics, resulting in decreased stability of the control algorithm.
When using parallel buses [see Figs. 5 and 6(b)] to communicate between the controllers, all controllers can transmit data after each control cycle. This reduces the dead time significantly compared to [24]. However, the remaining dead time can influence the dynamic performance as outlined in Section IV-D.
In order to avoid this problem, the authors propose a new approach where the communication of measurements and certain calculation results take action before processing them on another control unit [see Fig. 6(b)]. This leads to the sequence presented in Fig. 7. At the beginning of the sampling period, all control units read their sensor values and calculate the arm capacitor voltages according to (2). Then, all SCUs transfer this data to the PCU. Additionally, the output current is transferred to SCU 2 since the output current control takes place there.
After a successful data transfer, the control units perform output voltage control, balancing control, supplementary control, grid current control, and total energy control. Immediately after the output voltage calculation, SCU 2 transfers the results to the PCU since the result is required for the output transformation. The corresponding arm voltages are sent to the SCUs as input for the modulation scheme. When the switching states and times of all submodules are calculated, the SCUs communicate them to the corresponding modulator, which generates the switching commands S PC1−3 for the switching devices in the PCs. The bidirectional protocol communicates the switching commands to the submodules.
This procedure enables the elimination of additional dead times for the output voltage controls that are typical for stateof-the-art distributed control strategies as proposed in [24]. It should be noted that the fast communication network, which is presented in Section V-B, is a prerequisite for this scheme.
As depicted in Fig. 6, all control units, and therefore each part of the algorithm, are executed with the same control frequency. Therefore, the dead time for the input and output current control is reduced by nine control cycles in comparison to [24]. This increases the robustness of the control system significantly.

A. Hardware
The control platform consists of four modular control units that are connected via fiber optics, which is explained in detail in this section. The modular control units are designed based on a Xilinx Zynq 7015. The hardware consists of a main control unit, which can be expanded by fiber optical and analog ports via d-sub connectors. Therefore, one control unit is able to handle 88 digital IOs, 16 analog inputs, and 16 analog outputs.
With several control units working in parallel, the I/O requirements for a large-scale M3C can be matched. Using a larger FPGA (i.e., Zynq Ultrascale) instead of several smaller ones would be an alternative approach, but would likely have a higher cost. The complete system is depicted in Fig. 8.
The Zynq-SoC architecture is appealing for high-speed control systems, as it contains an FPGA as well as two ARM-Cortex processors in one package (for the 7015 series). In the proposed design, one CPU is used for real-time processes running FreeR-TOS as the operating system. The other CPU runs Linux for asynchronous processes, such as Ethernet communication for data visualization, modification, and tracing using LabAnalyser [33] software. Furthermore, Linux allows the use of a variety of available drivers for all kinds of hardware, e.g., temperature sensors.
The CPUs work in asymmetric multiprocessing mode, where only the Linux core uses the L2 cache. The asynchronous communication between Linux and FreeRTOS is realized by  shared memory FIFOs. Both cores are connected to the FPGA by an internal Advanced eXtensible Interface (AXI) bus, which allows high data throughput and low latency [25]. By using four control units, the data throughput is, in effect, multiplied by four. It is therefore possible to communicate large amounts of data between the FPGA and the processor. Another benefit of these Zynq-SoCs is the integrated multi-gigabit transceivers (GT), which is core element of the distributed control. The 7015 contains four transceivers, whereas other Zynq variants feature up to 16. This allows the use of standard network fiber-SFP modules to communicate the data between the different control units. This ensures resilience to electromagnetic interference and allows cable length > 500 m. Therefore, the multi-GTs and fiber-SFP modules are used to connect the control units. For all analog digital conversions, sigma-delta converters are used. The notches of the corresponding SINC3-Filter thereby match the switching frequency. The SINC3-filters, the modulator, a direct memory access (DMA) controller, the real-time process monitoring, safety features, and interprocessor communication were implemented in the FPGA. Table II depicts the data that are transferred between the real-time process and the FPGA on one SCU. Due to the AXI bus, the communication of these 380 B takes only 5.7 μs.

B. Communication Network Between Controllers
To communicate between the control units, RAM-FIFOs are implemented using Xilinx Chip2Chip and Aurora8B10B IP-Cores. It should be mentioned that the serialization of data causes a delay of 106 clock cycles (1.06 μs) to every bus read and write request at 15 m cable. This delay is measured using Xilinx Integrated Logic Analyzer (see Fig. 9). Therefore, a burst transfer is necessary to achieve a high throughput. For the selected burst size of 256, the chosen FPGA clock of 100 MHz and the 32-b system, a theoretical throughput of 32 b · 256 (1.06 μs + 256·10 ns) is achieved. The complete time interval for one transfer is therefore 3.62 μs. As this transfer is handled by a DMA controller, the transfer time can be used for calculations on the CPU. Furthermore, 128 FPGA clock cycles are necessary to parameterize the DMA controller.
The minimum delay to transfer data of 256 × 4 B from one control unit to another is therefore 1.06 μs + 256·10 ns + 128 ns = 4.9 μs.
As one can see in Fig. 7, three data transfers are performed within one control cycle, whose content is required to proceed with further calculations. Therefore, just calculations using only local parameters could be performed (i.e., sorting algorithm). Under consideration of (14), this adds up to 14.7 μs. Fig. 10 shows the remaining processing time for a given data transfer time. The higher the control frequency gets the more percentage of the cycle time is obtained just by the data transfer. Therefore, longer data transfer times limit the maximum control frequency as well as the complexity of the control algorithm. To minimize the access time of the CPU, the RAM-FIFO is realized in the on-chip memory (OCM) of the Zynq. The OCM is thereby a static random-access memory. As this memory is tightly coupled to the processor cores through the snoop control unit, the processors can access it very quickly compared to the L2 cache. The complete data path is shown in Fig. 11.

C. Synchronization
As every control unit has its own clock source, the control cycles have to be synchronized. In principle, there are several synchronization options. The following information describes two options. The first method uses the four high-priority channel interrupts of the Chip2Chip IP core. On a value change of the interrupt inputs, the output of the corresponding Chip2Chip toggles. The latency thereby is less than 1 μs. Generating this signal on the PCU at the beginning of each control cycle would be sufficient to create the necessary synchronization.
However, there is a problem if sigma-delta modulators are used on the different control units. It is necessary that the decimation is performed on a fixed window. As the sigma-delta clock is 20 MHz, the attainable 1 μs of the interrupts would not be sufficient as a decimator clock.
Therefore, the sigma-delta conversion has to be asynchronous to the control cycle, which results in a varying dead time.
The second method is the use of a common clock signal for all distributed control units. Here, the PCU generates a 20-MHz clock, which is distributed to all SCUs via fiber optics. In the FPGAs, a clock wizard IP core is used to create the internal 100-MHz clock from this 20-MHz input clock. This way, all control units feature the same internal clock and the values of the analog digital converter (ADCs) are synchronous to the control cycle. Furthermore, all real-time processes are started with a jitter of less than 10 ns and every FPGA monitors its corresponding realtime CPU. The first option is an attractive choice for industrial applications as low-cost, standard network SFP-fiber optics can be used.
The varying delay of the sensor values could be minimized, for example, by a SINC3 filter, which oversamples the signal followed by a SINC1 filter, which compensates the oversampling. Assuming, for example, a control frequency of 3 kHz, the SINC3-filter decimator could operate at 27 kHz.
The use of a moving average (SINC1) filter for the last nine SINC3 values creates the wanted notch filter effect at 3 kHz and reduces the dead time. Therefore, the dead time would only vary around one-ninth of a cycle period.
Nonetheless, option two has been chosen for the current design as this method is more accurate and adds further layers of process monitoring. To synchronize the read and write requests of the controllers the already mentioned Chip2Chip interrupts  (Fig. 7). Due to the AXI-bus and the asynchronous communication with the visualization framework LabAnaylser, the execution time shows a jitter of around 7 µs.
were used. Considering Fig. 7, two interrupts are necessary to implement a synchronous data communication. One interrupt from the SCU to the PCU signals that the DMA transfer is finished. This interrupt is executed when the DMA transfer "data transfer to primary" is finished and a second time when "read data from primary" is completed. The second interrupt is executed by the PCU after the "output transformation" to signal the SCUs that the DMA transfer shall be initialized.
One further interrupt is used to signal that the control units are ready. All SCUs send this interrupt to the PCU. Then, the PCU sends the ready signal to all SCUs and the control cycle starts.
The investigation of the presented distributed control shows that the complete communication, containing 1) reading sensor data, writing the modulator data, reading and setting all necessary digital inputs/outputs (see Table II) 2) two DMA transfers with 128 × 4 B per SCU 3) two additional DMA transfers of the motor control SCU takes about 41-48 μs (see Fig. 12). The glitches shown in the plot result from the DMA transfer of Linux to the Ethernet PHY, which results in a blocked AXI-Bus of the processor system. This effect adds a latency of about 7 μs.

D. Communication to Submodules
An asynchronous bidirectional protocol is used for the communication to the submodules. The telegram to a submodule contains the switching state of the full bridge as well as a checksum. The telegram from the submodule contains the current state of the H-bridge, status information, the measured capacitor voltage, and a check sum.
Furthermore, a fast transition to the blocking state of the full bridges can be achieved by applying a constant level on the bus.

E. Tracer Framework
The control algorithm is generated using MATLAB/Simulink with the embedded coder. Data tracing and visualization are  important aspects of an experimental setup. Therefore, the control units are equipped with a data tracing and modification framework. All data that are traced with scopes in the corresponding Simulink model are also traced in the real-time process. The data are transferred via shared memory to Linux and from this CPU-core via Ethernet to a LabAnalyser. This scheme enables a real-time data transfer up to 400 MBit/s from each control unit or 1.6 GBit/s for the complete distributed control platform. Therefore, all sensor values and states of the converter control can be monitored simultaneously. This is especially beneficial for rapid prototyping.

VI. TESTBENCH
To verify the correct operation of the distributed control platform, a 15 kvar M3C with 12 series connected submodules per arm (N = 12) is considered. Fig. 13 (left) presents an overview of the laboratory setup. Each PC, consisting of 36 submodules and three arm inductors, is located within its own cabinet with one SCU (see Fig. 13, right).
An additional cabinet contains the necessary contactors, measurements outside of the converter as well as the PCU. Table III lists the basic parameters of the converter.
A grid-forming converter, called Cinergia GE 15+, is the interface between M3C and the grid. It emulates a three-phase 400 V/50 Hz grid and enables an investigation of the M3C under specific grid conditions. The output of the M3C is connected to a passive ohmic-inductive load, which is a simple equivalent circuit to prove the function of the converter and control platform. The parameters of the load are summarized in Table IV.  Section VII-B shows the results for a voltage load step with a connected passive load. The displayed electrical measurements are the delta-sigma-filtered values used by the control units.

A. Arm Quantities
Fig. 14 displays the arm voltage for an output voltage of 850 V at 15 Hz, whereas the grid voltage is 400 V and 50 Hz. There is no load connected to the output of the converter. The arm voltage contains the input and output frequency. The multilevel operation results in a 25-level arm voltage. This experiment verifies the functionality of the modulation scheme as well as the communication between the control units.

B. Load Step
As mentioned in Section I, the M3C is especially advantageous for DFIG applications. An output frequency of 15 Hz corresponds to an applicable slip of 30% in a 50 Hz two-pole DFIG configuration, which is typical for such an application. Fig. 15 shows the experimental results of a voltage step of 360 V and an output frequency of 15 Hz at t = 0 s. Before the load step, there is a reactive grid current of I 2 = 2 A.
At the time point t = 0 s, the reactive grid current is reduced to 0 A. Fig. 15(a) and (b) show the output currents and the dq components of the grid current. The d component of the grid current rises since the active power increases when connecting the load to the converter. For better clarity, only three of the nine arm currents are shown in Fig. 15(c). The arm currents show different frequencies, which can be identified as the fundamentals of the output and input frequency. The arm capacitor voltages can be taken from Fig. 15(d). Since all the capacitor voltages are controlled to 1080 V, the operation of the distributed energy control is correct.

C. CPU Utilization
Assuming a sampling and control frequency of 3 kHz, the CPU utilization of the control units is about 20%. Thus, the control units have sufficient reserves to perform more complex or extensive control tasks and an increased sampling and control frequency.

VIII. CONCLUSION
Both the control algorithms and control platform are the main challenges for the implementation of M2Cs. This article introduces a novel distributed control platform for an M3C. To prove the functionality of the control platform and to investigate its performance, a 15 kVA M3C test bench with 108 submodules was developed. The submodule count of this test bench is increased by the factor of 2.4 compared to the literature [4]. Distributed control platforms commonly have to sacrifice performance because of communication delays between the distributed control units. To reduce the negative influence of communication delays a fast communication network based on optical SFP modules is realized. The control platform is based on four modular control units, one PCU, and three SCUs. All four control units are able to carry out high-level control tasks and therefore maximize the computing power of the distributed control system. A data throughput of 1.6 GBit/s enables access to all sensor values and control states in real time, which is advantageous for rapid prototyping and condition diagnosis. The experimental results verify the correct operation of the converter and control platform.