Hybrid Parallel-in-Time-and-Space Transient Stability Simulation of Large-Scale AC/DC Grids

The increasing complexity of modern AC/DC power systems poses a significant challenge to a fast solution of large-scale transient stability simulation problems. This paper proposes the hybrid parallel-in-time-and-space (PiT+PiS) transient simulation on the CPU-GPU platform to thoroughly exploit the parallelism from time and spatial perspectives, thereby fully utilizing parallel processing hardware. The respective electromechanical and electromagnetic aspects of the AC and DC grids demand a combination of transient stability (TS) simulation and electromagnetic transient (EMT) simulation to reflect both system-level and equipment-level transients. The TS simulation is performed on GPUs in the co-simulation, while the Parareal parallel-in-time (PiT) scheduling and EMT simulation are conducted on CPUs. Therefore, the heterogeneous CPU-GPU scheme can utilize asynchronous computing features to offset the data transfer latency between different processors. Higher scalability and extensibility than GPU-only dynamic parallelism design is achieved by utilizing concurrent GPU streams for coarse-grid and fine-grid computation. A synthetic AC/DC grid based on IEEE-118 Bus and CIGRÉ DCS2 systems showed a good accuracy compared to commercial TSAT software, and a speedup of 165 is achieved with 48 IEEE-118 Bus systems and 192 201-Level detail-modeled MMCs. Furthermore, the proposed method is also applicable to multi-GPU implementation where it demonstrates decent efficacy.

M ODERN power systems are increasingly complex due to the continuous integration of power electronic facilities such as the high voltage direct current (HVDC) links into transmission and distribution networks. Many HVDC projects are constructed or planned worldwide to integrate more clean energy from wind farms and PV stations, such as Changji-GuQuan UHVDC project in China [1], Dolwin 5 project in Europe [2], and TransWest project in the USA [3]. Using a typical step size of a few milliseconds, the transient stability (TS) analysis plays an important role in the planning, design, and operation of a modern power grid from a system point of view. It, however, is a positive-sequence-based analytical method that naturally falls short of complicated electromagnetic transient (EMT) details of power electronic devices in the microsecond-level or below. The TS-EMT co-simulation methodology which properly features both system-level and equipment-level power system phenomena is favored in hybrid AC/DC grid study. A consequently incurred rise of computational burden may extend the simulation duration, especially considering that a dramatic expansion of a future AC/DC power system as a result of incorporating more components is expected.
To handle the increasing scale and complexities, new acceleration techniques for TS and EMT simulation programs are desired. TS acceleration methods based on parallel processing algorithms for multi-core CPUs and many-core GPUs have shown a decent efficiency and have been well investigated in AC power grid studies [4]- [7], and heterogeneous CPU-GPU computing architecture for AC-DC grid TS-EMT co-simulation has recently been proposed [8]- [10], while the threads concurrency This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of these methods is dominantly contributed by parallel-in-space (PiS) strategies. The parallel-in-time (PiT) solutions were also investigated from a different perspective of parallel processing [11]. The very early version of PiT for solving TS problems was mainly based on Jacobi-decomposition [12], [13], which aimed to solve multiple continuous steps iteratively so that the parallelism was achieved by assigning a single step to parallel threads. In the 1990 s, most results were obtained from the virtual parallel machine because of a scarcity of multi-core CPUs, which limited further explorations in this area. The Parareal algorithm has emerged in many research areas [14], [15] where it exhibited efficacy [16]. It was introduced to solve TS simulation problems by decomposing the initial value problem into many sub-intervals [17], and has a better efficiency compared to its predecessors. These works mainly focused on PiT algorithms and potential comprehensive parallelism by considering PiS methods simultaneously is yet to be carried out. For example, a four times speedup was obtained in computing the IEEE 39-bus system [17] with 470 cores, and even a huge number of cores were used to solve a large-scale power system, the parallel efficiency was below 20% [18], which is still not as satisfactory as PiS methods.
This paper proposes a hybrid parallel-in-time-and-space (PiT+PiS) AC/DC TS-EMT co-simulation method which thoroughly exploits parallelism to expedite the simulation of modern power systems on many-core GPUs. It demonstrates: (1) High level of parallelism: PiT+PiS can achieve a higher speedup and efficiency than PiS or PiT-only methods; (2) Efficient CPU-GPU communication implementation: utilizing asynchronous unified CUDA memory to offset CPU-GPU communication overheads; (3) High scalability: using asynchronous multi-stream design to handle subsystems and schedule the workload on multiple GPUs/CPUs; (4) Multi-fold hybrid co-simulation: using applying the PiT+PiS method to TS-EMT co-simulation on CPU and GPU, the impacts of AC/DC system can be analyzed in a fast and accurate way. This paper is organized as follows: Section II introduces the multi-mass synchronous machine model, MMC model, and theoretical speedup analysis of PiT and PiT+PiS methods; Section III introduces the implementations of hybrid CPU-GPU PiT+PiS algorithm for AC/DC co-simulation; Section IV presents the case studies and performance comparison; Section V is the Conclusion.

A. Multi-Mass Torsional Shaft Generator Model
The generator equations comprise of three components, i.e., the synchronous machine, the mechanical multi-mass torsional shaft, and the control system, which together form a 17th-order model.
As shown in Fig. 1, the synchronous machine includes two windings on d-axis and two damping windings on q-axis, which is classified as Model 2.2 in IEEE Std 1110-2019 [19]. The machine model can be expressed by differential equations  as [20]:Ψ where Ψ fd and e fd are flux linkage and field voltage of the field winding, Ψ 1d , Ψ 1q , Ψ 2q are the flux linkages of direct and quadrature axis windings; R 1d , R 1q and R 2q are the resistance of direct and quadrature axis windings; ω R is the rated rotating speed. All the quantities are using the per unit system defined in [20] except the time t since the time domain simulation use seconds for the time unit. The currents in the dq frame are coupled with external power system equations as the stator currents must be obtained. To simplify the computation, an iterative method is used to handle the coupling [6].
The mechanical shaft, on the other hand, provides basic rotor equations for the machine: Since the synchronous generator is connected to the turbine, the four-mass-turbine shaft model is used, as shown in Fig. 2, where δ n means the relative rotor angle of each turbine, and specifically, δ 1 is the generator rotor angle. The stiffness coefficient between two neighboring masses is represented by parameter K, e.g., K 4,5 represents the coefficient between Mass5 and Mass4. T n , D n and H n refer to the torque, damping factor and inertia constant of each torsional shaft, respectively, as given by: 2H n , δ n (t) = ω R · Δω n(t) , n = 2, 3, 4, 5.
(3) The control system includes PSS, excitation, and AVR systems, which is classified as ST1A model in [21]: where τ R , τ ω , τ 1 , τ 2 , and K stab are constants, v 1 , v 2 , v 3 are state variables. The generator equations and power system network equations constitute the differential-algebraic equations (DAEs) of transient stability simulation, which can be expressed aṡ where 3 is the generator state vector, u = {i 1d , i 1q , i 2q , e fd , i fd , v dq } is the vector of system inputs; f is the vector function of equations (1)-(4); g is the algebraic equation to solve the input vector u.
In addition, the stator equations [20], [22] are necessary to solve the interface voltages and currents along with the external grid, making the simulation a DAE problem. The generator's variables are in dand q-axis so Park transformation is involved in solving the DAE [23], which makes it a nonlinear problem. As shown in Fig. 2, a partitioned iterative method [6] is used to solve the DAE, which decomposes generators from the main circuits.
With implicit Trapezoidal rule, an individual generator has the following discretized equation where n indicates the number of discrete steps, h indicates the time-step. Since solving such an implicit equation requires f n+1 at x n+1 and u n+1 , implicit methods requires the Jacobian matrix J = ∂f ∂x . The generator equation can be expressed by a statespace form: where A and B are coefficient matrices of Equation (1)(2)(3)(4). Then the equation to solve x n+1 can be expressed by where (E − h 2 A) is the Jacobian matrix. If f n+1 is nonlinear, A is no longer constant so that the Newton-Raphson iteration is required. Otherwise, it can be solved without local iteration [24].

B. MMC Model
To model the DC power systems, an equivalent-circuit-based MMC model for EMT simulation is adopted in this work. For the concurrent implementation of the half-bridge submodules (HBSMs), an artificial delay is assumed between the submodules and the main circuit to decouple each submodule and significantly reduce the number of circuit nodes while details are still preserved, as shown in Fig. 4(a). The ON and OFF states of IGBTs and diodes are represented by low-and high-impedance.
Each submodule is solved independently, whichindicates that the MMC computation is transformed into solving a number of 2 ×2 matrix equations for v sm and v c of each SM. At an arbitrary time instant n, represented by the capacitor voltage v c and submodule port voltage v sm , the 2 unknown nodes can be solved by the following discretized equations: where G ceq , i ceq denote the capacitor equivalent admittance and the companion current, g sw1,2 denote the equivalent admittance of the IGBT/diode pair, and i c is the capacitor current. The g sw1 and g sw2 are determined by where the gate signal v g has been converted into a binary value in the simulation, G on and G of f are the turn-on conductance and turn-off conductance. The SM circuit can also use more detailed device-level IGBT models instead of ideal two-state switches.
Since the equivalent circuit MMC model preserves individual submodule, equipment-level transients such as switching harmonics and capacitor energy balancing can be studied in EMT simulation [25]. Since the HBSM topologies under different switching states are known according to the submodule port current and control signals, the inverted SM admittance matrices are cached to speed up the computation. After solving v sm of each SM, all SMs can be lumped into a voltage source and it can be further simplified to a reduced Norton equivalent circuit by merging the arm-choke inductor as shown in Fig. 4 The nearest level modulation (NLM) is used as the MMC internal controller to maintain stable submodule capacitor voltages. The number of inserted submodules n ins is determined by: where n sm is the SM number in one arm, v ref is the reference voltage in per unit. After the n ins is obtained, the lower-level voltage balancing controller will use it to generate gate signals for each SM and the voltage balancing is based on a widely used sorting strategy [26]. The MMC main circuit under nodal analysis is shown in Fig. 4(c). Compared with the submodules, it has a minimum of 5 nodes even after converting each arm to its Norton equivalent circuit form. Noticing that instant solution as it is with the submodules is not readily available, which causes an artificial delay between main circuit and submodules. The KLU method [27] is adopted to solve the main circuit matrix equation.

C. Parareal Algorithm
The Parareal algorithm decomposes a large time interval into many small sub-intervals and solves the corresponding differential equations in parallel, and therefore, it is considered as an iterative multi-shooting approach [28]. As the differential equations present an initial value problem (IVP), initialization is mandatory for the solution process, and a fast serial predictor is used to provide the initial conditions, which divides the problem into a serial coarse-grid and a parallel fine-grid.
The Parareal algorithm for an overall N intervals can be the expressed by following nonlinear system of equation: where U 0 is a vector containing the known initial values, By applying Newton's method, the following iterative equation for each U j is obtained: where ∂F ∂U is the derivative of operator function F , written in a discrete form as: If F is used to obtain F (k) , it is identical to the normal sequential solution with F . However, the Quasi-Newton method can be used to approximate the Jacobian, which can be naturally done by an operator G with a larger time-step, so that the derivative can be generated to make the PiT computing feasible: The discrete sparse time points processed by G at {T 1 , T 2 , . . ., T N } constitute the coarse-grid; The discrete time points processed by F in [T 1 , T N ] constitute the fine-grid. Since the serial G produced the approximations for each [T j−1 , T j ] sub-interval, F can execute them in parallel. G can also be a faster integration method like Euler and backward Euler method while F is a more time-consuming method such as Trapezoidal and Runge-Kutta method. In this work, the Trapezoidal integration method is used, which gives where f refers to aforementioned equations (1)-(4), h F and h G are the fine-grid time-step and coarse-grid time-step respectively. The only difference between F and G is the size of timesteps and they are the same Trapezoidal method. Substituting ∂F ∂U with the approximation, the following equation is obtained to conduct the predict-correct iteration between coarse and , then continue (b) to start a new iteration.
fine grids: which was proven to be a Quasi-Newton method in [29]. As shown in Fig. 5, the Parareal algorithm has four major progressions: (a) Initialization: the initial guess is generated from the coarse operator; (b) Fine-grid Parallel Operation: the fine-grid workers produce the fine-grid solutions concurrently from the initial values of coarse-grid; (c) Solution Refinement: the coarse-grid operator produces new predictions and correct the solutions at T j using (18); (d) Finalization: If the error is smaller than tolerance, the fine-grid workers generate the final converged solutions. Otherwise, it continues to step (b) and continue the Parareal iterations.

D. Theoretical Speedup Analysis
Assuming a system with a fixed workload of n · w, where n indicates the total fine-grid time-steps of the simulation and w is the single-step execution time of the DAE solution, and the same integration method for the fine-grid and coarse-grid, according to Amdahl's law [30], the speedup of the PiT algorithm can be expressed by where S pit is the speedup of Parareal, m is the steps of fine-grid sub-intervals, I is the iteration number and p is the number of PiT processors, n = mp. The number of parallel processors is related to the sizes of time-steps and time-windows, which means more processors will add difficulties to convergence, resulting in degraded speedup, the theoretical speedup limit is bounded to min{ m I+1 , p I }, which creates a tradeoff between convergence and parallelism [31].
The parallel efficiency E pit of Parareal algorithm is which indicates that the maximum parallel efficiency is smaller than 50% due to the fact that I ≥ 2.
In practice, m must be a large number to compensate overheads caused by coarse-grid and synchronizations. When (I + 1)/ >> I/p the theoretical speedup upper limit can be seen as p I , while this limit is still significantly constrained by convergence. An alternative is to use a limited p (10 in this work) and a small time-window (10 ms × 10 = 0.1 s in this work).
The PiT+PiS method can retain spatial parallelism in each solution step of the PiT algorithm to improve the maximum parallel efficiency achieved by either method. The speedup of PiT+PiS is given by S pit+pis (p 1 , p 2 ) = S pis (p 1 ) · p 2 · m · w pis (I + 1) · p 2 · w pis + I · m · w pis where w pis is the execution time with spatial parallel methods, p 1 is number of parallel processors for PiS, and p 2 is the number of parallel processors for PiT. (21) indicates that compared to PiT-only or PiS-only method, the PiT+PiS method can utilize more parallel processors to solve a problem with a fixed size. The parallel efficiency for PiT+PiS is expressed by where E is the parallel efficiency for each parallel method. In practice, I(p 2 ) < I(p 1 p 2 ) is very common, and therefore, E pit+pis (p 1 , p 2 ) > E pit (p 1 p 2 ). Also, it is possible to achieve E pit+pis (p 1 , p 2 ) > E pis (p 1 p 2 ) when the PiS thread is saturated at p 1 .
which indicates that the system can utilize at most 8 threads for parallel computing. The PiT method is assumed to have m = 100, p = 10, and I(p) = 2 + int(p/10) which means that the iteration number increases by 1 when adding every 10 threads. Substituting these parameters with p 1 = {1, 2, . . ., 200} into (19)(20)(21)(22) will gives the results shown in Fig. 6. Clearly, the PiT+PiS can achieve better performance compared to PiT-only and PiS-only methods. Also, the PiT+PiS method requires more threads so it may be suitable for GPUs with thousands of CUDA cores.

III. AC/DC GRID PARALLEL-IN-TIME-AND-SPACE CO-SIMULATION
To establish an AC/DC PiT+PiS co-simulation program, a software hierarchy shown in Fig. 7 is proposed. Generally, the AC TS system solver and HVDC EMT system solver are developed independently and connected together via an interface.

A. GPU PiT+PiS Programming
The thousands of CUDA cores of an NVIDIA GPU are affiliated to dozens of streaming multiprocessors (SMs), which are responsible for scheduling instructions, as shown in Fig. 8. The frequency of GPU cores is much lower than many prevalent CPUs and hence the instruction cycle is accordingly longer. Considering that the GPU is designed for a large throughput and massively parallel computation, the TS system should be parallelized to achieve a comparable performance to their counterparts on CPU. Therefore, the TS simulation implementation on GPU in this work becomes a unified PiT+PiS scheme.
The pure-GPU computing architecture has been utilized in [6], [7]. However, the pure-GPU Parareal algorithm requires dynamic parallelism and inefficient complex stream synchronizations. For example, the parent coarse-grid under dynamic parallelism will lock GPU resources when launching child grids, and this severely limits the scalability. It is also very slow to perform complex condition checks and loops on GPUs. To avoid degradations, a multi-stream CPU-GPU hybrid PiT+PiS scheme shown in Fig. 9(a) is proposed, which not only utilizes more resources of a single GPU but also enables a multi-GPU architecture. The general pseudocode is attached in Appendix B.
The streams are pre-defined in the initialization stage, and the kernel launching is performed on the CPU. In Fig. 9(a) the coarse-grid stream is labeled as G stream and fine-grid streams are labeled as F streams. For the xth coarse-grid step, its prediction will be used by (x + 1)th fine-grid kernel function, so that the xth coarse-grid and xth fine-grid kernels can be launched at the same time. To maximize concurrency, the refinement of state variables U is integrated into the loop so it can overlap with fine-grid kernel executions. Since U is used in the current iteration to launch the (x + 1)th fine-grid kernel function, the operation in coarse-grid must be synchronized before (x + 1)th loop iteration begins, while fine-grid F streams are fully concurrent to each other and G stream. After all coarse operations are finished, the algorithm performs a device-wide synchronization to obtain all fine-grid results which are required in the next iteration. Due to the multi-stream architecture, multi-GPU execution becomes easier since the streams can be assigned to single or multiple GPUs. The memory mitigation problem can be solved by CUDA unified memory implicitly. The GPU multi-stream execution graph of proposed algorithm implementation is shown in Fig. 9(b).
The memory resources allocation and management are also moved to CPU, which is much easier with the Thrust library [32]. The Thrust library is a GPU-accelerated library of C++ parallel algorithms and data structures; it provides a host-vector class and device vector class as a drop-in replacement for C++ std::vector class, which can easily allocate, resize, and transfer memory data between CPUs and NVIDIA GPUs; it also provides a set of C++ std style functions to perform parallel operations on both CPU threads and NVIDIA CUDA cores.
In this work, vector classes are used to manage the memory of the PiT+PiS program, and other operations are performed by passing vector device data pointer to the hand-written CUDA global functions, which execute the simulation step in Fig. 3. The host-vector object can be transferred to a GPU device vector object with simple a = b; statements; moreover, the vector classes can take an allocator template argument which allocates the vector data memory on unified/managed CUDA memory, pinned memory, and device global memory without explicit CUDA function calls. The hierarchy of the TS system solver is shown in Fig. 7(a).

B. CPU-Based PiS EMT HVDC Simulation
For the EMT solver in Fig. 7(b), static components are timeinvariant power system components such as resistors, transient components are time-variant components such as capacitors, inductors, power sources, and MMCs. Due to the asynchronous nature of CPU-GPU hybrid execution and the multi-core parallel computing capability of the CPU, threads running EMT simulation can be concurrent to the TS problem solution on GPU. The granularity of EMT parallel computing is designed to be system-level, which means one thread is responsible for one or more HVDC systems. This can be achieved by OpenMP task or simple parallel for loop construct. As shown in Fig. 7(d), the decoupling and connections between EMT systems are based on the propagation delay of traveling wave transmission line models (TLMs) such as the Bergeron Line Model. All TLMs have a sending end and a receiving end, which are connected via a message bus implemented by ZeroMQ req/rep mode; the sending ends are clients to send data requests to the receiving ends, while the receiving ends are servers to accept requests and replies with their data. The rep/req mode of ZeroMQ is synchronous and thread-safe.

C. AC/DC Co-Simulation Interface
The AC/DC co-simulation method is based on [8] with modifications for PiT purposes, which is shown in Fig. 7(c) and Fig. 9(c). The general idea is that the EMT system is represented as a power source in the TS solution, while the RMS values of the bus voltages in the AC system are transformed into a time-domain instantaneous three-phase voltage source in the EMT simulation. The simulation time-step of EMT simulation is around 50 μs while the time-step for TS is 100 μs-10 ms. Since the TS system only produces voltages and power flows in the frequency domain, the data exchange frequency can be larger than the EMT time-step. However, when using Parareal, the communication between TS and EMT system bring new challenges. The TS PiT simulation consisting of coarse-grid and fine-grid requires the information from the EMT side. If the time-window is small enough such as 1 ms or 10 ms, the data can be exchanged per window without any iterations. If the time-window is large, it requires restarting the EMT simulation during each typical PiT iteration. The CPU-GPU asynchronous computing architecture enables the EMT simulation to be executed on the CPU while the GPU is solving the large-scale TS problem, so the EMT simulation can be considered as acquired by free: virtually no additional executing time adds to the original TS solving process.
To avoid frequent data copies and synchronizations, the waveforms of interfacing variables are exchanged per-time window; each waveform contains sampled values at the interval of coarsegrid's time-step, which is usually 1-10 ms. This method can be considered a combination of Parareal and the waveform relaxation method on AC/DC TS-EMT co-simulation. The two systems exchange the information at each TS Parareal time window as shown in Fig. 9. For most steady-state cases, the voltages, and power flow change little so the iteration can converge quickly. For sharp changes such as short circuits faults, the performance only degrades within several time-windows, but the accuracy can be guaranteed since Parareal can fall back to the sequential execution eventually.
The interface data are stored in CUDA unified memory vectors [33]. The unified memory has a single virtual memory address for CPUs and GPUs. The data mitigation can be triggered by page faults and it is natively supported by the page mitigation engine inside NVIDIA GPUs later than Maxwell architecture. The memory can be accessed by CPU and GPUs simultaneously which is suitable for sharing data between multiple GPUs. It also enables asynchronous memory copy with cudaMemPrefetchAsync function [34], so that the pinned memory is not required for this task. It saves a lot of complicated memory management works for CPU-GPU and multi-GPU communications and the code can be written as the same as normal multi-thread programs on CPU.

IV. DYNAMIC RESULTS AND PERFORMANCE EVALUATION
The performance is evaluated based on the test cases shown in Fig. 10. The four IEEE 118-Bus systems and a modified CIGRÉ DCS 2 MTDC system form up the Scale x1 base test system, which is used for producing results for study Case 1 and Case 2. Then the AC/DC system is expanded vertically and horizontally as shown in Fig. 10(a) to evaluate the scalability and computational efficiency of the hybrid CPU-GPU PiS+PiT simulation method. The 4 AC power grids in the Scale x1 system are labeled as Net-1, Net-2, Net-3, and Net-4 respectively in Fig. 10.  is to verify the results compared to the commercial DSATools TSAT. The results shown in Fig. 11 of the Parareal algorithm meet the TSAT results very well and the relative error is smaller than 1%. After the fault, the generator transients last for two seconds then return to the normal at 10-12 s since the fault is cleared. Fig. 11(a)-(c) shows the generator rotor angles, terminal voltages, and frequencies of generators, respectively. Bus 24, which is the closest generator bus to the fault location, has the largest voltage (−0.3 p.u.) and frequency (−0.3 Hz) deviations. Fig. 11(d) shows the voltage of non-generator buses, where Bus 23 has the bus lowest voltage but not zero, which is because the fault resistance is 1Ω, not 0.
Case 2: it presents the results of PiT+PiS AC/DC cosimulation of an overload and recovery scenario. The MTDC system is used to support AC power systems when an overload occurs in Net-1; the Cm-B2 is working under the DC voltage control model to maintain constant DC voltages and the other MMCs are working under power control mode. This scenarios has three stages: (1) a 600MW load is added to the Bus 118 at 4 s in Net-1, which causes drops in voltages and frequencies as shown in Fig. 12(a), (b) respectively; (2) at 10 s the MMC Cm-E1 is ordered to drain 600MW from HVDC buses as shown in Fig. 12(c), thus the voltages and frequencies of Net-1 start to recover. On the other side, Cm-B2 provides the real power of 648MW to maintain DC voltages, and it has to drain real power from Net-2 so that Net-2's voltages and frequencies start to decline as shown in Fig. 12(d), (e) respectively; (3) a 620MW real power is injected at 10.5 s to Bus 118 in Net-2 so that Net-2 can maintain the stability. Fig. 13 shows the performance comparison between CPU serial AC/DC co-simulation, CPU-GPU PiT+PiS AC/DC cosimulation, and the GPU PiS co-simulation with a single  NVIDIA Tesla V100. The execution time of the CPU serial program for large-scale cases is too long so only the speedup against the serial program is presented in the plot. The speedup and execution time increases almost linearly for both PiT+PiS and the traditional PiS parallel computing method. When the system scale is 12x, GPU PiS only achieved 98.7x speedup compared to the sequential program; meanwhile, the GPU PiT+PiS method achieved 165.6x speedup which is 1.67x faster than GPU PiS. Fig. 14 shows the performance of PiT+PiS method with 2x NVIDIA Tesla V100 GPU. When the system scale is 1x and 2x, the speedup is obvious, especially when the system scale is 2x, which has 8 IEEE-118 Bus systems, the parallel-efficiency is near 100%. However, when the system scale is larger, the speedup declines to near 1.0x which indicates the GPU concurrency stops increasing under the multi-GPU situation, despite the single GPU implementation having linear speedup growth. It is due to the size of GPU kernels since the streams in Fig. 9 are not guaranteed to be concurrent. The large-size kernel amplified the load imbalance between GPU-1 and GPU-2. Because the serial coarse-grid prediction is critical to performance and it should not be delayed or disrupted, all fine-grid kernels were launched to GPU-2 while GPU-1 only handles coarse-grid tasks. As the problem size grows, coarse-grid workloads become much smaller than fine-grid workloads, so the multi-GPU results become closer to single GPU execution. The more advanced solution is to launch some of the fine-grid kernels to GPU-1 when it is idle so that the computationally intensive fine-grid tasks can make use of multiple GPUs.

V. CONCLUSION
A hybrid CPU-GPU parallel-in-time-and-space transient stability simulation method is proposed based on the Parareal algorithm. The Parareal algorithm is implemented on GPU along with the traditional PiS algorithm to achieve maximum parallelism. The CPU-GPU design performs PiT scheduling and launches GPU kernel functions to streams on the CPU, which brings better scalability and extensibility to GPU-only design. Meanwhile, the CPU can perform the EMT simulation asynchronously when the GPU is running the transient stability simulation, which can be perfectly integrated with the proposed AC/DC co-simulation scheme and bring better performance and parallel efficiency. The study case shows good results both in accuracy and computational performance. The speedup for the PiT+PiS method to the PiS method is around 2x and can achieve 165.6x compared to sequential CPU programs for a large-scale system. The method can utilize multiple GPUs and can achieve near-maximum parallel efficiency with a system scale of 2x. Further investigations and benchmarks are planned to find the bottleneck and optimize the PiT+PiS algorithm on multi-GPUs.
The proposed hybrid PiT+PiS method shows good potential for the solution of large-scale AC/DC power system transient stability simulation problems.