A Configurable Hierarchical Architecture for Parallel Dynamic Contingency Analysis on GPUs

Dynamic contingency analysis (DCA) for modern power systems is fundamental to help researchers and operators look ahead of potential issues, arrange operational plans, and improve system stabilities. However, since the system size and the number of contingency scenarios continue to increase, pursuing more effective computational performance faces many difficulties, such as slow algorithms and limited computing resources. This research accelerates the intensive computations of massive DCAs by implementing a two-level hierarchical computing architecture on graphical processing units (GPUs). The performance of the designed method is examined using four test systems of different sizes and compared with a CPU-based parallel approach. The results show up to 2.8x speedup using one GPU and 4.2x speedup using two GPUs, respectively. More accelerations can be observed once more GPUs are configured. It demonstrates that the proposed architecture can significantly enhance the overall computational performance of massive DCAs while maintaining a strong scaling capability under various resource configurations.


I. INTRODUCTION
D YNAMIC contingency analysis (DCA) is a critical function for Energy Management Systems (EMS) to identify potential physical limitations and system stability constraints in diverse scenarios with power system component failures [1]. Static contingency analysis (CA) only considers the system's steady-state behavior in a certain snapshot under different contingencies. However, DCA, which seeks to assess the ability of a grid to withstand cascading component failures or contingencies within a specified duration (a few seconds) [2], is paramount for the reliable operation of modern power systems, underpinning system stability and economic prosperity [3]. In massive DCAs, each grid configuration with component outages is evaluated as a scenario, whose feasibility is obtained by the solution of a power system dynamic simulation. If any collapsed consequences are detected, preventive and corrective operations should be immediately implemented based on the DCA results [4]. Finishing these tasks implies complex computations on system topology and dynamics. The time-consuming simulations result in a long wait time before the operators can assess the situation and make a timely decision. This latency hinders timely operations in power grids, where quick reactions are necessary to adjust plans and strategies. Therefore, a computationally efficient solution to massive DCAs is of great significance.
High-performance computing (HPC) leveraging advanced computing architectures holds great promise to overcome this computational barrier to fit the need for fast responses in online operations. Many existing parallelization approaches in accelerating the performance of power system DCA simulation are performed on the central processing unit (CPU)-based multi-core computers or clusters. For example, [5], [6], and [7] acquire desirable speedup of a single dynamic contingency case using 64 cores. A two-level dynamic security assessment under uncertainty is introduced in [8]. Additionally, the dynamic contingency analysis tool (DCAT) developed in [9] bridges an outer-inner module to perform multiple contingencies, where the actual completion time is extremely host resource-dependent. In recent years, as the graphics processing unit (GPU) began to dominate the HPC area due to its exceptional computation capability, many intractable problems in science and engineering that have traditionally demanded a high-end supercomputer are now being solved on cost-effective GPUs alternatively [10], [11], [12], [13], [14]. In the power system domain, [15] proposes a novel method for transient stability simulation on multiple GPUs. [4] designs a hybrid computing architecture for realtime CA. Reference [16] uses a two-layered parallel static security assessment for large-scale power grids based on GPUs. Nonetheless, an advanced GPU-based implementation for parallel DCA is still missing.
This paper presents a configurable and hierarchical computing architecture for the first-ever accelerating massive DCAs in parallel on single or multiple GPUs. The higher level dispatches the available worker processes to the tasks at once. At its lower level, each single case in terms of GPU-optimized dynamic simulation is further programmed to use the provisioned resources on its attached process for data-level parallelism. The executions on one or more GPUs can be configured to host the DCA runs with enhanced flexibility and extended scalability. The main contributions of the paper are: • Improved single DCA algorithm and GPU-based implementation using high-level computing libraries.
• The design for parallelizing massive DCAs through GPU resource partitioning on a hierarchical architecture.
• Configurable and universal adaptiveness of the proposed architecture with minimized hardware network communications.
The organization of the rest of this paper encompasses: 1) an overview of DCA's core algorithm and relevant GPU computing techniques in Section II; 2) the architecture design and code implementation for the data-level and task-level parallelisms on GPUs in Section III; 3) the case studies including the performance comparisons on CPU and GPU using different test systems, and the scalability analysis regarding extendable hardware configurations in Section IV; and 4) a conclusion of the current research outcomes with proposed future enhancement in Section V.

II. BACKGROUND
With increasing renewable penetration and more frequent extreme weather events, there are exponential amounts of credible contingencies in power systems that should be analyzed in detail, imposing a substantial computational burden for real-time operations. This section reviews the general DCA mathematical model and parallel computing technologies on GPU, respectively.

A. SINGLE DYNAMIC CONTINGENCY ANALYSIS
The core computation in a single power system DCA is a dynamic simulation that generally consists of nodal admittance matrix (full Y matrix) manipulations and multiple time-step simulations. It requires a computationally intensive time-domain solution of numerous differential and algebraic equations (DAEs) for a short period of time (e.g., 30 seconds) as shown in (1), where the vector x represents dynamic state variables such as generator rotor angles and speeds, and the vector u represents algebraic variables such as the network bus voltage magnitudes and phase angles, and real and imaginary parts of the bus voltage [17]. Given a power system with N buses and M generators, the algebraic equations in (1) can be represented by (2), where the vector of current injections at each bus I N is the product of the full Y matrix Y NN and the vector of bus voltages V N . To simplify the complexity of the matrix and achieve best matrix operation performance, for a power system with classical model and constant impedance load, Y NN can be reduced to only contain generator internal buses, Y MM . According to [18] and [19], (3) represents the logic of acquired reduced nodal admittance matrix (reduced Y matrix), where Y MM is the matrix storing the resistance and reactance of generators, Y MN is the links between generator internal buses and terminal buses, Y NN contains constant load impedance and generator transient impedance, and Y NM is the transpose of Y MN . Thus, the algebraic equations can be modified to (4), where the representations of current injections I M can be reduced to the product of Y MM and the vector of generator internal voltages V M .
The equations of motion for an individual generator a in the complex system could be represented by (5) for a classical generator model.
Here H a is the inertia constant, w a is the speed, w sa is the synchronous speed, P ma and P ea are the mechanical power input and active power at the air gap, D a is the damping coefficient, and θ a is the angular position of the rotor in the electrical radians with respect to synchronously rotating reference [6].
To solve the DAEs, the differential equation set in (1) needs to be first discretized into algebraic equations, which are then lumped with the original algebraic equations. The Modified Euler (ME) method [20] is usually used to solve these equations explicitly at each time step as shown in (6). The system balance is perturbed by a contingency scenario before (Pre), during (On), and after (Post) a transient disturbance or fault is applied. The Y MM generated regarding each phase is then involved in the numerical integration function.
DCA addresses each contingency scenario by deploying one aforementioned dynamic simulation and evaluating the system's dynamic trajectory. As a result, both the realization of parallel execution of multiple DCA cases, and the speeding-up of each single dynamic simulation that dominates one DCA run lead to the notable acceleration of the overall computational performance of massive DCA study.

B. PARALLEL COMPUTING ON GPU 1) DATA-LEVEL PARALLELISM
Modern GPUs have a high data parallel architecture design, which is composed of many cores sharing the same control unit (e.g., 5,120 cores in a Tesla V100 GPU [21]). The primary computing unit of an NVIDIA GPU is called thread. All indexed threads grouped in a block perform Single Instruction with Multiple Data (SIMD) since each thread in charge of one array element executes the identical formula in parallel. The GPU memory (registers, shared memory, cache, and global memory) is accessible to the threads for performance optimization purposes. A grid containing a bunch of threads of blocks is the largest defined structure for GPU programming. These resource definitions, memory layout, and related computations can be controlled by writing a kernel through CUDA [22], a computing platform that allows the software to relieve the intensive CPU load with GPU. In this paper, as the GPU is favorable for the data-level parallel implementation, a single DCA algorithm in terms of dynamic simulation is optimized and accelerated by coding in CUDA Python [23], where the program takes advantage of the Python ecosystem, CUDA driver and runtime APIs, and extensive GPU data structure and parallel computing libraries (e.g., CuPy [24], Numba [25], etc.).

2) TASK-LEVEL PARALLELISM
Currently, large CPU-based distributed systems or clusters are dominating event-driven computing. The focus of using a portable device is still devoted to data-level parallelism, as discussed above. Thus, the concurrency of events is not usually the top priority due to traditional GPU design natures. Recent NVIDIA GPUs (Tesla V100 or A100 [26]) built with Volta architecture or later are capable of allowing applications from multiple processes to overlap, achieving resource partitioning and fast task-level parallelism. In this Algorithm 1 Original Y NN and Y MM Formations Input: system data (β 1 , β 2 ), fault condition (nb, fb, Bb, rx) Output: pre, on, and post fault research, the configurable implementations of such are done through the use of CUDA-aware Message Passing Interface (MPI) [27], a communication standard in parallel computing. It is responsible for launching and designating processes to different dynamic contingencies. Moreover, Multi-Process Service (MPS), which can benefit performance if the GPU computing capacity is underutilized by a single application process, enables each MPI job to have its own memory space but cooperate with the others without interference on the same GPU [28].

III. PARALLEL IMPLEMENTATIONS
This section presents a two-level computing architecture, which is designed to account for boosting DCA computations on single or multiple GPUs through both task and data-level parallelism with advanced algorithms and communication control strategy.

A. OPTIMIZATIONS OF SINGLE DCA ALGORITHM
According to Section II. A, the reduced Y matrix needs to be recalculated once a fault occurs during the dynamic simulation. As shown in Algorithm 1, the standard code version implemented in PST [29], which we use as a baseline reference, reruns each step in the formation of Y NN and Y MM in Y _reduce() function. It assumes each Y MM is a result of two types of variables: 1) β 1 and β 2 , which are constant, represent the two collections of power system input, respectively; 2) fb (fault far bus), nb (fault near bus), Bb (shunt susceptance), and rx (reactance) govern the fault condition to be applied in a scenario. Whereas given the fact that the variables changed for applying a fault are not modifying system parameters, Algorithm 1 can be optimized to mitigate the workload by effectively recycling and calling intermediate results from For solving a set of DAEs, the ME method in (6) is substituted with the Adams-Bashforth (AB) method in (7). Instead of predicting and updating the state variables in each iteration, AB only requires a one-time approximation by utilizing the values from current and previous steps, which theoretically provides a two times speedup regardless of hardware constraints due to one less operation.

B. ACCELERATION OF SINGLE DCA
The acceleration of a single DCA can be achieved by leveraging GPU resources. In Y NN and Y MM formations, GPU has the capability over CPU to accomplish large matrix operations. The data structure written through CuPy fully supports the array functionalities. For example, the significant computations in (3) are completed with the call of GPU-based matrix-matrix dot product [30] and LU decomposition [31], [32] for sparse linear system solving. The implementation offers opportunities for automatically parallelizing the program for speedup without modifying the original algorithm. The DAE solving function, where the algebraic or state variables of generators in (4) and (5) are independent of each other, contains vectorized and element-wise computations in each time step. Therefore, it is well-suited for data-level parallelism on GPU. These operations can be customized to merge the calculations into a kernel (sim()) through Numba to avoid the waste of multi-kernel launches repeatedly. As shown in Listing 1, the network equation solving and AB approximation can be fit into a 1-D grid of 1-D blocks. To perform SIMD, we make every indexed thread (tid) responsible for computing the dynamics of each generator. By looping over the columns of Y MM and V M , the current injection I M is computed. The approximated next step angle position (θ(steps + 1)[tid]) can be obtained by taking advantage of the results from the current and last two time steps.

C. ACCELERATION OF MASSIVE DCAs USING MULTIPROCESSING
The concept of multiprocessing refers to task-level parallelism, which generates multiple processes and allocates tasks between them. The design in this paper takes full advantage of multiprocessing as massive DCA runs are also independent of each other. We implement MPI to initialize a group of processes and distribute tasks. MPS is used to allow function executions from different processes to overlap on the device, achieving GPU partitions with higher resource utilization and faster elapsed time. It ensures the simultaneous scheduling of work submitted by individual processes and stays out of the critical execution path. The communication functionality between the process and server is enclosed within the CUDA binaries. Fig. 1 shows the entire implementation methodology on GPUs with a detailed illustration of the critical executions on a single GPU. The dynamic simulation program for analyzing a single contingency scenario is elaborated in Section II. According to Algorithm 2, it can be separated into three components (Y _init(), Y _red(), and sim()). Suppose there are x scenarios that need to be solved in total, and a machine has n available processes for one GPU (x n). Thus, there should be x/n iterations to complete all cases. In the beginning, the main process P 0 launches Y _init() function to parse the system input data and contingency settings. Then the program executes the steps in Algorithm 2 on GPU to get the full Y matrix and reduced Y matrix in pre-fault phase. The task-level parallelism starts from the information broadcast to all processes (P 0 , P 1 , . . . , P n−1 ). In this study, the MPI application is able to submit GPU-based work directly to the device by activating MPS. We formulate the resource provisioning strategy to equally partition the GPU available threads to each process (active thread percentage to 100%/n per process). Each process then locally generates and takes its The architecture is configurable since it also supports multi-GPU concurrency with controls for mapping DCA jobs over many processes on a GPU cluster. Each independent device is set to have an equal number of scenarios to run in parallel. No information transfer is needed. Furthermore, more performance boost can be achieved by aggregating more computing resources from other machines. The input control can evenly separate and dispatch the DCA cases to every GPU device on a scalable computing platform. One machine can act as an independent system, only allowing internal communications between processes within each attached GPU. In other words, each GPU should initialize the pre-fault condition with its assigned DCA scenarios to provide requirements for the local processes later. There is no action for controlling and sharing the global information from one host to the others, although the interactions can be fulfilled through a conventional higher-layer MPI global communicator if needed. This idea reduces the communication overhead in the network and improves overall performance.
In summary, the proposed GPU-based architecture has novel merits as follows: • Hierarchical task and data-level parallelism to enable massive DCA studies with fast dynamic simulations.
• High portability and reconfiguration capability to run on a single GPU or multiple GPUs if applicable.
• Minimal required communication and overhead incurred between CPU and GPU due to maximized deployment of all simulations on GPUs.

IV. RESULTS AND ANALYSIS
In this section, the speedup of the proposed parallel DCA architecture using different power system cases is presented. We analyze the performance scalability based on test results.

A. TEST SYSTEMS AND HARDWARE CONFIGURATION
Three real power systems (IEEE39b, IEEE145b, and Pol-ish3120b) with different scales are selected to evaluate the computational performance. An artificially created synthetic system (ATF900b) is also applied to reference an increased density of system topology. Table 1 summarizes the differences between these systems. The tripping scenarios (faults) are made during a 30-second dynamic simulation period and cleared after 0.1 seconds for each DCA case. The step width for approximation is 0.005 seconds. All the working programs are tested on Clemson University's supercomputing facilities' Palmetto Cluster [33], which is a Linux-based environment comprised of 2,115 compute nodes (totaling  32,600 CPU cores), including 639 nodes equipped with 2 Tesla V100 GPUs, and 34 nodes each have 2 Tesla A100 GPUs.

B. SINGLE DCA ON ONE GPU
In parallel computing, the performance of a program is bounded by the non-parallelization portion of the program and influenced by the computation-to-communication ratio, i.e., computation intensity vs. communication overhead. The CUDA initialization and system data transfer between host and device are unavoidable factors for our single DCA run. We have examined our program with a computing node consisting of 8 Intel Xeon(R) Gold 6148@2.40 GHz cores with 32 GB RAM and 1 Tesla V100 16 GB GPU, which is a typical configuration for testing the performance. For benchmarking purposes, we compare our results with the other two research outcomes in [34] and [35], where GPU is utilized as an accelerator to boost specific portions of a dynamic simulation with Algorithm 1. For example, [34] uses low-level CUDA programming to handle LU decomposition on GPU.
Reference [35] implements portable compiler directives on Fortran code to offload element search and assignments for dense matrix manipulations. The testing results of each power system are demonstrated in Table 2. Our implementation is the fastest among the three GPU-based studies mainly for three reasons: 1) The proposed dynamic simulation algorithm in Algorithm 2 significantly reduces the workload of Y NN and Y MM computations; 2) The high-level data structure usage and advanced all-GPU parallel implementations discussed in Section III. B improve the overall computational performance; 3) Although the systems vary in size, all test cases can fit into GPU to make use of its exceptional computing resources.

C. MULTIPLE DCAs ON A SINGLE COMPUTING NODE
For accelerating DCAs on GPU using the proposed multiprocessing approach, we request a maximum of 56 CPU cores, 372 GB memory, and 2 Tesla V100 GPUs on a single computing node. There are 1,024 scenarios to be solved for each test system. As the number of computing processes increases, more cases can be run in parallel, and the computational time is expected to decrease until a specific threshold is reached. Fig. 2 offers a glance at the sim() Numba kernel when the process number is set to 8. According to the timeline profiling, kernel executions are highly overlapped within a GPU, which exhibits great concurrency with high throughput as desired. The Volta MPS server supports 48 concurrent contexts per device at its maximum if the allocated GPU memory is within the limit, but the real allocation is subject to problem size and task intensity. In our case, we can fit 22 tasks per GPU device for the smallest IEEE39b system but only 16 for the largest ATF900 system. The performance of our GPU-based hierarchical architecture is compared against a pure CPU-based approach. The main difference between the two approaches is that the CPU simulator includes individual contingency workflow in Algorithm 1 and achieves data-level parallelism through CPU-based multi-threading techniques.

1) SCALABILITY WITH ONE GPU
The overall scalability of both single GPU-based and CPUbased multiprocessing implementations for four test systems are plotted in Fig. 3. As we expected, by offloading all DCA  simulations to a GPU, the performance is improved over the CPU-based one. The computational time is reduced across all test systems regardless of the number of processes. It can be observed that all test systems can reach faster or equivalent executions but with much fewer processes and memory than the CPU approach. For instance, using 56 CPU cores, the parallel CPU-based simulator can solve 1,024 scenarios for Polish3120b within 1.1 min. However, a faster execution (0.8 min) can be achieved on a GPU by launching only four processes. Moreover, at 16 processes, the GPU program accomplishes all the tasks in only 0.4 min, 4x times faster than the CPU. The considerable performance improvement comes from the optimized algorithm, the designed high-efficient GPU-based implementations, and GPU's superior processing capability for data parallelism.
2) SCALABILITY WITH TWO GPUs Fig. 4 demonstrates the successful implementation of multiprocessing on a GPU cluster. As illustrated, two Tesla V100 GPUs are on the current computing node. Each GPU has its own MPS server started, and the four processes with unique IDs are running to solve four independent DCA scenarios in parallel. Compared to the single GPU, since the maximum allowed processes can be doubled, half of the task loads are distributed to the added device, and the computational performance is linearly improved. Table 3 benchmarks the speedups using the CPU-based parallel simulator with different configurations (1, 16, 32, and 56 processes) against the peak performance of the GPU-based simulator using one GPU (16 processes) and two GPUs (32 processes). The Pol-ish3120b system exhibits the best speedup over CPU-based parallel implementation, where a boost of 2.8x with one GPU and 4.2x with two GPUs is recorded.

D. MULTIPLE DCAs ON DIFFERENT COMPUTING NODES ACROSS A DISTRIBUTED SYSTEM
We assume that the more resources available, the more tasks can be run concurrently to accelerate the computations. In another test for better acceleration, we solve 1,024 contingencies by taking advantage of distributed computing facilities. The hardware configurations of each node remain the same as in the single-node test. Table 4 presents the solution time using 16 processes on each GPU but from a different number of nodes for the ATF900b power system. By requesting four nodes and eight GPUs on a cluster, a speedup of 6.4 is recorded compared to the performance of one GPU on a single computing node. The scalability is nearly linear due to no network communications between the nodes. Each device is only responsible for dealing with the local interactions between its processes. The testing results indicate our approach has extendable and configurable nature, as mentioned in Section III. Hence, the GPU multiprocessing method can further improve the performance of massive DCA simulations if more computing power is available. VOLUME 10, 2023

V. CONCLUSION AND FUTURE WORK
Traditional power system CA and DCA research heavily rely on high-end CPU-based HPC facilities. This paper discusses a first-ever two-level hierarchical parallel architecture to accelerate massive DCA using GPU devices. The dynamic simulation for solving an individual DCA is optimized on GPU to achieve data-level parallelism. The architecture using the multiprocessing method makes multiple tasks run concurrently within each device. Compared to the CPU-based parallel solution, which requires much more computing resources, the GPU-based architecture achieves even better results by solving 1,024 scenarios in all test systems. It is concluded that the approach possesses the robustness and extensibility to enhance the overall computational performance for massive DCAs under various GPU configurations.
In the future, an integrated application will be built with: 1) fine-tuned CUDA dynamic simulators to satisfy more complex systems and models for a single DCA; 2) realtime data visualization by leveraging the graphics interoperability features of GPU; and 3) a generalized GPU-based multiprocessing architecture to accommodate other related research topics in power system domain.

ACKNOWLEDGMENT
Clemson University is acknowledged for the generous allotment of computing time on Palmetto Cluster.