Spatial-Temporal Deep Learning for Hosting Capacity Analysis in Distribution Grids

The widespread use of distributed energy sources (DERs) raises significant challenges for power system design, planning, and operation, leading to wide adaptation of tools on hosting capacity analysis (HCA). Traditional HCA methods conduct extensive power flow analysis. Due to the computation burden, these time-consuming methods fail to provide online hosting capacity (HC) in large distribution systems. To solve the problem, we first propose a deep learning-based problem formulation for HCA, which conducts offline training and determines HC in real time. The used learning model, long short-term memory (LSTM), implements historical time-series data to capture periodical patterns in distribution systems. However, directly applying LSTMs suffers from low accuracy due to the lack of consideration on spatial information, where location information like feeder topology is critical in nodal HCA. Therefore, we modify the forget gate function to dual forget gates, to capture the spatial correlation within the grid. Such a design turns the LSTM into the Spatial-Temporal LSTM (ST-LSTM). Moreover, as voltage violations are the most vital constraints in HCA, we design a voltage sensitivity gate to increase accuracy further. The results of LSTMs and ST-LSTMs on feeders, such as IEEE 34-, 123-bus feeders, and utility feeders, validate our designs.


I. INTRODUCTION
N OWADAYS, a growing number of renewable energybased distributed energy sources (DERs) have been used in the low-voltage distribution grid, e.g., photovoltaics (PVs). The widespread use of DERs brings many advantages, including voltage profile control, line loss reduction, and cost decrease. Meanwhile, challenges appear when the distribution grid is turned into an active grid. This is because DERs inevitably change load shapes, voltages, fault current profiles, etc., when the penetration level is substantial. For example, grid operators face over-voltages at solar-rich feeders instead of low voltage issues like the past. Specifically, solar energy Manuscript  often peaks at valley load for communities with lots of roof-top PVs. The net load then turns negative and introduces reverse power flow and over-voltages at the lateral end, causing the malfunction of voltage regulators and protection coordination. Without proper awareness of potential capacity to accommodate PVs, grid planners and operators find it challenging to handle capacity planning and take appropriate control actions, which leads to hosting capacity analysis (HCA) to assess the distribution grid for further operation [1]. Specifically, HCA is to determine the value of the hosting capacity (HC). HC is denoted as the maximum active power that can be injected by DERs at a bus in an existing distribution grid without causing technical problems or requiring changes to power system facilities. [2], [3], [4], [5] provide systematic and comprehensive introductions to the research, development, evaluation, and enhancement of HC. Traditional HCA methods typically conduct power flow analysis for a baseline feeder model and modified scenarios with different load profiles, DER penetration, and other uncertainties caused by the environment, power equipment, and human activities [6]. Then, these methods implement different techniques to check the violation conditions of operation constraints and thus quantify the HC [7], [8].
Related works [2], [3], [4], [5] classified current HCA methods into the deterministic (worst-case), stochastic, streamlined, and iterative Integration Capacity Analysis (ICA) methods. The deterministic method [9], [10], [11] and the stochastic method [12], [13], [14], [15], [16], [17] focus on specific scenario(s) without considering time correlation. The deterministic method obtains an optimal solution on a single scenario, usually the worst-case scenario(s), while a stochastic method uses the probabilistic technique, e.g., Monte Carlo simulation, to model the uncertainties in the power system. However, they cannot capture the relation between the system variables over time, which is important in HCA [18], [19], [20], [21]. Differently, using historical time-series data, the streamlined method [22] and the iterative ICA [23], [24] method provide insight into how hosting capacity changes over time and the ability to derive a corresponding hosting capacity pattern subjected to time. To determine the HC, the streamlined method applies a set of equations and algorithms to evaluate power system criteria at each node, whereas the ICA method iteratively increases the DERs at each node until system violations occur. Though they can consider time-related scenarios and increase the calculation accuracy of the timevarying power system model, each separate step is a complex calculation, e.g., iterative optimal functions. Therefore, the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ calculation burden is still the limit of traditional HCA for real-time tasks.
Since we aim to calculate HC in real time with high accuracy, the previous HCA methods cannot meet the requirements. Instead, we propose a machine learning-based problem formulation [25]. This formulation uses historical time-series data to conduct offline training and obtain the HC value based on real-time system conditions. Specifically, we model HCA as a supervised learning problem that uses data of power system features and operating conditions as input vector and HC data as target label. To achieve such a learning target, the mapping from historical data on different input features to the HC is highly nonlinear, and deep learning is a promising method to deal with the non-linearity. In order to capture the periodic pattern of the power system, e.g., hourly, daily, and yearly patterns, we use the recurrent neural network (RNN) as the basic learning framework [26], [27], [28]. In such a model, RNNs need some past context to predict the current output, but in practice, it can hardly capture the relationship in a long sequence. One of its variants, the long short-term memory (LSTM), has a much better performance of long-term learning [29]. While LSTM can improve our deep learning framework, the direct application on HCA still has some challenges.
Though the basic LSTM framework can capture the timevarying impacts in HCA, the impact of spatial information cannot be embedded directly. Such information includes the locations of nodes, current DERs, and potential DERs in the feeder. Ignoring these spatial relationships and simply relying on deep learning will limit the accuracy in our analysis [30], [31], [32]. Because of the flexibility of the deep neural network (DNN) and our motivation to consider both spatial and temporal correlation for HCA, we develop the basic LSTM to the Spatial-Temporal LSTM (ST-LSTM) [33], [34], [35], [36], [37]. We have two major contributions to achieve this goal. First, we modify the structure of LSTM cells. The most crucial LSTM design for capturing temporal correlation is the gate function. Reference [38] proposed the forget gate for the first time, and [29] emphasized that the forget gate is the most critical gate among all the gates. Therefore, we modified the forget gate to dual forget gates, which allows the model to transfer temporal and spatial memory in parallel. Second, to make the model perform well in our HC determination work, we design a sensitivity gate to use the voltage sensitivity data. The voltage sensitivity data is a particular data for HCA because it is related to the voltage violations constraint, which is one of the most critical limits in HCA [39]. Therefore, we design the voltage sensitivity gate to improve the accuracy further. This gate can work with the input gate, which is also an essential gate in the LSTM cell [40], to filter the new inputs.
We validate our new model using the IEEE 34-bus feeder, IEEE 123-bus feeder, and an Arizona generic utility feeder. The ST-LSTM has a much better performance for these feeders than the temporal or spatial LSTM models. Moreover, the designed sensitivity gate using the voltage sensitivity data in our LSTM model can improve the accuracy.
The remainder of this paper is organized as follows. Section II proposes the problem formulation. Section III illustrates the design of the dual forget gates and the sensitivity gate. Section IV provides the numerical validation using different models. Conclusion and discussions are in Section V.

II. PROBLEM FORMULATING
As our target is to provide real-time hosting capacity for each bus, we are not focusing on one single scenario. Different from the static snapshot hosting capacity, we first formulate a data-driven machine learning problem model for dynamic hosting capacity (DHC) analysis, which will help the distribution system operators enhance hosting capacity determination, facilitate optimal control, and dispatch of DERs in real time. Specifically, this can be achieved by integrating this function as a module in the data platform. The information of the HC values allows the other modules in the data platform, specifically the module on coordinated control of DERs, to determine which regions/buses are operating near their HC limit, how to control the DERs at these regions and around it, to mitigate HC violations such as overvoltage. This can be achieved by a cloud-based platform called end-to-end solar energy optimization platform (e-SEOP) that receives these data and performs multiple analytics, including dynamic HCA described in this paper and real-time control of DERs through the network of edge intelligent devices (EIDs) [41].

A. Per-Bus Hosting Capacity Analysis
The Per-bus Hosting Capacity Analysis determines the hosting capacity values for each bus, which are per-bus HC values. We use a two-bus example with current injection at both buses of (p1, p2) = (0, 2) to introduce the per-bus HC values. To determine the HC value for bus 1, we gradually increase the injection by bus1 and keep all the other settings the same, including the injection by bus 2. If (p1, p2) = (1, 2) is the maximum value without causing problem, the HC number of bus 1 is 1 − 0 = 1. To determine the HC value for bus 2, we also keep all the other settings the same and only change the injection by bus 2. If (p1, p2) = (0, 3.5) is the maximum value without causing problem, the HC number of bus 2 is 3.5 − 2 = 1.5. Therefore, the HC numbers for both buses are (hc1, hc2) = (1, 1.5). In this case, we have the per-bus hosting capacity under this scenario setting. Then, we can use the same metric for other scenarios and finally obtain time-series HC results for these time-series scenarios.
The per-bus hosting capacity values for each bus within a feeder are drastically different based on the impedances, short circuit ratios, and loads at the transformers. To determine per-bus HC values, traditional methods need to solve optimal functions or do iterative HCA to check the system violations for each bus respectively. Unlike these time-consuming methods, our learning-based HCA is to find a mapping rule between the power flow data and the hosting capacity data.
With the offline trained model, the online prediction process can compute the per-bus HC values in real time.

B. Problem Formulation
With power flow data and per-bus hosting capacity data, we formulate our hosting capacity analysis (HCA) as a regression problem. Our deep learning-base model has offline training and online prediction. In the training, the model uses the historical time-series data from system setups. Specifically, the power flow data from power flow analysis is the training input, including the voltage magnitude, voltage angle, load profiles, and PV profiles, etc. Moreover, the per-bus HC data is the desired output, which is also generated from simulation. Since the data from the simulation is required by our training, the detailed feeder settings for power flow analysis are also essential. Our learning model can capture the HC-related information among these data. Subsequently, we can apply the well-trained learning model to calculate the HC value according to the new inputs of the system in real time.
Our deep learning-based HCA will not calculate the perbus HC values independently. Instead, we use two data sequences for better HC determination, considering power systems' temporal and spatial correlations. Specifically, the temporal sequence, which is the historical time-series data, can allow the model to learn the periodic pattern of the power system and obtain the HC changes over time [21]. The spatial sequences the topology of the distribution network. The perbus HC values of the two adjacent buses could be different because of different load profiles, PV profiles, etc. Also, the constraints when determining the hosting capacity are highly related to the spatial relationships of buses. For example, the new PV may introduce the reverse current or the overvoltage violation at an upstream bus. Therefore, the spatial sequences provide a physical model embedding to improve the accuracy [42].
To capture the spatial configuration and temporal dynamics, and determine the HC with high accuracy, both spatial and temporal information needs to follow well-formed orders. Our proposed ST-LSTM model has two dimensions, which are the temporal and spatial sequences. The temporal sequence can be derived from the time-series models. However, for the spatial information, it is difficult to put all the buses in the complex distribution grid into a single sequence. A random traversal in the network will result in wrong spatial correlation. Thus, to convert the network into a spatial sequence, we propose traversing the network with paths. This method divides the distribution grid into different paths, from the slack bus or three-phase main trunk to each lateral end. These paths to the lateral ends will cover all the feeder buses, and we call it the longest paths method. Due to the fewer buses in one path, the dimensionality of the input data will also decrease, which will reduce the computation burden. In the following, we formally define the problem of learning-based hosting capacity analysis.

C. Mathematical Problem Definition
Mathematically, in the offline training, we define input tensor X as the values of all the time-series power flow data from time 0 to time T. Moreover, n is the number of input features of each bus, and m is the number of buses. X is a R T×n×m tensor and x s,i (t) is the value of the i-th power system feature component of bus s at time-series t. X(t) ∈ R n×m denotes the feature values of all the buses at time t. Similarly, the output H is a R T×1×m tensor, which represents the calculated HC values. H(t) will denote a slice of H at time t and h s (t) is the HC value of bus s.
With the historical data, we want to learn a regression model f : X → H to predict the future H(τ ) based on X(τ ). This mapping is achieved by updating the parameters of our neural network model. In our training, we decomposed the input tensor X ∈ R T×n×m into slices. Each slice X(t) ∈ R n×m represents the input data at time series t. In our recurrent Spatial-Temporal LSTM model, at each time series step, the input is one slice of the data tensor.
The problem with using the deep learning-based model to calculate the HC value is defined as follows.
The offline training process:

III. SPATIAL-TEMPORAL LONG SHORT-TERM MEMORY
With the formulated problem, we will illustrate the proposed deep learning method, the Spatial-Temporal LSTM, for datadriven HCA. The basic LSTM model is first introduced to consider temporal correlation. While the basic LSTM model cannot consider spatial and temporal correlations simultaneously in the power system, we modified the forget gate in the LSTM cell to dual forget gates to process the temporal and spatial information in parallel. Moreover, another sensitivity gate is designed to boost the HCA accuracy with critical factors: the voltage violations constrain in the HCA.

A. Illustration of Basic LSTM
To consider the correlation over time in learning, we use the Recurrent Neural Network (RNN), which is designed for sequential information. An RNN is a series of identical feedforward neural networks, one unit for each time or step, which are known as 'RNN cells' [43], [44]. The input of an RNN cell has two parts. One is the output from the last cell, which is hidden state h t−1 , and the other is the input vector for this cell, which is x t . In this way, an RNN can use its internal state (memory) to process sequences of inputs. By transferring the information through these repeating cells, the network can be trained recurrently. Each RNN cell will have an output h t as where φ(·) is the activation function to extract non-linearity. Though RNNs are designed to process sequential data, it has drawbacks for HCA. One is that RNNs cannot transfer information correctly between two cells with relatively long distances. To solve this problem, LSTMs improve basic RNNs via gate functions, which are capable of learning short-term and long-term dependencies [45]. Such gate design also avoids gradient exploding and vanishing, which are another significant drawback of RNNs [45]. Therefore, LSTMs have become an effective model for sequential prediction problems.
Different from the basic recurrent unit, each LSTM unit, which is shown in Fig. 1(a), use the cell state c t to memorize the long-term information and the hidden state h t to capture the short-term information, where the RNNs only use the hidden state h t . To control the information stream in the model, the designed three gates recurrently update the c t and the h t , which is the most important design of the LSTM. The functions of a single cell are where i t , f t and o t represent the input, forget, and output gates of the t-th cell, respectively. In order to handle long-term dependencies in data, the input gate works together with the candidate g t to decide what information should be input and the forget gate decides what information should be forgotten. Additionally, the output gate decides what information should be output. These three gates use the sigmoid function, which is represented as σ (·), as the activation function. This nonlinear function σ (·) can map the input values between 0 to 1, where 1 means to keep the information while 0 means to abandon the information. Each of the gates and the candidate is a small neural network to learn functional mapping. h t−1 is the hidden output vector of the last cell and x t represents the input vector.
Specifically, x t is the power flow data in our problem formulation. Besides, W i , W f , W o and W c are the weight matrices and b i , b f , b o and b c are corresponding biases to parameterize the functional gates and candidate.
Since we have the output of all the gates, the next step is to use them to update the cell state c t and the hidden state h t . The c t updating function is The former term is the previous cell state c t−1 controlled by the forget gate f t , which is to decide how much long-term memory should be forgotten. The latter term is the mapped new input candidate value controlled by the input gate, which decides how much new input should be added. The two terms of this function will work together to update the cell state c t using the element-wise (Hadamard) product . Subsequently, the h t is denoted as where the h t is the hidden state that needs to be transferred to the next cell. In this way, the LSTM model can follow a sequence to conduct training. The output h t of each cell is the result we want to obtain, which is the hosting capacity value in our problem.

B. Temporal and Spatial Sequence in HCA
A basic LSTM is assertive when processing temporal sequential data, e.g., the historical time-series power flow data. Thus, this model can learn the HC changes over time with respect to different system conditions like loads, inverter setpoints, etc. However, a power system is a complex network where buses will have mutual effects with each other. We expect the model to be coupled or coordinated because we are looking for violations not only at the bus under consideration but anywhere in the feeder due to the power injection at this bus. An overvoltage violation may occur at an upstream bus even though the bus where the power is injected does not see overvoltage. Therefore, the connectivity relationship is crucial in HCA, and ignoring the spatial correlation will cause low accuracy.
Since an LSTM is trained as a simple chain, and one cell can only receive information from the last cell, this recurrent network calculates both the c t and the h t according to one sequential data series. Thus, the LSTM network is unable to process two data sequences of different dimensions simultaneously. In order to consider both temporal and spatial correlation for HC, we embed the topology information of the power system by modifying the LSTM to the Spatial-Temporal LSTM. The spatial sequences we used are the paths in the network, and each of the paths is from the feeder head to the lateral end.
In Fig. 2 and Fig. 3, we use the comparisons of the structures and information flow between the basic LSTM and ST-LSTM to show how cells transfer c t and the h t in the ST-LSTM model. The cell design to embed temporal and spatial correlation is detailed in the next subsection.

C. Design of the Dual Forget Gates
Specifically, to process temporal and spatial sequences simultaneously, we reconstruct the structure of the cells as shown in Fig. 2. Though the forget gate is critical in the basic LSTM, the single gate cannot process information from two dimensions separately. Therefore, we modified this gate to dual forget gates. The dual forget gates are designed to decide what information the cell should memorize from these two dimensions. Consequently, one new cell can receive the cell states and hidden states from the last temporal cell and the last spatial cell separately. Mathematically, we design the dual forget gates and modify the other functional gates from (2), (3), (4), and (5) accordingly as follows, where f T s,t and f S s,t are the dual forget gates. f T s,t represents the temporal forget gate based on temporal information, while f S s,t is the spatial forget gate based on spatial information. Besides, h s,t−1 is the hidden state from the last time step t − 1 while the h s−1,t is the hidden state from the last space step s − 1. The designed dual forget gates are highlighted in Fig. 1(b).
To decide which temporal information needs to be memorised, we multiply the cell state c s,t−1 element-wise by our temporal-based forget gate f T s,t . Similarly, the element-wise multiplication of the cell state c s−1,t and spatial-based forget gate f S s,t is to determine which spatial information to keep/leave. Thus, the new cell state is The design of the dual forget gates causes the cell state c t and the hidden state h t of this cell to be impacted by both the temporal forget gate and the spatial forget gate. In this case, the ST-LSTM model can capture the temporal and spatial correlation in parallel.

D. Design of the Sensitivity Gate
Previous input features come from power flow analysis and lack specific HC-related information. However, this information cannot be directly added because it covers information over different locations. To improve the performance of the ST-LSTM, we design a new gate, namely the sensitivity gate, to decide which information should be input. The sensitivity gate, collaborating with the input gate, use the voltage sensitivity information to control the input when adding new information into the cell state.
During hosting capacity analysis, we notice that the limit of voltage violations constraint plays a crucial role in HCA, which is similarly concluded by [5]. It is to check the overvoltage on the other buses after adding PV generators at one selected bus. Specifically, [39] illustrates that voltage increase at load bus bars is a severely limiting factor when installing DERs. Based on this, we use this voltage sensitivity data as the input of the sensitivity gate. The voltage sensitivity data is the voltage changes of the other buses when adding a unit number of generation at one bus.
The voltage sensitivity is considered over buses, where the locations of nodes lead to different impacts on the other buses.
To embed the impact based on locations, we use the average sensitivity data. We define the average voltage sensitivity of bus s as the number of the average voltage change of the other buses on the path it belongs to when adding a generator at bus s. Our simulation found that the maximum bus voltage change of one path always happens at the lateral end. Additionally, the voltage change of the neighbor nodes has a higher error than the average bus voltage change. Thus, we use the average bus voltage change to reflect the impact brought by each individual bus on the entire path.
Moreover, we can observe that, when adding a generator at bus s, the voltage change of the buses from the same path will be higher than the buses from the other paths, which shows the independence of paths. Thus, this weaker interactive impact between two paths supports our design of the longest paths method.
Mathematically, the sensitivity gate function is where V s,t is the vector of voltage sensitivity data and it will be mapped using the weight matrix. In this gate, we also use the sigmoid function as our activation function, which can extract non-linear impact of voltage sensitivity on HC. The designed sensitivity gate is highlighted in Fig. 1(b). Consequently, the cell state function is updated from (13) to Since we do not modify the output gate, the function of h s,t will be remained as In summary, to capture the temporal and spatial correlation, the designed dual forget gates can allow the ST-LSTM model to process information in these two dimensions in parallel. To improve the model's accuracy, we embed the physical information by designing the sensitivity gate, which can consider the voltage violations constraint in HCA. Furthermore, to achieve the goal of online prediction, we keep using new time-series data to update the model.

IV. NUMERICAL RESULTS
The proposed Spatial-Temporal LSTM model for hosting capacity analysis is validated extensively on various test cases. This section illustrates the results on test distribution grids, including the IEEE 123-bus feeder and a utility high penetration feeder. This utility feeder, shown in Fig. 5, is an actual 12.47 kV, 9 km-long Arizona utility feeder with 3.8 MW of residential roof-top PV installed, leading to a penetration level of more than 200% (3.8 MW/1.6 MW) as compared to the feeder total gross load during peak PV production hours [46]. To simplify the calculation, we assume that all equipped PVs at residential customers are integrated into the secondary side of each distribution transformer. The simplified model of this feeder contains 2100 bus nodes and 371 distribution transformers [47].
We will validate the performance of the designed dual forget gates and the sensitivity gates via the comparison with baseline models, which are the temporal sequence LSTM and the spatial sequence LSTM. These comparison results will support our design.

A. Dataset Preparation
To prepare training and testing data for validation, we use CYME to generate time-series data by conducting power flow analysis and hosting capacity analysis. CYME is a power engineering simulation software by EATON which can be used to analyze power systems [48]. In CYME, the Load Flow module can provide power flow analysis, and the Integration Capacity Analysis module can calculate the HC. Specifically, we use the Integration Capacity Analysis module in CYME software to calculate the hosting capacity as part of the training data set [49], [50]. This module can set different constraints, including voltage violation constraints, thermal loading constraints, etc. Then, the software will progressively add PV and run power flow simulations until violations of one or more operation standards appear. Our learning-based model is to find a mapping rule from the power flow data to the HC values for each bus. Since these different constraints already limit the simulated HC datasets we used in our model, our learningbased model can naturally consider the constraints. We have time-series models of the IEEE 123-bus feeder and the Arizona utility high penetration feeder in CYME.
We use the result of Load Flow analysis as our input vector. Specifically, we use the voltage magnitude, voltage angles, load profiles, and PV profiles, etc. The h t of each cell is the HC value. We use the min-max normalization method to preprocess the input vector. For each feature in the input vector, we convert the minimum and maximum values to 0 and 1, respectively. All other values are converted to a decimal number between 0 and 1. For the HC data, we directly use a selected constant to scale the value. This linearity scale method allows us to retrieve the output to the expected HC value easily.
To use the voltage sensitivity data on one selected path, first, we add a generator at one bus. Then, we calculate the average voltage change at the other buses on this path. This average voltage change vector is the input of the sensitivity gate in the ST-LSTM cell. In this way, the sensitivity gate can capture the overall impact of each bus on this path.
B. Experiment Setups 1) Baseline Methods: As shown in Fig. 2, the basic LSTM model can only flow one type of sequential information, either temporal or spatial sequence. Thus, we use the basic LSTM as the baseline model to test the performance using either temporal or spatial sequences.
For temporal sequence LSTM, we have time-series models. In each training step, we input the power flow data from power flow analysis of all the time-series at one bus and can have the output as the HC data vector for all the time series at this bus. For spatial sequence LSTM, we use different paths in the feeder, all from the slack bus to the lateral ends. Then, we collect the data of one path for each time and put them into the ST-LSTM model. In this way, we can obtain the HC data vector of all the buses at this time. For our ST-LSTM model, we make a similar setting to the spatial sequence LSTM. The difference is that we use the outputs from the last time as part of the input each time. In the IEEE 123-bus feeder, we have 12 time-series models, 8 of which are the training set and 4 of which are the testing set. The utility feeder has 24 time-series models where 16 are training sets and 8 are testing sets.
2) Proposed ST-LSTM (Longest Paths Method for Spatial Correlation): For the IEEE example feeders, we will show the result of the IEEE 123-bus feeder. The longest path in the IEEE 123-bus feeder is the red path shown in Fig. 6. This path contains 24 buses and two critical regulators. Node 67 on this path is one of HCA's most critical nodes. Besides the longest path, different paths will compose the whole network. Because the IEEE 123-bus feeder does not have long lateral branches, we can add PV generators on the three-phase main trunk, we have paths from the slack bus to each lateral end. Since we can obtain the HC value of all the buses on the trained path, theoretically, we can find a finite number of paths to cover the whole feeder. In this feeder, we have 39 paths and calculate the HC value of all buses in the feeder. The computed results may be slightly different on the buses from the overlapped paths. To reduce this overfitting risk, we use a boosted method which takes the average of the computed value from different paths. Compared to the IEEE feeder, the utility feeder has fewer overlapped paths, and the paths are from the three-phase main trunk to each lateral end. We divide the whole network into 5 different zones and use a depth-first search (DFS) based script to find about 6 paths in each zone. In this way, we can obtain the HC value of all the buses.
3) Proposed ST-LSTM (Online Update for Better Performance): The proposed ST-LSTM is a regression model to find a mapping rule from historical power flow data

4) Performance Evaluation Metrics:
We use two criteria to evaluate the performance of the models. The first is the Mean square error (MSE) criteria, and the second is the percentage accuracy. The model uses the MSE to conduct back-propagation, and the percentage accuracy can give an intuitive comparison between the simulated HC and the calculated HC. Specifically, simulated HC is the simulation result from CYME, and the calculated HC is the output of the deep learning models.

C. Accuracy of ST-LSTM With Multi-Dimension Information
To validate the performance of our ST-LSTM with the designed dual forget gates, this section shows the results of the longest path in the IEEE 123-bus feeder using different models, which are the temporal sequence LSTM, the spatial sequence LSTM, and the ST-LSTM models.
As shown in Table I, the temporal sequence LSTM has the worst performance. HC is highly correlated to spatial information, while the temporal sequence LSTM lacks this crucial information. Compared with the spatial sequence LSTM, the ST-LSTM model has lower MSE and percentage error. Fig. 7 shows the comparisons between these two models at time t, which is time-series 11.

Remark 1:
For the ST-LSTM model, we need to find a method to build a spatial sequence and capture the structural relationship for simulation. In computer science, graph neural network (GNN) is a popular way to embed the topology information [51]. Therefore, we implement the GNN in the ST-LSTM to compare with the proposed longest path method for spatial correlation. Specifically, we add a graph convolutional network (GCN), which is a representative type of the GNNs, layer before the input of the ST-LSTM cell [51]. This GCN layer is built based on the adjacency matrix, mapping the original input power flow data to output. However, the direct model combination did not improve the learning performance in the IEEE 123-bus feeder. In the simulation, we use the order of bus numbers as the spatial sequence. The results of the basic ST-LSTM model and GCN-based ST-LSTM model are shown in Table II. Neither of the models has a sensitivity gate to consider spatial correlation as proposed. Thus, we proposed the longest paths method to  III  COMPARISON BETWEEN THE SPATIAL SEQUENCE LSTM MODEL AND  THE ST-LSTM MODEL ON OTHER PATHS OF THE IEEE 123-BUS FEEDER   TABLE IV  COMPARISON OF PERCENTAGE ERROR BETWEEN THE SPATIAL LSTM  AND ST-LSTM IN THE UTILITY FEEDER divide the feeder network into paths for a better learning performance.

D. Accuracy of the ST-LSTM on Different Paths
To prove that the high accuracy of the ST-LSTM is not an accidental experimental result, this section shows the results of other paths in the IEEE 123-bus feeder using the spatial sequence LSTM and the ST-LSTM models.
We choose another two paths in the IEEE 123-bus feeder. In Fig. 6, the yellow and blue paths represent Path 2 and Path 3 in the IEEE 123-bus feeder, respectively.
The numerical results comparison between the spatial sequence LSTM and the ST-LSTM of path 2 and path 3 are shown in Table III. We can see, compared with the spatial sequence LSTM, the ST-LSTM model has a significantly smaller percentage error on both of the two paths. That is to say, the ST-LSTM model will increase the accuracy of HC calculation for different paths in the IEEE 123-bus feeder.

E. Practicality of the ST-LSTM
This section will show the accuracy of the ST-LSTM in a utility feeder, which will further verify the practicality of the ST-LSTM in HCA.
The network we used is a generic Arizona utility high penetration feeder. We divide this utility feeder into different zones. The division of the zones was primarily based on the location of the buses and where those buses are connected into the three-phase main trunk. In each zone, we also use a script to find all the paths. In this way, we can obtain the HC value for all the buses. The comparison is shown in Table IV. The accuracy of the ST-LSTM is higher compared with the basic LSTM as well. The detailed results of time t, which is timeseries 16, of one of the paths using the spatial sequence LSTM and the ST-LSTM models are shown in Fig. 8(a) and Fig. 8(b) respectively.

F. Accuracy of the ST-LSTM With the Sensitivity Gate
Since we already verified the performance of the ST-LSTM with the designed dual forget gates, this section will show the accuracy of the ST-LSTM with and without the sensitivity gate, which will further validate the performance of the sensitivity gate of ST-LSTM.
In our review and simulation, we found that the voltage sensitivity data correlates with the HC. To make use of this correlation, we use the voltage sensitivity data as the input of the sensitivity gate. The comparison between the ST-LSTM without and with the sensitivity gate is shown in Table V and the detailed results of one time t, which is time-series 16, of one of the paths using the ST-LSTM with and without sensitivity gate models are shown in Fig. 8(b) and Fig. 8(c) respectively. Notice, the sensitivity gate can significantly decrease the percentage error. This paper uses a developed deep learning-based method, namely Spatial-Temporal LSTM, to calculate the distribution grids' hosting capacity (HC). There are three main contributions of this paper. First, the paper proposed a deep learning-based method to calculate the HC in real time. Second, the paper modifies the basic LSTM to the ST-LSTM. This paper updates the forget gate to dual forget gates to build a correlation between temporal and spatial sequences because the forget gate is exceptionally significant in the basic LSTM model. Third, this paper increases the accuracy of the deep learning-based HC calculation method using the designed sensitivity gate. These gates use the result of the voltage sensitivity analysis. Comparing basic LSTM and the ST-LSTM on IEEE example feeders confirms that the new ST-LSTM has a considerable increase in the average and the respective HC accuracy at the buses. To further validate the design of the ST-LSTM deep learning framework, we test the performance on a generic Arizona utility high penetration feeder, which also reinforces the superiority of our new designs.
In our future work, we plan to theoretically investigate how voltage sensitivity data can improve the performance of deep learning-based HCA. Furthermore, we will use a more efficient way to embed spatial information, for example, a treestructured LSTM network or a combination of the GNN and LSTM network may be feasible research directions to extend the ST-LSTM design.