An Automated Approach for Screening Residential PV Applications Using a Random Forest Model

The rapid growth of residential solar photovoltaics (PV) applications is a challenge for distribution utilities as they work to maintain grid standards and minimize customer interconnection times. A “screening process” is typically used by utilities to approve customer interconnection request. While conventional “fast-track screening” methods (e.g., limiting PV capacity to 15% of transformer capacity) can be done quickly, they are too restrictive for new PV interconnections. On the other hand, detailed studies often require power flow modeling and would increase customer interconnection times. This work uses a random forest (RF) model to screen residential solar applications without the need for power flow analysis. The proposed RF model is based on commonly available PV application information and network data as inputs, such as application size and solar penetration. The correlation and importance of these RF inputs are investigated so that utilities have flexible implementation options. Further advantages of this data-driven approach are transparency, i.e., utilities can show how different inputs affect a pass/fail decision, and a quantified probability associated with the screening decisions can be provided. Case studies show how a utility would use the proposed approach and benchmark the proposed approach with conventional screening methods. The proposed approach was found to be more accurate than the conventional fast-track screens. It was also found to be faster than detailed power flow studies and nearly as accurate.


I. INTRODUCTION
T HE rapid deployment of residential solar photovoltaics (PV) poses challenges for distribution utilities. During the interconnection process, it is important for engineers to maintain grid standards, minimize customer wait times, and assess customer applications accurately. These objectives are often complicated by poor primary and secondary network models and overly simplistic ''fast-track'' screening heuristics. Fig. 1 shows a typical utility interconnection process for distributed PV. In this process, customers first submit ''solar applications'' (i.e., interconnection requests) to utilities when they want to install PV at their premises and utility will then decide if the request is approved or not. Typically, applications will include the solar capacity and other technical details to help the utility determine if it could cause a violation on the distribution network. According to the Federal Energy Regulatory Commission (FERC) Small Generator Interconnection Procedures [1], applications should go through a fast-track screen first. One common fast-track screen is the ''15% rule,'' which sends customers to ''Supplemental Review'' if the aggregate PV capacity is more than 15% of the peak load [2]; however, as residential solar penetrations increase and fewer customers pass the screen, the 15% rule can become overly restrictive. The authors of [3] proposed to update the 15% rule VOLUME 10, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Typical utility interconnection process [5].
method based on distributed generation type and locationspecific screening criteria, which can be used to identify zones where higher penetrations are acceptable, e.g., downstream from one service transformer. Although this locationspecific method performs screens quickly, it still yields conservative results [4]. After an impact study is deemed necessary to evaluate an interconnection request (see Fig. 1), a common utility approach is to run power flow on the primary distribution network. Due to a general absence of high quality secondary models, customers are modeled with a direct connection to the primary or to the service transformers. A fixed voltage drop is applied to account for secondary network impacts. These methods run the risk of incorrectly passing or failing applications.
Several power flow-based methods have emerged to inform utility interconnections. Reference [6] defined local hosting capacities for different feeder zones based on characteristics, such as distance of PV to substation and voltage regulator set points. Reference [7] investigated the impact of PV location and phase connection type on PV hosting capacity and proposed to incorporate these impacts in fast-track screening. Reference [8] developed streamlined distribution system hosting capacity to inform utilities in screening new interconnection requests as well as to estimate how much and where PV can be located such that grid updates are minimized. Reference [9] provided a scalable open-source tool to estimate the hosting capacity of PV in distribution systems. Reference [10] reviewed current distributed energy resource interconnection practices and emerging solutions for incorporating more PV, such as system upgrades and non-wire alternatives (NWA, e.g., inverter control, solar-plus-storage).
In the absence of high-quality models, several datadriven methods have been proposed to improve distribution utility models. These methods use advanced metering infrastructure (AMI) to correct customer-to-transformer mapping, correct phase [11], and predict secondary topologies [12], [13], but they are not robust to poor AMI coverage. By using decision tree and logistic regression to predict secondary network models, [14] demonstrated both robustness and scalability. Research into data-driven hosting capacity has also explored the possibility of circumventing network models using machine learning [15], [16], [17]. However, these data-driven hosting capacity methods cannot be used directly by utilities to screen customer applications. This is because the purpose of hosting capacity analysis is to find how much PV a distribution system can accommodate. Despite the ubiquity of hosting capacity maps, utilities follow different processes for evaluating residential solar applications as described in Fig. 1. They do this because applications can fail due to their size, location, and other attributes even when hosting capacity exists on the distribution network.
This paper proposes a data-driven screening approach for residential solar applications using a random forest (RF) model. Once trained, the approach does not require power flow analysis, which bridges the gap between conventional fast-track screening approaches and detailed impact studies.
To build the RF model, a utility would first need to generate training data sets by performing detailed grid impact studies [9], [18]. Note that this only requires a subset of feeders that are representative of the utility feeders, and it could circumvent the need for validated circuit and secondary models for all customers in the service territory. Next, feature engineering is done to investigate both the correlation and importance of features. Last, the trained screening model can be applied to a larger set of feeders in the same utility territory by assuming that their grid characteristics are relatively consistent. The main properties of the proposed model are summarized as follows: • The RF screening model can account for primary and secondary distribution network conditions.
• The RF screening model is fast and does not require large computing resources.
• The RF screening model uses commonly available inputs (such as application size, distance to substation) and can allow engineers to choose inputs based on available data.
• The RF screening model allows utilities to choose their own risk criteria based on a probabilistic assessment of voltage violation likelihood.
• The RF screening model is transparent. Engineers can see how each input feature leads to an assessment of the customer application's likelihood of causing a voltage violation.
The main contributions of this paper are summarized as follows: • We propose an automated approach for screening residential PV applications based on an RF model. The approach was designed for utility engineers in need of a screening tool that is more accurate than traditional screening methods, and faster and less resource intensive than power flow based approaches.
• We quantitatively compare the proposed model with other screening approaches, and we demonstrate how a utility would use the proposed approach in different scenarios.

A. FEEDERS UNDER STUDY
The RF model was trained and tested on 12 northeastern feeders with nominal voltage of 13.2 kv that and fit with secondary models produced by [14]. The feeders were selected from a larger set of feeder models based on several quality assurance criteria. The selected feeders were mostly residential customers, hosting capacity results were realistic, and there was a close match between AMI and SCADA data. Additionally, the mean absolute error between customer service points and AMI voltages was less than 2 volts. Fig. 2 shows key metrics that describe the feeders' length, customer count, regulator count, and capacitor count, etc. The top two sub-figures show that the feeder with index 8 has the longest distance from its feeder head to its furthest component but has the fewest customer count. The bottom two sub-figures summarize the installed and pending capacity of PV measured in KVA. The metrics for the 12 chosen feeders were largely representative of a larger set of 400 feeders from the utility.

B. GENERATING TRAINING AND TESTING DATA
Time-series hosting capacity analysis results were used to develop the RF training and testing data sets. Solar was sized so that solar energy generation matched load over one year. The hosting capacity analysis simulates many scenarios assuming increasing deployment of residential solar from 0% to 150% penetration (i.e., solar capacity/feeder peak load) until it was not possible to add more solar without causing a voltage violation (ANSI 5% deviation from nominal). More details about the time-series hosting capacity analysis can be found in [19]. Note that the violation metrics and thresholds are easily extensible to include other metrics [18]. The 95th percentile voltage was used to determine if the customer interconnection would cause a violation. Several features were collected that are predictors of a voltage violation, such as feeder primary nominal voltage and customer application size (kw). Detailed features will be introduced in the next section. Additionally, the hosting capacity analysis was performed under several NWA scenarios, including smart inverters with fixed 98% fixed power factor, IEEE Category B volt-var control curve, and volt/watt control curves. The volt-var and volt-watt curves followed IEEE Std 1547-2018 [20]. A solarplus-residential battery NWA scenario was also analyzed, which used the following time-based control program: • 10 pm to 4 am -discharge at 1/3 maximum discharge rate until battery fully discharged • 5 am to 4 pm -charge battery from gross solar production • 5 pm to 9 pm -self-consumption; discharge battery to offset positive net load, otherwise charge from net solar production.

III. RANDOM FOREST MODEL
Random forest is an ensemble method for classification relying on the ''wisdom of crowds,'' where a committee of decision trees casts a vote for the predicted class [21]. The key objective of an RF is to build a large collection of lowor de-correlated trees, find the probabilistic classification VOLUME 10, 2023 results of the individual decision tree and then average those probabilistic results. In this case, we use the RF model to classify (screen) a customer adopting PV if it is likely to introduce a voltage violation, where each tree can be seen as a decision process that looks at different features and provides their individual decision. The RF then takes this collections of decisions to estimate the probability of a violation and distills it down to a pass/fail screen result. One advantage of using RF here is that probabilistic screening decisions can be provided. In this section, we start with the feature selection process, and then implement RF using the Python Scikit-Learn library [22].

A. FEATURE SELECTION
Feature selection relies on domain knowledge and statistical processing. Here, we collect features that can describe a feeder's overall conditions as well as the customer electrical properties. Both are good indicators for predicting operational violations. For example, feeder primary solar penetration describes the total installed capacity (KVA) ratio to peak load at the feeder level; secondary solar penetration describes the same ratio but in the low voltage system (downstream of the service transformer where the customer is located); distance measures are also important in predicting operational violations, such as the customer electrical distance (R, X), actual distance (km), and customer degree which describes the number of poles between the customer to the corresponding service transformers. The full list of features inspected here is included in Fig. 3, where the correlation of features will be discussed in the subsequent subsections.

1) IMPORTANCE MEASURE
Feature importance scores measure the ability of an independent variable (i.e., a feature) in a model to predict dependent variables. They can help understand the data, interpret the model, and improve the model. Feature importance scores can be estimated using different methods. We use the permutation importance method, which randomly shuffles a single predictor variable and observes the decrease in a model score (i.e., impurity) [23]. This procedure breaks the relationship between the feature and the target variable; thus, the drop in the model score is indicative of how much the model depends on the feature [22]. Reference [24] shows the advantages of permutation methods in terms of reliable results and computational efficiency.

2) CORRELATED FEATURES
Feature importance measures might be inaccurate if the features under study are highly correlated [24]. The correlation matrix in Fig. 3 shows how the features under study are correlated. For example, electrical distance R and X are highly correlated. In Fig. 4, the permutation importance method is used without accounting for feature correlation. For example, the ''primary nominal voltage'' and ''regulator'' features are typically associated with affecting hosting capacity, but they are correlated and incorrectly show a very low feature importance as worse than random. One option would be to remove correlated variables, but this approach would remove flexibility in the screening model. For utilities with limited data, the option to choose between screening inputs can be valuable. To overcome this problem, correlated features are first grouped and the grouped sets of correlated features are used in the importance analysis. The idea of using grouped features is similar to redundancy analysis where some features can be eliminated if these features can be represented by other features (i.e., highly correlated features). Thus, utilities can choose one or more feature within each correlated group based on data accessibility without compromising the classification results. Here, the importance score for grouped features are calculated based on [24], where a group of features are treated as a ''meta-feature'' and features within each ''meta-feature'' are permuted together. A ''meta-feature'' has a higher importance score when its permutation causes a larger reduction in model accuracy. Note that the correlated features only affect the feature importance measures, they do not affect the accuracy of the RF model. The calculated grouped features and importance scores are shown in Fig. 5. This is in comparison with Fig. 4 with ungrouped features. One can observe that the ungrouped method ranks the primary nominal voltage and regulators as worse than random. This is because primary nominal voltage and the number of regulators are correlated with the number of capacitors in the models in our training set. The RF could, therefore, attribute importance to any of these features without loss of model accuracy. Grouping the correlated features into meta-features circumvents this misattribution of importance, leads to importance scores more consistent with engineering judgement, and provides flexibility to utility users.

B. RANDOM FOREST TRAINING AND TUNING OPTIONS FOR SCREENING RESIDENTIAL PV
All the features in Fig. 5 are used to build the RF in Python except customer simulated voltage, which is used for labeling the screening decision, i.e., pass or fail, encoded as 1 or 0. This is easily extensible to other criteria, e.g., thermal constraints on grid assets. The detailed steps and algorithms to build a general RF can be found in [23]. Next, we will find the ''optimal'' set of RF model parameters that best fit the given data, i.e., a set of model parameters that will yield the least amount of errors for both training and testing. This process is called hyperparameter training, which is performed, heuristically following two steps: 1) random and 2) grid search hyperparameter tuning. The random search step is to narrow down the hyperparameters to a smaller range of search space for a finer grid search step. A subset of the RF hyperparameters is listed in Table 1 [22]. Fig. 6 shows the grid search for parameter tuning. The upper left corner would be equivalent to a decision tree model with the n_estimater being 1. Also the figure shows that the bottom right corner has a darker color than the upper left corner, which means that overall the algorithm prefers a large  number of ''weaker'' learners over fewer strong decision tree learners. Based on the grid search, the optimal parameters can be chosen. The test results from the test data set are shown in Fig. 7a without a preference for error type. When it is desired, the preference for a false positive (e.g., to decrease the number of customers that are incorrectly passed) can be modified by changing the pass threshold (used for converting  a predicted probability to a binary decision) for the RF model, as shown in Fig. 7b. 1 In addition, RF with default parameters in Scikit-learn has a training computational complexity O(mn log n), where m is the number of trees, and n is the number of samples. 2 Therefore, the training time for the screening model is normally acceptable. Fig. 8 shows the probability of 2000+ customers from the selected 12 feeders passing a screening study, as predicted by the RF. The figure shows that our model (with designed training data) is balanced enough to predict the full range of pass and fail probabilities In the figure, the right side shows customers with a high probability of passing. The left side shows customers with a low probability of passing. Consider the case where the majority of customers are concentrated in the middle around a pass probability of 0.5. In this case, the 1 Note that the RF probability results come from averaging the probability results from each tree as opposed to a common misconception of simply averaging each tree's decisions [23].

A. EXAMPLES OF SCREENING PV APPLICATIONS
2 https://scikit-learn.org/stable/modules/ensemble.html#parameters RF results would be trivial and never predict a pass or fail with high certainty. Table 2 demonstrates how the RF screening model could be used with transparency by utility engineers. For a given customer application, the engineer inputs customer application size, voltage level, inverter settings, and other available information. The screening model outputs the probability that the customer will cause a voltage violation and fail the screen. In the table, the customers' pass probability increases by 55% when volt/watt settings are used by customer 1 and only 10% for customer 2.

B. COMPARISON WITH THE CONVENTIONAL METHODS
This subsection compares the proposed RF screening model with other conventional screening methods, including: • Power flow, with a full circuit model, including primary and secondary distribution systems. This is the most accurate method. A failed application is triggered when the customer simulated voltage exceeds 1.05 or below 0.95.
• Proposed RF screening, a failed application is predicted by the model. In practice, a pass is predicted when the pass probability is greater than 50%.
• DTrans voltage + 2% rise, this is a heuristic method, and it only requires the voltage at the distribution transformer. A failed application is triggered when the distribution transformer voltage + 2% rise exceeds 1.05 p.u.
• 15% rule DTran, this is similar to the 15% rule [2] but applied on each distribution transformer. A failed application is triggered if a distribution transformer contains 15% or more solar capacity relative to the transformer.
• 15% rule, a failed application is triggered if the feeder contains 15% or more solar, see [2]. Fig. 9 shows that the proposed RF screening approach performs better than the three conventional methods and is very close to the power flow method in terms of customer application pass rate.
Note that the power flow results in this paper are based on time-series simulations [19], so the choice of voltage criteria (maximum, 95 percentile, median voltage) can also have an impact on the pass/fail decisions.

V. DISCUSSION
The proposed data-driven screening approach has several advantages over conventional fast-track screens. It was found to be more accurate than the conventional fast-track screens. It was also found to be faster than detailed power flow studies and nearly as accurate. The work in [9], for example, shows that time-series hosting capacity is computationally intensive and can take hours to produce results. In contrast, the proposed model, once trained, can produce results in a few seconds. Today, many utilities also lack detailed power flow models that include secondaries, so the option to use the proposed model could bring high accuracy results at low cost.
The data-driven screening approach can also complement the conventional fast-tract screens and detailed power flow studies. Utilities can triangulate results among the methods. They can also use the data-driven screening approach to do preliminary what-if analysis (e.g., changing the voltage class of a feeder) that can inform detailed power flow studies.
Implementation of a data-driven screening approach requires comprehensive and systematic curation of training data. Ideal training data would come from feeder models that include secondary networks and are validated with AMI. Additionally, the feeder models would have a broader range of key features, including but not limited to the features used in this paper. Here, we found that some design parameters (e.g., the number of regulators) did not improve the interconnection pass rates as expected, and we theorize that this is due to confounding variables. For example, the presence of capacitors and regulators on a feeder may significantly help the interconnection pass rate, but if they are found on older feeders with other design constraints, data-driven techniques may inadvertently associate them with poor interconnection pass rates.
These study limitations may have impacted the efficacy of NWAs in this study. Although we generally found NWAs to improve the probability of an application passing, the pass probability improvement was not very large because we used default parameters from IEEE standard [20]. Further work is needed to examine optimal inverter (e.g., control curves) settings and to improve convergence on circuit models with high penetrations of volt/var and volt/watt inverters.
Traditional scenario-based interconnection studies and hosting capacity analysis have limitations in providing training data. For example, a customer interconnection will depend not only on their inverter settings but also on the inverter settings of customers sharing the feeder. In this paper, we assumed that all customers shared common inverter settings. In reality, customers may have a combination of legacy inverters without any controls and smart inverter controls with various set points. Interconnection studies and hosting capacity analysis cannot test all permutations of inverter settings, so more practical scenarios might be needed for training.

VI. CONCLUSION
This paper developed an automated approach for screening residential PV applications using an RF model. The model takes commonly available PV application information as inputs, i.e., distance to substation, and can provide quantified probabilistic screening results with low computational and engineering effort. The training data with correct labels are crucial for the model to yield trustworthy results. Overall, the outcome of this work can facilitate decision making in screening PV applications and can also be used in conjunction with traditional screening methods (e.g., power flow analysis or 15% rule).
Future work includes improving the training dataset to include more diverse and generalizable customer properties and scenarios, and perhaps incorporating field data coming from existing connected PV.