Deep Federated Learning-Based Privacy-Preserving Wind Power Forecasting

Given the growing installed capacity, wind energy will exert a profound impact on the flexibility of modern energy systems. Wind power forecasting is a practical solution for dealing with the attributed variations and uncertainties, balancing supply and demand, and improving the reliability of the system. To achieve more accurate and generalizable forecast models, comprehensive data sets, supplied by multiple wind farms owing to their spatio-temporal dependencies, are required. In addition, data aggregation/collaboration across many wind farms scattered around a country is difficult, if not impossible, due to complex administrative processes, industry competition, and data privacy and security concerns. This article offers federated learning-based wind energy forecasting as a novel decentralized collaborative modeling method capable of training a single model on data from many wind farms without jeopardizing the privacy or security of data. To this end, rather than sending private data across sites, local model parameters are securely transmitted. A comparison between the proposed private distributed model and non-private centralized and fully private localized models indicates the high performance of the proposed federated learning-based wind power forecasting with 87.96% accuracy. Enjoying the smoothing effect, the higher generalizability of the proposed model with 83.63% accuracy is also substantiated in comparison to localized and centralized approaches while the privacy of the underlying data is preserved.


I. INTRODUCTION
Wind energy is now a prominent renewable energy source and an essential alternative energy solution for energy development due to rising energy demand, dwindling global resource reserves, and environmental protection concerns [1]. It is clear that the utilization of wind energy has increased dramatically in recent years, thereby exerting a profound impact on the flexibility of modern energy systems. According to [2], the total capacity of all erected wind turbines globally reached 837 GW by the end of 2021 indicated in Fig. 1.
The associate editor coordinating the review of this manuscript and approving it for publication was Xianzhi Wang .
Despite all of the benefits associated with wind energy, various issues such as variability, internal instability, and uncertainty limit its high penetration into energy systems [3]. To overcome these difficulties and mitigate existing uncertainties, accurate forecasting of wind power has been offered as a dependable and low-cost solution [4].
The proposed forecasting techniques, based on methodology, are categorized into four broad groups: physical, statistical, intelligent, and hybrid models [5]. The physical models focus on numerical weather forecasting and use various meteorological data collected from observation systems to forecast wind speed. Although useful for long-term prediction horizons, physical models require additional factors, such as geographic and geomorphic conditions, temperature, and pressure. Additionally, these techniques necessitate the use of a lot of measuring sensors, which are not necessarily economical [6]. Statistical and intelligent models use past observations to extract time-varying relationships in timeseries [7]. Various statistical models for wind speed forecasting have been introduced, including Kalman filter [8], Box. Jenkins models (AR, ARIMA models, etc.) [9], and Particle Swarm Optimization [10]. While statistical techniques perform well when estimating basic time series, they are incapable of handling nonlinear data and perform poorly when processing datasets with complex behavior. To address these issues, intelligent models are adapted because of their great capacity for learning volatility and nonlinearity. Data mining methods such as artificial neural networks (ANNs) [11], machine learning models [12], and deep learning [13] are used to create those intelligent models. These models have been widely utilized in recent years for a range of energy and power system applications and have consistently outperformed other models for wind forecasting applications [14].
It has long been observed that the combined (relative) variability of multiple wind generators (or solar generators) installed in a wider area is less than the variability experienced by a single system [15]. Additionally, intelligent models are susceptible to overfitting [16], limiting their capacity to generalize when deployed to new datasets. Uncorrelated locations represent a smoothing effect that can reduce variability associated with wind turbines, and therefore, improve the accuracy and generalizability of deterministic forecasts [17], [18]. Typically, suggested frameworks assume that all data records from smart meters are transmitted over broadband networks to a centralized computing infrastructure for model training. Nonetheless, this assumption creates privacy issues, since data profiles disclose a wealth of sensitive data, such as the connection of wind turbines and control centers, the wind farm network, and the turbine itself. Sending such sensitive data across networks exposes it to hostile interception and exploitation. Thus, the primary drawback of both conventional and intelligence methods used in previous forecasting models is the need for centralized data. The centralized data is very sensitive since it may readily be utilized to infer critical/private information or conduct cybersecurity attacks [19], [20]. For example, reference [21] examined the effect of data integrity attacks on the physical system of a wind farm. As such, companies are becoming more worried that their information is being utilized (or worse, misused) without their knowledge or consent. Under this landscape, collecting and exchanging data across various energy companies becomes more difficult, if not unfeasible, while the value of collaboration over data exchange is not immediately apparent.
The preservation of data privacy in centralized databases has been the subject of numerous studies in recent years. For example, methods for securing multi-client decision trees with vertically partitioned data were presented in [22] and [23]. Following their work, Vaidya and Clifton developed secure association mining methods [24], Naive Bayes classifier [25], and secure k-means [26]. Private Support Vector Machine methods have been developed for both vertically and horizontally partitioned data [27]. Secure methods for multigroup linear regression and classification were suggested in reference [28]. Using homomorphic encryption, the authors of [29] devised a privacy-preserving linear regression technique for horizontally partitioned data. Aono et al. [30] pioneered the use of homomorphic encryption to secure logistic regression. Shokri and Shmatikov [31] suggested training neural networks with updated parameters for horizontally partitioned data. With recent advancements in deep learning, privacy-preserving neural network inference has garnered considerable academic attention [32], [33].
Despite the efforts made in previous literature, the privacy issues associated with forecasting systems with multiple clients have remained a persistent challenge. To solve such a challenge while expanding the amount and diversity of data sets, the machine learning community has suggested Federated Learning (FL) [34]. FL is a decentralized collaborative approach to machine learning in which each device contributes to the training of a central model without providing any data. As shown in Fig. 2, the server initially initializes the model randomly or using publicly accessible data. The model is then sent to a randomly chosen group of devices (clients) for local training using their data. Each client updates the model's weights on the server, which are then averaged and utilized to update the global model. This procedure will be continued until the global model reaches a state of equilibrium. FL-based frameworks have been proposed for other applications such as traffic flow forecasting [35], load forecasting [36], renewable scenario generation [37], behindthe-meter solar generation disaggregation [38] and solar irradiation forecasting [39]. Given the importance of developing accurate yet generalizable forecasting algorithms based on data from multiple parties by maintaining data privacy, to our knowledge, there is no relevant study in the existing literature that addresses this issue explicitly in wind power generation applications.
Based on the above discussion, it can be seen that prior studies have relied mainly on the idea of providing a more 39522 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. accurate model for forecasting by combining different models of machine learning and deep learning algorithms. Most suggested techniques have significant flaws since the little emphasis is placed on preserving the privacy of data associated with wind farms and meteorological stations. In addition, the proposed methods are only able to perform forecasting operations for specific areas where forecasting models are trained with data related to that area and cannot generalize forecasting for adjacent areas. Furthermore, combining machine learning algorithms with numerical weather predictions and terrain-specific conditions, while can increase the accuracy of the forecasting, adds to the complexity of the models, and requires much higher computational time. This paper proposes the use of FL to train a privacy-preserved wind power forecasting model. We use Long Short Term Memory (LSTM), a deep recurrent neural network for predicting time series, which makes use of historical measurements of the wind speed to anticipate future values of wind generation. Our contributions to this study are as follows: • For the first time, an FL-based wind power forecasting scheme is proposed to offer a secure method of protecting data privacy by training the forecasting models locally and avoiding the exchange of raw data across various wind farms. The proposed scheme enables forecasting to benefit from the improved performance provided by global model aggregation in the absence of data exchange (Section II).
• A comparison between the proposed private distributed model, non-private centralized, and fully private localized models is conducted indicating the high accuracy of the proposed federated learning-based wind power forecasting using real-world datasets (Section III).
• Enjoying the smoothing effect, the higher generalizability of the proposed model is also substantiated while the privacy of the underlying data is preserved (Section III).
Lastly, Section IV concludes the paper and elaborates on some future research directions.

II. FEDERATED LEARNING
Modern energy systems have recognized the enormous potential of artificial intelligence as a result of the emergence and advancement of industry 4.0, and have begun to anticipate more complex, creative algorithms in a variety of applications, including forecasting. However, except for a few industries, others have only restricted and/or poor-quality data, thereby limiting the full potential of artificial intelligence. Data privacy and security, on the other hand, have recently become a global concern. Federated learning is a novel modeling technique that allows a single model to be trained on data from many sources without jeopardizing data privacy or security. It can unleash the full potential of artificial intelligence with promising applications where data is decentralized, typically unbalanced, and not identically distributed.
As previously stated, many factors contribute to the issue of large amounts of data required to train joint machine learning models. Thus, it is logical to explore methods for developing machine learning architectures that do not rely on accumulating all data in a single storage place for model training. A possible approach is to build a model at each location where a data source is situated, and then to allow those locations to communicate their unique models in order to reach a consensus on a global model. To ensure client data security and privacy, the communication mechanism is meticulously designed to prevent any site from interfering with the private data of another site. Simultaneously, the model is constructed as though the data sources were merged. Therefore, rather than sending data across sites, model parameters are securely transmitted, ensuring that third parties would not second guess the contents of another party's data.
FL aims to train a model on decentralized data D 1 , . . . , D m that is usually imbalanced and not identically distributed. A centralized approach is to assemble all data as D = D 1 ∪ . . . ∪ D N to train a model (M cen ). However, to preserve data privacy, federated learning considers collaborative training of a model (M fed ), such that its accuracy (A fed ) satisfies where δ is a non-negative real number and A cen is the accuracy of M cen . This equation conveys the intuition that the joint model resulted from performs roughly the same as when all data sources are combined. Because FL data providers would not disclose their data to a centralized server and other clients, we enable the FL system to perform somewhat worse than a joint model. This extra security and privacy assurance is worth much more than the loss in accuracy for many applications including wind power forecasting. For wind power forecasting, we suggest the FL system utilize a central coordinating server, which is utilized to further create the joint model. The FL architecture may alternatively be built in a peer-to-peer fashion, eliminating the need for a coordinator; however, this will result in increased computational load. Fig. 3 illustrates the proposed FL coordinator scheme in a wind power forecasting system. The coordinator in this scenario is a central aggregation server (parameter server), which distributes an initial model to the local data owners 1-M (clients or participants). Each one of the data owners 1-M trains a local model with their own dataset and updates the model's weights through the aggregating server. The aggregation server then combines the model updates received from the clients (e.g., through federated averaging [40]) and sends them back. This procedure is repeated until the convergence criterion is satisfied or the maximum number of iterations is exceeded. Under this architecture, the original data of the individual providers never leaves the possession of the local data owners. This method not only protects user privacy and data security but also eliminates the communication cost associated with raw data transmission. To avoid data leakage, communication between the coordinating server and clients may be encrypted (e.g., utilizing homomorphic encryption [41]). This is a horizontal federated learning technique in which several users with the same feature space but different samples train a model jointly on a server. Algorithm 1 details the step-by-step procedure proposed as the FL wind power forecasting. To begin, a small set of randomly chosen participants, referred to as a mini-batch, computes model parameters locally, encrypts them, and sends them to the server encrypted. The server then performs secure aggregation without jeopardizing any participant's privacy and returns to participants the aggregated parameters. Aggregation is a wellknown approach that is based on stochastic gradient descent and is used in many different applications [42]. Finally, participants use the decrypted parameters to update their own models. This approach is repeated until the loss function converges, at which point the training phase is terminated.

Algorithm 1: Federated Averaging
Define minibatch size B, number of clients m and epochs E, the rate of learning ξ and global model w g .
[Client i] ClientTraining(i,w t g ): As a major shortcoming in dealing with sequential data, traditional neural networks suffers from a lack of memory to reflect temporal dependencies. Recurrent neural networks (RNNs) are proposed to address this issue by allowing information to persist through a recursive network. As illustrated in Fig. 4, the network acts as a memory allowing information to be passed from one step of the network to the  next. However, RNNs are not suitable for tackling longterm dependencies due to their short-term memory originating from the vanishing gradient problem, during which the gradient shrinks as it back propagates through time [43]. The LSTM networks are deep RNNs enabling learning long-term dependencies through cell state [44]. They are able to regulate the flow of information, i.e. remove or add information to the cell state, through internal mechanisms called gates. As depicted in Fig. 5, an LSTM consists of three consecutive gates including forget, input and output gates. First, the forget gate decides whether the information coming from the previous time stamp is to be remembered or is irrelevant and can be thrown away from the cell state (forget). Next, the input gate decides what new information should be added to the cell state. Finally, the output gate decides what parts of the updated cell state should be passed from the current timestamp to the next time stamp.
Each of these gates has unique computational relationships and functions, the process of calculating each variable at time t is shown as follows where σ is the logistic sigmoid function, f t , i t , o t , c t , and a t denotes forget gate, input gate, output gate, memory cell, and hidden vector respectively. W l * = W lf , W li , W la , W lo and W m * = W mf , W mi , W ma , W mo represents trainable weights of the respective gates while b f , b i , b o , and b a are output biases. Lastly, operator * defines the Hadamard product.

B. PERFORMANCE METRICS
To assess the effectiveness of the proposed models, we use various performance indices with respect to accuracy.
The following paragraphs introduce those performance indices.

1) MEAN ABSOLUTE ERROR (MAE)
MAE, which evaluates the mean absolute difference between predictions and observations, is expressed in (8) as It is worth mentioning that because MAE does not have a differentiable function, most ML algorithms that use gradient descent have a hard time incorporating MAE as the evaluation metric. To compensate for this problem, other performance metrics should be considered.

2) ROOT MEAN SQUARE ERROR (RMSE)
RMSE, as expressed in (9), can consider the error's direction by measuring the root of the mean of the distance between predictions and observations.
To make the RMSE metric more sensible when it is used in RESs models, normalized RMSE (nRMSE) is often proposed, whose formula is depicted in (10).
where P inst is the installed capacity of the wind power plant, which is 1 MW in selected wind farms.

III. SIMULATION RESULTS
This section evaluates the FL forecasting method's performance using real-world datasets, and the findings are compared to centralized models operating under non-shared data situations. Additionally, the influence of the participation ratio on FL accuracy is examined using ten clients. The results indicate that the proposed FL scheme delivers competitive performance while ensuring data privacy.

A. DATA ANALYTICS
Geographically, Iran is located in a mountainous region with great potential for wind power generation. As illustrated in Table 1, nine different wind farms scattered around the country are considered here as: Abadan, Chabahar, Kahrizak, Khaf, Zahedan, Mahshahr, Neyshabur, Nikouyeh, Sonqor, and Tabriz. Datasets with a 10-min sampling measured at the height of 40 m were collected from these wind farms, whose statistical information is provided in Table 2. Moreover, Fig. 6 depicts the Weibull distribution of these wind farms for data measured at the height of 40 m. As can be seen, multiple wind farms with different profiles can provide representative data for the country due to their spatio-temporal dependencies. VOLUME 11, 2023

B. PREPROCESSING
Data preprocessing is a critical component of machine learning, as it prepares data for knowledge discovery by cleaning, integrating, reducing, transforming, and discretizing it. Data cleaning tries to fill in missing values, smooth out noise, discover and eliminate outliers, and resolve data inconsistencies. Data integration attempts to resolve issues such as entity identification, tuple duplication, data value conflict, and redundancy and correlation. By compressing data and lowering its dimensionality and numerosity, data reduction aims to produce a reduced representation of the data. Through data normalization, aggregation, and generalization, data transformation assists in the translation of data into a suitable format, whereas data discretization replaces raw data values with ranges. The preprocessing stage is described in this article as follows: Missing values are replaced with the median using the Simple-Imputer function, outliers are found and eliminated using the Z-score metric, duplicates are simply deleted, and data is normalized using the MinMaxScaler function.

C. CASE STUDY SCENARIOS
We offer comparisons against centralized learning, localized learning, and federated-based cases to evaluate the efficacy of using FL to wind power forecasting. Table 3 summarizes these various scenarios.
To begin, we created a centralized, non-distributed learning method that is most often used in situations when data privacy is not a significant issue during training. This case consolidates individual wind farm information and conducts training in a single place. Also, centralized training establishes a baseline for the capabilities of a single, collaborative forecasting model in a non-private environment. We train models for 35 epochs with early stopping depending on the lowest error obtained on the validation set.
The second scenario is an entirely private localized learning environment in which each dataset is trained independently, and the training process is isolated from every other wind farm. This technique results in forecasting models that are specifically customized to each wind farm and cannot profit from the information contained in other wind farms' data. In accordance with the centralized learning method, training was performed for a maximum of 35 epochs. It is essential to emphasize that in localized situations, individual datasets are private and unobservable to other data owners.
Next, we offer an FL-based approach with the same objective as centralized learning: to train a single, joint model that generalizes well enough to give accurate predictions for all individual wind farms. However, FL provides advantages that exceed a localized learning environment as FL has higher generalizability. Unlike centralized learning, the training data from each wind farm is not pooled in the FL. The training data, on the other hand, is kept private by each local client.
The only way to determine how effectively a model generalizes to new situations is to test it on unseen data. In this regard, we hold out client data to assess the generalizability of the algorithms (centralized, localized, and federated) when they are exposed to completely new data. We use one client for testing the model and other clients for training the models. After the machine learning model has been trained and verified, a holdout subset is employed to give a final estimate of its performance. Using client data as a held-out subset allows to build generalizable models that are applicable to future data collection, rather than only the data used to train the model.

D. FORECASTING RESULTS OF DIFFERENT SCENARIOS
To evaluate the performance of the representative approaches, we report the R 2 , RMSE, MSE, and MAE metrics obtained on the test set for each of the 9 wind farms for different case studies. The average performance indices are also overall clients (in FL and distributed methods) or over validation sets (in centralized approaches). The performance results of Case 1, Case 2, and Case 3 are detailed in Table 4, Table 5, and Table 6, respectively.
The centralized approach provides access to all databases gathered from the various clients. As a result, model accuracy is anticipated to be higher in comparison to alternative approaches that utilize far fewer data. Using the root mean square error as an example, the centralized method performs 24.97 percent better than the localized approach and 5.35 percent higher than the FL. This demonstrates that centralized models are capable of effective predictions, although with a high data need and a trade-off in privacy. While centralized models may allow for the learning of collective behaviors, they also risk the privacy of the energy facilities since data must be collected in a single place.
The localized learning method involves training a model for each client separately, utilizing just the data that is    accessible to that particular wind farm. The high value of R 2 score along with lower values of MAE, MSE, and RMSE indicate a reasonable performance for the localized model. This means that the LSTM architecture is adequate to learn complex generation profiles that are specific to each client. Also, the localized model maintains privacy as there is no sharing of data between clients. However, this approach lacks generalizability because the training samples are limited to one specific location, and new and unseen data might result in poor performance of the localized models.
The FL method uses iterative communication between a supermodel and each client for each round of training. A subset of clients is selected, each of which trains its own local data separately for a limited number of epochs. As a result, a pool of local models is created that can be used to further update the supermodel. The selected clients update/co-train the supermodel by sending the parameters associated with the local models. Because of such a training procedure, the federated model can preserve privacy in contrast to the centralized model. Additionally, the FL outperforms the localized model by 7.42%, as measured by the R 2 score. The average values of RMSE, MSE, and MAE are also 2.67%, 6.9%, and 32.37% less than the localized model, respectively.
For scenarios involving a held-out subset with a localized approach (Case 4, Case 5, and Case 6), we perform the experiments by holding out the worst-performing client model with the highest errors on the validation set. As such, Client 1, Client 3, Client 5, and Client 8 are not involved in the training phase, and we only use them to assess the generalizability of the representative models during the testing phase. Table 7,  Table 8, and Table 9 show the details of the obtained performances over the held-out clients as well as the average metric scores for different scenarios.
As it is shown, when the models are exposed to previously unobserved data, their overall performance suffers a degradation. Nonetheless, the centralized method outperforms the localized approach by a significant margin (e.g., 19.57% higher R 2 and 21% lower RMSE), although at the expense of a large amount of data and a reduction in privacy. This time, however, the FL shows higher performance compared to the localized and centralized approaches. For example, the VOLUME 11, 2023     R 2 is 8.96% and 23.24% more than those of centralized and localized methods, respectively. Additionally, the RMSE is 14.33% and 46.08% lower, respectively, than the centralized and localized methods. The projected values for a 5-hour forecasting horizon of Client 1 are displayed in Fig. 7 to help comprehend the capabilities of federated learning in comparison to centralized and localized techniques.

E. COMPARISON STUDY
This section will compare the suggested strategy to numerous cutting-edge machine learning methods. The purpose of this study is to get a better understanding of the advantages and limits of the decomposition-based model in contrast to powerful and timely techniques such as support vector machine (SVM), random forest (RF), and multi-layer perceptron (MLP). There are several machine learning models, each with its own characteristics and uses. These models were chosen as representative of the most popular and effective supervised learning techniques. These algorithms provide very precise, consistent, and interpretable prediction models. Nonetheless, the proposed method is applicable to the remaining machine learning models. This section begins with an overview of the typical ML algorithms. Following the findings comes the debate. Table 10 displays the performance of the recommended forecasting models for a six-hour-ahead forecasting horizon with varied evaluation criteria for various data sets (Client1, Client3, Client5, Client8, and average performance).
As expected, diverse algorithms display a variety of traits and performance characteristics. While certain algorithms, such as RF, perform better for some customers, they may be surpassed by other models in other contexts and on average. On the contrary, MLP did well in all circumstances. Both MLP and LSTM are capable of projecting wind power rather well, with LSTM beating MLP in simulations on average. Nevertheless, the suggested federated strategy has shown consistent, high-level performance across all stations, as indicated by the mean result. The worse performance of the ML algorithms (in comparison to the suggested models) is attributable to their inability to account for the nonstationarity and variability of wind profile data. Although ML models are capable of learning data, they are unable to capture the time-dependent characteristics of the wind series. LSTM, on the other hand, may match datasets better since it maintains temporal relationships. Using the MAPE as an example, the suggested method performs 3.4% better than MLP, 9.4% better than RF, and 6.7% better than SVM on average.

IV. CONCLUSION
Collective wind energy forecasting is a difficult task, given the privacy concerns surrounding wind farm data. Here, we proposed a privacy-preserving wind power predictor system by federating the training of machine learning models between several wind farms. To our understanding, this is one of the first studies that examine federated learning in the context of learning-based wind energy prediction. By using a federated learning method, we may substantially decrease the overall communication between clients and the central server as server-client data transmission is no longer required. Because the server does not gather data from individual wind farms, data privacy is preserved. Federated learning outperforms localized models in our trials and performs rather well when compared to centralized approaches. When exposed to unseen data, federated learning shows higher genralizeability compared to its counterparts.