Machine Learning-Based Digital Twin for Predictive Modeling in Wind Turbines

Wind turbines are one of the primary sources of renewable energy, which leads to a sustainable and efficient energy solution. It does not release any carbon emissions to pollute our planet. The wind farms monitoring and power generation prediction is a complex problem due to the unpredictability of wind speed. Consequently, it limits the decision power of the management team to plan the energy consumption in an effective way. Our proposed model solves this challenge by utilizing a 5G-Next Generation-Radio Access Network (5G-NG-RAN) assisted cloud-based digital twins’ framework to virtually monitor wind turbines and form a predictive model to forecast wind speed and predict the generated power. The developed model is based on Microsoft Azure digital twins infrastructure as a 5-dimensional digital twins platform. The predictive modeling is based on a deep learning approach, temporal convolution network (TCN) followed by a non-parametric k-nearest neighbor (kNN) regression. Predictive modeling has two components. First, it processes the univariate time series data of wind to predict its speed. Secondly, it estimates the power generation for each quarter of the year ranges from one week to a whole month (i.e., medium-term prediction) To evaluate the framework the experiments are performed on onshore wind turbines publicly available datasets. The obtained results confirm the applicability of the proposed framework. Furthermore, the comparative analysis with the existing classical prediction models shows that our designed approach obtained better results. The model can assist the management team to monitor the wind farms remotely as well as estimate the power generation in advance.


I. INTRODUCTION
Nowadays, wind farms are a common sight in the United Kingdom to generate clean energy as well as contribute to achieving the net-zero emission goal by 2050 [1]. One of the primary sources of harmful emission is the generation of electricity using fossil fuels (i.e., coal and gas). To keep the environment clean, wind energy is a plentiful source to generate environment-friendly energy. The wind turbines allow us to use the power of the wind to rotate generators and produce energy. The wind turbine performance varies with the change of season as well as geographic location [2], [3]. Therefore, the same wind turbine has different performances in different months and at various locations. It poses challenges for the The associate editor coordinating the review of this manuscript and approving it for publication was Mounim A. El Yacoubi . energy management team to handle uncertainties. To monitor the performance, the modern wind turbines are equipped with supervisory control and data acquisition (SCADA) unit [4]. It is a cost-effective and data-driven approach to store the data streams for further analysis. The selected data dimensions from the SCADA unit are presented in the following Fig. 1. An illustration of 5G network-assisted architecture for digital twin to forecast the generated power by wind turbines.
In Fig. 1, the vertical axis presents the power generation with its units kilowatt (kW), and the horizontal axis presents wind speed with its unit meter per second (m/s). It provides the theoretical power and active power curve with cut-in (V c ), rated output (V t ), and cut-out (V s ) region. The (V c ) region specifies where the wind has no power to generate the energy. The region between (V c ) and (V t ) is a rapid growth power generation that depends on wind speed. Similarly, (V t ) is obtained rated output of the wind turbine, and (V s ) region shows the power generation stop due to high winds. The power curves present the non-linear relationship between the generated power and the wind speed incidence at the height of the rotor hub [5]. The dynamical behavior can be expressed as: where P w is power associate with wind speed v of wind, ρ refers to air density, and A is the surface area of the turbine rotor.
The data-driven approaches and curve fitting techniques are developed to predict the performance [6], condition monitoring [7], and fault diagnostics [8] of wind farms. At the same time, the current challenges include accurate forecasting of power generation and the limitation of realtime data exploration of the whole wind farm. The developed forecasting approaches are based on statistical [9] and machine learning models [10], [11]. Yang et al. [12] introduced a data preprocessing technique to estimate the data distribution over the quantile of years. They proposed a statistical approach based on bi-directional Markov chain interpolation for theoretical power calculation. It helps to obtain realistic results. Yun et al. [2] developed a statistical framework based on multiple spline regression model for the power curve modeling of the wind turbine. Their approach described the complex nonlinear relationship between wind speed and wind power using different basis functions and different numbers of knots inside multiple spline regression. Khosravi et al. [11] applied machine learning algorithms to predict the wind speed for Osorio wind farm in the south of Brazil. They applied neural networks, support vector regression, and fuzzy inference systems optimized with computational intelligence-based algorithms. They reported a neural network-based model outperforms as compared to the considered model in the study. Furthermore, they conclude wind speed has a direct influence on the generated power. In our previous work [6], we also conclude the relationship between wind and generate power followed by deep learning model prediction.
The recent advancement in machine learning approaches especially deep learning models has a breakthrough to solve complex problems. In case of temporal sequence modeling, deep learning approaches has the ability to consider the long history of the input data which leads to an accurate result as compared to its classical approaches. Secondly, a huge amount of data is available from SCADA units that make such models more robust to learn and generalize the concept of prediction. In the continuation of our research towards wind energy, the developed model considers wind speed as univariate time-series data to forecast the whole month and each week. Later, this wind speed forecasting is further used to predict power generation.
In terms of networked systems, digital twins have been considered as an emerging technology to enable a wide range of typical applications such as manufacturing, 5G and beyond networks, intelligent transportation systems, climate change, and smart cities [13]. In [14], a digital twin-based intelligent cooperation framework of UAV swarm integrated with machine learning algorithm is investigated to tackle the problem of real-time control of the behaviours of UAV swarm. Digital twins can additionally evolve their context-awareness capabilities to identify cybersecurity issues in real-time, which can be effectively applied to smart grid deployments [15]. More importantly, digital VOLUME 10, 2022 twins technology is a powerful tool to address problems of joint communication and computation task offloading in mobile edge computing (MEC) [16] to enable various mission-critical applications in the industrial Internet of things [17], [18]. Therefore, designing digital twins-based solutions significantly contributes to the development of both academia and industry in the digital era.
The real-time data exploration and a feedback loop to the wind farms are possible through digital twins' technology. It provides next-generation computer-oriented solutions. It can create a digital copy of wind farms connected with the physical wind turbines, where supervisory control and data streams are accessible for analysis and prediction. Olatunji et al. [19] briefly introduced the digital twin technology in wind turbine fault diagnosis and condition monitoring. They highlight the wind industry transformation to the next level with enhanced accessibility and availability using digital twins technology. Similarly, Kishnamoorthi et al. [20] developed a digital twin based model to predict the remaining life of offshore fixed and floating wind turbines as a predictive maintenance strategy. The model is based on operational data of SCADA units and a physics-based approach to predict the remaining life of wind turbines.
Many researchers have designed various digital twin models for predictive maintenance and have contributed significantly to improvements in wind energy technology. At present, none of the existing digital twins can predict power generation and real-time monitoring. Therefore, this study proposes a novel framework to process the temporal data stream to forecast the wind speed and predict the generated energy. Consequently, it can provide virtual access to the wind turbines 24 by 7 without visiting the physical wind farms. While its feedback loop makes the communication back and forth with wind turbines for monitoring purposes. The contribution of our work is outlined as follows: • Our framework is built over a 5G-Next Generation-Radio Access Network (5G-NG-RAN) assisted cloudbased digital twin model for understanding and analyzing wind farms. It is a cost-effective solution; hence, digital twins modeling is possible with the payas-you-go cloud services.
• The designed machine learning pipeline has two novel components-first, the forecasting of wind speed based on an advanced temporal convolutional neural (TCN) network. Second, the processing of wind forecast to predict the power generation for a month, including each quarter (i.e., medium-term analysis). The rest of this paper is organized as follows. Section II presents the proposed framework details with methods and procedures. Section III, provide the results, discussion, and performance of the framework. Finally, Section IV concludes our findings.

II. NETWORK MODEL, METHODS AND PROCEDURES
The network-assisted prediction allows support from edge infrastructure to form a setup of a collaborative system between the physical wind farm and the digital twin of it. We specifically considered Next Generation Radio Access Network (NG-RAN) as a core architecture shown in Fig. 2. The NG-RAN divides gNB into a Control Unit (CU) and a Distributed Unit (DU). gNB-CU forms the core of the network handling 5G functions such as Access and Mobility Management Function (AMF), User Plane Function (UPF) and all the associated Security Functions (SF) [21]. The gNB-DU forms the edge part of the network interacts with the wind farm through gateways and relays information via gNB-CU to the private cloud setup with a virtual (digital twin) wind farm to perform predictions. The advantage of using 5G-NG-RAN is that it allows better integration of cloud services as wind farms are geographically isolated regions than the data centers and can also ensure better coverage with lower latency offering more real-time services. It also reduces the cost of deployment by reducing the number of near-farm data centers. Moreover, better resource utilization is attainable if a large amount of data from the wind turbines must be shared between the physical and the digital systems. Furthermore, a basic framework considering the interaction between the wind farm and the virtual farm without the network infrastructure, illustrated in Fig. 3, helps to understand the actual workflow. It offers a real wind farm that is connected to its digital twin ''virtual wind farm''. Each wind turbine has supervisory control and data acquisition unit in a wind farm to provide the data for monitoring. The captured data logs are connected with virtual wind turbines with the help of digital twins modeling. Our framework processed the data logs to perform predictive modeling. It has the ability to report the possible generation of electric power from the wind farm in the coming days.

A. DIGITAL TWINS MODELING
The wind turbine supervisory control and data acquisition (SCADA) unit is modeled as digital twins using the Microsoft Azure platform. It is based on platform as a service (PaaS) to model digital twins and provides digital monitoring as a next-generation computer-oriented solution. Furthermore, a cloud-based infrastructure provides a costeffective solution. The 5D modeling approach [22] is followed to model the wind farms. The Eq. 2 depicts the model. where DT is digital twins wind farm, PE is physical entities, VR is virtual representation, DC is data curation, CS is communication scheme, and Ss for services. The details about each dimension is presented as follows:

1) PHYSICAL ENTITIES (PE)
The wind turbines consist of various physical entities, including mechanical devices, monitoring sensors, SCADA units, and activities processes. The PE can be categorized into a unit level, system level, and system of system-level [23]. The SCADA unit is considered at PE level, which is an essential tool to collect the data for monitoring the behavior of wind turbines. It enables the analyst to access the historical and real-time data of wind turbines for further analysis.

2) VIRTUAL REPRESENTATION (VR)
An entity in the virtual environment represents each SCADA unit of a wind turbine to construct the wind farm. The VR has the ability to build a connection with the wind turbines and replicate the behavior virtually. The wind farms can be connected even they are physically apart using a knowledge graph as shown in Fig. 4. In Fig. 4, two wind farms are connected using a knowledge graph with five and two wind turbines SCADA units, respectively. Similarly, each wind farm has an interface that connects the SCADA unit of wind turbines. A relationship and association are defined using digital twins definition language (DTDL) to join the SCADA unit. The DTDL is an XML-based language, and its snapshot is present in appendix Fig. 13. It also has the ability to provide a mechanism to connect sensors and link the real-time readings for further analysis.

3) DATA CURATION (DC)
The DC is the central part of digit twins. The temporal data streams of SCADA units are curated to monitor the SCADA units. It provides real-time access to the PE. The following output presents the wind speed, direction, generated power, and theoretical power of the wind turbines. Furthermore, it also gives metadata information.
The connected SCADA unit presents the real-time values in the digital twin explorer of Microsoft Azure. It shows the  digital twin's ID, etag, real-time values of the SCADA unit, and metadata information about the last update of the specific unit. It also has powerful structure query language support to have deeper analysis inside the constructed knowledge graph. A service model could be deployed over it to make an informed decision.

4) COMMUNICATION SCHEME (CS)
Digital twins are dynamically connected with PE units using a representational state transfer application programming interface (REST-API). The CS also connects the DC for realtime communication. Such CS enables the functionality of the digital twins model to communicate in real-time. It is also responsible for data flow from the PE-SCADA unit to VR-SCADA units.

5) SERVICES (Ss)
The Ss is an essential part that provides an adapter to communicate with other PE, model services, data analysts, etc. It provides support to customized services that can be built outside of the model. The proposed framework is based on cloud computing infrastructure, which already has a standards protocol for service delivery.

B. PREDICTIVE MODELING
The predictive modeling component process two sources of information. First, a wind speed as a univariate time series using a novel deep learning model based on a temporal convolutional neural (TCN) network. The TCN is modern deep neural architecture and has proved better results as VOLUME 10, 2022 compared to its sequence modeling counterparts as well as more efficient in terms of computation time [24]. It combines the power of dilated convolution and residual block. Second, the forecasting results are further processed to predict the power generation from wind turbines using k-nearest neighbor regression. The input to the TCN is wind speed data stream, and it is defined as: where WS is wind speed, t i+1 = t i + t and WS = (.) T represents the value of wind speed at any time instance t. The basic assumption is made that p(ws t+1 |ws(t 1 ), ws(t 2 ), ws(t 3 ), . . . , ws(t n )) does not depends on the future timestamps. It is realistic assumption in case of wind speed forecasting. A function f is defined as: where ws(t n+m ) denotes the forecasting of month or any quarter inside it. The function f contains the dilated convolution layers residual blocks. The dilated convolution is defined as: where C is dilated convolution operation, k is the filter size being learned, and h − d.i consider the past sequence of wind speed. It allows the network to operate on a coarser scale rather than a normal convolution but more efficient as shown in Fig. 6.
The following expression computes the layer of TCN: A mean squared error (MSE) loss function is used for training to converge the temporal convolution neural network. It is calculated as: The output is passed to machine learning model k nearest neighbor (kNN) regression [25] to regress the value of energy. It is a supervised nonparametric regression technique that calculates the distance of the test point in the feature space. The goal is to predict the power generation as a linear combination of its k nearest neighbors using a distance metric. The k is a hyper-parameter that indicates the number of neighbors to be considered for the prediction of GP. We find out the optimal number of neighbors (i.e., k = 9) using grid search. Its graph with further discussion is presented in Section III. The input to the model is predicted wind speed WS that needs to predict the generated power GP. It can be defined as a function: It enables the prediction of generated power over the forecasting of wind speed. Consequently, it can assist the team makes informed decisions over the predicted energy.

C. DATASET
A publicly available onshore wind farm dataset [26] is used for our experiment. The wind farm is located in Yalova, the northwestern region of Turkey, and operation since 2016. About the regional information, Yalova is located at the lat- The data was collected and stored by SCADA unit. The SCADA units have the ability to store, retrieve, and exports the data for a variety of stakeholders. The SCADA system operator has a responsibility to validate the incoming data. The wind turbine with data collection information is present in Table. 1.
The active power generation of a wind turbine has a strong correlation with wind speed. The active power is widely employed for monitoring wind turbine performance and power generation [27], [28]. We considered the wind speed as a temporal data series to predict the intensity of the wind in the coming days. Based on wind prediction we further forecast the ''active power'' generation. The obtained results are explained in the following section.

III. RESULTS
This section explains and presents the obtained result, followed by a discussion.

A. WIND FORECASTING
To understand the data, the data is explored. We found that the dataset contains missing values at a few time intervals that indicate maintenance of wind turbines or other possible reasons. We replace such value with previous time-stamp observation. It is a necessary step to process the data in a  time-series manner. We split the dataset into four quarters and make predictions for a week, two weeks, three weeks, and the entire month of each quarter Q1, Q2, Q3, and Q4. The training model parameters are presented in Table 2  The following two performance metrics are considered to measure the performance of the model.
where MAE is mean absolute error, RMSE is the root mean squared error, y is the actual value from the test set, and y is the predicted value from the trained model. The MAE measures the average magnitude of the errors by considering the absolute value, which presents the accuracy of the prediction. The RMSE measures the forecasting error by differencing the prediction and the actual value, which is squared, average, and then followed by a square root. Both MAE and RMSE can be used together to present the model errors. The RMSE provides large error values as compared to MAE because it gives relatively high weight to large errors. If both performance measures have the same value then it means the error of the model has the same magnitude. The first quarter prediction is presented in Fig. 7. Fig. 7 presents the results of the TCN model. It shows the predictions are close enough but have a difference with actual wind speed. This difference is measured in terms of performance metrics and reported in Table 3. The second quarter includes the month of April, May, and June, while it predicts the 7, 14, 21, and 30 days as shown in Fig. 8.   Table 4 confirms the low error rate of our model prediction. Similarly, Q3 and Q4 results are presented in Fig. 9, 10 and performance measure in Table 5, 6. The obtained results shows the TCN model learn the wind speed correctly and obtained performance measures results confirms it.

B. POWER GENERATION PREDICTION
The power generation prediction is based on kNN regression model. The optimal value of k is searched using the grid search as shown in Fig. 12. The x-axis presents the value of k that means required neighbors and y-axis presents the mean absolute error.
The forecasting of wind speed over the period of a month is used to predict power generation. The results of kNN regression are presented in Fig. 11.
The obtained results are close enough to the actual power generation. The developed model not only predicts the wind    speed but also predicts power generation. It can help the management team know in advance about the generated power and plan the storage in smart grids. Our objective is to design and develop an advanced pipeline to predict the power generation that can be used for an energy management team to make a timely decision.

C. COMPARATIVE ANALYSIS
To validate the performance of our model, we compared the results with three state-of-the-art machine learning models. First, the decision tree is a successful approach in predictive modeling as a supervised learning approach. It constructs the tree sequentially by calculating the entropy of the attributes. The theoretical background is rooted in information theory. During the learning phase of the decision tree it split the nodes by adjusting the numerical parameter of threshold function [29]. Second, the random forest is a VOLUME 10, 2022 powerful machine learning algorithm that is based on ensemble learning paradigm. Several trees are constructed over the training dataset known as base learners. A simple voting mechanism is used to combine the individual results. This can help to reduce the variance of the model which shows high accuracy and robustness in many application domains [30]. Third, support vector regression (SVR) based on statistical theory and successfully applied to finance, forecasting electricity prices, and power consumption [31]. The basic idea of SVR is to transform the input data points to a higher dimensional feature space through a function and separate the feature space with maximum margin. The mean absolute error as a performance measure is reported in Table 7. Table 7 shows a significant improvement in each quarter which confirms the applicability to predict the power generation.

IV. CONCLUSION
Wind farms are contributing towards the generation of clean and affordable energy to support sustainable solutions. The wind-turbines condition monitoring and power generation prediction play an essential role in supporting the management team in making informed decisions and supporting the domestic supply chain. We developed a 5G-NG-RAN assisted cloud-based digital twins framework to monitor the SCADA units of wind turbines. The digital twins enabled real-time monitoring of wind farms without visiting them physically. 5G-NG-RAN assisted cloud allows low latency services to support digital twin in real-time for predictive modeling in wind turbines. Furthermore, a machine learning pipeline is designed over the temporal convolutional neural network and kNN regression to forecast wind and power generation. The empirical evaluation of the publicly available dataset confirms the applicability in real wind-farms scenarios.
Our future plan is to extend this digital twins framework for offshore wind farms monitoring and prediction to support sustainable energy solutions.