Journals & Magazines >IEEE Access >Volume: 13

Correlation Between Wind Turbine Failures and Environmental Conditions: A Machine Learning Approach

Correlation between wind turbine failures and environmental conditions: a machine learning-based approach for optimizing predictive maintenance.

Abstract:

Wind energy has emerged as a vital renewable source, competing with conventional energy due to its clean and inexhaustible nature. However, the global mass production of ...Show More

Metadata

Abstract:

Wind energy has emerged as a vital renewable source, competing with conventional energy due to its clean and inexhaustible nature. However, the global mass production of wind turbines often disregards the unique environmental conditions of installation sites, leading to performance and reliability challenges. This study applies machine learning methodologies to analyze the correlation between wind turbine failures and local environmental conditions. The research leverages Rough Set Theory to transform instances of undesirable turbine shutdowns—especially those influenced by incomplete tropicalization processes—into actionable decision rules. The findings provide practical insights applicable to wind farms worldwide, enabling optimized maintenance strategies and precise adjustments to protection parameters. These improvements contribute to reducing failure rates, enhancing energy conversion efficiency, and promoting the sustainable expansion of wind energy across diverse geographic and climatic contexts.

Correlation between wind turbine failures and environmental conditions: a machine learning-based approach for optimizing predictive maintenance.

Published in: IEEE Access ( Volume: 13)

Page(s): 50043 - 50058

Date of Publication: 14 March 2025

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2025.3551241

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Wind energy is one of the fastest-growing renewable energy sources worldwide. In Brazil it accounted for 12.8% of the national energy matrix in 2022 and is expected to reach 15% by 2026, having met 24% of the total demand of the National Interconnected System (SIN) on October 15, 2022 [1]. Wind energy is clean, inexhaustible, and does not contribute to greenhouse emissions or environmental pollution. Wind turbines, responsible for converting wind kinetic energy into electricity, are the core technology behind this sustainable energy source [2].

The Brazilian regions most favorable for wind power generation are the coastal areas of the Northeast (such as Ceará, Rio Grande do Norte, Paraíba, and Bahia) and the South (Rio Grande do Sul and Santa Catarina). These locations exhibit strong and constant wind patterns: while the Northeast experiences stable, trade wind-driven conditions, the South faces more variability due to frequent cold fronts [3].

Wind turbines are commercially produced on a global scale, often without specific adaptations to the unique environmental conditions of their installation sites [4]. Consequently, there is an opportunity for in-depth engineering studies based on real operational data to identify correlations between wind turbine failures and local environmental and geographical conditions. Understanding their relationships enables the development of tailored maintenance strategies and fine-tuning of protection settings, ultimately improving operational reliability and turbine performance. This, in turn, leads to reduced failure rates, higher energy conversion efficiency and increased financial returns for wind farm operators [4].

It is important to note that environmental factors such as temperature and humidity play a role in shaping wind characteristics. Temperature variations generate pressure differentials and convection currents, influencing wind speed and turbulence. Likewise, humidity affects air density and contributes to weather systems that impact wind patterns. Thus, by analyzing wind-related variables such as speed, direction, and turbulence, we inherently account for the indirect influence of temperature and humidity. This approach ensures that our findings reflect broader climatic interactions affecting wind turbine performance, even without explicitly including these additional environmental parameters.

This study hypothesizes that certain types of wind turbine failures are strongly correlated with specific environmental conditions and that applying machine learning techniques can identify predictive patterns, enabling improved maintenance strategies and reduced downtime. By leveraging Rough Set Theory, we aim to uncover actionable decision rules that facilitate proactive adjustments to turbine operation, ultimately increasing efficiency and reliability in diverse environmental contexts.

SECTION II.

State of the Art

The analysis of failures in wind turbines and their relationship with environmental conditions is of great importance to the wind energy sector, as it enables an understanding of the main causes of failures and the implementation of preventive measures to increase the reliability and efficiency of these devices. In this chapter, relevant studies addressing the relationship between wind turbine failures and environmental conditions in different locations will be discussed.

For many centuries, humanity has utilized the motive force of winds for the production of work. However, the conversion of wind energy into electrical energy only gained commercially significant scale in the 2000s, as demonstrated in Figure 1, and due to national and international efforts towards the decarbonization of the energy matrix, it is highly likely that this trend will continue to grow in the coming decades.

FIGURE 1.

Evolution of installed capacity (MW) [5].

Show All

In 2023, Brazil maintained sixth position in the World Ranking of cumulative wind capacity compiled by WWEA (World Wind Energy Association) Annual Report 2023 [6], as shown in TABLE 1.

TABLE 1 Global Wind Energy Capacity Rankings by Country as of 2023

The growth of the wind power matrix in Brazil and worldwide has triggered numerous studies related to the performance of wind turbines, especially because such equipment is designed to operate in locations where the environmental conditions can be significantly different from what was originally planned. Silva et al. [4] comments that wind turbine projects are initially introduced through prototypes to target markets. For this reason, the “tropicalization” of projects, which are generally conceived in Europe and the USA, becomes necessary to meet local requirements. The growth of the wind power matrix in Brazil and worldwide has triggered numerous studies related to the performance of wind turbines, especially because such equipment is designed to operate in locations where the environmental conditions can be significantly different from what was originally planned. Silva et al. [4] comments that wind turbine projects are initially introduced through prototypes to target markets. For this reason, the “tropicalization” of projects, which are generally conceived in Europe and the USA, becomes necessary to meet local requirements. Consequently, the operation and maintenance parameters need to be constantly adjusted throughout the lifetime of the wind turbines, incorporating historical results.

There is no single, simple solution to the complexities involved in operating and maintaining wind turbines. However, Data Science offers tools that enable the construction of a comprehensive view of all variables in the context and how they interrelate to produce respective effects. Various authors discuss the use of data, applied statistics, and artificial intelligence techniques to understand problems associated with the operation and maintenance of wind turbines. Sankineni et al. [7] explores the application of data science to determine historical operational trends of wind turbines, reasons for energy production variability among sites, and future expectations, aiding in the development of an action plan. The study highlights how wind energy production is influenced by several factors, such as wind speed, wind gusts, and turbine defects. It also discusses how wind speed impacts turbine hardware, which directly affects performance and results in significant turbine downtime. The paper examines the use of various Machine Learning (ML) techniques to compare their effectiveness in preventing future issues in wind turbines, aiming to provide immediate assistance and minimize machine downtime.

With an emphasis on generation forecasting, Kokila and Isakki [8] discusses the monitoring of conditions and failure detection in wind turbines using power curves, rotor curves, and blade pitch curves, as explained in data mining algorithms. Based on high-frequency vibration data, the author proposes a fault diagnosis and troubleshooting strategy using “envelope curve” analysis. The Condition Monitoring System (CMS) technique can be used to identify and resolve issues in the gearbox and rotor of the turbine. Using temperature measurements, it is possible to detect failures in wind turbine generators through a neural network model, which identifies failures based on components such as generator oil leakage, blade cracks, low wind speed, and generator cooling temperature. The Random Forest technique is also used to analyze failures in the generator brushes. Based on data mining and statistical methods, the author advocates for an alternative approach to assess performance and monitor the conditions of wind turbines, in which sensors must be installed at specific locations on the turbine, monitored through the SCADA system, enabling early warnings of failures.

Sambana et al. [9] emphasizes that the safety and proper operation of wind turbines depend on continuous condition monitoring, which can be achieved through ML techniques. His studies focus on anomaly detection methods and Support Vector Machines (SVM), using real data to identify potential failures, concluding that the proposed scheme allows fault detection to be carried out more reliably and quickly. Another example of using data for better exploitation of wind energy resources is discussed by Singh and Rizwan [10], who presents a forecasting tool based on time series data sets to estimate wind energy. Furthermore, an approach for exploring data sets to visualize the data obtained from the SCADA system is advocated. Data set analysis is shown in polar coordinate systems and pair diagrams to better understand the relationship between wind and energy production. With the help of input features provided, such as theoretical power, actual produced power, wind direction, wind speed, month, and hour, the power generated by the wind turbine is predicted using ML algorithms.

Focusing on maintenance, Bilendo et al. [11] presents a method to identify anomalous behaviors that may lead to failures in wind turbines. The proposed method can achieve accurate results. The efficacy of the method is validated with real data from operating wind turbines. Although the proposed method has the advantages mentioned, one of the limitations noted by the author is the “cost of processing time,” which is expected due to the collection and processing of labeled data from the SCADA database, which is not labeled.

SCADA data related to the temperature of the generator and gearbox oil of wind turbines are used by Qian et al. [12] to identify failures in the gearbox and generator winding of the turbines. The results showed that the differences between the actual output signals and the signals predicted by the model are caused by gearbox failure and generator winding failure. Thus, the proposed method can provide early warnings of imminent failures in components. The author presents the ELM (Extreme Learning Machine) algorithm used in the monitoring of conditions of wind turbines.

Karadavi et al. [13] utilizes data from the SCADA system of a wind farm located in southeast Turkey to predict alarms associated with failure conditions. The proposed model consists of four main stages: data acquisition from SCADA, data analysis and preprocessing, feature selection, and classification study. The author compares the classification methods of Support Vector Classification (SVC) and Decision Trees to predict failures in wind turbines, also considering data under normal conditions (no failures). It was demonstrated that the decision tree method provides more accurate results than SVC. The author emphasizes that the F1 Score is a crucial metric for unbalanced classes. Based on performance results (AUC-ROC, Confusion Matrix, and F1 Score), Decision Trees were selected as the appropriate classifier for this system. One of the relevant aspects of this study is the use of the SMOTE method, a technique employed to handle unbalanced classes in datasets. When a dataset presents a significant discrepancy between the number of instances in a majority class and a minority class, the performance of ML models can be affected as they may become biased towards the majority class. The SMOTE addresses the issue of class imbalance by artificially generating new instances of the minority class to balance the distribution of classes in the dataset. These new instances are created from existing samples by interpolating between the characteristics of the nearest neighbors. Applying SMOTE amplifies the minority classes, allowing the ML model to have more examples to learn from and improving its generalization ability. This prevents the model from being biased towards the majority class and enhances accuracy and overall performance when dealing with unbalanced classes.

Yan et al. [14] proposes a fault diagnosis method based on Principal Component Analysis (PCA) and a Support Vector Machine (SVM) model to address the problems of high dimensionality and large sample size of wind turbine failure data. Initially, PCA is used to extract low-dimensional failure characteristics from high-dimensional failure data, aiming to eliminate correlation among features. Subsequently, these low-dimensional failure characteristics are used as inputs to train SVM classifiers. Fault diagnosis is then performed through the classification of these characteristics. Simulation results showed that diagnostic accuracy could reach 100%, indicating that this method can diagnose various types of failures quickly and effectively. Feature extraction through PCA can effectively reduce noise and dimensionality. The author demonstrates that analyzing two components can contain 98.5% of the failure information in the wind turbine data, effectively reducing the correlation among characteristics. Analyzing the simulation results of wind turbine failure data, this method can accurately identify different failure states and normal states of a wind turbine. Compared with the Back Propagation Neural Network (BP) model, the GSA-SVM model can achieve 100% diagnostic accuracy with only 200 training samples, whereas the BP Neural Network model requires more than 3,000 training samples to reach 100% accuracy. This ensures that the SVM model can better solve the classification problem with few samples, allowing for rapid and effective identification of wind turbine failure types.

Afrasiabi et al. [15] proposes a deep learning-based framework (DNN) for fault diagnosis. The proposal considers a single classification method, yet divided into two blocks: the first uses a Generative Adversarial Network (GAN) to artificially generate false signals to enhance feature extraction efficiency and enable the training of a ML fault detector with a low number of samples; the second block is a Temporal Convolutional Neural Network (TCNN) classifier capable of learning temporal characteristics and accurately detecting abnormalities. The proposed method is tested with real data from three 3 MW turbines in Ireland, and its results are compared with Support Vector Machine (SVM) and Feedforward Neural Network (FFNN) methods, thereby demonstrating that the proposed method is a suitable alternative for fault classification in wind turbines.

Mazidi et al. [16] analyzes factors affecting the active power of a wind turbine, such as gearbox temperature, pitch angle, rotor speed, and wind speed, using data over a 12-month period and employing techniques such as Kohonen Maps, Multilayer Perceptron, Decision Trees, and Rough Sets. The study demonstrates that each of the techniques has advantages and disadvantages in achieving the objective. Four classification techniques and algorithms were used to classify a dataset with four variables and one target variable from a wind turbine. In the data preparation stage, almost all techniques performed similarly, however, Kohonen Maps have the best visualization features through various types of graphs. Regarding learning and classification speed, the multilayer perceptron was slower compared to others. Decision Trees show greater interpretability of their results.

The Rough Sets methodology, introduced by Pawlak et al. [17], is a data mining approach designed to discover relationships and patterns in complex datasets. It is based on the theory of Rough Sets, enabling the identification of association rules between variables and the categorization of instances based on imprecise or uncertain information. The application of this methodology in analyzing wind turbine data reveals correlations between wind conditions and observed protection mechanisms, even amidst uncertainties or data noise.

This method has been widely used in various fields, showing notable advantages and practical applicability. In Biomedical Engineering, Chaudhuri and Mitra [18] utilized the methodology to assist in the diagnosis of heart diseases using electrocardiogram waveforms. Jagielska et al. [19] conducted a comprehensive comparison of classification techniques, including Neural Networks, Fuzzy Systems, Genetic Algorithms, and Rough Sets. Across various datasets, the Rough Sets methodology demonstrated superior performance in scenarios such as credit approval. In the Human Resources sector, Chen and Chien [20] used Rough Sets to select employees in a semiconductor company, considering attributes such as education and experience to identify potential talents.

In the field of electrical energy, the Rough Sets methodology plays a significant role in studies of protection and control systems. Han et al. [21] combined Rough Sets with Neural Networks to classify faults in power systems with high accuracy. Additional implementations included predictive maintenance of induction motors [22], load estimates in electrical systems [23], classification of consumer demand curves [24], and anomaly detection in electric grid information systems [25]. The methodology has also proven applicable in operational decision-making [26] and power flow control [27], demonstrating its broad utility and effectiveness across diverse applications.

In recent years, new machine learning models have been developed to address challenges related to classification and uncertainty analysis in data. One emerging approach is granular-ball computing (GBC), which has shown great promise in various applications. Xie et al. [28] introduces an improved method for generating granular balls (GBG++), characterized by its efficiency, robustness, and scalability. The GBG++ method incorporates an attention mechanism to enhance stability and efficiency while reducing reliance on traditional algorithms such as k-means. Additionally, the study proposes a granular-ball-based classifier (GBkNN++), which evaluates the quality of granular balls to improve classification accuracy. Experimental results indicate that GBG++ outperforms various granular-ball-based classifiers and traditional machine learning methods across 24 benchmark datasets.

Integrating such techniques into the analysis of wind turbine failures could provide a new perspective for handling uncertainties and enhancing the efficiency of predictive analysis.

The integration of data analysis techniques, applied statistics, and artificial intelligence emerges as a pathway toward more efficient operation and maintenance processes, contributing to the maximization of wind energy production and the reduction of costs associated with failures and downtimes. These studies provide a solid foundation for future research and developments in the field of wind energy, highlighting the importance of data-based approaches to tackle operational and maintenance challenges.

This work stands out due to its detailed and specific approach in analyzing the correlations between failures in wind turbines and the environmental conditions of their locations, which is essential for developing effective operation and maintenance strategies. Common to other works, it involves the application of data science techniques to real operational data of wind turbines, aiming to obtain diagnostics, understand patterns of failures, and identify correlations between variables. This approach not only enhances the understanding of wind turbine performance but also aids in predictive maintenance and operational planning.

SECTION III.

Development

This chapter is organized according to the CRISP-DM methodology [29]. This is a structured model for conducting data science projects, ensuring that analyses are robust, replicable, and applicable to practical problems. The process is divided into six phases, which will be detailed in the following sections: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

A. Business Understanding

The Praias de Parajuru Wind Power Plant consists of 19 wind turbines, with a total installed capacity of 28.8 MW. The equipment is manufactured by the German company Vensys, model VS77, featuring multipole synchronous generators with permanent magnets and a unit capacity of 1,540 kVA, a nominal output voltage of 690 Volts, operating at a nominal wind speed of 11m/s, and installed at a height of 85 meters. The plant is located in the municipality of Beberibe, Ceará State, 102 km east of the State capital Fortaleza. The boosting substation is located within the wind park itself, with two step-up transformers of 20 MVA and 15 MVA – 34.5 kV/69 kV, making up the Parajuru A and Parajuru B busbars, through which the wind turbines are connected to the Enel Distribution Network [30].

As defined in the manufacturer’s manual of the wind turbines of these parks [31], the main faults of the VENSYS 77/62b wind turbine model include:

Global Error: General errors affecting the overall operation of the turbine.
Acceleration Error: Problems related to the acceleration of the nacelle.
Cabinet Cooling Error: Failures associated with the cabinet cooling system.
Converter Error: Issues detected in the converter, including monitoring, IGBT temperature, and others.
Hydraulic Error: Failures in the hydraulic system.
Yaw Mechanism Error: Problems related to the yaw mechanism, including lubrication.
Generator Error: Failures associated with the generator.
Transformer Error: Issues detected in the transformer.
Grid Error: Includes failures related to voltage, current, frequency, active and reactive power of the electrical grid.
Fuse Error: Problems associated with fuses.
UPS Error: Failures in the Uninterrupted Power Supply (UPS) system.
Tower Cooling Error: Problems in the tower cooling system.
Profibus Error: Failures associated with the profibus system.
Wind Measurement Error: Problems related to wind measurement.
Navigation Light Error: Failures in navigation lights.
Startup Error: Problems during startup.
Shutdown Error: Failures during shutdown.
CPU System Error: Problems related to the system CPU.
PitchW System Error: Failures associated with the PitchW system, including motor temperature, capacitor voltage, and others.

Given that wind turbine designs do not take into account the specific characteristics of the locations where they will be installed, a significant gap arises between the theoretical and actual operational performance of these devices. The inherent variations in local wind conditions, such as intensity, direction, and turbulence, can directly affect the efficiency and safety of wind turbines.

These variations can sometimes exceed predefined operating parameters, potentially leading to unplanned shutdowns, reduced energy generation efficiency, and in extreme cases, damage to the equipment. Therefore, it becomes imperative to adopt an analytical and empirical approach, grounded in detailed analysis of operational data and environmental conditions.

By studying generation data together with environmental variables, it is possible to identify patterns and anomalies that indicate the need for fine-tuning in protection and control systems. Analysis can reveal relevant insights on how specific wind conditions interact with wind turbines, allowing for the implementation of customized corrective measures that optimize equipment efficiency and resilience.

Ultimately, a data-driven approach not only improves energy efficiency but also extends the lifespan of wind turbines, reduces maintenance costs, and increases the reliability of energy supply. Thus, the integration of data analysis becomes a strategy with the potential to mitigate the challenges associated with mismatches in protection and control, ensuring that wind turbines operate in an optimized, safe, and efficient manner, aligned with the unique environmental dynamics of their specific locations.

B. Data Understanding

The initial phase of this study involved the detailed data collection from the Praias de Parajuru wind farm. The data, obtained from the SCADA system of the Cemig Operation Center, provide relevant information about energy production and operational failures over a continuous 12-month period from April 1, 2022, to March 31, 2023.

The original datasets, with a granularity of one sample every four seconds, were integrated into 10-minute intervals to make the analysis manageable and aligned with standard practices in wind turbine engineering. This resulted in a total of 52,560 samples for each set of variables, for each of the wind turbines at the respective wind farms. For the Praias de Parajuru Wind Farm, there is a total of 197 variables, including individual measurements from 19 wind turbines and collective data from the park’s anemometric tower.

An initial exploratory analysis reveals the diversity of telemetered data encompassing variables such as wind direction and speed, nacelle direction, generation limitation, turbine rotation, blade pitch angles, among others. Collective variables, such as wind direction and speed at specific altitudes, atmospheric pressure, ambient temperature, and relative humidity, provide additional information about the prevailing environmental conditions.

The failure records, totaling approximately 6,100 occurrences for Praias de Parajuru, complement the set of information necessary for the continuation of this work. Each record is meticulously detailed, including the date and time of the occurrence and reinstatement, the associated wind turbine, the identified cause, a detailed description of the occurrence, the action taken to mitigate the failure, and additional observations.

The causes of the failures were categorized based on the nature and origin of the occurrences, facilitating a more in-depth analysis of trends, patterns, and anomalies associated with specific components of the wind turbines, such as the pitch system, yaw, gearbox oil level, among others, as cited in Vensys [31].

With a clear understanding of the structure, nature, and complexity of the data at hand, the study progresses to the subsequent phases of data preparation. The richness of the collected data will not only facilitate robust analysis but also support the identification and mitigation of protection and control mismatches, thus optimizing the operational efficiency and energy production of the wind farms under study.

C. Data Preparation

Given the unique and individual nature of each wind turbine, a decision was made to create a specific dataset for each one, resulting in a total of 19 distinct datasets. This strategy is based on the premise that each wind turbine, equipped with its own settings, components, and operational circumstances, represents a particular individual within the context of the wind farm. Grouping all wind turbines into a single dataset could inadvertently mask the critical nuances and individual variations that each one exhibits. The individuality of each turbine is marked by differences in operational settings, component wear, efficiency of control and protection systems, and its own geographical position relative to other turbines, among other factors. Therefore, the analysis of individual datasets becomes necessary to capture and understand these subtle, yet significant differences that directly influence the performance and efficiency of each unit.

During the data preparation phase, the main focus was on cleaning and filtering the records. It was identified that a significant portion of the records from April to July 2022 were compromised, with wind measurements being zeroed or null. These records were removed from the datasets. The reason for this exclusion is that such records do not add value to subsequent analysis and could introduce noise or distortions into the study. TABLE 2 demonstrates the total impact in terms of record reduction in the analysis context.

TABLE 2 Data Cleaning Impact Summary

The TABLE 2 illustrates the comprehensive data cleaning efforts undertaken to refine the quality of the datasets for each wind turbine at the Praias de Parajuru Wind Farm. The columns indicate the number of null records removed, the percentage of data cleaned relative to the original dataset, the total records remaining after cleaning, and the number of operational interventions recorded. The varying percentages of data removed across turbines highlight the distinct operational and environmental challenges faced by each turbine, emphasizing the importance of customized data handling approaches.

It is important to analyze the information presented in Figure 2, which shows that the number of anomalies in each category varies significantly across wind turbines. For instance, Turbine 1 predominantly experienced “Error Converter” incidents, whereas Turbine 11 had a predominance of “Error Acceleration” incidents. In the case of Turbine 5, the number of incidents is almost negligible, making it challenging to train a ML model effectively. This variability highlights the need for tailored approaches in monitoring and diagnosing turbine-specific issues, further emphasizing the importance of understanding each turbine’s operational nuances for effective anomaly detection and system optimization;

FIGURE 2.

Number of Shutdowns, by category, per wind turbine.

Show All

When training a ML model for anomaly detection, it is essential to have a sufficient number of examples of each type of anomaly so that the model can learn the patterns associated with these abnormal conditions. The amount of data required can vary significantly depending on the complexity of the anomaly type and the variation in the data.

Types of anomalies with only one or two records (as seen in various categories for most wind turbines) may not be sufficient to effectively train a model, even when resampling techniques (such as SMOTE) are applied. A model may not learn to adequately differentiate these anomalies from normal variations in the data. Ideally, one would want to have dozens or even hundreds of examples for each anomaly category, especially if there is significant variation within that category. For this reason, the author has chosen to consider, in subsequent analyses, only those turbines that present anomalies with 20 or more occurrences in the available dataset, as shown in TABLE 3.

TABLE 3 Summary of Anomalies With 20 or More Occurrences by Wind Turbine and Category

Furthermore, occurrences whose causes were clearly identifiable as not associated with wind dynamics, such as problems involving telecommunications infrastructure, supervision and control (Error Profibus), and issues related to the electrical grid of the wind farm (Error Grid), were also removed from the analysis. Consequently, the categories that form the object of this study are those associated with acceleration protection (Error Acceleration), power converter (Error Converter), and the pitch system (Error Pitch).

Despite the significant reduction in sample sizes, it is understood that the number of remaining records is still substantial and does not limit the continuity and quality of the studies, as well as the achievement of objectives. This strategic focus ensures that the ML model training is based on robust and relevant datasets, enhancing the likelihood of developing an effective anomaly detection system.

The TABLE 3 presents a structured overview of anomalies that have occurred at least 20 times, sorted by wind turbine and category. The blank spaces indicate that the number of anomalies in that category did not reach the threshold of 20 incidents, highlighting a more focused distribution of significant anomalies across the turbines. This targeted data approach ensures that the analysis concentrates on more frequent and potentially critical issues, facilitating more precise machine learning modeling and better understanding of prevalent issues in specific turbines.

Using the date and time of measurements and the records of protection activation, each event in the wind turbines was then associated with the respective wind measurements mentioned, thus composing datasets similar to that shown in TABLE 4, which illustrates how each event or error trigger in the wind turbines is linked to specific wind conditions at the time of occurrence. The columns provide detailed information about wind direction and speed, along with their respective standard deviations, and a turbulence index which could indicate the severity or stability of the wind at the time of the event. Such detailed data allows for a more nuanced analysis of the conditions leading up to each error, aiding in the detection and prediction of potential faults based on specific wind patterns.

TABLE 4 Association of Protection Activations With Wind-Related Variables

D. Modeling

In the modeling phase, a routine was developed to automate the analysis and modeling of data from each wind turbine at the Praias de Parajuru wind farm individually. The code is structured to iterate over each turbine, load its specific data, prepare the data, and apply a series of ML models to identify patterns and anomalies.

The choice of ML methods in the modeling stage considers that the problem at hand is the detection of anomalies in complex systems [32]. These systems are often characterized by imbalanced data, where failures, though critical, are rare events compared to normal operation. The complexity is further amplified by the high dimensionality of the data, with a multitude of monitored variables, each contributing to the complexity of the modeling environment.

According to Alla and Adari [33], an anomaly is a value or outcome that deviates from what is expected, but the exact classification can vary. There are three types of anomalies: point-based, context-based, and pattern-based. Point-based anomalies are values that, while not necessarily outliers, are atypical, such as an abnormal result in a blood test. Context-based anomalies are normal values that become anomalous in certain contexts, like a spike in sales in an unexpected month. Pattern-based anomalies are deviations from historical trends, such as an atypical amount of rainfall for a specific month. Anomaly detection utilizes advanced algorithms to identify anomalous data or patterns and is related to outlier detection, novelty detection, and noise removal.

The need for real-time anomaly detection and the variety of anomaly types presents additional challenges, requiring a careful approach in the selection and optimization of models. Feature engineering becomes a central pillar, ensuring that the models are fed with transformed and selected data to maximize effectiveness in identifying the subtle and complex patterns associated with specific failures. Model interpretability is also a critical consideration, ensuring that the insights obtained can be recognized and utilized to produce effective strategies for predictive maintenance and operational optimization.

Therefore, modeling in this context is not just a technical matter but also a strategic one, integrating domain knowledge, system complexity, and advanced analytical capabilities to ensure operational integrity and efficiency of the wind turbines.

Given these considerations, specific analytical models were selected for classification, each characterized by their peculiarities and intrinsic applicabilities. The application of these models is particularly justified by the inherent complexity of the data, which involves significant variability in environmental conditions and operational differences among turbines. This variability requires models capable of capturing non-linear relationships and handling high-dimensional datasets effectively. Moreover, the presence of class imbalance in failure data necessitates robust algorithms like Gradient Boosting and Random Forest, which perform well in such scenarios by maintaining accuracy across minority classes. Additionally, interpretability is essential in the context of predictive maintenance, where actionable insights are needed for operational decision-making. Models such as Random Forest and Support Vector Machine offer clear interpretative advantages, while the integration of Rough Sets methodology provides transparent decision rules, enhancing the overall understanding of the factors contributing to turbine failures. This integrated modeling approach thus improves the detection of relevant patterns, supports efficient maintenance strategies, and contributes to the reliability and performance optimization of wind energy systems:

Random Forest (RF) [32], [33] is an ensemble learning method renowned for its ability to construct multiple decision trees, providing robustness and accuracy in both classification and regression tasks. It is notably resistant to overfitting and can manage a large number of input variables. Ideal for complex, non-linear datasets, RF adapts well to nuances and variations in data, making it highly suitable for predicting diverse types of anomalies in wind turbine operations.
Gradient Boosting (GB) [6], [34] builds trees sequentially, with each tree correcting errors from the previous one. It excels in regression, classification, and ranking tasks and is known for its computational efficiency. GB has been used to compare performance against LSTM neural networks by analyzing deviations between measured data and model predictions, aiding in the detection of abnormal turbine behavior.
Support Vector Machine (SVM) [37] is versatile, capable of performing linear and non-linear classifications, regression, and outlier detection. It is particularly effective with high-dimensionality feature spaces. Suitable for classification problems where classes are not linearly separable, SVM is used for high complexity datasets common in wind turbine diagnostics.
K-Nearest Neighbors (KNN) [7] is an instance-based learning model that approximates local function and defers computation until function evaluation. It is intuitive and straightforward. Effective in scenarios where proximity or similarity between data points significantly indicates classification, KNN is used for diagnosing operational patterns in turbines.
Multilayer Perceptron (MLP) [38] is a type of feedforward artificial neural network, includes at least three layers of nodes and is instrumental in performing complex learning tasks. Capable of capturing complex non-linear relationships and interactions between variables, MLP is widely used for complex classification problems in turbine data.
Logistic Regression (LR) [39] is a simple yet powerful statistical model that is easy to implement and interpret. It is used to understand the relationship between independent and dependent variables. Useful in binary and multiclass classification tasks, LR provides insights into the likelihood of specific anomalies occurring under given conditions.
Naive Bayes (NB) [37], like logistic regression, is a statistical model primarily used in binary and multiclass classification tasks. It is characterized by its efficiency and accuracy. NB is particularly effective for classification problems involving categorical variables and is known for its simplicity and efficiency.

The selection of these models was driven by their complementary strengths in addressing the specific data characteristics of wind turbine operations. Given the high-dimensional nature of environmental and operational data, models like Random Forest and SVM were chosen for their robustness and ability to handle complex feature spaces effectively. The imbalanced distribution of failure events necessitated the use of Gradient Boosting, which excels in scenarios where minority class detection is critical. K-Nearest Neighbors was included due to its simplicity and effectiveness in identifying local patterns, particularly useful for anomaly detection in isolated conditions. Logistic Regression and Naive Bayes were selected for their interpretability and efficiency, providing strong baselines and quick insights into classification performance. Finally, the Multilayer Perceptron was chosen for its capacity to capture non-linear relationships, essential when modeling the dynamic and interdependent factors affecting turbine performance. This diversified model selection ensures a comprehensive analysis capable of uncovering both broad trends and subtle anomalies within the data.

1) Considerations on Employed Models

In the development of this study, understanding the different machine learning (ML) models employed and the key variables that are adjusted to optimize their performance is crucial. TABLE 5 provides a summary of the models used in the analysis. For each model, the main characteristics, the hyperparameters optimized via GridSearchCV, and the respective bibliographic references are presented.

The TABLE 5 helps in setting up the models and aids in understanding how each hyperparameter impacts the learning process and the overall effectiveness of the model in handling complex datasets, such as those encountered in monitoring and diagnosing wind turbine operations.

TABLE 5 Summary of ML Models and Hyperparameters to be Optimized

2) Considerations on Data Preprocessing

The application of the ML models was conducted using an open source ML library that supports supervised and unsupervised learning known as “Grid Search CV”, from ScikitLearn [40], that facilitates the optimization of hyperparameters, ensuring the most favorable configuration to maximize the accuracy and efficiency of the models. This method systematically explores multiple combinations of parameters and cross-validating (conducting five iterations of tests with different data sets) as it goes to determine which tune gives the best performance.

Each model was integrated into a workflow (pipeline) that includes data normalization via the Standard Scaler [40]. Standard Scaler is a widely used data preprocessing technique in ML and statistics to normalize or standardize a dataset. Standardization involves transforming the data so that it has a mean ( $\mu$ ) of zero and a standard deviation ( $\sigma$ ) of one. This transformation is accomplished by subtracting the mean of each observation and dividing by the standard deviation, ensuring that the model is not biased or misled by the natural variance in the dataset.

The main advantage of this technique is that it improves the convergence and efficiency of machine learning algorithms, especially those sensitive to feature scales, such as SVM and neural networks. Moreover, it facilitates the interpretation of coefficients in linear models. However, the disadvantages include sensitivity to outliers, as the mean and standard deviation can be significantly affected by extreme values. StandardScaler is not recommended for data that does not follow a normal distribution, as it may not always transform the data into a normal distribution. For the case under study, other normalization methods were also considered, specifically RobustScaler and MinMaxScaler, all from ScikitLearn [40]. In preliminary tests, the three methods demonstrated quite similar performance, with StandardScaler slightly outperforming the other two methods in preliminary tests.

Additionally, the technique of Principal Component Analysis (PCA) was applied. PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The first principal component has the highest variance and accounts for as much of the variability in the data as possible. Each subsequent component is orthogonal to the previous ones and accounts for the remaining variability. This reduction simplifies the data and speeds up the computation without significant loss of information. In this study, PCA was configured to maintain the principal components that represent up to 95% of the total variance of the data, useful for reducing dimensionality while retaining most of the original information.

3) Application of Rough Sets Methodology

For the implementation of the Rough Sets approach in analyzing wind turbine data, an essential step involves transforming quantitative data into categories based on specific domain knowledge. Wind speed data were divided into four distinct categories [41], reflecting different wind speed ranges, as follows:

Class A: Wind Speed $\lt =5.5$ m/s (light wind conditions)
Class B: 5.5 m/s < Wind Speed $\lt =12$ m/s (moderate wind conditions)
Class C: 12 m/s < Wind Speed $\lt =20.5$ m/s (strong wind conditions)
Class D: 20.5 m/s < Wind Speed $\lt =25$ m/s (very strong wind conditions)

The standard deviation of wind speed is categorized to reflect wind speed variation, divided into low, medium, and high variation:

Low variation: Deviation $\lt =0.5$ m/s
Medium variation: 0.5 m/s <= Deviation <1 m/s
High variation: Deviation $\gt =1$ m/s

The turbulence index, here defined as the ratio between the wind speed and the standard deviation of the wind speed samples, is classified into three distinct levels considering [42], representing low, medium, and high turbulence conditions, which assists in assessing the operational conditions of the wind turbine:

Low turbulence: Index <0.08
Medium turbulence: $0.08\lt =$ Index <0.12
High turbulence: Index $\gt =0.12$

The wind direction is categorized into eight principal directions, according to the standard compass rose. Each representing a specific degree interval. This categorization is important for identifying the predominant wind direction and its variations:

N: $0\le$ Direction <45 or Direction =360
NE: $45\le$ Direction <90
E: $90\le$ Direction <135
SE: $135\le$ Direction <180
S: $180\le$ Direction <225
SW: $225\le$ Direction <270
W: $270\le$ Direction <315
NW: $315\le$ Direction <360

Similarly, the standard deviation of wind direction is divided into three categories to represent the angular variation of wind direction, which is crucial for understanding the stability of wind conditions:

Low angular variation: Deviation <4°
Medium angular variation: $4^{\circ } \le$ Deviation <7°
High angular variation: Deviation $\ge 7^{\circ }$

These transformations enable the effective application of the Rough Sets method by providing a clear basis for analyzing and comparing the different operational states and environmental conditions of the wind turbines, thereby facilitating the identification of relevant patterns for predictive maintenance and performance optimization.

The application of the Rough Sets method to the data in question was supported by the ROSETTA software [43], a tool specifically developed for pattern recognition and data mining within the framework of Rough Set Theory. Unlike most machine learning methods, rule-based models are easily readable and can, therefore, be used to identify and understand patterns in the data and produce predictions. The rules generated by ROSETTA take the format “IF premise_1 AND premise_2 AND …AND premise_n THEN conclusion_1 OR conclusion_2 OR …OR conclusion_m”, where n is the number of premises, limited to the number of variables; m is the number of conclusions, limited to the number of existing classes.

In the context of this work, two outputs from this application are particularly relevant: the rules and the performance measures. Each rule produced by the Rough Sets method is accompanied by a set of parameters that can be used to qualify the rule under certain aspects:

LHS support: Number of objects in the training set that align with the IF part of the rule;
RHS support: Number of objects in the training set that align with both the IF part and the THEN part of the rule at the same time (LHS and RHS support will be the same unless the THEN part contains multiple decisions);
RHS Accuracy: RHS support divided by LHS support (Accuracy will be 1.0 unless the THEN part contains multiple decisions);
LHS Coverage: LHS support divided by the number of objects in the training set;
RHS Coverage: RHS Support divided by the number of objectives of the decision class listed in the THEN part of the rule;
LHS Length: Number of attributes in the IF part of the rule;
RHS Length: Number of decisions in the THEN part of the rule.

The decision table TABLE 6 is derived from the transformed dataset, now structured according to the Rough Sets methodology. This format facilitates the application of decision rules based on categorically defined conditions. Each row represents a specific case with the categorization of wind direction, wind speed class, and other relevant attributes, paired with the type of error triggered. This structured approach allows for a clearer analysis of how different conditions lead to specific operational issues. The decision rules derived from Rough Set Theory are obtained by analyzing recurring patterns in historical turbine failure data. The methodology involves categorizing environmental conditions—such as wind speed, turbulence index, and direction—into discrete classes, allowing for a systematic comparison between different operating scenarios. By identifying combinations of these variables that frequently precede specific failure modes, the method extracts rules in the format ‘IF wind condition X AND turbulence Y THEN failure Z.’ These rules serve as an interpretable framework for predicting and preventing failures, offering valuable insights for turbine operation adjustments.

TABLE 6 Decision Table, Formed From the Transformation of Original Data

E. Evaluation

For machine learning (ML) models based on classification, a variety of performance metrics can be utilized to assess the effectiveness of the employed models. These include accuracy, balanced accuracy, precision, sensitivity (or recall), F1 Score, among others. Given the characteristics of the problem presented, as detailed earlier, the decision was made to choose Balanced Accuracy as the metric due to its ability to provide a balanced evaluation in scenarios with unequal class distributions.

Additionally, the “one-vs-all” strategy was adopted to optimize the performance of the employed machine learning models. This technique allows for the individual treatment of each class against all others, facilitating the management of imbalances between categories. Figure 3 presents a series of box plot graphs demonstrating the performance of the ML algorithms used in predicting shutdowns in wind turbines, for each type of error associated, considering the 19 wind turbines of the Praias de Parajuru Wind Farm.

FIGURE 3.

Box plots demonstrating the performance of the ML algorithms used in predicting shutdowns in wind turbines, for each associated error type, considering the 19 wind turbines of the praias de parajuru wind farm.

Show All

From Figure 3 it can be observed that:

Converter Error: The K Nearest Neighbours (KNN) model shows the highest median balanced accuracy, indicating generally superior performance in predicting this category of failure. The Multi-layer Perceptron (MLP) model also displays a good median accuracy but with greater variation, suggesting slightly less consistency.
Acceleration Error: The Gradient Boosting (GB), Multi-layer Perceptron (MLP), Naive Bayes (NB), and Support Vector Machine (SVM) models showed similar performances, with relatively low variability.
Pitch System Error: In this category, the K Nearest Neighbours (KNN) model performed the best among the considered models, although the medians are slightly above 0.5, which do not represent good results in practice.

In summary, the K Nearest Neighbours (KNN) and Multi-layer Perceptron (MLP) models prove to be more efficient from the perspective of balanced accuracy for the type of problem presented. However, it is worth noting that the choice of the best model may depend on other factors, such as model complexity, training time, and interpretability, which are not directly addressed by balanced accuracy.

It is also necessary to discuss the existence of results where Balanced Accuracy resulted in 0.5. In a binary classification scenario, this means that the model is making correct predictions at a level equivalent to that of a random classifier, that is, it does not have a discriminative capacity between classes better than chance. The suboptimal performance of ML models in predicting certain types of failures in wind turbines can be attributed to the complex nature of the failures, which are not always directly correlated with the operational variables used, such as wind direction, wind speed, and their respective variations (variables provided as input to the models). For example, internal mechanical failures, such as defects in bearings or alignment problems in the drive train, may be more related to the maintenance history and the age of the equipment than to immediate meteorological or operational variables. Similarly, electrical failures may be more connected to fluctuations in the electrical grid or defects in components that are not immediately evident from standard operational data. The absence of input variables that directly capture the condition and wear of critical components, or even aspects such as the quality of installation and maintenance, may limit the models’ ability to accurately predict such failures. Therefore, it is important to recognize that the inclusion of maintenance data, history of previous failures, and equipment condition analyses may be necessary to improve the performance of predictive models in scenarios of wind turbine condition monitoring when this is the goal of the analysis.

A threshold for Balanced Accuracy needs to be established, which allows for the inference that the input variables adequately explain the anomalies observed in wind turbines. According to the author’s discretion, results above 0.70 are considered reasonable for accurately predicting both classes; above 0.80, the model is deemed effective in classifying instances in both classes; and above 0.90 are considered excellent.

The TABLE 7, therefore, highlights the events that were adequately predicted by the ML models considered in this work. Pragmatically, it reveals that despite the differences between the ML models applied to various types of errors in wind turbines, the variations in balanced accuracy are relatively small. This suggests that, in practice, the efficacy of the different models is quite similar, indicating a certain uniformity in the performance of these models in varied contexts. This uniformity in model performance may indicate that the ML algorithms are efficiently capturing the underlying trends in the data, regardless of their specific approaches. This is good news in terms of flexibility and robustness, as it suggests that engineers and data scientists can choose from a range of models without significantly compromising accuracy.

TABLE 7 Categories of Anomalies That Achieved Adequate Balanced Accuracy Results, With the Respective ML Models

The patterns and trends identified reflect the complexity and operational peculiarities of wind turbines. These patterns, highlighted by the precision and balance of performance metrics, provide a basis for the next phase of the investigation, where insights gained will be deepened through the Rough Sets method.

This transition is essential, as it allows the findings derived from machine learning models to be integrated and interpreted in a broader context of operational analysis and decision-making. Rough Sets, with their ability to handle uncertainties and imprecisions, enable a deeper exploration of the critical conditions that lead to undesirable shutdowns and failures of wind turbines. Thus, the relationship between the evaluation of the models and the implementation of the Rough Sets method is not just methodological but also conceptual, promoting a holistic approach to understanding and mitigating the challenges faced in the operation of wind farms.

F. Deployment

The application of the Rough Sets model to the data processed by machine learning models reveals significant patterns through association rules. These rules highlight specific wind conditions that are frequently associated with particular failures in wind turbines.

For the purposes of this work, greater attention should be given to the rules whose events are considered undesirable from the perspective of wind turbine maintenance experts, as shown in TABLE 8. Specifically, operational interruptions associated with ‘Pitch’ and ‘Converter’ functions are not justifiable under light wind conditions, which are characterized by low speeds, minimal variations, and reduced turbulence. Similarly, stops due to ‘Acceleration’ are seen as inappropriate in scenarios of medium or low wind intensity, accompanied by equally medium to low variations and turbulences.

TABLE 8 Comparative Analysis of Wind Conditions Associated With Undesirable Shutdowns in Different Categories of Wind Turbine Errors, With Emphasis on Turbines 09, 10, and 11, Which Showed Higher Values in LHS Support and LHS Coverage, Indicating Greater Recurrence and Significance for the Identified Failures

The root causes of turbine failures can be broadly classified into three categories: structural, environmental, and operational. Structural issues arise from design limitations, material fatigue, and assembly imperfections, as seen in turbine #11, where tower deformation increased susceptibility to vibrations. Environmental factors, such as wind turbulence and rapid directional changes, contribute to unexpected loads on the nacelle and rotor, potentially triggering unnecessary protection mechanisms. Operational factors include component wear, sensor inaccuracies, and control system misconfigurations, as observed in turbine #10, where defective encoders led to pitch system malfunctions. Understanding these underlying causes allows for targeted corrective actions, improving both turbine reliability and maintenance efficiency.

The practical implications of these decision rules extend directly to wind farm maintenance and operation strategies. By implementing these rules, operators can fine-tune protection thresholds to minimize unnecessary shutdowns while maintaining safety. For example, if a specific turbulence level is identified as a recurrent trigger for nacelle acceleration errors, a temporary power reduction strategy can be adopted to mitigate the issue without requiring a full shutdown. These insights support data-driven maintenance strategies, reducing downtime and enhancing turbine reliability.

These specific conditions point to a substantial opportunity to investigate the internal causes that promote such undesirable interruptions. A detailed analysis followed by appropriate interventions in the protection and control systems of the wind turbines can mitigate these failures. Consequently, it is anticipated that such actions may result in a significant increase in the availability of these assets. This increase in availability not only improves the operational efficiency of wind turbines but also enhances the reliability of the wind energy system as a whole.

In the context of implementing corrective measures directed at wind turbines 9, 10, and 11, which showed higher frequency and significance of failures, several specific strategies were adopted.

For wind turbine #11, it was confirmed that a deformation in one of the sections during the tower assembly increased the equipment’s susceptibility to vibration issues. In response, a new logic for nacelle acceleration protection was implemented. This logic aims to mitigate failures related to nacelle acceleration, although the wind turbine may continue to experience a significant number of failures due to the structural problem in one of its sections. Additionally, a structural study is underway to assess the condition of the tower, focusing on integrity and the implementation of possible reinforcements to increase its rigidity.

The wind turbine #09, which did not present specific known causes for the failures, also had changes made to the nacelle acceleration protection logic. The new logic reduces the instantaneous power for one minute when excessive vibration is detected, aiming to attenuate the acceleration to values below 0.03G. This action is intended to reduce the oscillation of the structure and prevent the wind turbine from shutting down. This same logic was extended to all wind turbines in the fleet, seeking a collective benefit.

For wind turbine #10, two defects were reported that occurred intermittently. Additionally, two faulty encoders were discovered and replaced immediately. After these interventions, it was observed that there were no more pitch protection actuations, and the failures continue to be monitored to ensure that the problems have been effectively resolved.

SECTION IV.

Conclusion

This study presented a detailed analysis of the correlations between wind turbine failures and environmental conditions. Through, using data science techniques, it was possible to identify significant patterns that contribute to optimizing the maintenance and operational efficiency of wind turbines. The results demonstrate the feasibility and importance of applying advanced data analyses in the wind energy sector. Moreover, this study paves the way for future research and practical applications aimed at continuous improvement in the operation and maintenance of wind turbines.

The application of Data Science techniques in the operation and maintenance of wind turbines represents the state of the art in the field of renewable energies. This approach is crucial due to the accelerated growth of renewable energy sources, which generate volumes of data beyond human processing capacity. Efficient analysis of these data is essential to promote sustainable development aligned with ESG (Environmental, Social, and Governance) principles. This emerging field is constantly evolving, and continuous learning in this area is fundamental to advancing sustainability and energy efficiency.

During development, it was found that the machine learning models used revealed important patterns from the provided data, contributing to optimizing the maintenance and operational efficiency of some wind turbines. Some anomalies detected could not be fully explained based on the available information, suggesting the need for additional data or other analytical approaches for a deeper understanding of these phenomena. This limitation highlights the complexity of wind turbine systems, where operational and structural factors, in addition to environmental conditions, may influence failure behavior. Future research should consider incorporating complementary data, such as electrical variables (current, voltage, power quality) and detailed maintenance records, to provide a more comprehensive understanding of failure dynamics. However, this perception corroborates the objective of this work, which is the identification and characterization of the influence of environmental variables (especially related to wind) on the occurrence of failures and protection actuations in wind turbines.

The application of the Rough Sets method allowed for a deeper analytical view of the events that occurred under wind conditions that supposedly should not have triggered protection actuations. Since the prediction methods produced satisfactory results in terms of balanced accuracy, the combination with the Rough Sets method brought even greater robustness to the diagnosis of failures and the consequent guidance for adjustments and corrections, aiming to increase the reliability of wind turbine operation and maintenance.

Furthermore, the application of Explainable Artificial Intelligence (XAI) techniques is recommended to enhance the interpretability of machine learning models, enabling a clearer identification of the environmental variables that most influence the observed failures. These future initiatives aim to complement the current findings, promoting a more holistic analysis of wind turbine behavior under various operational contexts.

In conclusion, this work has made a significant advance in the application of data science to optimize the operation and maintenance of wind turbines, contributing not only to the increase of operational results of the ventures but also establishing a path for future innovations in the use of ML methodologies in the field of renewable energies.

ACKNOWLEDGMENT

The authors would like to thank CAPES for the financial support and the opportunity to conduct this research.

References is not available for this document.

Correlation Between Wind Turbine Failures and Environmental Conditions: A Machine Learning Approach

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

State of the Art