Operation and Maintenance Decision Support System for Photovoltaic Systems

Operation and maintenance (O&M) and monitoring strategies are important for safeguarding optimum photovoltaic (PV) performance while also minimizing downtimes due to faults. An O&M decision support system (DSS) was developed in this work for providing recommendations of actionable decisions to resolve fault and performance loss events. The proposed DSS operates entirely on raw field measurements and incorporates technical asset and financial management features. Historical measurements from a large-scale PV system installed in Greece were used for the benchmarking procedure. The results demonstrated the financial benefits of performing mitigation actions in case of near zero power production incidents. Stochastic simulations that consider component malfunctions and failures exhibited a net economic gain of approximately 4.17 €/kW/year when performing O&M actions. For an electricity price of 59.98 €/MWh, a minimum of 8.4% energy loss per year is required for offsetting the annualized O&M cost value of 7.45 €/kW/year calculated by the SunSpec/National Renewable Energy Laboratory (NREL) PV O&M Cost Model.


I. INTRODUCTION
Solar photovoltaic (PV) technology is set to become the dominant source of electricity generation worldwide and a key foundation of future power systems [1]. In this domain, scaling up of cost-effective electricity from solar technologies is crucial for the decarbonization and transformation of the electricity sector. A key enabling factor for the future uptake and enhancement of the PV technological value chain is the reduction of the levelized cost of electricity (LCoE).
The associate editor coordinating the review of this manuscript and approving it for publication was Zhiwei Gao .
One way of achieving this is by improving the lifetime performance and optimizing monitoring and operation and maintenance (O&M) strategies [2]. Along this context, the key technical solutions that support high plant performance are associated with the capabilities of intelligent data analytic methods that provide real-time monitoring and automated diagnostics. To this end, monitoring systems enhanced with automated data-driven features (such as remote failure detection, fault prediction and maintenance strategies) can assist in improving reliability and safeguarding optimal PV performance by intelligently resolving power reduction issues [2]. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Evolution of data-driven approaches towards prognostic and more advanced analytics for reducing operational costs. Figure obtained from Wood Mackenzie [3].
Traditional solar monitoring and O&M approaches include the implementation of descriptive analytics and diagnostics, while state-of-the-art methods focus on prognostics (i.e., predictive and prescriptive analytics) [3]. During this era, automated and digital solutions are deployed to improve the PV performance and transform the maintenance services. Current research activities are thus beginning to focus on utilizing more advanced and complex analytics and procedures to optimize the O&M activities (see Fig. 1) [4].
Since solar plants continuously generate large amounts of data, data-driven methods (that extract value from data -see Fig. 2) are becoming more valuable for day-to-day monitoring, O&M management, and reporting [5]. Interpreting efficiently the mere data can provide meaningful information about different failures/losses and their root causes. Following such insights, PV plant owners can take decisions and perform maintenance actions. Optimized O&M actions are valuable for ensuring quality of operation (e.g., maximizing the plant output power and minimizing loss of energy generation) and substantially increasing the PV reliability, which in turn, improves revenues and hence, reduces the LCoE.
A recent industry benchmark study [6] demonstrated that the average recoverable energy of a PV plant is 5.27% (equivalent to a potential recoverable income of 10,000 $/MW/year) when performing optimized O&M strategies. The percentage value of 5.27% represents the average energy that could be recovered for a typical PV system (of 16.1 MW p ) if the detected underperformance incidents were resolved and not the total amount of energy loss per year. The same study indicated that the global PV industry could be losing $14.5 billion each year by 2024 if not executing an O&M strategy [6].
strategies are periodically planned according to a specific maintenance plan. In some cases, such as in soiling mitigation, preventive maintenance can be put in place proactively; i.e., even before the system is in operation, by, for example, including anti-soiling coatings or optimizing the PV module or system design to minimize the ac-cumulation of dust [8].
Condition-based maintenance involves the extraction of real-time information from the monitored data and forecasts to schedule and optimize maintenance activities (e.g., schedule cleaning events and snow removal, detect potential failures at an early stage or before occurrence). These maintenance activities are mainly affected by environmental (e.g., soiling, snow) and extreme weather conditions (e.g., hurricanes and tornadoes). If analysts were able to predict the soiling and snow losses and their seasonality from widely available environmental data (e.g., particulate matter and rainfall), it would be possible to estimate the loss and optimize the O&M schedule in advance [9], [10].
With regards to soiling mitigation, it is commonly performed in a corrective rather than predictive manner, based on the values of the monitored losses. The simplest methodology for taking an O&M decision and determining the cleaning event was proposed by Cristaldi et al. [11], where the cleaning was performed once the financial loss due to soiling was higher than the cleaning cost. To support the O&M teams in the optimization procedure of cleaning schedules, numerous economic models (based on the LCoE, Net Present Value, etc.) have been proposed in the literature [11]- [13], that consider several input parameters such as the cleaning cost, the soiling rate, the plant size, etc.
Corrective maintenance involves actions and/or techniques taken to correct/repair failures, malfunctions, or damages detected by remote monitoring or during regular inspections [2]. Such actions are unscheduled maintenance and are required to repair the detected issues (i.e., replace failed components) and restore the PV system back to normal operation. Corrective maintenance actions take place after a failure is detected, which requires the application of fault diagnosis tools (i.e., algorithms for detecting and classifying faults). In such cases, the response time, which is the sum of acknowledgment and intervention times (i.e., the time to detect and acknowledge the fault, inform the technicians, find suitable replacement equipment and finally the time to reach the plant by a service technician or a subcontractor) and resolution (or repair) time (i.e., the time on site to resolve the incident starting from the moment of reaching the PV plant) are considered [2].
Corrective (or repair) actions must be prioritized and scheduled by an analytical method that evaluates the criticality, economic and technical impact of incidents. In this domain, four main methods have been proposed in the literature to minimize the potential loss of energy generation: a) Failure Mode, Effects and Criticality Analysis (FMECA) [14], [15], b) Multi Criteria Decision Analysis (MCDA) [16], c) Reliability, Availability, and Maintainability (RAM) analysis [16] and d) Cost Priority Number (CPN) [17]. When the PV plant (or part of it) needs to be taken offline for the execution of corrective actions, night time or low irradiation hours are considered to be the best practice for minimizing the energy loss [2].
For detecting faults in PV systems, several techniques have been proposed in the literature [18]- [22]. In general, fault diagnostic methods for PV systems are based on visual inspections, image processing and data analytic (including signal processing) techniques [18]. Failure diagnosis based on data analytic methods is becoming increasingly popular lately since they can offer real-time health-state monitoring. These failure diagnosis methods operate on the recorded weather and operational data (such as power, voltage and current) [19] and therefore, do not require any additional hardware installation or labor cost. They allow remote, automated and real-time monitoring, providing insights and recommendations in case of PV underperformance issues. In principle, comparative, statistical and/or data-driven (e.g., artificial intelligence) analyses are performed, yielding useful information on the health-state and operation of the system. A number of failures including inverter shutdown, mismatch faults (partial shading), open-and short-circuit faults, line-to-line faults, string disconnections and bypass diode faults were reliably identified by such data analytic methods [20]- [22]. It is worth noting here that there is no consensus with respect to the most common technical issues that affect the PV plant power production. For example, according to [6] (a study conducted in 2021 using utility-scale solar plants data with a total capacity of 1.2 GW p ), most occurrences (∼80%) are PV module related problems. Similarly, a 2017 study [17] (conducted using data from PV plants of around 442 MW p nominal capacity) found that 63% of the detected failure cases were due to PV module failures. On the other hand, other studies [14], [15], [23], [24] find inverters as the most vulnerable components in a PV plant. However, this might be due to the inverter level monitoring (or even AC only), which can only ''see'' inverter failures and is essentially blind on the PV module or string levels.
Even though, a lot of work has been performed in the area of fault detection [18]- [22], technical risks quantification [14]- [17], [24], maintenance strategies [2], [7], [25] and commercialization of diagnostic tools for PV power plants [6], [26], [27], very few published articles [16], [19], [28], [29] are concerned with the development and description of automated monitoring systems, that allow plant owners/operators to maximize energy production, reduce operational costs and improve reliability. In this research field, the PV industry is still facing challenges on demonstrating the effectiveness of decision support system (DSS) platforms capable of generating specific action recommendations to optimize O&M activities. Studies from other disciplines (e.g., agriculture) and other research fields in the power sector (e.g., wind) have already demonstrated the importance of DSS structures for strategic maintenance planning [27].
Recently, Herz et al. [16] published a report based on a cost-benefit analysis to derive the best mitigation strategy from a technical and financial perspective. This approach utilized the CPN methodology for calculating the cost of individual entries in the ticketing system of a PV plant and then prioritizing decisions and providing results for risk managing strategies. The benchmarking results using three case studies (inverter failures, plant with PV modules affected by potential induced degradation and soiled PV modules) demonstrated the need for an automated and time-efficient solution for extracting key parameters from maintenance tickets and the lack of lack of a standardized methodology for failures categorization. For a 10 MW PV plant affected by potential induced degradation (PID), the analysis revealed that the project's 20-year financial profit was 48% below expectations. It was concluded that mitigation options such as PID-boxes and/or replacing very low performing PV modules should be taken as a solution compared to ''no actions''. Regarding cleaning routines for PV systems in desert regions, the results showed that if no cleaning (natural cleaning) was performed for the 10 MW p PV plant in Abu Dhabi (exhibiting a soiling loss rate of 0.3%/day with two significant precipitation events over a year), soiling losses reached up to 30% per year and result in annual yield loss of $2,614,000. In case of monthly cleaning events, the soiling losses were reduced to 4% (resulting in reduced yield losses $377,000 with an annual cost of $497,000 for cleaning services). The best economic cleaning measure was achieved when performing ''triggered cleaning'' at a soiling loss of 5%. In case of triggered cleanings, the soiling losses were reduced to 2.3% (resulting in reduced yield losses of $212,000 with an annual cleaning cost of $200,000). Similarly, machine learning approaches for fault detection along with the CPN approach were integrated into a digital asset management system to prioritize maintenance activities in PV systems [28]. This work is ongoing (the DSS was not fully developed) and the results demonstrated the effectiveness of the proposed fault detection algorithm for detecting faulty PV operation. Likewise, the work presented in [29] was carried out to better understand the CPN methodology and to evaluate its applicability in routine operations of a large O&M operator. The results demonstrated a CPN value of up to 0.487 e/kW p /year when the inverter was off.
In our previous work [19], a DSS for corrective maintenance in large-scale PV systems was developed. The results demonstrated the financial benefits of performing corrective actions in case of critical failures. Reduced response and resolution times of corrective actions improved the PV power production of the test PV plant by 1.65% over a 30-month period. The obtained results showed that even for 1% energy yield improvement, the implementation of an automated DSS was recommended for PV plants with capacities greater than 250 kW p .
To address the aforementioned shortcomings, this paper extends upon the work presented in [19], by integrating additional functionalities for failure/loss categorization and by considering all three levels of maintenance, enabling preventive, predictive and corrective capabilities. The complete procedure for the development of the DSS (along with an extensive description of the incorporated data analytic functionalities) is thus provided in this work, in an attempt to minimize the cost and energy impact of underperformance incidents in PV systems. The DSS operates entirely on acquired raw field measurements and utilizes an automated data-driven diagnostic architecture for maximizing the PV energy yield. The proposed DSS incorporates technical asset and financial management features for remote and real-time failure detection and provides suggestions for maintenance actions to resolve the PV underperformance issues. Examples of the DSS functionalities are presented along with an economic assessment of mitigation measures. In this context, the impact of O&M actions in terms of recoverable energy and revenue was assessed to determine whether it is beneficial to perform (or not) the maintenance actions suggested by the DSS. The proposed DSS was benchmarked using historical inverter level data from a PV power plant installed in Larissa, Greece. To enable a stronger verification and benchmarking of the DSS architecture, approximately 6 years of data from the test PV power plant were used.

A. DECISION SUPPORT SYSTEM ARCHITECTURE
The DSS structure, illustrated in Fig. 3, operates on time series of meteorological and electrical measurements. The DSS automatically cleans and analyzes the PV operational and weather data to calculate meaningful metrics, extract information on health-state condition, produce insightful results, and recommend specific actions to improve system performance.
Data Quality Routines (DQRs) are initially applied for filtering out invalid data, while a power predictive model is used for simulating the PV performance in the absence of fault/loss conditions. Failure Detection Algorithms (FDAs) are then used for detecting failures, while Trend-based Loss Routines (TLRs) are used for detecting performance losses. Energy loss estimation and criticality evaluation is then performed by using the FMECA analytical method. Finally, suggestions for maintenance actions are generated by the Maintenance Strategies Routines (MSRs). More details are given in the following subsections.

B. DATA QUALITY ROUTINES (DQRs)
The DQRs process comprises of 8 sequential steps, including initial data statistics (Step 1), consistency examination (Step 2a), identification of technical availability (Step 2badditional checks added to Step 2 for failure detection [19]), data filtering (Step 3), detection of invalid (missing and erroneous) data (Step 4), determination of missing data mechanism and rate (Step 5), treatment of invalid values and dataset reconstruction (Step 6), data aggregation (Step 7) and final data statistics (Step 8) [30]. Detailed information about the procedure for the data cleansing methodology can be found in Livera et al. [30]. The goals of the DQRs process are: a) to identify and remove invalid data points of the train set before simulating the PV plant performance, b) to derive information about system health-state condition and c) to provide insights about possible data and technical (performance) issues (e.g., communication loss problems, data storage and synchronization issues, sensors' faulty operation, PV system outages/downtimes, interruptions for maintenance reasons, grid failures, etc.).

C. PV SYSTEM SIMULATION MODEL
A machine learning (ML) predictive model was used to predict the DC power of the test PV system by leveraging the eXtreme Gradient Boosting (XGBoost) algorithm [31]. The ML model was selected due to its high prediction accuracy even when trained on low fractions of on-site data and minimal features [31].
The ML model was trained based on a 10:90% train and test set approach. The train set (i.e., data covering approximately a period of 6 months) was used for the model's training process and contained fault-free data. The test set (rest of the dataset -63 months of data) contained both faulted and fault-free periods and it was used for assessing the model's performance. The goodness of the model's fit was evaluated using the coefficient of determination (R 2 ) and the mean absolute percentage error (MAPE) [32]. The predictive model was then used as a reference model in the FDAs procedure.

D. FAILURES DETECTION ALGORITHMS (FDAs)
During fault conditions, the DC power production of the system is either near zero (0-15% of predicted power) due to inverter shutdown failures, maintenance events, grid failures, ground faults, etc. or reduced (15-80% of predicted power) due to bypass diode and short-circuit faults, partial shading, etc. [33].
In this context, a comparative algorithm (that compares the measured against the predicted DC power production for each data point) is used for fault detection. The fault operation is detected when the absolute error (AE), defined as the absolute difference between the predicted and measured power, exceeded a predefined set threshold level (TL). The TL was calculated by multiplying the power of the array at Standard Test Conditions (STC) with the combined yield uncertainty of the model, which was calculated by deriving the partial derivatives of the model's inputs [34].
Finally, labels (indicating normal or fault operation) were inserted to the dataset under study by utilizing the maintenance log of the test PV plant. The accuracy metric, defined as the ratio of the number of correct predictions (True Positive + True Negative) to the number of total predictions (True Positive + True Negative + False Positive + False Negative), was then used to assess the performance of the FDAs [35].

E. TREND-BASED PERFORMANCE LOSS ROUTINES (TLRs)
Incidents causing gradual or seasonal power loss are referred to as ''trend-based'' performance losses (e.g., degradation, snow, and soiling). Such losses can reduce the PV system power production by up to 20% [33]; in some specific cases (i.e., heavy snowfall or sandstorm), this set range can be exceeded. Trend-based incidents can result in either reversible or irreversible (permanent) performance loss based on the caused damage [36]. Most of the irreversible losses can be classified as material/component degradation of the PV cell/module and balance of system.
Soiling and degradation were detected by leveraging the RdTools open-source python library [37], that is being accepted and used by the industry for evaluating the reliability of the system. This library has the capability of evaluating the PV production to obtain rates of performance degradation and soiling loss. The RdTools incorporates the Yearon-Year (YoY) [38] method for estimating the performance loss rate (PLR) (in %/year) and the stochastic rate and recovery (SRR) [39] method for estimating the soiling losses and detecting cleaning events. Snow losses were detected by post-processing PV performance parameters (i.e., performance ratio values) along with the site's weather conditions (i.e., snowfall and ambient temperature measurements) [40], [41]. VOLUME 10, 2022

F. FAILURES AND PERFORMANCE LOSS CATEGORISATION AND CRITICALITY
An anomaly detection algorithm was utilized along with a change-point (CP) model for distinguishing failures (e.g., near zero fault occurrences) from reversible (e.g., soiling and snow) and irreversible (e.g., degradation) performance loss mechanisms. In particular, the Seasonal Hybrid Extreme Studentized Deviates (S-H-ESD) was employed to detect data anomalies, indicating failure occurrences in time series data [19], [42]. The S-H-ESD algorithm detects both global and local anomalies by applying Seasonal and Trend decomposition using Loess [43] and robust statistics (i.e., statistical test hypothesis, median based estimation, piecewise approximation) together with Extreme Studentized Deviates (ESD).
In parallel, the Facebook prophet (FBP) algorithm was used to identify the number and location(s) of change-point(s) in time series data by capturing linear and complex trends as well as abrupt profile changes [44], [45]. The FBP has the capability of differentiating reversible from irreversible mechanisms, extracting the soiling losses and estimating both the PLR and degradation rate (R D ) of PV systems [41]. This can be achieved by rating the detected changes and adjusting the flexibility of the algorithm (changepoint_prior_scale hyperparameter) to capture either changes due to performance loss factors, soiling loss or only degradation rate changes (by avoiding the influence of outliers/faults and temporary effects) [41]. More details about the FBP model calibration procedure and its usage for categorization of incidents are available in Livera et al. [41]. Breaking down of system energy losses into 6 main categories (near zero power production incidents, reduced power production incidents due to faults, faulty/defective equipment, soiling, snow, degradation) was then performed to provide insights on the fault root causes.
Subsequently, when an underperformance issue was detected, the energy loss during that time period was estimated using the area under the curve (AUC) given by: where X indicates the DC (or AC) power production, and P X_pred and P X_meas are the predicted and measured power of the PV system, respectively. The incident's criticality can be then assessed using the FMECA [15], MCDA [16], RAM [16] or CPN [17]. Though, a comparative analysis is needed to derive the most robust analytical method for assessing incidents' criticality and optimize the field O&M interventions (part of future work). The selected method can be then incorporated to the DSS to highlight the most critical failures for optimizing the MSRs. For demonstration purposes of the DSS functionalities, the FMECA approach is used in this work for determining the fault criticality (i.e., non-critical, medium and critical [19]). The complete procedure for deriving the fault criticality is described in Livera et al. in [19], while the detailed methodology for the FMECA approach is provided in [14], [15].

G. MAINTENANCE STRATEGIES ROUTINES (MSRs)
In case of PV underperformance, the DSS recommends specific actions (e.g., cleaning of PV modules, repair of faulty equipment, replacement of PV modules, etc.) to be performed by the O&M personnel to mitigate the effect of failures and performance losses. The results indicating fault/loss events are used as inputs to a recommendation engine responsible for transforming underperformance events into actions to be conducted by the O&M team, thus optimizing field operations. The recommendation engine uses the criticality value derived from the FMECA analytical method along with a text format input to generate mitigation actions and schedule/prioritize the O&M activities.
To study the cost-benefits for the recommended actions, a cost analysis was carried out. Initially, the energy loss (kWh) estimated by the AUC was translated into economic (or revenue) loss (e). The energy that could be recovered (also translated into revenue recovery) by performing O&M actions (e.g., corrective actions) was thus estimated by considering variable response time and fixed resolution time. The resolution time was specified according to [17] that states the time to fix each fault type, while the response time was defined as stated in the contract made between the PV plant owner and the O&M company [2]. In our case study, the O&M contractor was obliged to react on alarms indicating faulty PV operation within a certain period of time (e.g., within 4 daytime hours when the entire PV plant is off, up to 24 hours for more than 30% power loss and finally within 36 hours for 0-30% power loss) 7 days a week [2].
To evaluate the economic impact of O&M actions, the Net Present Value (NPV) was used. The NPV evaluates the profitability of an investment (i.e., compares the revenues and costs over the project lifetime) and it is given by (2) [13]: where t is time, T is the total number of years of operation, C is the installation cost (e/kW), p is the average electricity price (e/kWh), PL is the performance loss profile, E t is the lifetime energy multiplied by the degradation rate (R D ), O&M t is the yearly O&M costs (e/kW/year), n R is the number of yearly repair visits, R W is the specific repair cost for the whole PV site (e/kW) and r is the discount rate (%/year). A positive NPV indicates a profitable investment.

H. BENCHMARKING
The proposed DSS was benchmarked using historical field measurements from a PV power plant in Larissa, Greece (Köppen-Geiger-Photovoltaic climate classification DH; Temperate with high irradiation) [46]. The performance of the PV system and the prevailing meteorological conditions are recorded according to the requirements set by the IEC 61724-1 [47] since 2013. The field data are stored with the use of a measurement monitoring platform. The monitoring platform stores data at a resolution of 1 second and accumulation steps of 15-minute averages. The meteorological measurements include the in-plane irradiance (G I ) measured with a pyranometer (Kipp Zonen CM21-CV), ambient temperature (T amb ), module back-surface temperature (T mod ), inverter temperature (T inv ), wind speed (W s ) and direction (W a ). The electrical measurements include the inverter DC current, voltage and power and AC output power (P out ). Additional yields and performance metrics such as the array and system performance ratio (PR) were also calculated [48]. Lastly, weather data, that were unavailable at the power plant (i.e., snowfall and rainfall measurements), were sourced from Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) [49].
The outdoor field measurements as well as the calculated performance metrics, were used to create a PV dataset of 15-minute average measurements from the four gridconnected inverters from April 01, 2013 to December 31, 2018. Over the evaluation period, different types of faults and performance loss events (e.g., plant was down due to grid failures, scheduled maintenance, ground faults, low plant power production and/or low PR due to snowfall, equipment malfunctions, soiling, etc.) occurred. Information about the outage periods, failure types, loss mechanisms and corrective actions were kept in a maintenance log (which was used for validating the proposed architecture). It is worth noting here that the issues reported in the log and their classification are highly dependent on the monitoring system level (e.g., module, string or inverter level). In addition, the test PV plant is continuously monitored by an O&M solar company; thus, the detected underperformance incidents (grid failures were excluded as mitigation measures cannot be taken) were resolved based on the agreed response time (as stated in the contract made between the PV plant owner and the O&M company [2]). Furthermore, maintenance of the plant was pre-scheduled twice a year (during winter and summer months).

I. TEST SCENARIOS
Since the test PV plant is currently monitored by an O&M solar company (see subsection II. H. Benchmarking) and hence the ''actual'' loss of energy generation could not be estimated (only the lost energy during the period starting from the acknowledgement time until the resolution time could be estimated), test scenarios were generated to assess the economic impact of fault events (that occur during the lifetime operation of a PV system). In particular, stochastic simulations derived from the Photovoltaic Reliability and Performance Model (PV-RPM) developed by Sandia National Laboratories (SNL) [50] and the National Renewable Energy Laboratory (NREL) [51] were performed. The PV-RPM allows users to develop and run simulations, where PV performance and costs are impacted from components that can fail stochastically [24]. Costs associated to false alarms were not included in the analysis.
In parallel, it was assumed that the performance of the test PV plant can be improved from 1% to 5.27% [6] by complying with the recommendations of the DSS (e.g., performing corrective actions and resolving the detected critical fault incidents), thus minimizing downtimes.
Finally, to examine for which cases performing corrective actions is beneficial, an economic analysis with varied input parameters (i.e., energy yield, agreed electricity price and number of yearly repair visits) was performed.

III. DECISION SUPPORT SYSTEM (DSS) APPLICATION ON REAL FIELD DATA A. DATA QUALITY ROUTINES (QDRs)
The DQRs were initially applied to the field measurements to estimate the technical availability (uptime) of the PV system [2]. The uptime is a PV plant key performance indicator (KPI). It is defined as the time during which the plant is operating over the total possible time it is able to operate, without taking any exclusion factors into account [2]. The technical availability for the four inverters was calculated to be higher than 98.52%, indicating a well-maintained PV system. Uptimes reported in the literature ranged from 95.5% to 99.5% [52], while a best practice is a guaranteed availability of >98% over a year [2].
The whole plant (or individual inverters) was (were) down for approximately 303 hours (equivalent of 0.14%) over the evaluation period, reflecting the high performance of the services provided by the O&M contractor (aiming to resolve ''quickly'' the underperformance incidents and minimize downtimes) [2].
The data quality methodology was then used to include daylight measurements only (irradiance values between 20 W/m 2 and 1300 W/m 2 ) and to filter out invalid measurements (by applying the boxplot rule method [45]) before simulating the PV performance [30]. Over the evaluation period, the DQRs methodology detected 5.28% invalid data points (e.g., erroneous and missing values), indicating a continuously monitored PV plant with a high-quality data acquisition system (or a system with a high monitoring health-state grade) [53]. VOLUME 10, 2022

B. PV SYSTEM SIMULATION MODEL
The ML model was used to predict the DC power production of the test PV system. The DC power predictions of the ML model for inverter level data, resulted in an R 2 of 0.99 and a MAPE of 5.38% over the test period, demonstrating its suitability as a reference model embedded in the FDAs. Although, lower MAPE values (e.g., 2.05%) have been reported in the literature for predicting the DC power [32], [54], the validation process in such cases was performed either under normal operating conditions or on simulated data. In this work, the prediction model's performance was assessed using the test set that contained both normal and actual fault conditions and hence higher MAPE values were obtained.
An example of the measured and predicted DC power of inverter 1 is shown in Fig. 4 for a week in May. The predicted DC power exhibited similar behavior to the measured DC power during clear-sky, moderate and overcast days.

C. FAILURE DETECTION ALGORITHMS (FDAs)
Over the test period, the FDAs detected 98 failure occurrences (or 2592 fault data points), including faulty/defective equipment events, since the AE exceeded the set TL. The FDAs achieved 96.25% detection accuracy, since from the 2592 detected faulty data point, 1776, 719, 69 and 28 were classified as True Positive, True Negative, False Negative and False Positive data points. A literature search revealed fault detection accuracies ranging from 93.09% up to 99.15% [55]- [58]. The algorithms that achieved high detection accuracies (>98%) were validated using labelled datasets and emulated fault conditions. In this work, fault emulation was not possible. In addition, the labeling procedure was performed using the maintenance log of the test PV plant, that reported only fault issues at the inverter level (e.g., near zero power production incidents due to inverter failures, ground faults, grid failures, etc.), thus justifying the lower detection accuracy provided by the FDAs.
An example of a fault occurrence detected by the FDAs is shown in Fig. 5 (see July 27 th and 28 th ). Information extracted from the maintenance log of the test PV plant indicated a failure event (inverter 1 was down due to ground fault) which started on the 27 th of July at 16:15 pm and resolved on the 28 th of July at 10:00 am.

D. TREND-BASED PERFORMANCE LOSS ROUTINES (TLRs)
The RdTools library was used for evaluating the reliability of the PV system. The PLR of the test PV system was estimated by applying the YoY method on the daily DC performance data (i.e., ratio of measured to the predicted power production). Over the evaluation period, an annual PLR of −0.90%/year (with a confidence interval of −1.03 to 0.76%/year) was obtained. The obtained PLR is in line with recently published reports and papers [59], [60], that reported PLR values in the range of −0.5 to −1 %/year with a median PLR value of −0.63 %/year.
In parallel, the SRR model [61] was used for soiling loss extraction and detection of cleaning events. The model was fed with daily DC performance data (i.e., measured and predicted power production data), calculated using the filters listed in [62] and considered only the central hours of the day [63]. The cleaning events were identified from the positive shifts in the DC performance profile and any dry period of at least 14 days was fitted using the Theil Sen regression [64]. The SRR generates 1000 potential soiling profiles for each inverter through a Monte Carlo simulation. From these, a single soiling profile per inverter can be extracted from the median value of each day [65].
Over the evaluation period, 34 cleaning events were detected by the model. Also, inverters experienced low/limited soiling losses, with averages of 0.9% to 1.4% for the period between April 2013 and December 2018 (see Fig. 6). Information extracted from the maintenance log, reported that the PV modules were cleaned twice a year by the O&M company. Therefore, this justifies the low soiling losses obtained when compared to the higher values reported in the literature (typical soiling loss between 4 to 7% with a range from 2 to 25%) [40], [66].

E. FAILURES AND PERFORMANCE LOSS CATEGORISATION
The weekly PR time series was initially constructed (using the recorded measurements) and examined for failures and  [39]. Red lines: median soiling profile, as in [65]. Blue vertical bars: daily rainfall intensities, downloaded from MERRA-2 [49].
performance losses. The weekly PR time series of inverter 1, depicted in Fig. 7, shows the seasonal profile of the test subsystem, with higher PR values in the winter and lower in the summer. By applying the S-H-ESD algorithm on the constructed time series, five data anomalies were detected (circled in purple in Fig. 7). By post-processing the site's weather conditions, the detected low PR values in January 2015 and January 2017 were due to snowfall, while the low PR value in June 2018 was due to inverter shutdown and grid failures.
The FBP algorithm was then used to assess the overall plant health-state and to estimate both the PLR and the R D , by adjusting its flexibility hyperparameter. For PLR estimation, the FBP flexibility was set to 2.5 to capture a signal with all performance losses and then re-adjusted to 0.04 to capture only degradation changes [41]. By applying the ordinary least squares (OLS) method on the FBP extracted linear trend (see Fig. 7), a linear PLR and R D of −0.99%/year and −0.49%/year was obtained, respectively. Even though PV degradation contributes to the PLR, the majority of the exhibited performance losses was found to be due to reversible and temporary phenomena (circled in purple in Fig. 7). The obtained R D is in line with the degradation rates reported in the literature for silicon PV module technologies. A recent study (conducted in 2021) found degradation rates ranging from −0.01 to −0.47%/year for fault-free PV plants [67], while a previous comprehensive review (published in 2013) found that the median and average R D for c-Si is −0.5%/year and −0.7%/year, respectively [33].
Over the evaluation period, the test PV system produced 16,344 MWh, while the FDAs and TLRs detected 138 incidents, accounting for 298.16 MWh (1.82%) of lost energy. The estimated loss energy of 1.82% represents the amount of lost energy during the period starting from the acknowledgement time until the resolution time. An example of the energy loss estimation using the AUC for an inverter failure is depicted in Fig. 8. The failure incident (started on the 23 rd of October at 14:30 pm and resolved on the 24 th of October at 12:30 pm) resulted in 899 kWh of energy being lost. The losses breakdown analysis is summarized in Table 1. From the detected underperformance incidents, 52.94% was due to near zero power production incidents, 11.77% was due reduced power production incidents due to faults (e.g., bypass diode failures, partial shading, short-circuit faults, etc.), 0.47% was due to faulty/defective equipment, 24.37% was due to performance losses (soiling, snow and degradation) and other incidents and error accounted for 10.45%.

F. MAINTENANCE STRATEGIES ROUTINES (MSRs)
The results of the energy loss and incidents breakdown analyses are used as inputs to a recommendation engine responsible for transforming underperformance events into actions to be conducted by the O&M team, thus optimizing field operations. The engine uses a criticality value (derived from the FMECA approach) along with a text format input to generate mitigation actions.
For the detected near zero power production incidents (e.g., inverter shutdown failure), that account for approximately 53% of the total lost energy, the DSS recommendation is to perform immediate corrective actions. Such incidents were the most severe in terms of lost energy (see Table 1) and they are categorized as critical incidents by the FMECA approach [15]. Thus, reduced response time maximizes the PV plant energy production (and hence the financial revenue) that could be recovered. Even in the case that the detected events were resolved within 4 daytime hours by the O&M company, an amount of 187,838 kWh (or equivalently e11,267 based on an electricity price of 59.98 e/MWh) was lost over the evaluation period.
Trend-based performance losses (soiling and snow) were the second most severe fault category, accounting for approximately 25% of total lost energy -categorized as medium criticality incidents [19]. Therefore, cleaning events scheduling was considered. In lack of information on the maximum extent of soiling (i.e., the losses in conditions of no mitigation), a cleaning optimization was conducted on the available time series to evaluate the profitability of cleaning events. For a price of 59.98 e/MWh (price for PV installations between 1-10 MW installed in Greece) [68], the extracted soiling losses correspond to lost revenues in the range 0.9 to 1.3 e/kW/year. With one additional cleaning per year, the losses could be reduced by up to 11% rel [13]. However, markets similar to Greece have cleaning costs ranging from a minimum of 0.09 e/m 2 /cleaning (Spain) to a maximum of 0.19 e/m 2 /cleaning (Italy) [8]. The results of the soiling analysis showed that, for the given PV site, each cleaning can cost between 0.6 and 1.3 e/kW, making any additional soiling mitigation (more than two yearly cleanings) not economically viable/justifiable for this specific PV plant. Thus, the DSS recommendation would be to postpone the cleaning event to the next periodic maintenance -same applies (to be repaired/corrected/performed during next planned maintenance) for the rest of the incidents.
Finally, since the obtained R D for the test PV system is approximately 0.5%/year (based on a price of 59.98 e/MWh, this corresponds to lost revenues of 0.45 e/kW during the end of 2018), the DSS recommendation is not to replace the PV modules (assuming PV module cost of 0.57 e/W p ). Concepts of recyclability (i.e., replace a PV module with a used one) should be addressed in future work.

IV. TEST SCENARIOS
To examine the impact a DSS would have on the performance and revenues of a power plant over its lifetime, three test scenarios were simulated. Note that the application of DSS on field data is only restricted to the years of operation, however it was applied to show the performance of the proposed DSS.

A. STOCHASTIC SIMULATIONS
Initially, stochastic simulations were performed to assess lifetime impacts of faults in PV systems. Test realizations were thus simulated using the System Advisor Model's (SAM's) PV-RPM model [50]. The specifications of the test PV plant (i.e., module and inverter peak capacities, the number of modules per string, the number of strings, the number of inverters and the module soiling factor) were used along with the parameters listed in Table 4  For the given electricity price of 59.98 e/MWh, a minimum of 8.4% energy loss per year is required for offsetting the annualized O&M cost value. For higher electricity prices (e.g., 100 e/MWh), the benefit of performing corrective actions increases (for installations with capacities higher than 100 kW p ), and in this case offsets the cost of the O&M when the annual energy loss is greater than 5%. In addition, higher electricity prices can incentivize corrective actions, as for the same maintenance cost, each kWh of recovered energy would return higher profit.

B. ECONOMIC EVALUATION OF CORRECTIVE ACTIONS
An economic evaluation was performed for examining the impact of corrective actions. It was assumed that the DSS could improve the performance of the test PV plant (from 1% to 5.27% [6]) by resolving the detected failure incidents and performing corrective actions. In addition, the maintenance activities could be optimized (e.g., performing such actions during nighttime or low irradiation hours, optimizing the number of yearly repair visits, minimizing waiting time in solar plant, etc.) by using an analytical method (e.g., FMECA, CPN, etc.) [14], [17] or additional hardware components with complex software's, thus reducing the O&M costs by 20% [26]. Based on an electricity price of 59.98 e/MWh [68], the results (summarized in Table 2) demonstrated total potential savings up to e58,588 (6,200 e/MW/year).

C. ECONOMIC ANALYSIS WITH VARIED INPUT PARAMETERS FOR DETERMINING SCENARIOS THAT PERFORMING CORRECTIVE ACTIONS IS BENEFICIAL FOR PV PLANT OWNERS
An economic analysis with varied input parameters was then performed to examine for which scenarios performing corrective actions is beneficial. For the test analysis, the following 3 input parameters were varied: the energy yield (i.e., low yield of 700 kWh/kW p and high yield of 1600 kWh/kW p ) the agreed electricity price (i.e., low price of 50 e/MWh, medium price of 100 e/MWh and high price of 200 e/MWh) and the number of yearly repair visits (e.g., 1, 2, 5 and 10). Thus, the investigated scenarios were: 1. Low energy yield and low electricity price, 2. Low energy yield and medium electricity price, 3. Low energy yield and high electricity price, 4. High energy yield and low electricity price, 5. High energy yield and medium electricity price, 6. High energy yield and high electricity price. The value of the rest input parameters for the economic analysis was derived from the literature (see Table 3). The NPV for the six different scenarios is depicted in Fig. 9, assuming 1, 2, 5 and 10 yearly repair visits.
As expected, the results showed that both the annual energy yield and the electricity price values can influence the NPV. Scenarios 3, 5 and 6 resulted in a positive NPV (for any number of yearly repair visits), indicating a profitable PV project investment. From those test scenarios, scenario 6 yielded the highest NPV of 2,511 e/kW p (when scheduling a repair visit once a year) over the 25 years lifetime. On the contrary, scenarios 1, 2 and 4 resulted in a negative NPV for any number of yearly repair visits. It can also be seen (from Fig. 9) that the low electricity price scenarios resulted in a negative NPV, while the high electricity price scenarios resulted in a positive NPV. Finally, the increasing number of annual repair visits had a negative effect on the NPV. Therefore, the decision for performing corrective actions is location-and case-dependent. It depends mainly on the agreed electricity price, the received irradiation and corresponding energy yield, the signed O&M contract (i.e., agreed O&M schedule, guarantee response time etc.), the installed capacity of the PV plant, the labor cost, the PV module cleaning cost, etc.

V. CONCLUSION
Monitoring and O&M strategies are important for maximizing the output energy production of installed PV systems while also minimizing downtimes. To minimize the energy loss and increase the plant's revenue, a DSS architecture was developed in this work. The DSS incorporates data-driven functionalities to detect PV underperformance issues and proposes mitigation actions in cases of fault and performance loss events. By considering financial metrics, the PV plant operator/owner can take a decision whether to perform or not the maintenance actions.
The proposed DSS operates entirely on the recorded field measurements, and it was benchmarked experimentally on a PV plant installed in Greece. Its data analytic functionalities clean and analyze the PV system's available data and provide insightful and meaningful information about the system health-state condition. The recommendation engine then considers the results of the failure and performance loss diagnostic algorithms along with economic metrics and generates recommendations with specific O&M actions.
The obtained results demonstrated the effectiveness of the proposed system for detecting faults in PV systems and categorizing the detected incidents into reversible and irreversible mechanisms. Over the evaluation period, the test system produced 16,344 MWh, while the lost energy accounted for 298.16 MWh (1.82%). Though, the estimated 1.82% loss of energy generation does not represent the ''actual'' lost energy, but the energy lost during the period starting from the acknowledgement time until the resolution time.
Moreover, the benchmarking results revealed the financial benefits of performing corrective actions in cases of near zero power production incidents. Regarding reversible loss mechanisms mitigation, the soiling analysis showed that, for the given PV site, making additional cleaning events was not economically worth. Furthermore, the PV plant experienced a degradation rate of −0.49%/year, suggesting that PV modules re-placement (with new ones) is not economically viable.
When performing O&M actions, the stochastic simulations (that consider component malfunctions and failures) exhibited a net economic gain of approximately 4.17 e/kW/year. For an electricity price of 59.98 e/MWh, a minimum of 8.4% energy loss per year is required for offsetting the annualized estimated O&M cost value of 7.45 e/kW/year.
Finally, the economic analysis revealed that O&M strategies are location-and case-dependent and should be coupled with a financial and a reliability model to grow profit margins for the PV plant owners. The NPV analysis showed that both the energy yield and the agreed electricity price can influence the profitability of a PV project.
The limitations of the proposed DSS were outlined in Livera et al. [41]. To overcome the limitations of the DSS, further research is needed on CP model's flexibility calibration. Likewise, to optimize PV field operations and improve the system's performance, a failure criticality assessment tool along with prognostics are needed. Therefore, future work will focus on incorporating predictive maintenance strategies (e.g., deriving long term trends and predicting future fault/loss incidents), prescriptive analytics and prioritization functions based on incident's criticality and cost-based metrics (i.e., FMECA or CPN methods). Such features will help the O&M teams to schedule and prioritize the field maintenance strategies more efficiently and cost-effectively.

APPENDIX
Test realizations were simulated using the System Advisor Model's (SAM's) PV-RPM model [50]. The user input parameters are summarized in Table 4.
The PV O&M Cost Model [70], released by the National Renewable Energy Laboratory (NREL) and SunSpec, was used to calculate the annual O&M costs for the test PV plant. Apart from the O&M costs, the model also estimates the cash flow (by anticipating scheduled and corrective maintenance tasks), the net present value (NPV) and reserve account for each year over the evaluation period [70].
To estimate the PV O&M costs, the model requires as inputs the technical specifications of the PV plant (e.g., the plant size, module power and efficiency, number of modules, inverters and transformers, number of strings, modules per string, etc.) and the annual energy yield (1490 kWh/kW p /year [69]). The PV O&M model also anticipated the specific repair cost for the PV site, Greece's labor rates (indicated by the O&M company), materials warranty period (e.g., 10 years warranty period for inverter and 20 years performance warranty period for the PV modules), a discount rate of 6.4%/year and an inflation rate of 2%/year.
For the implementation of the model, the administrative and preventive maintenance actions were predefined (at a scheduled interval of twice per year) and the action costs were escalated according to the inflation rate of the year in which they occur. On the contrary, corrective actions were scheduled based on a failure distribution curve (Weibull distribution). The costs of corrective maintenance actions were calculated by multiplying the probability of a failure to occur in a given year with the component's replacement cost. The user input parameters of the PV O&M Cost Model are summarized in Table 5.
The obtained results of the cost model implementation demonstrated annual costs varying from e4,588 in Year 1 to e90,847 in Year 25 (see Fig. 10), with an annualized value of 7.45 e/kW/year over its lifetime. The estimated value of 7.45 e/kW/year is lower that the latest PV O&M cost value of e11.05/kW/year (including inverter replacement costs) reported by NREL in 2018 for fixed tilt utility-scale plants [73]. Though, as stated in the same report [73], the PV O&M costs continue to show a decline for utility-scale plants. As shown in Fig. 10, the annual cash flow is higher in the final years as it includes administrative, preventive and corrective  maintenance costs, which depend on the year that the action takes place and on equipment's warranties. The warranties affect whether a failure will result in labor or hardware costs, or both depending on whether the year falls within the warranty period. The amount to keep in reserve (that will be sufficient to cover unplanned repair costs for each year of the analysis period) varies from less than e4,713 early in the analysis period (Year 1) to a maximum of e207,593 in Year 21, based on a desired probability of 92%. The reserve account amount is higher in the final years rather than in the early years (see Fig. 10). It includes parts inventory, and thus the costs vary from year to year and increase at different rates over time (as the system ages due to inflation, increasing failure rates and expiring warranties) as modeled by heuristic failure distributions (e.g., Weibull distribution) based on actuarial data of the services. The values were converted from US Dollar, considering a 0.85$/econversion factor.