Data-Driven I – V Feature Extraction for Photovoltaic Modules

—In research on photovoltaic (PV) device degradation, current–voltage ( I – V ) datasets carry a large amount of information in addition to the maximum power point. Performance parameters such as short-circuit current, open-circuit voltage, shunt resistance, series resistance, and ﬁll factor are essential for diagnosing the performance and degradation of solar cells and modules. To enable the scaling of I – V studies to millions of I – V curves, we have developed a data-driven method to extract I – V curve parameters and distributed this method as an open-source package in R. In contrast with the traditional practice of ﬁtting the diode equation to I – V curves individually, which requires solving a transcendental equation, this data-driven method can be applied to large volumes of I – V data in a short time. Our data-driven feature extraction technique is tested on I – V curves generated with the single-diode model and applied to I – V curves with different data point densities collected from three different sources. This method has a high repeatability for extracting I – V features, without requiring knowledge of the device or expected parameters to be input by the researcher. We also demonstrate how this method can be applied to large datasets and accommodates nonstandard I – V curves including those showing artifacts of connection problems or shading where bypass diode activation produces multiple “steps.” These features together make the data-driven I – V feature extraction method ideal for evaluating time-series I – V data and analyzing power degradation mechanisms in PV modules through cross comparisons of the extracted parameters.


I. INTRODUCTION
C URRENT-voltage (I-V ) curve parameters are the most commonly used measurements to evaluate the performance and degradation of photovoltaic (PV) cells and modules.These performance features include the maximum power point (P mp ), short-circuit current (I sc ), open-circuit voltage (V oc ), shunt resistance (R sh ), series resistance (R s ), and fill factor (F F ).The reduction of P mp represents power degradation of a PV module or cell [1].Other I-V features, meanwhile, imply specific mechanisms of module or cell performance and degradation [2]- [4].Fitting the diode model to a single I-V curve, based on the theory and physics of solar cell operation, is the traditional way to obtain these I-V features.
One method of fitting the I-V curve with the diode model is to use the Lambert W function to obtain an explicit analytical solution [5]- [9].Iterative numerical methods are also time-consuming and require manual setting or prior knowledge of the approximate initial fitting parameters for each I-V curve [10]- [14].When analyzing a large number of I-V curves, for example, millions of I-V curves acquired from commercial PV power plants utilizing time-series I-V scanning tools, fitting to the diode model becomes computationally and/or labor intensive.
There have been several studies in the literature considering time-series of I-V curves and their features [15]- [19], with studies of only inverter-obtained data such as I sc , V oc , F F , and P mp being even more common.Our group has recently employed network structural equation modeling (netSEM) for PV module degradation studies [1], [20], [21].netSEM evaluates datasets of stressors, mechanisms, and responses as time series to identify and quantify relevant mathematical models linking these variables.For outdoor studies of PV modules, environmental exposure stressors, such as irradiance, temperature, and humidity, are modeled with responses such as power and wet insulation resistance, and mechanistic predictors of degradation, such as I-V features, to reveal active degradation pathways.Here, we propose a data-driven I-V feature extraction method to increase the efficiency and repeatability of I-V time-series data stream analysis.This is based on linear regression methods applied to different regions of the I-V curve [14], [22]- [25].In this paper, we scale the linear regression approach to I-V curve fitting to accurately and efficiently process millions of I-V curves from a diverse variety of sources.
Data-driven regression approaches, as being presented here, intrinsically have sensitivities arising from the specific nature of the data used and its noise [14], [26].Yet, at the same time, the diode model is not always adequate for describing the operation of PV modules (involving multiple cells, bypass diodes, etc.) and degraded PV devices [27], [28].For example, multiple "steps" observed in a PV module I-V curve serve as an indication of mismatch between cells and/or irradiance present in different areas of the PV array or module under test.This can arise from partial shading of the PV array or degradation and damage of PV cells in the string, thereby causing bypass diodes to activate [29], and the resulting I-V curve does not conform to the diode model.Therefore, many studies regard these curves as erroneous and throw them out.Furthermore, I-V data must exhibit low noise for accurate use of the diode model: 1% noise in an I-V curve leads to approximately 20% of relative error in the value of R s extracted from the fitted diode model [8].
In this paper, we describe this data-driven I-V feature extraction method and algorithm for time-series I-V studies of PV modules.Statistical methods such as simple linear regression and smoothing spline are used [30].A simulation study is conducted to evaluate the performance of the I-V feature algorithm on diode-model-generated I-V curves with various levels of noise.Datasets from different sources, as well as a time series of I-V curves, are used to demonstrate how the proposed method performs on real-world data and how it can be applied in practice.

A. Data-Driven I-V Feature Extraction Method
The data-driven I-V feature extraction method uses linear regressions and basic computational practices on various regions of the I-V curve to obtain values for several I-V features as follows.I sc is defined as the current at zero voltage (the y-intercept of the I-V curve), while V oc is the voltage at zero current (the x-intercept).R sh is calculated as the negative inverse slope of the I-V curve near V = 0, and R s is the negative inverse slope of the I-V curve near V oc .P mp is the maximum product of current and voltage on the I-V curve.F F is defined as the ratio of the maximum power from the solar cell to the product of V oc and I sc and measures the "squareness" of the solar cell's I-V curve.I-V curve parameters, as defined in this method, are illustrated in Fig. 1.
In most PV module I-V curves, observation points are evenly spaced in voltage.However, when approaching to V oc , the current decreases exponentially, resulting in few points in this pseudolinear region close to V oc , which may introduce bias when estimating V oc and R s .Additionally, some I-V tracers acquire more data points near P mp for more accurate determination of the ideal operating point and fewer points in the pseudolinear regions near I sc and V oc .Thus, we use a smoothing spline on each raw I-V curve to generate an equivalent I-V curve with 500 points with equal spacing in voltage, giving enough data points to estimate these features with low statistical uncertainty.The smoothing spline involves interpolation and nonparametric  regression.Let {V i , I i : i = 1, 2, . . ., n} denote a set of n observations and f (v) be a function that fits the observed data.The smoothing spline is the function f that minimizes where λ is a nonnegative tuning parameter that controls the roughness of the smoothing spline [30].We use the stats::smoothspline function in R to perform the spline [31].
The data-driven I-V feature extraction method applies the above definitions of the six parameters (I sc , R sh , V oc , R s , P mp , and F F ) to automatically calculate their values, as illustrated in Fig. 2.
Because not all I-V curves have a single step, as shown in Fig. 1, we use segmented regression to find the number and locations of change points.Segmented regression can identify change points in a curve and is used here to figure out the voltage where a change point occurs [32], [33].However, not all change points indicate the appearance of steps.The change points between steps are those with the slope on the left "steeper" than the slope on the right, with the slope on the left being negative.In addition, the difference between the absolute value of slopes on the left and right sides of the point should be sufficiently large.Thus, we denote β 1 as the slope to the left of the change point, β 2 as the slope to the right of the change point, and a as a parameter decided by the noise of the I-V curve (for larger noise, set larger a).We build the following criterion to identify the change points that indicate multiple steps. 1) We then extract I-V features on each step of the I-V curve.Fig. 3 shows the I-V feature extracted result of an example I-V curve with two steps from the Fraunhofer-ISE dataset.In this example, we find that the voltage of the change point between steps is 26.62 V. Based on this point, the original I-V curve is divided into two single-step I-V curves, and we extract I-V features for each step.
To determine I sc and R sh , a linear regression is performed on the 500-point splined I-V curve.The linear regression model is shown follows: where X is the independent variable, Y is the dependent variable, α and β are coefficients, and is the error term.
The regression is performed on a moving window of five consecutive points, with current being the dependent variable and voltage being the independent variable, and the slope for each five-point window is stored.A five-point window (i.e., 1% of the splined data length) is used because only a very small number of observations approximate to a straight line; therefore, this length is most accurate to estimate slope as well as the change of slopes along the I-V curve.
We could expect that the slope coefficients for the low-voltage part of the I-V curve do not change much between windows.For the part of I-V curve that passes through the maximum power point, the slope coefficient changes sharply, and we use the rapid change in slope of the five-point moving box to identify some of the I-V features such as change points between steps [27].A typical change of the slope pattern for a standard one-step I-V curve can be seen in Fig. 4. As shown in the figure, the change rate of the five-point line slope remains relatively stable from zero voltage to approaching the maximum power point, as this corresponds to the linear part in the I-V curve near I sc .Thus, we set a critical value for the change rate of the five-point slope to find this linear region and determine the corresponding consecutive current and voltage points that have a change rate in slope smaller than this critical value.With the selected data points from the linear region for the I-V curve near I sc , the linear regression model in ( 2) is used to find the slope and intercept.Based on our definition of I sc and R sh (see Fig. 1), I sc is estimated with the intercept of the fitted line, and R sh is estimated by the negative inverse of its slope.Note that the number of selected data points for the fit of I sc and R sh is typically 70-75 on the 500-point splined I-V curve.Some I-V curves, especially from outdoor systems, exhibit a rapid change in slope approaching 0 V, which makes R sh and I sc determination challenging.As shown in Fig. 5(a), a nonlinear region near I sc , due to the poor module connection or mis-recording by the I-V curve tracing system, should be removed in this I-V curve.Therefore, in our proposed algorithm, only consecutive data points with low change of slope are used to determine I sc and R sh , thereby excluding curvature the low-voltage region, as shown in Fig. 5(b).Then, using the selected linear data points, we correct the nonlinear region, as shown in Fig. 5(c).This method can automatically find the appropriate current and voltage values that define the linear part of I-V curves at low voltages.
V oc and R s are similarly calculated from the linear part in the I-V curve with voltage higher than that of the maximum power point.Here, we consider a regression model with voltage as the dependent variable and current as the independent variable.Let the change rate of current be the difference between two consecutive data points of currents divided by the current with lower voltage.Thus, we set a critical value and select the data points that have change rate of current larger than the critical value consecutively.According to the definition of V oc and R s , V oc is estimated by the linear intercept, where I = 0, and R s is estimated by the slope.Note that the number of selected data points for the fit of V oc and R s is typically 50-55 on the 500-point splined I-V curve.
Finally, P mp is calculated by finding the maximum product of current and voltage for each of the 500 I, V data point pairs, without fitting of the spline.F F is calculated as follows: In addition, the repeatabilities are calculated for the I-V features using a resampling method.For each iteration, we randomly select 90% of the data points and apply the extraction method to obtain I-V feature results.We repeat this 10 000 times to obtain 10 000 values of each extracted I-V feature.The standard deviation (SD) of all 10 000 iterations is calculated for each I-V feature, and the repeatability is defined as 100% − the SD%.
The data-driven I-V feature extraction method and functions are available as a free open-source R package, easily downloaded from the Comprehensive R Archive Network [34].

B. I-V Measurement and Datasets
In order to validate the data-driven I-V feature extraction method, we first use data simulated with the single-diode model to compare the given and extracted I-V features.We then, demonstrate the method on the following three time-series I-V curve datasets from different sources, each with a different number of data points (or observations) in the I-V curve.
The Fraunhofer-ISE dataset [35], [36] consists of I-V time series from eight PV modules across three different locations, with two modules on Mount Zugspitze (in Germany, abbreviated as UFS, in the ET climate zone), three modules in Gran Canaria (in Spain, abbreviated as GC, in the BWh climate zone), and three modules in the Negev Desert (in Israel, abbreviated as NEG, in the BSh climate zone).Climate zones are classified by the Köppen-Geiger climate zone system [37].BWh represents a hot desert climate, BSh is a hot semiarid climate, and ET is polar climate [38]- [40].Depending on the module, we have data for three to six years of outdoor exposure with power readings taken every 2-3 min and I-V curve measured every 10 min.The UFS data start from 2012, while data for GC and NEG start from 2010.There are the total of 2.2 million I-V curves, and each single I-V curve has 40-42 data points.These I-V curves were acquired using an ESL Solar 500 tracer made by ET Instrumente [41] in ambient conditions with varying irradiance and temperature.
The second dataset of I-V curves is from the SDLE SunFarm, a 1-acre outdoor test facility on the Case Western Reserve University campus in Cleveland, OH, USA, where we have 122 individual PV power plants with microinverters and 32 PV modules connected to a DayStar Multitracer for acquisition of I-V and P mp time series with power readings taken every 1 min and I-V curve measured every 10 min [42].This dataset has I-V curves from a standard multicrystalline silicon aluminum back-surface field (Al-BSF) module and a passivated emitter and rear cell (PERC) monocrystalline silicon module, with nameplate wattages of 279 and 315, respectively.The I-V curves in this dataset have 180-200 data points and are acquired using a DayStar Multitracer [43] in ambient conditions with varying irradiance and temperature.In this paper, we randomly select one I-V curve from this dataset to demonstrate our method.
The third dataset of I-V curves includes three different brands of monosilicon Al-BSF modules, with wattages of 285, 280, and 285, undergoing an accelerated indoor sequential exposure test consisting of 500 h of damp heat exposure, followed by 1000 cycles of dynamic mechanical loading (DH + DML sequential test), which is done stepwise to a total exposure of 4000 h of Damp Heat [44], [45].In this dataset, each of the I-V curves consists of 3600-3800 data points.These I-V curves were acquired using a SPIRE 4600SLP flash tester [46] at standard test conditions (STC) (1 sun and 25 • C).In this paper, we randomly select one I-V curve from this dataset to demonstrate our method.

III. RESULTS
In this section, we conduct a simulation study to validate the data-driven I-V feature extraction method on I-V curves generated with the single-diode model.We then apply the method to real-world I-V curves described above, acquired by different I-V scanning equipment, which produce different numbers of data points for each I-V curve.

A. I-V Curve Simulation Study
The single-diode model assumes that the dark current can be described by a single exponential dependence modified by the diode ideality factor n [47].The current-voltage relationship is given by where V and I are terminal voltage in volts and current in amperes, I ph (≈ I sc ) is the photogenerated current, I 0 is the diode reverse saturation current, and V th is the thermal voltage.It is well known that ( 4) is an implicit transcendental equation, which may not be solved explicitly in general for I and V using common elementary functions [48].Therefore, one approach for exact explicit analytical solutions for I and V can be expressed using the Lambert W function, which is defined as the solution to the equation W (x) exp[W (x)] = x, [6], [7] as follows: and where and W represents the Lambert W function.
To illustrate the robustness of our data-driven I-V feature extraction method, we generate an I-V curve with 1000 observations (data points) based on the single-diode model and then use the algorithm to calculate I-V parameters including I sc , V oc , R sh , and R s .Let N c be the number of cells, which is included in the V th in (4), and T emp denote the temperature.Setting N c = 60, T emp = 25 • C, V oc = 40.20 V, n = 1.5,I sc = 8 A, R sh = 600 Ω, and R s = 0.48 Ω, an I-V curve is generated from (5) using a sequence of 1000 points in V from 0 to V oc .
To the I values, we add random noise, which follows a normal distribution with zero mean and different SD, listed in Table I.Here, we generate I-V curves with noise levels between 0 and 0.02 A, which are typical noise levels for real-world I-V curves.
Table I shows the percent error and repeatability of four extracted I-V parameters for I-V curves generated using the diode model with the different levels of noise.We observe that the percent errors for I sc and V oc are low, which indicates that the data-driven I-V feature extraction method performs very well in feature estimation for I sc and V oc .For R sh , as the noise level increases, so does the percent error, with accuracy significantly decreasing at 0.015 A of noise on the simulated curve.For R s , since the extracted values are higher than the set values, as has been demonstrated previously [26], the percent error is more than 61%.However, all extracted values are calculated with repeatability greater than 99.9%.Therefore, the data-driven I-V feature extraction method is a robust, practical, and easily implemented parameter extraction procedure for I-V curves.

B. Real-World I-V Curve Examples
1) Time-Series I-V Curves From the Fraunhofer-ISE Outdoor Dataset: We apply the data-driven feature extraction method to the dataset from Fraunhofer-ISE consisting of over 2 200 000 I-V curves.This dataset does not include nighttime values and was not filtered for this analysis.Table II shows the average percent difference of I-V feature extracted results to the reported values for each module.Note that there is no reported values for R sh in GC2 and NEG3.The percent difference is small generally for I sc , V oc , P mp , and F F , with only two modules with large difference.
2) Single I-V Curves From Various Sources: Fig. 6 shows examples of splined (red) and original (black) I-V curves from three real-world datasets, each measured with unique equipment and having a different number of datapoints and inherent noise.The I-V features extracted from these three curves are given along with the accompanying reported values in Table III.
The I-V curve in Fig. 6(a) was selected from the Fraunhofer-ISE dataset, from module GC1.The temperature and irradiance at the time of measurement were 23.6 • C and 205.3 W/m 2 , respectively.The I-V curve in Fig. 6(b) from the SDLE SunFarm was recorded for a 60-cell PERC module at 563.27 W/m 2 irradiance and 45.37 • C temperature.The I-V curve in Fig. 6(c) was taken on a SPIRE 4600SLP flash tester at STC for a commercial module that had undergone damp heat + dynamic mechanical loading indoor accelerated testing.For the three I-V curves shown in Fig. 6, the extracted I-V feature values from our proposed method agree with reported values, and with greater than 99.9% reliability in all cases.

IV. DISCUSSION
A data-driven I-V feature extraction method to extract the solar cell I-V feature parameters has been developed.While  previous literature typically used the diode model or a combination of diode model and statistical methods to extract I-V features, they did not consider the occurrence of multiple steps in I-V curves [27] or the curvature that appears in I-V curves caused by measurement inaccuracies during the I-V curve tracer as it sweeps the voltage.Our proposed data-driven I-V feature extraction method makes corrections for the curvature issue and extracts the I-V features using computationally efficient data-driven algorithm, which enables analysis of massive numbers of I-V curves as are acquired as time-series datasets.

A. Accuracy and Repeatability of Extracted I-V Features
To illustrate the accuracy of our proposed data-driven I-V feature extraction method, we conducted a simulation study using diode-model-generated curves.The repeatabilities of all extracted I-V features are greater than 99.9%, as shown in Table I, which indicates that the data-driven I-V feature extraction method is robust in feature estimation.In the simulation study, the extracted I sc , R sh , and V oc are very accurate compared with the values set in the single-diode model for curve generation.Note that the extracted R sh becomes inaccurate when the noise reaches 0.015.Meanwhile, the value of the extracted R s is approximately 62% higher than the set value.The percent deviation of the extracted R s from its true value changes with the ideality factor of the diode model, indicating that the slope near V oc is highly dependent on the cell recombination rates.
However, because the data-driven feature extraction method is highly repeatable using our automated algorithm, values of R s produced this way are intercomparable.Care should be taken, however, in interpreting the absolute values of the extracted value of R s , as this is an amalgamation of the actual R s and recombination influences.

B. Robustness of Parameter Extraction From Different I-V Curve Sources
The time-series I-V data from the Fraunhofer-ISE dataset showed good agreement between extracted and reported values for most I-V features across 2 200 000 I-V curves.Extracted R sh and R s exhibited systematic differences from reported values, as expected based on prior studies comparing linear methods for determining these values, as discussed earlier.
Certain modules had high percent difference for other parameters.One reason for particular modules' high percent difference may be due to atypically shaped I-V curves that are not adequately handled by traditional feature extraction methods.For example, a module with poor electrical connection with an I-V curve, as shown in Fig. 5, would have consistently larger reported I sc and lower reported R sh than those obtained with our algorithm.We suspect this is the case for modules GC3 and UFS1.The difference in P mp may be the result of splining, as the reported result uses the measured data points (40-42 data points) in I-V curves, while we use 500 data points, which are closer to underlying I-V curves of the module.
The quality of I-V curve data has a strong influence on extracting the I-V features.In the Fraunhofer-ISE dataset, I-V curves have only 40-42 data points each, leading to inaccuracy in estimating R sh , R s , and P mp on the original curves.We use a smoothing spline function to fit this data and generate 500 data points from this curve in order to make the result more accurate and repeatable.For the I-V curve data from the DH + DML Indoor Accelerated Test, which has 3600-3800 data points, we still generate 500 data points and found that the repeatability is 99.98%.Therefore, using 500 data points is sufficient for accurate extracted I-V features from a range of data sources.In addition, I-V curves with different numbers of observations make it hard to set a uniform criteria in the function (i.e., the critical value to find the linear region) and would be problematic when dealing with a large number of I-V curves.By splining a curve to 500 data points, we can use uniform criteria and, thus, apply our algorithm more broadly.The repeatability for I sc , V oc , R s , R sh , P mp , and F F is greater than 99.9% for all curves tested here; therefore, the technique is robust for highly varied sources of I-V curves.
Many I-V tracer systems also report values of the I-V features, but without disclosing the algorithms used to determine these parameters.This is the case for the Spire, DayStar Multitracer, and the ESL systems used as the data acquisition sources in this paper.This leads to the inherent obfuscation of the meaning and accuracy of reported feature parameters from these pieces of equipment.By using a common analytical package based in open-source algorithms and codes, one is able to analyze I-V curves from diverse instruments and arrive at I-V feature parameters with a common basis.This is an example of strong scientific advantages of open-source software, codes, and algorithms [49], [50].

C. Computational Efficiency I-V Feature Extraction
For the Fraunhofer-ISE dataset, there are a total of 2.2 million I-V curves [35], [36].The computation of extracted I-V features took approximately 3 h using Simple Linux Utility for Resource Management computing resource on a single machine with specifications: a compute node of High Performance Computing server has Intel(R) Xeon(R) CPU E5-24500 @ 2.10-GHz processor, 24-GB memory, and 12 CPU cores × 2.69 GHz.

V. CONCLUSION
In this paper, we have developed a data-driven I-V feature extraction method to extract features from I-V curves and calculate the repeatabilities of each I-V feature.Three different datasets have been used to demonstrate how this method can be applied in practice.Moreover, we have conducted a simulation study to illustrate the accuracy and reproducibility of the extracted I-V features by generating I-V curves from the single-diode model.Our proposed method performs very well in I-V feature estimation for I sc , R sh , and V oc , while the estimation of R s shows predictable error.All values are estimated with very high repeatability.Therefore, the data-driven I-V feature extraction method is an accurate, robust, and fast parameter extraction procedure for characterizing large volumes of PV module I-V data.

Fig. 1 .
Fig. 1.Standard one-step I-V curve and five I-V features: I sc , R sh , P mp , V oc , and R s .

Fig. 2 .
Fig. 2. Detailed procedure of the data-driven method to calculate I-V features.

Fig. 3 .
Fig. 3. I-V feature extracted results of an example of I-V curve with two steps.

Fig. 4 .
Fig. 4. Change rate of the slope for regressions performed with a moving window of five points based on a standard I-V curve.

Fig. 5 .
Fig. 5. (a) I-V curve showing curvature in the low-voltage region of the curve due to a module connection problem.(b) Change of slope has great variation in the front, remains stable for the linear part near I sc , and then increases sharply.(c) Using the data-driven method, we are able to correct the I-V curve with curvature.

Fig. 6 .
Fig. 6.(a) I-V curve with 41 data points from the Fraunhofer-ISE outdoor dataset.(b) I-V curve with 186 data points from CWRU-SDLE SunFarm.(c) I-V curve with 3694 data points from the DH + DML Indoor Accelerated Test.The corresponding 500 generated I-V curve data point smoothing splines are shown in red.

TABLE I PERCENT
ERROR BETWEEN I-V FEATURE EXTRACTED RESULTS AND I-V FEATURE SET VALUES BASED ON DIFFERENT NOISE LEVELSThe repeatabilities are listed in parentheses.

TABLE II AVERAGE
PERCENT DIFFERENCE OF I-V FEATURE EXTRACTED RESULTS FROM FRAUNHOFER-ISE LABORATORY REPORTED VALUES FOR OVER 2 200 000 I-V CURVES BY MODULE

TABLE III I
-V FEATURE EXTRACTED RESULT WITH REPEATABILITIES FOR THREE SINGLE I-V CURVES FROM DIFFERENT SOURCES AND THE I-V TRACER EQUIPMENT REPORTED VALUES