Modeling Error Evaluation of Ground Observed Vegetation Parameters

To verify large-scale vegetation parameter measurements, the average value of sampling points from small-scale data is typically used. However, this method undermines the validity of the data due to the difference in scale or an inappropriate number of sampling points. A robust universal error assessment method for measuring ground vegetation parameters is, therefore, needed. Herein, we simulated vegetation scenarios and measurements by employing a normal distribution function and the Lindbergh–Levi theorem to deduce the characteristics of the error distribution. We found that the small- and large-scale error variations were similar among the theoretically deduced leaf area index (LAI) measurements. In addition, LAI was consistently normally distributed regardless of which a systematic error or an accidental error was applied. The difference between observed and theoretical errors was highest in the low-density scenario (7.6% at < 3% interval) and was lowest in the high-density scenario (5.5% at < 3% interval), while the average ratio between deviation and theoretical error of each scenario was 2.64% (low density), 2.07% (medium density), and 2.29% (high density). Furthermore, the relative difference between the theoretical and empirical errors was highest in the high-density scenario (20.0% at < 1% interval) and lowest in the low-density scenario (14.9% at < 1% interval), respectively. These data show the strength of a universal error assessment method, and we recommend that existing large-scale data of the study region are used to build a theoretical error distribution. Such prior work in conjunction with the models outlined in this article could reduce measurement costs and improve the efficiency of conducting ground measurements.

lenges faced by modern plant biology [1]. Measurements of different vegetation parameters can help us understand genetic characteristics [2]. The utility and importance of terrestrial vegetation (including crops) parameters, such as leaf area index (LAI) and fractional vegetation cover (FVC), have increased in recent years [3]- [6]. There are two universally recognized methods for measuring these parameters: 1) remote sensing inversions [7]- [9] and 2) ground observations [10], [11]. Remote sensing inversion directly measures vegetation parameters at large scales (few meters to hundreds of kilometers). However, due to the technical limitations of remote sensing (relatively narrow spatial and temporal resolution, and the uncertainty of methodology), it is often necessary to cross validate these data with ground observations [12]- [14]. Meanwhile, as the extensive collection of phenotypic data remains onerous, there is often a focus on traits that are easy or inexpensive to measure, while more costly or difficult-to-score phenotypes are studied in only a few individuals [15], [16]. This approach is bound to create uncertainty when it comes to generating the true value of each vegetation parameter. In contrast to remote sensing, no universal methods for error assessments of ground observations exist. Survey costs and land accessibility limit the extent to which ground observations can be measured, so the data are normally extrapolated based on the parameters calculated for small areas [17]- [19]. For these reasons, whole regions are rarely or never measured in their entirety, which means that the sampling errors always exist.
Vegetation parameters contain two main error components [20], including a systematic error (SE) that varies between different instruments' attributes or protocols, and the accidental error made by the surveyors [21]. While probability theories, such as likelihood theory [22] and Bayes theory [23], have been used to improve existing models, the chosen error assessment must be based on appropriate specificities for each model [24]. Analytical precision can be measured by analyzing replicates or in combination with sampling precision using a balanced design of sampling and analytical duplicates. However, there are no general methods for estimating sampling bias [25]. Besides, all the previous methods for error assessment were conducted after the measurement of vegetation parameters, and we had no expectation of error distribution prior to field work [26], [27]. To address these issues, we took LAI as representative of vegetation parameters, and the aims of this study are to create an equation that can be used to estimate error distribution for ground observations and test the deduced equation on a virtual scenario using the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ ground observations of LAI and compare the empirical and theoretical error distributions.

II. DEDUCTION OF GROUND-BASED ERROR ASSESSMENT
In this section, we go through the process of deduction of error distribution from normal distribution theorem and Lindeberg-Levi theorem (the latter is also known as independent distribution center limit theorem). To the best of our knowledge, this is the first time that LAI measurements have been deconstructed (true LAI, SE, and accidental error) and combined with the Lindeberg-Levi equation to achieve theoretical distribution values. Using this method, we enable error distributions to be calculated, which is important for evaluating the accuracy of vegetation parameter measurements.
The Lindeberg-Levi theorem states that when a sampling method for an independent variable (mean μ and standard deviation σ ) is used, the mean tends to be normally distributed as long as the sampling size (n) is adequately large [28], [29]. This is expressed as where ξ i is each measurement, n is the number of total sampling points, and N is a normal distribution. The LAI for each sampling point is expressed as where X is the LAI measurement for each sampling point, x is the true value of each data point (without any errors), m is the SE, and ε is the accidental error for each sampling point (irrespective of the gross error). The presence of the two errors leads to a certain degree of deviation from the value x. When ground measurements of LAI are sampled over large areas, it is likely that a variety of instruments and methods are used, which may induce SEs. We may calculate the average value of the LAI measurements with the following equation: where q is the different types of SEs, f i is the number of sampling points under each SE, and m i is the value of each SE. According to the statistical principle of accidental error, the mean of errors would converge to zero as the sample size increases [30]. Thus, regardless of vegetation attributes, (3) may be rewritten as follows: where p is the number of sampling points for each SE. Equation (4) suggests that the difference between the true value x and the actual measured value X is equal to a weighted average of the SEs (accidental error ε is eliminated during the averaging of the values).
Based on the above-mentioned equations, we use the normal distribution theorem and Lindeberg-Levi theorem to deduce the following: where the result will follow a normal distribution based on the total sampling number (n). The mean is the true LAI minus the weighted average of the SEs, and the variation is (σ 2 /n).
As this deduction was based on the vegetation density and research area, the distribution and forest type will not have an impact on the results, rather it should be used as an error assessment tool. Equation (5) should be used to assess the distribution of the sampling error as well as the probability of different error intervals (deviation from true LAI) based on the variety of SEs and variation of LAI across the whole study region. If the variations of LAI at different scales are similar, we assume that the variation is the same for the whole region. In the simulated model given in the following, we test the existence of such an assumption.

III. THEORETICAL VALIDATION OF ERROR ASSESSMENT
To validate (5), we used the MATLAB software to create three scenarios with low-, medium-, and high-density vegetation levels (representing LAIs of 2.54, 5.09, and 8.09, respectively). For building a single tree, three elements, including trunk position, trunk height, and branch length, are imported. In previous studies, researchers found that the position and height of trees inhomogeneous forests tend to follow the normal and Poisson distributions, respectively [31], [32]. Based on these assumptions, we created two groups of random numbers representing tree height and tree position. Then, for each group, three numbers were randomly selected to create a low-density scenario. Meanwhile, five and eight numbers were selected in order to create a mediumand high-density scenarios, respectively. Branches (left/right) were created using (6) and (7), where the tree trunk was set perpendicular to the ground at an angle of α = π/2, and the relationship between the angles of the upper branches (α n+1 ) and the lower branches (α n ) is described as follows: where L is left and R is right. The relationship between the length of the upper branches (LOB n+1 ) and lower branches (LOB n ) is described as follows: Relevant parameters [33] were used to create three scenarios (low-, medium-, and high-density levels) using MATLAB, and each scenario includes pixels (see Fig. 1) of either vegetation or no vegetation. The true LAI for each scenario was calculated as the ratio of the number of vegetation pixels and the number of horizontal pixels in breadth.
In each scenario, the observed value of LAI was calculated by randomly selecting vertical lines, where the number of vegetation pixels that each of them contained was summed  and divided by the number of lines. Computer simulations were used to calculate the error distribution of different error intervals to verify the authenticity of the theoretical deduction.
The variation of the observed LAI was stable at certain pixel scales with the lowest variability across sampling sizes at the medium scales (see Fig. 2), suggesting that our assumption (small-and large-scale error variations are similar) was true. Based on the estimated LAI variation in the whole study region, (5) was used to calculate the error distribution of LAI measurements.
Error distribution of LAI was calculated using a sequence of the normal distribution to represent accidental error as well as five types of SE with the number of SEs ascending from 50 to 130 (see Table I). Using (5), we found that the measured average LAI was consistently normally distributed regardless of which an SE or accidental error was applied (see Fig. 3). In addition, the differences between the observed and theoretical errors were highest in the low-density scenario (7.6% difference when error interval was <3%) and lowest in the high-density scenario (5.5% difference when error interval was <3%) (see Fig. 4). The average ratio between the deviation and theoretical errors of each scenario was 2.64% (low density), 2.07% (medium density), and 2.29% (high density). The average percentage of empirical error located in one interval is, therefore, in the region of about 2% of the theoretical value.
Furthermore, the relative difference between theoretical and empirical error (when error interval was <1%) was highest in the high-density scenario (20.1% with four kinds of SE) and lowest in the low-density scenario (0.17% with one kind of SE). The average deviation from the mean between the theoretical and empirical errors in each scenario was 5.9% (low density), 4.1% (medium density), and 4.9% (high density) (see Fig. 5).

IV. DISCUSSION
In the field, different research plots often have different vegetation conditions (such as mean LAI and deviation of LAI in different subplots). However, it is hard to obtain real SE of measurements in each location. This is because SE is not only caused by different instruments but also influenced by the measuring behavior of each surveyor. For this reason, we used real data to validate error distributions that may have been created during fieldwork using different sampling sizes, excluding SE.
The data were gathered from an old secondary forest   (4)], and the abscissa stands for different ratios between deviation and theoretical LAI. For example, <1% means that the measurement result error is less than 1%. The y-axis is the percent ( * 100) of sampling points in one error interval to total points. The gray area indicates the error of empirical error and the theoretical error (each dashed line). Every measurement scenario is simulated 1000 times.  We can see in Fig. 6 that as sampling quantity increases, the results of measured LAI are closer to mean LAI (we assumed it as true value) of the plot. Meanwhile, for the plot with the highest deviation of LAI, a higher number of samples are needed in order to reduce the deviation. For example, three sampling points in BOB-03 provided a result close to mean value, while the deviation of 13 sampling points in KOG-05 was still high (about 1). The LAI in BOB-03 converged with the mean faster than in NXV-02 and KOG-05. In Fig. 6(b), the results of measurements in BOB-03 are more concentrated around the mean LAI no matter how many sampling points we used. This suggests that the sampling quantity of measurement in one research plot to reach specific error requirement should be decided by the deviation of different subplots rather than average LAI of the whole plot. This is particularly important when working with different habitat types as the forest plot in BOB-03 that has a more homogenous canopy.   (5) Errors occur in all geographical measurements regardless of what instruments and methods are implemented [34], [35]. In addition, both the number of sampling points and the sampling location will have an impact on the measurement output. Data collected at the small scale are often used to verify large scale data or examine the relationship between the statistical precision of sample estimates and plot size [36]- [38]. While these methods have been used as error assessments, it is now clear that measurements at different scales are likely to undermine the validity of data. For example, based on (4) and (5), when the true value of LAI in a study area is 8.09, there is a 50% chance that the deviation will be greater than 5% if the sampling size is 50. However, if the sampling size is 450, there is a high chance that the probability will be less than 5%. Therefore, better understanding and control of the error during sampling will benefit the analysis of relationships between the measured value and the true value of different vegetation parameters.

V. CONCLUSION
In this article, we have demonstrated the reliability and applicability of error assessment in LAI ground observations where any deviation of error distribution could be due to either the number of sampling points or the process of averaging variation across different scales. What is more, this method puts the SE and accidental error into the evaluation system, making the results more reliable.
To deal with the errors that occur in the field, we should not only focus on promoting the accuracy of instruments [39], [40] and improving the authenticity of models [41], [42] but also pay attention to the distribution of errors. Overall, our error assessment method has two advantages over other models: 1) error assessment is conducted before measurement, which will give surveyors an expectation of error distribution and 2) the method has its adaptability and flexibility as theoretical error distribution for each study area is calculated by its vegetation distribution.
For future ground observed measurements, we recommend that the average variation of LAI at the large scale (e.g., using a drone or satellite imagery) is calculated or that the variation of previous vegetation parameter observations in the area is acquired. These data may thereafter be used to build a theoretical error distribution. Prior knowledge is key to produce more accurate estimates, and researchers are encouraged to have an appropriate number of sampling points to reasonably meet the error requirement. Such prior work in conjunction with the models outlined in this article could reduce the measurement costs and improve the efficiency of conducting ground measurements.