Parallel Computing for Obtaining Regional Scale Rice Growth Conditions Based on WOFOST and Satellite Images

It is very important to obtain continuous regional crop parameters efficiently in the agricultural field. However, remote sensing data can provide spatial-continuous / temporal-disperse crop information while crop growth model can provide temporal-continuous / spatial-disperse crop information. Therefore, the assimilation between crop growth model and remote sensing data is an efficient way for obtaining continuous vegetation growth information. This study aims to present a parallel method based on graphic processing unit (GPU) to improve the efficiency of the assimilation between RS data and crop growth model to estimate rice growth parameters. Remote sensing data, Landsat and HJ-1 images, were collected and the World Food Studies (WOFOST) crop growth model which has a strong flexibility was employed. To acquire continuous regional crop parameters, particle swarm optimization (PSO) data assimilation method was used to combine remote sensing images and WOFOST and this process is accompanied by a parallel method based on the Compute Unified Device Architecture (CUDA) platform of NVIDIA GPU. With these methods, we obtained daily rice growth parameters of Zhuzhou City, Hunan, China and compared the efficiency and precision of parallel method and non-parallel method. Results showed that the parallel program has a remarkable speedup (reaching 240 times) compared with the non-parallel program with a similar accuracy. This study indicated that the parallel implementation based on GPU was successful in improving the efficiency of the assimilation between RS data and the WOFOST model.


I. INTRODUCTION
Data assimilation, which is of momentous theoretical and application values, has been successfully applied to numerous fields, such as atmosphere [1]- [3], weather [4]- [6], agriculture [7]- [9], ocean [10]- [12], land surface [13]- [15] and hydrology [16]- [18]. Assimilating remote sensing (RS) information into a crop model can provide continuous regional plant growth information and has been applied to a wide range of agricultural fields in crop growth assessment, The associate editor coordinating the review of this manuscript and approving it for publication was Wenming Cao . agricultural environmental control and farmland management decision [19]- [23].
The assimilation between RS data and crop growth model regards RS information (continuous in spatial scale and discrete in time scale) as the measured variables. In addition, the model's simulating process is optimized with an assimilation algorithm to obtain vegetation information successive in spatial and temporal scales [24]. On the basis of various RS data, crop growth models and assimilation algorithms, researchers have conducted numerous studies on the different applications of assimilating RS information into crop models [20], [22], [25]- [27]. Huang [28] conducted VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ an experiment on regional winter wheat yield forecasting with moderate-resolution imaging spectroradiometer-leaf area index (MODIS-LAI) products on the basis of assimilation between the World Food Studies (WOFOST) model [29]- [32] and ensemble Kalman algorithm method [33]- [36]. Their results indicated a remarkable improvement compared with statistical yield, which provides a reliable approach for predicting regional crop yield. Jin et al. [37] developed an improved assimilation method with Landsat 8 images on the basis of the WOFOST model and particle swarm optimization (PSO) algorithm [38]- [42] for the efficient assessment of heavy metal stress levels in rice. Previous studies on the assimilation of the crop growth model and RS data often focus on its precision but not on its efficiency. However, when the study scale expands, current methods for assimilation cannot satisfy the needs for rapidly obtaining continuous accurate crop growth information in a large scale. The first need involves the tremendous amounts of data in the assimilation. RS images reflect vast earth surface information and their data volume is often large. With the improvement of the image's spatial and temporal resolution and the extension of the study area, the number of pixels to assimilate will increase quickly and the calculation volume will be enormous. The second need includes computationally expensive algorithms. The crop growth model quantificationally simulates the complex growth processes, such as the crop development process, CO2 assimilation, respiration and leaf area simulation [43]- [47]. In addition, the assimilation algorithms generally require considerable iterations to optimize the model's simulating process and must conduct a series of calculations in each iteration [48].
In recent years, graphic processing unit (GPU)-based highperformance computing technology has been increasingly applied in large-scale computing scenarios, such as mathematical calculations, image processing, computational biology and chemistry and fluid dynamics simulation [49]- [55]. Early works on GPU-based processing for RS data have been conducted by numerous researchers [56]- [62]. Wei and Huang [63] explored a GPU-based implementation of the extended Kalman filter. The Compute Unified Device Architecture (CUDA) on the Nvidia GeForce GTX 590 GPU was used. The speedup reached 1386x and the parallel implementation of extended Kalman filter will serve as good reference on real large-scale applications. Blattner and Yang [64] utilized the CUDA programming framework in numerical weather prediction by the data assimilation method, local ensemble transformed Kalman filter algorithm. Results show that an improvement of 72.1× speedup and provide attractive evidence for applying CUDA GPUs to high demanding scientific computation realms. Zheng et al. [65] first use GPU and CUDA technology on RRTM (Rapid Radiative transfer module) long-wave radiation module of GRAPES (Global and Regional Assimilation and Prediction System) model for parallel processing. The results show that the parallel computing algorithm is correct, stable and efficient for operational implementation of GRAPES in near future. The assimilation calculations for each rice pixel on the RS image are not only consistent but also mutually independent; thus, the parallel implementation is feasible.
In this study, we develop a highly efficient GPU-based parallel program for the assimilation of the WOFOST model and RS images with the PSO algorithm on the CUDA platform. We input the most time-consuming work (each rice pixel's assimilation with the WOFOST model) to a specific thread on the GPU and concurrently execute all threads. The execution time and assimilating precision of serial and parallel programs are compared. With the parallel assimilating results, we simulate the spatial distribution of three important rice growth parameters, LAI, root weight (WRT) and panicle weight (WSO), on a large scale.  Figure 1). This area has a subtropical monsoon humid climate with adequate illumination, annual average temperature of 15.5-25 • C and annual precipitation of 1250-1500 mm, which is suitable for crop growth. Zhuzhou City is an important food base and its grain output is higher than those of other regions in Hunan Province. Fast and accurate monitoring of crop growth information in this region is important.

B. DATA PREPARATION
The main data of this work contain two parts: RS data for deriving measured assimilating observations and meteorological data for external input data of the WOFOST model. In this study, Landsat 7 (Enhanced Thematic Mapper Plus (ETM+) sensor) and HJ-1A/B (charge-coupled device (CCD)) satellite imagery were jointly used to generate measured LAI (MLAI) series [66]. Table 1 shows the parameters of these images. Radiometric calibration and atmospheric correction were conducted for these two types of images. The radiometric calibration for CCD images was based on the absolute radiation calibration coefficient of HJ-1A/B, released by the China Resources Satellite Application Center. In addition, the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes model was adopted for the atmospheric correction of all images. The corrected rootmean-square error was less than 0.5 pixels. All images were downloaded from the United States Geological Survey (http://glovis.usgs.gov/) and China Centre for Resources Satellite Data and Application (http://www.cresda.com). The meteorological data included the daily maximum air temperature (TMAX), daily minimum air temperature (TMIN) and daily solar radiation (AVRAD) from June 1 to September 30 in 2015, 2016 and 2017. They are obtained from China Meteorological Data Sharing Service System. As the spatial scale of the study area is not too big, according to the law of Geographic Similarity, it is reasonable to use the data from one meteorological station (Zhuzhou Station) to represent the meteorological factors of the whole study area.
In this work, we select LAI as the assimilative observation because it is an important index that reflects the growth status of plant population and a primary output result of the WOFOST model. The measured rice LAI series are calculated with RS images. In this study, we extract rice region in the study area with a random forest algorithm [67]- [70] and then calculate rice LAI through its NDVI obtained from RS images. Previous studies have shown that the exponential empirical model between LAI and NDVI in this area is practicable and most widely used [71]. The empirical model was established by using the local measured LAI data and NDVI conversion index in Zhuzhou City. And its R2 is 0.84 and the MSE is 0.06, which is acceptable in the actual data assimilation process. Their transformational relationship is expressed in the following equation: Then, we can obtain the MLAI series. We extract a total of 401296 rice pixels with a valid LAI in the study area. Figure 2 shows the overall technical roadmap of this work. The parallel implementation of the assimilation between RS images and the WOFOST model with the PSO algorithm mainly includes two parts: retrieving MLAI from RS images in the Environment for Visualising Images (ENVI) on the CPU and assimilating all rice pixels' MLAI with the simulative LAI (SLAI) series calculated from the WOFOST model synchronously on the GPU. The lower part of Figure 2 presents each rice pixel's assimilating process, i.e. each thread's computational task. Each thread's output result is an optimised leaf senescence index, which is an input parameter of the WOFOST model.

A. ASSIMILATION BETWEEN THE WOFOST MODEL AND RS DATA
Each thread's computational task includes each rice pixel's assimilation between SLAI calculated from the WOFOST model and MLAI retrieved from RS data, which is the most time-consuming in the assimilating process. The methods for obtaining MLAI from satellite images are previously discussed. The methods for obtaining SLAI from the WOFOST model and assimilation are presented as follows.
The assimilation between the WOFOST model and RS data involves obtaining the model's optimal input parameters for each rice pixel on the RS image through the assimilation algorithm. The input parameters to be optimised must satisfy two conditions: (1) difficult to obtain in a regional scale and (2) closely related to the output results of WOFOST. Considering these principles, we selected SPAN as the input parameter to be optimised. SPAN (day) is the leaf senescence index, which refers to the life cycle of leaves at 35 • C. Generally, SPAN ranges from 17 to 50 [72]. As the exact VOLUME 8, 2020 value of SPAN for each pixel is difficult to determine due to the individual differences of rice [73], we select a random value in its range as the initial input of the WOFOST model. The meteorological data and initial randomised SPAN are the input data of the WOFOST model, and the output result is an interannual LAI series at the time step of 1 day. Then, we can extract the SLAI at the acquisition time of images. In this study, the meteorological data from one weather station is used to represent the entire study area, because the area of this study is relatively small. The crop data in the WOFOST model includes: specific leaf area, net photosynthetic rate, dry matter distribution coefficient, etc. Since the research object of this article is rice, the rice varieties in this area are relatively uniform, so uniform crop parameters are used in this study area.
However, a certain error occurs between SLAI and MLAI. The PSO algorithm will reduce this error by adjusting SPAN through considerable iterations. In the PSO algorithm, the error is calculated as follows: where N is the number of RS images in a year; LAI m and LAI s denote the MLAI and SLAI, respectively; and C represents the degree of deviation between LAIm and LAIs. When C reaches its minimal mean, the SLAI series are closest to the actual LAI, and the corresponding SPAN is optimised. Finally, each rice pixel's daily growth parameters are calculated with the rectified WOFOST model (the optimum SPAN as its input parameter). Then, we can continuously analyse the rice growth information on a regional scale.

B. NVIDIA GPU ARCHITECTURE
We have utilized the Pascal architecture-based GPU (Tesla P40) with CUDA to realize the parallel implementation of the assimilating process. A brief introduction of the Pascal architecture and CUDA program model is presented as follows.

1) PASCAL ARCHITECTURE
The GPU architecture is constructed by using extensible arrays with a streaming multiprocessor (SM). The hardware is parallelized on the GPU by copying the SM. The Pascal architecture produced by NVIDIA in 2016 is a quick, efficient and high-performance computing architecture. Similar to previous Tesla-class GPUs, the Pascal architecture is composed of an array of graphics processing clusters (GPCs), texture processing clusters (TPCs), SMs and memory controllers. A full Pascal architecture consists of six GPCs, 60 Pascal SMs, 20 TPCs (each with two SMs) and eight 512-bit memory controllers. Each GPC has 10 SMs. Each SM has 64 CUDA cores and four texture units. The total number of cores loaded in NVIDIA's Tesla P40 GPU is 3840. Each memory controller is attached to 51 KB of L2 cache [74].

2) CUDA PROGRAMMING MODEL
CUDA is a universal parallel computing platform and programming model and is an extension of the C programming language.
In the CUDA programming model, the CPU is the host terminal, whereas the GPU is the device terminal. Developers can stipulate computation tasks to be parallelized into a kernel function, which is invoked on the host terminal and executed on the device terminal. Before the CPU calls a kernel function, developers have to allocate a memory on the GPU for data sets used in kernels and copy them from the CPU to the GPU. After a kernel finishes, users have to copy the output results from the device to the host [75].
The organization of threads on the GPU is the key for accelerating large-scale data processing. To manage the threads on the GPU further, CUDA has introduced an abstract concept, namely, thread hierarchy. Figure 3 depicts the two levels of thread hierarchy (thread grids and thread blocks [75]. When a kernel is activated, all generating threads form one grid. One grid consists of multiple thread blocks, and one thread block includes a group of threads. The organization of threads involves grouping all threads in one grid. We can artificially set the number of threads in one block and the number of thread blocks in one grid for our program to obtain optimal performance [74].

C. PARALLEL IMPLEMENTATION APPROACHES FOR THE ASSIMILATION
In the parallelisation of assimilating RS information into the WOFOST model, NVIDIA Tesla P40 based on Pascal architecture GPU and Xeon E5-2620 CPU is used for the hardware environment. Tesla series is specially designed for large-scale parallel computing.  Ubuntu Linux system, and the CUDA codes (.cu) are written in C++ language and compiled with NVCC.

1) THREAD ORGANISATION METHODS ON GPU
MLAI retrieved from satellite images in a year is the main data set and is saved in a 2D array. We assign all rice pixels' LAI values of one image into a 1D vector from top to bottom, from left to right, as one column of MLAI. Moreover, the rice pixels' locations in a 2D image are saved, which will be used in the subsequent simulation of the spatial distribution of rice growth parameters. Thus, the length of MLAI equals the total number of rice pixels in our study area, whereas its width equals the number of RS images obtained in 1 year. Each row represents the MLAI series of one rice pixel. We extract a total of 401296 rice pixels with valid LAI values in the study area, and the length of MLAI is greater than the width. Hence, although the data set is a 2D array, we organize all threads in a 1D manner, and the assimilation for each rice pixel is allocated to a thread ( Figure 4). In the kernel function, we can use the thread index to extract the MLAI series of each thread's corresponding rice pixel. As we organized all threads in a 1D manner, the relationship between the index (idx) of one rice pixel's MLAI series and the thread identifiers is as follows: where threadIdx and blockIdx are the indexes of thread and block, respectively, and blockDim indicates the block size, which is equal to the number of threads in one block.

2) PARALLEL PROGRAM DESIGN
In this work, PSO_WOFOST is set as the kernel function, which indicates that all threads allocated on the GPU will execute the same PSO_WOFOST function with different input data. The prototype is presented as follows: __global__ void PSO_WOFOST (double * MLAI, double * TMIN, double * TMAX, double * AVRAD, long rand, double * gbest) PSO_WOFOST is responsible for each rice pixel's assimilating process with the PSO algorithm, i.e. the calculation of optimal SPAN. The main steps of PSO_WOFOST are as follows: . Thread organization in a 1D manner. Each row under MLAI stores each rice pixel's MLAI series. N is equal to the total number of rice pixels in our study area. Each rice pixel's assimilating process is allocated to one specific thread on the GPU.
Step 1: Obtain the index of one rice pixel's MLAI series with thread identifiers and extract the rice pixel's MLAI series; Step 2: Randomly initialize SPAN in its range; Step 3: Input randomized SPAN and meteorological data into the WOFOST model to calculate SLAI; Step 4: Calculate the cost function value (C) with SLAI and MLAI through Eq. (2).
Step 5: Assess whether C reaches the minimum. If yes, output the corresponding SPAN to the gbest in accordance with the index in Step 1; if no, adjust SPAN and return to Step 3. Before kernel function is activated, the data to be transferred from the host terminal to the device terminal include MLAI, TMIN, TMAX and AVRAD. MLAI stores all rice pixels' LAI in a year and TMIN, TMAX and AVRAD store the meteorological data. After the kernel function is finished, all rice pixels' optimal SPAN must be transferred from the GPU to the CPU. gbest stores all rice pixels' SPAN. Transferring such a large data between the GPU and CPU is timeconsuming. Thus, to reduce the transferring time, we allocate a pinned memory for the data to be transferred directly instead of a pageable memory on the host terminal. This process will reduce the time of copying data from pageable memory to pinned memory (the GPU can only access the host pinned memory) [74]. We can use the following codes to allocate pinned memory for MLAI. //Allocate pinned memory, nbytes is the size of MLAI cudaMallocHost ((double * * ) &MLAI, nbytes); As the methods of memory allocation for variables in the parallel program, the local variables declaring inside the kernel function are located in registers. Given the size of these five datasets are relatively larger and each thread (a pixel's assimilation process) needs to obtain corresponding data from MLAI, TMIN, TMAX AVRAD and assign assimilation result to gbest, we allocated global memory for them in our parallel program. VOLUME 8, 2020

IV. RESULTS
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation as well as the experimental conclusions that can be drawn.

A. PRECISION ASSESSMENT OF ASSIMILATION IMPLEMENTED ON GPU
We regarded the discrepancy (C) between each rice pixel's MLAI retrieved from satellite imagery and SLAI calculated with the WOFOST model as the standard for measuring the assimilation effect, which is calculated in Eq. (2). We evaluated the assimilation precision in three cases: a) SLAIe, calculated from the WOFOST model that sets experiential SPAN (50) as the input parameter; b) SLAIc, calculated from the WOFOST model that sets the optimized SPAN of the assimilation implemented on the CPU as the input parameter; and c) SLAIg, calculated from the WOFOST model that sets the optimized SPAN of the assimilation implemented on the GPU as the input parameter. However, the comparison of assimilation precision for all rice pixels in our study area is hardly possible to conduct. Thus, we selected six sites within the scope of the study area to compare their assimilation precision. As previously described, we extracted a total of 401296 rice pixels in each RS image and organized them in one dimension. Similarly, the assimilating results, namely, gbest (optimized SPAN for each rice pixel), were stored in one dimension. We divided the gbest equidistantly into six parts and extracted one optimized SPAN in each part. Six optimized SPANs and meteorological data are inputted into the WOFOST model to calculate the SLAIc and SLAIg of the six sites. The results of comparing these six sites' precision represent the assimilation precision of the entire study area. Figure 5 shows the locations of six sites and comparison results of the assimilation precision.
First, SLAI calculated from the WOFOST model that sets the assimilating result (optimized SPAN) as the input parameter is closer to the actual LAI than SLAI calculated from the WOFOST model that sets the experiential SPAN as the input parameter, whether the assimilation is implemented on the GPU or CPU. However, the precision of the assimilation implemented on the GPU is slightly inferior to that on the CPU. The reason is that the GPU is often used for parallel tasks with simple control logic and calculations and poorly handles programs that require high-precision results, whereas the CPU performs better when executing extreme complex programs. Nevertheless, numerous complex floating-point calculations and logical operations in the assimilating process are presented.

B. EVALUATIONS OF GPU PERFORMANCE IN COMPUTATIONAL TIME GAINS
The serial program includes individually executing all rice pixels' assimilation on the CPU, which is the same with the parallel program. However, when the data size (number of rice pixels) is greater than 10000, the memory space on the CPU will be insufficient for storing the variables in the assimilating process. Therefore, we divided all the datasets (401296 pixels) into 41 batches. Each batch size is 10000 and the size of the last one is 1296. When the data size is over 10000, the execution time of different size data sets is the computation time for all batches within the corresponding data set in Table 3. Table 3 compares the computation times of serial and parallel programs in handling different data sizes (number of rice pixels). Distinctly, the parallel program executed by the GPU for the assimilation considerably outperforms the serial program executed by the CPU. The major reason for the remarkable acceleration is the massive parallelism implemented by a large number of different threads on the GPU that synchronously executes the assimilation processes. However, when the data size is less than 100 pixels, the execution time of the serial program is less than that of the parallel program. This phenomenon results from the frequent memory transmission between the GPU and CPU, which can weaken the parallel program's performance. Table 3 shows that the speedup increases with the increase of input data size. The maximum Speedup in this work reaches near 240 times.

C. SPATIAL DISTRIBUTION OF RICE GROWTH PARAMETERS IN A LARGE SCALE
Each rice pixel's optimized SPAN can be used as the input parameter of the WOFOST model to simulate rice LAI daily. WOFOST can also simulate important growth parameters other than LAI, such as WRT and WSO, which can commendably reflect the rice growth conditions. All rice pixels' LAI, WRT and WSO values are reorganized in accordance with the previously saved locations of all rice pixels in the 2D image. As shown in Figure 6, we simulated the spatial distribution of rice LAI, WRT and WSO within the entire study area at several key time nodes during the rice growth season. Figure

V. DISCUSSION
We mainly improved the efficiency of the assimilation between the WOFOST model and satellite imagery by parallel implementation based on the GPU with CUDA platform. We inputted the most time-consuming work (all rice pixels' assimilation) into thousands of GPU threads and concurrently start all threads to reduce the execution time. We compared the precision and execution times of the assimilation implemented on the CPU and GPU. Although the assimilation precision performance of the GPU is slightly lower than that of the CPU, the gains in the computational time aspect, such as the huge acceleration (240 times) from the GPU's parallel implementation, are sufficiently excellent for application. With the highly efficient parallel assimilating results, we simulated the spatial distribution of rice growth conditions in a large scale.
A comparison with other GPU-based image processing works shows differences in two main aspects. The first aspect involves the manners of thread organization on the GPU. General works for image processing must assign specific computational tasks for all pixels on the image, which are inherently planar, so the 2D thread organization is further suitable. Assimilation for all pixels on the RS image was not required to obtain the rice growth conditions in our study area but not for all rice pixels. We extracted all rice pixels and input rice LAI in one RS image into a 1D vector. Hence, we adopted a 1D manner to organize all threads on the GPU. Second, the model calculations of each thread in this work are considerably more complex than those in other image processing works. As discussed in Section 3, the kernel function PSO_WOFOST consists of numerous intermediate variables, floating-point calculations, iterations and judgment statements, which prevent the speedup of the assimilating progress. We simplified the PSO_WOFOST as much as possible, for instance, removing the unnecessary variables and calculations in the WOFOST model. However, the execution of WOFOST and the PSO algorithm is complex and timeconsuming. The optimization and acceleration of each rice pixel's assimilating algorithm still require further work.
The parallel implementation of assimilation can be optimized in numerous aspects in the future. For example, on the basis of different usages of variables, the most applicable memory space is allocated on the GPU to store variables, such as constant, shared and local memories, to improve the parallel program's performance. In addition, CUDA provides a series of accelerated libraries to improve the parallel program's performance, for example, cuBLAS contains all interfaces of functions in the linear algebra BLAS library and cuRAND includes the methods of quickly generating random numbers. The utilization of CUDA -accelerated libraries will greatly improve the parallel program's performance.
Actually, this article does not solve the upscaling problem of the WOFOST model applicated in large-scale. On the  one hand, when the study area continues to expand, it is completely insufficient to use the data from one meteorological station to represent the entire area due to the spatial heterogeneity. The possible methods for upscaling to a larger area include using grid meteorological data as the model's inputs, conducting spatial interpolation using the data from multiple weather stations. For the issue of meteorological data upscaling, we will continue to study in the future. On the other hand, when the study area continues to expand, due to regional differences in crop growth, the crop parameters in the WOFOST model also need to be adjusted. How to adjust us will also be one important research direction in the future.

VI. CONCLUSION
A highly efficient parallel method based on the GPU with the CUDA platform for the assimilation between the RS data set and WOFOST model by using the PSO algorithm was developed. Each rice pixel's assimilating process between MLAI and SLAI through PSO, which is the most time-consuming work, was allocated to one specific thread on the GPU for the parallel implementation of all rice pixels' assimilation tasks in the entire study area. With the assimilating results, we simulated the crop growth parameters quickly to analyze rice growth conditions in a large scale. The remarkable progress of this work is based on ensuring the precision of assimilating results within an acceptable margin of error and dramatically improving the performance of the assimilation process. The execution times of the parallel and serial programs when handling different sizes of rice pixels are compared. Results show that the parallel implementation based on the GPU has considerably improved the speed of the assimilation between the RS data and WOFOST model and the savings scale with the increase in data size. The speedup reaches almost 240 times in this work. With the highly efficient assimilation results, we simulated rice growth parameters quickly and obtained the spatial distribution of rice growth conditions in a large scale.