Paddy Rice Mapping Using a Dual-Path Spatio-Temporal Network Based on Annual Time-Series Sentinel-2 Images

Paddy rice is one of the main foods of the global population. To guarantee paddy rice acreage is essential to ensure food security. Currently, techniques for large-area paddy field mapping rely mainly on complex rule-based machine learning algorithms. But it is difficult for them to achieve an optimal balance between discriminability and robustness. In this article, we proposed a novel deep learning-based approach for large-scale paddy rice mapping, termed dual-path interactive network (DPIN). An annual time-series Sentinel-2 remote sensing images are used as data source. Taking several areas of interest over the middle and lower Yangtze River plain of China as experimental fields, our model achieves an F1-score of 91.22% on the test dataset, which is 1.09% higher than the existing state-of-the-art predictive model, and its inference speed is 1.18 times faster than it. DPIN-Lite is a lightweight variant of DPIN, and while keeping a competitive mapping accuracy, its inference speed is 1.91 times faster than the compared method (with the best score except for DPIN and DPIN-Lite).


I. INTRODUCTION
Paddy rice is a staple food for more than half of the global population, and ensuring its production is a great guarantee of global food security and environmental sustainability [1], [2], [3]. According to the statistics from the Food and Agriculture Organization of the United States [4], Asia is the main source of paddy rice worldwide, accounting for about 90% of the global production during the years between 2013 to 2020 [5]. Among them, China alone accounts for 28%. In recent decades, the demand for rice has dramatically increased with population growth [6] and accelerated urbanization. Unfortunately, human activities like high emissions [7] have raised the threat to food supply [8], [9], and in recent years global pandemics and regional warfare greatly increase the The associate editor coordinating the review of this manuscript and approving it for publication was Diego Oliva . risk of food crisis. As for China, due to lower economic benefits from growing staple crops like rice, the non-food phenomenon of cultivated lands is increasingly severe. So, it is highly necessary to accurately extract the paddy rice planting area on a greater scale.
Various methods have been proposed for paddy rice mapping, and remote sensing-based technology has proven to be the most effective [10], [11], [12], [13]. Compared to SAR (Synthetic Aperture Radar) images [14], [15], optical remote sensing images are susceptible to tillage because of their relatively high spectral resolution (from visible to near-infrared, and to short wave infrared) and spatial resolution [16]. Therefore, optical images are still the main data source for paddy rice identification and mapping. Common candidates include Moderate Resolution Imaging Spectroradiometer (MODIS), Landsat-7/8 [17] and Sentinel-2 [18], [19], Planet series, GaoFen-2 and WorldView-3, whose spatial resolutions ranging from 250 m to about 0.5 m. To facilitate the requirements of large area cover, and low cost while holding accuracy, the Sentinel-2 imagery with a spatial resolution of 10m and a revisit interval of 5 days was used in the present study.
Paddy rice mapping algorithms can be divided into two categories: phenology-based and spectral learning-based methods, and they can be used in combination. Phenologybased methods use the growth condition indexes of crops to extract their phonologic information and then identify them according to experts' rules or empirical thresholds. The vegetation/water indexes, like Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), Normalized Difference Water Index (NDWI), etc., are often used [20], [21], [22]. Using them can measure the growing period, such as Start of the Season (SOS), End of the Season (EOS) and Length of the Season (LOS) [23], [24], [25], which are useful for distinguishing paddy rice from other crops or ground features. More often, the transplanting period of rice is monitored because the characteristics of plant, soil and water are blended during this time and paddy rice fields show biggest differences from other objects in the indexes [26], [27]. However, prior planting information are needed, and cloud occlusion would greatly affect in the key period. In addition, studies have proved that since the images in a single phenological period cannot fully reflect the characteristics of the whole rice growth cycle, using images of multiple phenological periods can achieve better results [26], [27].
Spectral learning-based methods can be further divided into methods based on spectral matching, methods based on machine learning, and methods based on deep learning techniques. Spectral matching methods identify rice by measuring the similarity with spectral features or secondorder features (NDVI, etc.) [28], [29]. This requires the preextracted rice features to be accurate enough, which is hard especially when there is more than one rice ripe pattern. The machine learning methods are often used to learn the spectral characteristic, and then make the prediction. Support vector machines (SVM) [16], random forests (RF) [26], [30] and decision trees (DT) [31] have made good contributions in the field of rice mapping. However, due to their limited learning ability, machine learning methods are usually performed on indexes extracted by phenology methods rather than the complete spectrum channels, so the mapping accuracy is restricted.
Deep learning (DL) models have stronger learning capabilities and can learn expressions from the full spectrum channels. In recent years, some studies have applied DL techniques to paddy rice mapping, including classical convolutional networks [23], [32] and time-based models (like LSTM) [33]. However, convolutional networks lack constraints on temporal sequential relationships, and timebased models ignore spatial dependence. For the paddy rice mapping task, both the temporal variation and spatial texture information are critical, thus, spatio-temporal models are preferred.
Spatio-temporal models have been practiced in other fields and demonstrated their powerful capabilities. One classic model is called ConvLSTM [34], [35], which was first used for predicting the weather, and then for traffic, behavior recognition, anomaly monitoring, and a series of spatiotemporal problems. Later, two improved versions called U-ConvLSTM [36] and U-BiConvLSTM [37] were proposed to improve the training and inference speed, at the cost of the ability to capture temporal features. In [38], the spatiotemporal model structure was explored with the time-series Sentinel-2 images for the crop segmentation task, and finally, a model named UTAE with the highest recognition accuracy was proposed. It is one of the currently recognized spatiotemporal models that can learn the features of remotely sensed time-series images. Its structure is very enlightening for our work, but still has some problems: Insufficient fusion of spatial and temporal features, large memory consumption and slow inference speed, etc. Therefore, in this paper, we aim to mapping paddy rice in high precision using time series Sentinel-2 images, and contributions are: (1) Designing a better spatio-temporal DL model, which would enhance the fusion of spatial and temporal features, reduce memory consumption and improve model inference speed; (2) Proposing a paddy rice mapping scheme including data selection, data preprocessing and model training. The paper is organized as follows: Section 2 introduces the study area and methods; section 3 presents the mapping results and relevant analysis. Section 4 discusses the limitation and some explorations of our work; Section 5 is the concluding chapter.

A. STUDY AREA
The middle and lower Yangtze River plain is one of the most prominent rice-producing areas in China. Its total area is about 200,000 km2, and spans seven provinces, including Hubei, Hunan, Jiangxi, Anhui, Jiangsu, Zhejiang and Shanghai. The plain is somewhat swampy, made up of many lakes and rivers, making it suitable for rice growing and freshwater fish, and it is therefore known as the ''land of fish and rice.'' Controlled by subtropical monsoon climate, the plain is rich in rainfall, with an average annual precipitation of 1000∼1500 mm. According to rough estimates, the rainfall days account for about one-third of a year, and are concentrated between April and June. Sometimes influenced by typhoons, there is also plenty of rainy days from July to September.
Five areas of interest (AOIs), denoted as AH1, AH2, AH3, JX1 and ZJ1 (see Fig.1), respectively, were chosen to act as our experimental fields. However, due to confidentiality agreements, we are not able to add more location details. According to the ''Dataset of Rice Spatial Distribution in Nine Provinces and One City in Southeast China,'' the paddy rice in these AOIs is ripe once a year, and some are in rotation with other dry crops such as wheat, maize, etc. According to the transplanting time, paddy rice can be divided into three types: early-season rice, mid-season rice and late-season rice. The early-season rice is usually transplanted in late March to early April and harvested in mid-to-late July. The midseason rice is usually transplanted from early April to late May and harvested in mid-to-late September. The late-season rice is usually transplanted in mid-to-late June and harvested in early-to-mid October. The paddy rice grown in the rotation is mainly mid or late-season rice.

B. DATASET
Multi-temporal Sentinel 2-SR images falling within the AOIs were collected based on Google Earth Engine (GEE). The time span of these images ranges between January and December 2019, and their cloud coverage is kept ≤ 9%. Three atmospheric spectral channels (1, 9 and 10) of the Sentinel-2 satellite were routinely removed from the data while retaining the other 10 channels. Finally, these images were resampled at a spatial resolution of 10m per pixel. The ground truth image of paddy rice was manually annotated by agricultural image interpretation experts based on google earth images and field investigations.
Beyond resampling, a series of image pre-processing operations were performed, which are described in detail in Section 2.3. And then, all the collected images were cropped into small patches using a sliding window with a stride of 128 non-overlapping pixels. After that, the patches (or say samples) without paddy fields were removed from the dataset, and the remaining 6280 samples were used for subsequent model training and prediction. Based on the Python platform, each sample in the dataset was stored In our experiment, 80% of the samples were used for training and the remaining 20% for validation. Finally, a five-fold validation method was used to obtain the paddy-field map over the entire sample area.

III. MATHOLOGY
A. WORKFLOW Fig. 2 displays the workflow of the proposed method for paddy-field mapping, which consists of three procedures/modules: image pre-processing, construction of the DPIN model, model training and evaluation. First, the preprocessing module consists of cloud removal, per-month image integration, and zero-padding for absent time-series images. Second, the DPIN model is composed of two submodules: a dual-path interactive encoder (DPIE) and an all-level decoder (ALD). Third, for model training: considering lightweight parameters and sufficient training samples, we trained our DPIN model from scratch; for model evaluation: a five-fold validation strategy is applied to evaluate the accuracy and reliability of the proposed method.

B. IMAGE PRE-PROCESSING
The cloud removal operation is a two-step procedure. The first one is the image-level removal. As mentioned, only the images having cloud coverage ≤9% were selected to constitute our training datasets. However, a small cloudcoverage threshold may result in absence of some timeseries images. In [40] and [41], the adopted threshold is ≤70%. However, the accuracy loss caused by these cloudcontaminated input images is very serious. To balance the cloud coverage and data deficiency, the empirical value 9% was used. The second one is pixel-level cloud removal. Based on GEE platform, we use the QA60 band, which embedded in Sentinel-2 data flagged the Opaque Clouds pixels (Bit 10) and Cirrus Clouds pixels (Bit 11) [42], to conduct cloud detection and removal. However, the QA60 band is only sensitive to thick clouds, and pixels contaminated by thin clouds still remain.
In theory, the satellites in the Sentinel-2 constellation could provide a revisit time of 5 days in cloud-free conditions. In practice, the data availability in several months (such as June, July and October) can be scarce due to cloud cover. One way to address this issue is to conduct data integration. For example, supposing there are 1 to 6 image(s) available in one month, we take the medians of the pixel values of these images for each pixel to generate an integrated image. By doing this, we minimize the impact of cloud coverthe pixel values contaminated by clouds can be dismissed as outliers during this averaging process. Finally, we can obtain 12 time-series images from January to December by conducting a per-month data-integration operation within the year of 2019.
Even so, there are still some images unavailable, and our solution is twofold as follows: (1) substitute the missing images in 2019 for the counterpart images from the 2018 or 2020 image database; (2) when there are no substitutable images, we conduct zero-padding for the missing data. This operation ensures that the input time series are of the same length. In practice, the operation (1) can deal with most of the described problems, and the operation (2) is not a common case, so it has limited impact on the model performance. Meanwhile, the high-quality rice mapping results we got prove that DPIN has some ability to resist data missing.

C. MODEL CONSTRUCTION
The spatio-temporal models could extract the growth features of paddy rice from the time-series images, and the encoder's consumption of memory is related mostly to the length of the time-series. Fig. 3 (a) is such an example, in which a temporal feature aggregator (TFA) module is constructed to integrate the multi-temporal feature representations derived from the encoder, and then, the resulting representations are sent to the decoder to generate the rice distribution map. By contrast, to reduce the memory cost, Fig. 3(b) is designed to use the stacked time-series images as input to the spatial networks, however, it has a difficulty in producing high-quality rice mapping results because the time-sequential context information is not further exploited during the encoding process. To solve these problems, we designed a dual-path spatio-temporal model for paddy rice mapping, termed DPIN, and its core components include a dual-pathinteractive encoder (DPIE) and an all-level decoder (ALD), as shown in Fig. 3(c).
DPIE adopts an (N-stage) dual-path structure connected by the interactive modules P and H t (t ∈ [0, T ]). First, the input time-series images are transformed into a four-dimensional tensor X with the shape of [T, C, H, W],where T denotes the length of the time sequence, and in our case T=12, representing the observation data come from 12 months in one year; C denotes the number of channels, and in our case C=10; H and W represent the height and width of each input image. And then, the DPIE module encodes X through two paths: (1) the ''separate'' path: the time-sequential images are encoded by using a spatial convolutional network in a parallel manner, which involves creating a series of L-Blocks; (2) the ''stacked'' path: all the input images are first concatenated and then encoded by a standard convolutional network to integrate the multi-temporal information. The feature representations derived from these two paths will be sent to and integrated by the interactive modules at different stages. The interactive module P plays the role of a bridge between the two paths: at each stage, the feature representation derived from the ''separate'' path is sent to and processed by the interactive module, and then integrated with the corresponding representation derived from the ''stacked'' path. In a similar fashion, H t (t ∈ [0, T ]) facilitates the communication from the ''stacked'' path to the ''separate'' path. Note that: in Fig.4, in order to save memory space, we stipulate that the dimension of the L-Block along the ''separate'' path must be even smaller than that of the Block along the ''stacked'' path.
The architectures of Block and L-Block are shown in Tab. 1 and Tab. 2, respectively. The Block module consists of five convolutional layers from top to bottom, which are: a 7 × 7 depth-wise convolution layer, a layer normalization layer (see [43]), a GELU activation layer, a 1 × 1 convolution layer and the other layer normalization layer. Such an architecture is borrowed from the ConvNeXt model [44], and on this basis we developed the L-Block module. To avoid memory overhead, L-Block has two lightweight structures: as shown in Fig. 4, from stage 1 to stage N (N=4), it is designed to have a 2 × 2 convolution layer and a layer normalization layer; while at stage 0, it consists of a 1 × 1 convolution layer, a layer normalization layer and a Block module. The input of this Block module is a stacked tensor with size [T×C, H, W], and the output of it is reshaped to fit the style of a normal tensor: [T, C, H, W]. By doing this, we could maintain the inherent time-series context relations and add spatiotemporal information onto the ''stacked'' path.
The ''separate'' path works in the following way: For each time point t, the encoder G l at level l takes as input the feature representations of the previous layer g l−1 t , and outputs a representation map x equals 1 when l is 1. The same assignment rule is applicable to the values of and w l in the decoder and the layers along the ''stacked'' path. The g l t is computed as follows: The ''stacked'' path works in the following way: First, the orange box in Fig. 4 indicates the concatenating operation of X (size: [T ×c, h, w]). And then, the encoder E l at level l takes as input the feature representations of the previous layer e l−1 , and outputs a feature representation e l of size (T * c l ) × h l × w l .Therein, e l is computed as follows: To facilitate the communication between the temporal and spatial representations, we add several interactive modules in between the ''separate'' path and the ''stacked'' path, which FIGURE 5. Structure of all-level decoder, which contains two paths: path I (marked with red) and path II (marked with blue).The architecture of PPM is after [45].
is noted as P and H t (t ∈ [0, T ]) in Fig. 4, and they are actually all 1 × 1 convolution layer. As a lightweight TFA, F takes the stacked g l t as input (t ∈ [0, T ]), and outputs the spatiotemporally fused representations based on the timeseries representations. These fused representations are then added or concatenated with e l−1 to get e l . H t is used to deliver the global spatio-temporal information from the ''stacked'' path to the ''separate'' path: it takes e l−1 as input, and outputs a feature representation (H t (e l−1 ) ) with the same size as g l t , and then the multi-temporal results of g l−1 t + H t (e l−1 ) will be put to the L-block for further processing. Herein, the H t (·) is used to reduce the feature dimension of e l−1 . Given the above, the following formulas describe how the DPEI works: where [·] T t=0 denotes the stacked stack time-series features from t 0 to t T , means the concatenation operation, X denotes the time-sequential images, and Conv(·) is a convolution module including convolution layer, a normalization layer and a GELU layer.
In accordance with the encoder, our decoder also has two paths. As shown in Fig. 5, as low-level features could preserve the edge details, we created a parallel path (path II, the blue line) to concatenate the feature representations derived from DPIE at different scales. However, due to the shallow network projection, low-level features usually lack of global spatial information which is helpful for high-accuracy classification. Inasmuch, we added a PPM (Pyramid Pooling Module, see reference [45] for details) module at the end of the encoder to extract the deepest spatiotemporal representations. After that, we created another path (path I, the red line) to fuse the low-level features and the feature representations e l derived from the encoder, and the newly generated representations which contain more global and essential spatiotemporal information would be sent to path II for further processing.
This idea comes from [46]. Note that: PPM is incorporated in DPIE because it can adaptively capture the spatiotemporal information at different scales by using four pooling windows of different sizes, thus facilitating the extraction of deeper and meaningful paddy rice semantic features.
Adhere to the path II there are N -1 stages (i.e., D l , l = 1 . . . N ), and each of them consists of a series of convolutional layers, GELU layers and normalization layers. The output of D l is a feature representation d l with size (T × c l ) × h l × w l , where d l is computed in the following way: where U denotes the up-sampling operation. And PPM, as shown in Fig. 5b, is used to extract multi-scale spatial features. The pooling scale of PPM is set as (1, 2, 3 & 6), and its output dimension is empirically set as 256. Given the above, the path II can be formulized as: whereŷ is the predicted probability, F represents the output layer composed of a 1 × 1 convolution layer with output channels of 2 (paddy rice or non-paddy rice) and a softmax function.
The loss function used in this paper is cross entropy, formulated as: where L symbols the loss, y denotes the ground truth, andŷ denotes the predicted probabilities output from F.

A. IMPLEMENTATION DETAILS
Details of experiments in this paper are presented here, including the working environment, model setting and training tricks. All codes were written in python 3.8 under the Pytorch 1.10.1 framework, and experiments were conducted on a single NVIDIA Tesla V100 with 32 GB of GPU memory. During the training process, the batch size was set to 14, and the initial learning rate was set to 0.007. A start warm strategy was used in the first 4 epochs, with the total training epochs set to 100. Meanwhile, a learning rate decay strategy named cosine decay and two regularization operationsstochastic depth [51] with a drop rate of 0.1 and weight decay with the param set as 1e-4, were adopted. In data augmentation strategy, random vertical and horizontal flips with a probability of 50%, and a random crop with a size of 96, were applied. An AdamW optimizer with default parameters was used for model training and a mix-precision training scheme was adopted. Unlike the general neural networks with 4 stages, the stage number N of DPIE was set to 5 in the experiment, and the first stage is dedicated to retaining low-level information. In each stage, there is an L-Block and several Blocks. The Block numbers were set to [1, 3, 3, 27, 3] from stage 0 to stage 4, and their corresponding out-channels were set to [96,96,192,384,768], respectively. The out-channels of the L-blocks were set to [4,8,16,32,64] for each time point. The initial weights of Blocks from stage 1 to stage 5 were loaded from a pre-trained ConvNeXt-small model. Because the components in stage 0 and L-Blocks were extra designed, they have no pre-trained parameters. The out-channels of the decoder were set to 256 in all stages. This paper gave two model versions, a standard DPIN and a lighter one named DPIN-Lite. The numbers of Blocks in DPIN-Lite were set to [1, 1, 1, 3, 1] for lower memory cost and higher inference speed.

B. MAPPING RESULTS
In this paper, four metrics were used to evaluate the effectiveness and accuracy of the proposed method, which are Precision, Recall, F1-score and IoU (Intersection over Union). The formulas are given as follows: where TP, TN, FP and FN denote the number of true positives, true negatives, false positives and false negatives. Normally, F1-score and IoU are more concerned by researchers because they take both precision and recall into account and reflect the comprehensive model performance. We evaluated DPIN in all five sample areas and computed the overall values. As shown in Tab. 3, the overall Precision, Recall, F1-score and IoU of DPIN are 90.41%, 92.04%, 83.85% and 91.22%, respectively. DPIN acquired good performances in all sample areas and proved its great ability in the rice mapping tasks. The best performance was acquired in JX1 with an F1-score of 95.64% and IoU of 91.65%, and the worst F1-score of 89.97% was got in ZJ1. Fig. 6 shows the prediction results of two representative areas in these two areas. Their distributions of ground truth and prediction results are basically identical from a regional perspective, as the purple and blue color areas show in the Figure. Unlike the large-area rectangular plots in the plain of ZJ1, the paddy rice fields of JX1 are distributed along the valleys and slopes and are fragmentized and irregular. Even  so, the best prediction results were still acquired in this area, which is probably due to its simple planting pattern (single rice without rotation). While in most other areas, the rotation patterns vary greatly between different plots and crop types cannot be ascertained, which increases the uncertainties of model learning.

C. PERFORMANCE COMPARISON
DPIN and DPIN-Lite were compared with eight models under the same environment and parameter settings, which include four spatial models and four spatio-temporal models. The spatial models are DeepLab v3+ (with the backbone as MobileNet v2 [50]), Segformer, Mlp-Seg and ConvNeXt, whose structures contain three main components of deep learning: convolution (Deeplab v3+, ConvNeXt), self-attention (Segformer) and fully connected perceptron (Mlp-Seg). The four spatio-temporal models are ConvLSTM, U-ConvLSTM, U-BiConvLSTM and UTAE. All model performances were listed in Tab. 4, and it can be seen that DPIN got the best scores, with an F1-score of 1.09% and IoU of 1.71% higher than the third-best model (UTAE), respectively. DPIN-Lite has fewer parameters and higher inference speed than DPIN and got the second-best performance.
Among four spatial models, Deeplab v3+ had the worst performance because its backbone is a lightweight network.  Table 4.
The other three are constructed by self-attention modules, fully connected perceptron and convolutional blocks, respectively. Results reveal that convolutional structure is more advantageous in paddy rice mapping. However, spatial models generally perform worse than spatio-temporal models because the formers couldn't fully explore and utilize the temporal context relationships.
As a classical spatio-temporal model, the Conv-LSTM model got a not bad F1-score of 89.56%. U-ConvLSTM is constructed by simply adding the LSTM structure as the temporal feature aggregator (TFA) into a U-Net structure. Such modification did not bring better performance because most parameters are still learning spatial features. U-BiConvLSTM and UTAE improved the results (F1-score of 89.84% and 90.13%) by enhancing the capability of TFA. However, the spatial and temporal features still cannot freely interact. DPIN overcomes this problem by creating a dual-path interactive structure and getting the best result.
The performance differences of these models are not only significant in the quantitative assessment but also visually noticeable in the amplifying images as shown in Fig. 7. As indicated in the red rectangle boxes, DPIN captured much finer details compared to other models. To be specific, DPIN produced fewer false positives and negatives (e.g., line 3 & 4 in Fig. 7), returned relatively accurate shapes/boundaries (e.g., line 1, 2 & 4 in Fig. 7) and retained hard-identified plots (e.g., line 2 in Fig. 7).

D. MEMORY COST AND INFERENCE SPEED COMPARISON
The memory cost and inference speed are two main aspects that constrain the spatio-temporal model, and Tab. 5 compares their performances. The memory cost was recorded during the training process, consisting of two parts: the parameters of models themselves, and intermediate variables generated during runtime, termed Activations. The inference speed is measured by Frame-Per-Second (FPS), referring to the throughput of the model per second. The higher the FPS, the faster the model executes.
From Tab. 5 we can see that the main difference in memory cost exists in Activations. U-ConvLSTM and U-BiConvLSTM design relatively simple TFA to save memory and lead to limited performance. Complicated TFA improves the capability of capturing spatio-temporal information at the cost of large memory consumption like UTAE does. DPIN and DPIN-Lite make the appropriate trade-off to get the best performance. And by applying a dualpath interactive branch as a TFA structure instead of time-cost LSTM and self-attention blocks, the execution speed of DPIN and DPIN-Lite are greatly improved. Especially, the speed of DPIN-Lite is about 3 times of the UTAE model (best scores except for DPIN and DPIN-Lite).

E. OBLATION STUDY
To verify the effectiveness of the dual-path interactive structure, a series of comparison models were designed, including 1) DPIN-A: using only one-side interaction from the ''separate'' path to the ''stacked'' path; 2) DPIN-B: removal of the ''separate'' path; 3) DPIN-C: removal of the  Through the experiment we can confirm that: First, the dual-path interactive structure could improve the recognition effect. In DPIE, the ''separate'' path extracts spatial features from sequences by frames, the sequential order retains the temporal context relationship, and the ''stacked'' path mainly extracts the global spatio-temporal representations. The interaction of the two paths helps the integration of temporal and spatial features. Second, two parallel paths perform better than a single path, because the ''separate'' path could capture temporal context relationships. Last, the addition of stage 0 without a down-sampling operation improves the results. This demonstrates the importance of the All-level decoder, and with a layer of non-dimensionally sub-sampled features, the fullest spatial detail can be reserved. Some detailed comparisons were visualized in Fig. 8. As the model structure of DPIN is continuously chopped, paddy rice identification is becoming less effective, and the recognition ability for difficult fields gradually decreases. To be specific, the false positives and negatives increased (e.g., lines 1, 3 & 4 in Fig. 7), and the capability of retaining accurate shapes and boundaries decreased (e.g., line 2 in Fig. 7).

V. DISCUSSION
The superiority of DPIN has been demonstrated above. In this section, limitations and a few methodological attempts of the experiment are discussed. Some prediction errors are shown in Fig. 9, and to analyze the causes and corresponding ground features, NDVI curves of them were plotted. As can be seen, most common errors occur along the plot boundaries. They are mainly caused by the mixed pixels, whose spectral curves consist of several different ground features. Mixed pixels usually affect the classification accuracy of small and linear ground features when using relatively low-resolution images. In Fig. 9(a), the boundary of a road was mis-predicted as its adjacent paddy rice. The NDVI curve of road is flat, the rice curve peaks in August, while the curve trends of their boundary pixels fall in between road and rice, making it difficult to classify. Fig. 9(b) shows a case that the ground truth labels are wrong while our model predicted correctly. Due to the carelessness or other reasons, one block with flat NDVI curve were labeled as paddy rice. It demonstrates that manual labels cannot be 100% right and a well-trained model is able to correct some errors. Fig. 9(c) shows a case where the labels are correct, but the model predicted wrong. This plot has two peaks in the NDVI curves, which is very few in our dataset and then hard to be predicted. It can be deduced that double-season rice is planted here. In Fig. 9(d), the curve of false positive plot has no distinct peaks and valleys, and the crop type and planting pattern are unclear. These ground features belong to the complicated targets of our model.
In this experiment a whole year of 12 images were used, and is it reasonable, or can just using the images in the period of rice growth get a better result? Fig. 10 releases the NDVI curves of five areas (5000 random samples per area). The black curve represents the mean NDVI of paddy rice, and the gray area represents the standard deviation. It can be seen that  NDVI values in all five areas reach the vertex around August, which means that paddy rice grew to its peak at this time. Also, there is a small peak in April in the curves of AH1, AH2, AH3 and ZJ1, and it's probably the growth peak of the rotated crops. From the curves we can roughly deduce that the single-season paddy rice starts its period from May or June and end in November or December. Thus, time-series images from May to December can cover the whole growth period of paddy rice in our sampling area.
Then a set of comparison experiments were conducted: Rice mapping with a 12-months images (a whole year), eightmonths images (May to December), and one-month image (August). Their results are recorded in Tab. 7. We can see that the model performance decreased when using eightmonths of images instead of a whole year. This is somewhat counter-intuitive -images outside the paddy rice growth period have no relevant features of paddy rice but adding them into input image series improved the mapping performance. However, it could be explainable from another perspective -images from other months provide some plot-related information: if rotated with other crops, they could reinforce the property of cultivated land; if not rotated, they could provide clear boundaries. Additionally, using a-whole-year time-series images can facilitate the application of proposed model in a large area in spite of the spatial differences of crop phenology and rotation mode. When using the image in August alone, the model performance is worst because the features of other phenological phase are missing.
Trying to improve the mapping results. several more attempts have been explored in the model designing and training processes. The first is adopting stricter data augmentation strategy, including RandAugment [51] and Mixup [52] augmentation. But on the contrary it compromises the mapping accuracy. We speculate that this is because the strict data augmentation operation leads to more significant variation in the sample distribution, and the forced restriction of ''label invariance'' between the augmented and original samples hurts the model performance. Another attempt is the use of pre-trained technique. DPIN was pre-trained on other datasets and then fine-tuned on the paddy rice-mapping task. The pre-trained data were USDA-NASS Cropland Data Layer (CDL) products. But probably because the growth pattern of U.S.' crops is quite different from that of China, the pretraining did not work as we wished. Note that large-scale pretraining still can be expected to play an important role in the crop mapping tasks in the future.
A few directions deserve further study: One is the tradeoff of missing and noisy data, both could decrease the model performance. In this experiment, images with cloud percentage more than 9% were removed. That is to avoid that large pixels were contaminated by clouds, because the pixelbased cloud removal algorithm can hardly detect all clouds, especially thin clouds. If the threshold was set more strictly, more months would have no data. Although our model can learn on incomplete time series images, the performance of the model is expected to be further improved if a more effective balance between cloud presence and image loss can be achieved. Another research direction is fusion of multisource data. for the cloudy and rainy areas, a fusion input of Sentinel-1 and Sentinel-2 images could be a better choice. In short, DPIN is already a good try in time series images, and further studies may need to focus more on data mining and application on a larger and more complex crop scene.

VI. CONCLUSION
Accurate, large-scale mapping methods for paddy rice fields have long been required by governments and agricultural departments. In this paper, a novel strategy is developed and implemented by using time series Sentinel-2 SR images acquired from the GEE platform. Five sample areas of interest over the middle and lower Yangtze River plain were chosen for method validation. In the strategy, an improved spatiotemporal model DPIN and its lighter version DPIN-Lite, were proposed. Both of them are constructed by a dualpath interactive encoder to enhance the fusion of spatial and temporal features, and an all-level decoder to retain all scale feature maps. Compared with peer models, DPIN yields the best results with an overall F1-score of 91.22%, and has a significant advantage in the inference speed, reaching up to 56.21 FPS when the input image size is 128 × 128. DPIN beats the next-best peer model (UTAE) by 1.09% in F1score and 118% in inference speed. And DPIN-Lite further improves the inference speed to 291% of the UTAE, with only a 0.14% F1-score decrease from DPIN. Our experiment proves that using a full year of 12 images can get better results. And using full-year images makes our method easily transferred to other areas with no prior crop phenological information. In future, our model will be trained and tested in the larger area with more rotation patterns. National Science & Technology Infrastructure of China (http://www.geodata.cn) and also would like to thank the Google Earth Engine Platform for Data Acquisition and Preprocessing.
HUI WANG was born in Leping, China, in August 1995. He received the bachelor's and master's degrees in surveying and mapping engineering from the China University of Geosciences, Beijing, China, in 2017 and 2021, respectively. Currently, he is a Junior Researcher with the Nanhu Laboratory. His research interests include intelligent interpretation of remote sensing imagery, design of deep learning network architecture, and application of machine learning model. RONGHAO WEI received the bachelor's degree from Wuhan University, Beijing, China, in 2005, and the master's degree from the Institute of Geodesy and Geophysics, Chinese Academy of Sciences, Beijing, in 2008. From 2008 to 2020, he worked as an Engineer with the Zhejiang Surveying Institute of Estuary and Coast. He is currently an Engineer with the Zhejiang Institute of Hydraulics and Estuary. His current research interests include remote sensing, image processing, shipboard LiDAR, and sonar data integration. VOLUME 10, 2022