Multitemporal Relearning With Convolutional LSTM Models for Land Use Classification

In this article, we present a novel hybrid framework, which integrates spatial–temporal semantic segmentation with postclassification relearning, for multitemporal land use and land cover (LULC) classification based on very high resolution (VHR) satellite imagery. To efficiently obtain optimal multitemporal LULC classification maps, the hybrid framework utilizes a spatial–temporal semantic segmentation model to harness temporal dependency for extracting high-level spatial–temporal features. In addition, the principle of postclassification relearning is adopted to efficiently optimize model output. Thereby, the initial outcome of a semantic segmentation model is provided to a subsequent model via an extended input space to guide the learning of discriminative feature representations in an end-to-end fashion. Last, object-based voting is coupled with postclassification relearning for coping with the high intraclass and low interclass variances. The framework was tested with two different postclassification relearning strategies (i.e., pixel-based relearning and object-based relearning) and three convolutional neural network models, i.e., UNet, a simple Convolutional LSTM, and a UNet Convolutional-LSTM. The experiments were conducted on two datasets with LULC labels that contain rich semantic information and variant building morphologic features (e.g., informal settlements). Each dataset contains four time steps from WorldView-2 and Quickbird imagery. The experimental results unambiguously underline that the proposed framework is efficient in terms of classifying complex LULC maps with multitemporal VHR images.


I. INTRODUCTION
O VER recent years, the number of very high resolution (VHR) multitemporal satellite imagery has significantly increased and become commercially available [1]. VHR imagery provides opportunities for extracting many details including various land use types and building morphologies. One potential application domain is land use analysis [2]- [4]. The World Bank [5] estimates that three billion people will live in substandard housing by 2030. By 2050, the UN [6] projects that two thirds of the world's population, around 7 billion people, will live in urban areas. The increasingly available multitemporal satellite imagery can be beneficial for examining urban development over time and help set policies to limit urban sprawls, among others. Efficient methods for classifying land use and land cover (LULC) from multitemporal VHR imagery is therefore timely and worthy of further exploration. Finding suitable and effective approaches for multitemporal LULC classification based on VHR remote sensing imagery remains challenging. First, the significantly improved spatial resolution of VHR imagery leads to high intravariation and low intervariation between each LULC class [7], [8]. This issue inevitably decreases the separability between different LULC classes, especially for land use categories that contain much semantic information. Moreover, most of the existing temporal methods have not fully exploited the temporal sequential features in multitemporal data because of their limitations regarding automation and flexibility [9]. It can be argued that the temporal dependency embedded in the consecutive time steps of time-series data contains the features of transition patterns, i.e., transitions rules of LULC changes. It has been widely recognized that deep learning (DL) methods can extract rules that represent the relationship between the distributions of input and output. Convolutional neural networks (CNN) are designated for processing spatial features, whereas recurrent neural networks (RNN) excel at analyzing temporal relationships. These advanced developments in the DL field shed light on taking advantages of the temporal dependency features for improving classification accuracy.
In the past, different strategies were followed to improve the accuracy of LULC classification for VHR remote sensing imagery.

A. Convolutional Neural Networks for LULC Classification
DL algorithms, particularly CNN, have gained great success and are deployed in the remote sensing community [10]. It is because CNNs excel at effectively encoding discriminating features based on spectral and spatial information [11]. Such abilities enable CNN models to achieve remarkable accuracy in image classification tasks [12], [13]. The applications of CNNs in LULC can be mainly grouped into two categories: scene-based classification and pixel-based classification [14]. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Scene-based classification, also termed patch-based classification, refers to the categorization of images into a set of LULC classes based on the main content of each image [15]. Many efforts of deploying CNNs for scene-based classification of remote sensing imagery have been made. A typical example of scene-based classification starts with sampling numerous patches from a relatively large image for model training, then an LULC map is generated by the trained model through classifying the scene category of every sampled patch [16]. Sharma et al. [16] proposed a path-based CNN framework that is designated for medium-resolution remote sensing data. Liu and Shi [17] deployed a scene-based CNN model for local climate mapping. It can be argued that scene-based CNNs have several limitations in LULC classification. First, a suitable size of input patch is difficult to define, especially when the sizes of ground targets are highly variant [18]. Moreover, the patches sampled in the same large image are processed independently, which means the context information is neglected in the process of classification. Furthermore, scene-based methods usually are adopted for recognizing large objects, whereby pixel-based methods are more suitable for detecting fine details [19].
Pixel-based classification refers to techniques which assign a class label to each pixel of an image. Semantic segmentation techniques, as one type of pixel-based classification, assign labels to each pixel according to contextual information. Over recent years, fully convolutional networks (FCNs) [20] have received much attention in terms of their outstanding performance in semantic segmentation tasks. Comparing with conventional pixel-based classifiers that partition pixels solely based on the specific spectral information, FCNs perform pixel-level labeling by using multiple fully convolutional layers to extract highlevel contextual features embedded in images. Although FCNs usually do not exhibit heavy salt-and-pepper noise, one of the constraints of FCNs is the blurring class boundaries caused by multilevel abstractions [18], [21].
It has been widely recognized that the performance of a CNN model can heavily depend on the quality and amount of training data. Unlike ordinary RGB images that have open datasets containing enormous amount of data for training CNN models, large-scale annotated multispectral remote sensing data is scarce [22]. Therefore, considerable efforts have been made in developing CNN models that are suitable for remote sensing imagery. One of the strategies is to exploit existing trained CNN models, by including transfer learning strategies [23] and fine tuning mechanisms [11] for model adaption. However, this direction requires highly resembled classification tasks and model structures, for example, parameters trained on ordinary dataset are very challenging to be transferred to multispectral remote sensing dataset. Alternatively, another effective direction is to develop CNN-based frameworks that are tailored for remote sensing data, i.e., integrating with object-based segments [1], and developing effective post-classification methods [18], [24].

B. Integration of Object-Based Image Analysis and Convolutional Neural Networks
Object-based image analysis (OBIA) has been widely applied for mitigating the high intraclass and low interclass variabilities in VHR imagery [25]. Object-based segmentation refers to the process of partitioning images into small objects based on homogeneity attributes of the image. One of the primary constituting aspects of such techniques is to model meaningful real-world objects (e.g., with a segmentation algorithm) before further processing. Those allow for a diversified characterization of spectral values, consideration of geometry-related properties of objects, and also encoding of additional spatial information such as relationships of (topological) neighborhood and spatial hierarchy [26]. Given the outstanding performance of DL in tasks of scene-based classification and semantic segmentation, attempts of integrating OBIA with DL models have been explored. For example, X. Zhang et al. [3] extracted spectral, spatial, and texture features of image segments to train a deep neural network for land-cover classification.
In terms of the integration of OBIA with CNNs, some research focused on the direction of deploying OBIA as a method of preprocessing inputs for training scene-based DL models. However, the object-based segmentation cannot be directly used for CNN training. This is because object-based segments have irregular shapes and various sizes, whereas CNN models demand input units to have square shapes and uniformed size [18]. To deal with this issue, [1] proposed to use the minimum bounding box of object-based segments to sample training inputs for scene-based classification. However, the implementation of OBIA as a preprocessing method still could not address the aforementioned limitation of scene-based CNNs in terms of deploying contexture information for classification.
To deal with VHR imagery, besides clustering pixels into objects as processing units, another strategy is to adopt effective postclassification methods [27]. Object-based voting (OBV) has been extensively acknowledged as an effective OBIA-related classification postprocessing (CPP) method, which refers to a refinement of classified labels according to the boundaries of objects to improve the classification accuracy [28]. In general, CPP methods can achieve considerable accuracy improvement efficiently and concisely. However, comparing with preprocessing methods, much less attention has been paid on CPP methods [28]. Therefore, the potential of deploying OBV-based CPP methods for refining the output of a CNN is worthy of further exploration. In this manner, Liu et al. [18] proposed a simple but efficient framework, which integrates the OBV with DL. Thereby, the outputs of a CNN are aggregated with a majority voting strategy to object-based segments. The hybrid method brought significant improvements to the classification accuracy of CNN outputs.
The process of object-based segmentation is conducted in an iterative bottom-up manner that starts from merging pixels into objects [29]. The sizes and geometries of objects are determined by three parameters, which are "scale," "shape," and "compactness." The "scale" parameter determines the sensitivity for the object fusion [29], greater scale value results in larger segmentation area. The "shape" parameter defines the influence of color on the segmentation process, the "compactness" parameter defines the smoothness and compactness of object boundaries. Among the three parameters of object-based segmentation, it is worth noting that the "scale" parameter has the most significant effect on the segmentation of objects, it determines whether objects are segmented into appropriate sizes.
For instance, large scales can result in under-segmentation, whereas small scales can lead to over-segmentation. Consequently, Liu et al. [18] also claim that the selection of different scales can have significant effects on the results of refinements.

C. Postclassification Relearning for Land Use and Land Cover Classification
Although the aforementioned OBV-based CPP methods can enhance raw classification accuracy, they do not account for improving the separability between classes [28]. Regarding this issue, relearning-based CPP methods have demonstrated immense potential in terms of improving postclassification accuracy through enhancing the separability of the original classifier [28]. The basic idea of postclassification relearning is to deploy the initial output of the model for calculation of additional features. Then with such additional features, the performance of the model can be enhanced with extended input space after training for a second time [24].
Over recent years, the advantages of using relearning-based CPP has raised attention. Geiß and Taubenböck [24] proposed an object-based relearning (OBR) framework, which improved the classification accuracy by retraining a model with a triplet of hierarchical OBV features generated from its preliminary outcomes. Also, Han et al. [30] proposed an edge-preservation multiclassifier relearning framework, which includes iterative relearning procedures based on landscape metrics to enhance the separability of LULC classes. Furthermore, Shi et al. [31] developed an active relearning framework that can improve the classification results with less labeling costs. More recently, Lei et al. [32] developed an object-oriented classification method which iteratively integrates classification results. The experimental results achieved promising accuracy even with a limited number of samples. The above experiments all conducted the relearning processes in an iterative manner, in order to harness the additional feature that can be updated after each iteration. Such process can be carried on until the optimization hits a plateau.

D. Multitemporal Land Use and Land Cover Classification
Multitemporal LULC classification is an active field in remote sensing community. Over recent years, since the access to multitemporal remote sensing imagery became increasingly available, more opportunities emerged for the utilization of temporary dependency to improve LULC classification [33]. Temporal sequential LULC data can significantly facilitate change detection and growth prediction. Moreover, the temporal dependency embedded in multitemporal data can be utilized to enhance the classifier performance. Vuolo et al. [34] tested the effect of adding temporal information as additional features for crop classification. The classification accuracy showed considerable improvement after the utilization of temporal information. However, in practice, it is still challenging to find an effective method for multitemporal LULC classification. Conventional methods of temporal feature extraction have many limitations, including time-consuming manual feature engineering and predefined rules that lack of flexibility [9].
In the domain of DL, models of RNNs, including long shortterm memory (LSTM) and gated recurrent units, are designated for processing sequential temporal data. Subsequently, to harness the power of RNNs for processing spatial-temporal sequential data, efforts of integrating CNNs and RNNs have been made. A Convolutional LSTM (ConvLSTM) network is proposed for anticipating future precipitation [31]. Given multitemporal land use data is inherently spatial-temporal sequential data, recurrent convolutional structures have been applied for LULC classification and prediction. In this manner, the framework of ConvLSTM has been deployed in the field of remote sensing. Mou et al. [35] integrated CNN and LSTM layers to form a recurrent CNN for detecting land cover changes. Moreover, Rußwurm and Körner [36] proposed an encoder structure with recurrent convolutional layers for land cover classification. This structure can utilize the temporal interdependency embedded in the input data. As a result, they improved classification accuracy as well as alleviated the pre-processing work regarding minor missing information (e.g., clouding filling).
Although the model of ConvLSTM is designed for processing spatial-temporal data, the structure of a simple ConvLSTM model does not have advantages for the task of pixel-based classification. To cope with this issue, attempts of combining ConvLSTM with FCNs have been made. Milletari et al. [37] proposed a coarse-to-fine context memory framework for medical image segmentation, which uses ResNet as encoder and ConvLSTM layers as decoder. Each ResNet block in the encoder is timely distributed with its counterpart ConvLSTM layer in the decoder. Such structure enables feature interpretation based on coarse-to-fine context information, and allowed higher classification accuracy of image data than a simple UNet. Moreover, Azad et al. [38] developed a ConvLSTM U-Net model for medical image segmentation, which employed bidirectional ConvLSTM layers in the decoder part of a U-Net. This framework presented better segmentation performance due to its ability of abstracting more discriminative features. As for the applications of semantic segmentation tasks for remote sensing imagery, Gallego et al. [39] used ConvLSTM unites as the first layer in autoencoders for segmenting oil spills from side-looking airborne radar imagery. Similarly, Teimouri et al. [40] adopted ConvLSTM layers at the last stage of an FCN for classifying various crop types from SAR data.
From a unifying perspective, it can be argued that coupling OBV with relearning methods can reasonably optimize classification boundaries and improve class separability. However, to the best of our knowledge, approaches which internalize those processing principles in a beneficial way for multitemporal models are currently absent. Consequently, in this article, we aim to uniquely examine the benefits of temporal dependency in a deep relearning context. For this purpose, we propose a hybrid framework for efficient multitemporal LULC classification of VHR remote sensing imagery. The framework adopts a recurrent convolutional structure as LULC classifier, which is integrated with postclassification relearning for model improvement. At the relearning stage, the consecutive temporal outcomes of a ConvLSTM-based model were utilized to extend the original input space in the temporal dimension. Then the classifier was retrained using the extended input space to improve its class separability. The relearning process was iteratively conducted to achieve an optimal result. The various combinations of relearning strategies, including OBV-based relearning and pixel-based relearning, were tested and evaluated in an exhaustive manner. The proposed framework was examined with complex classification categories with relatively few labeled pixels in order to show that this framework has potential to be applied for a wide range of multitemporal LULC tasks.
The rest of the article is organized as follows: Section II introduces each component of the proposed framework. The experiment datasets and setup are described in Section III. Then we report the results of experiments in Section IV and finally Section V concludes this article.

II. PROPOSED METHODOLOGY
An overview of the proposed methods is provided in Fig. 1. They build upon an advanced UNet-ConvLSTM model for multitemporal LULC classification (Section II-A). Subsequent to that, the outputs are processed with an OBV method (Section II-B) to both eventually enhance the classification output and establish a further input for an iterative relearning strategy (Section II-C).
A. Convolutional Neural Network Models 1) UNet Model for Semantic Segmentation: It has been widely recognized that FCNs can achieve robust performance in the tasks of LULC classification. UNet, a semantic segmentation model built upon conventional FCNs, was first introduced by Ronneberger et al. [41] for biomedical image segmentation. The name of "UNet" comes from its distinctive u-shaped encoderdecoder architecture, which presents the down-sampling and up-sampling process. To be more specific, the down-sampling part of the network acts as the encoder, which extracts features through convolutional layers and downscales data by max pooling layers at multiple scales. Then the up-sampling part of the network functions as the decoder, which expands the down-sampled data at each scale to match its counterpart layer of the encoder. The encoder and the decoder form a more or less symmetric structure. This structure of UNet facilitates feature extraction at multiple scales, which enables the classifier to consider both global context and local high-level features.
Another distinctive feature of UNet comprises the skip connections between encoder and decoder, which concatenate the low level, coarse feature maps in the encoder with the high level, fine feature maps in the corresponding scale of decoder [42]. Such skip connections between the encoder and decoders can effectively restore the fine-grained features in the model output.
UNet-based models exhibited promising performance when being applied on tasks of pixelwise image segmentation, including high spatial resolution aerial photo and remote sensing images [43]. Comparing with conventional CNNs that require a large amount of training data, UNet excels at achieving precise segmentations with much fewer training images [41] and less training time. However, in the process of fusion features extracted by encoder and decoder, the low-level features generate noise in high-level features, therefore result in blurring segment boundaries [44]. Furthermore, although UNet has robust performance in semantic segmentation tasks, a two-dimenional (2-D) UNet does not consider temporal dependency in the data.
In this article, we deploy UNet as a benchmark model to compare the effects of relearning strategies on a single-temporal model with multitemporal models.
2) ConvLSTM Models for Spatial-Temporal Data: ConvL-STM, first proposed by X. Shi et al. [45], is developed based on the structure of fully connected LSTM (FC-LSTM). Similar with FC-LSTMs, ConvLSTM structures contain hidden states H 1 , …, H t , and cell states C 1 , …, C t . The hidden states can be regarded as cell outputs. The cell states function as the memory of layer, the information in which can be selectively updated or discarded. In a ConvLSTM model, convolutional structures are deployed to replace the full connected layers for input-to-state and state-to-state transition. In this manner, both temporal consistency and spatial correlation are taken into consideration in a ConvLSTM model.
Multiple ConvLSTM layers can be stacked together to form more complex structures. The input space of a ConvLSTM layer X ∈ R t×h×w×c , in which t, h, w, and c, respectively, refer to time steps, height, width, and channels. A simple ConvLSTM model consists of multiple ConvLSTM layers, and a ConvLSTM layer contains several ConvLSTM cells. The number of ConvL-STM cells in a ConvLSTM layer corresponds to the number of time steps of the dataset. Each cell takes X t ∈ R h×w×c as input, and generates its hidden state H t ∈ R h×w×c and cell state C t ∈ R h×w×c at time step t. The cell output H t and state C t are controlled by three gates (i.e., forget gate f t , input gate i t , and output gate o t ), which have same spatial dimensions (i.e., R h×w×c ). The three gates are computed by corresponding learnable weights and biases with activation functions [ Fig. 1(d)]. The key equations of a typical ConvLSTM layer can be described as follows [45]: where W and b refer to weight and bias, σ and tanh refer to sigmoid activation function and tanh activation function. " * " denotes the convolution operator and "•" represents the Hadamard product.
3) Integrating UNet and ConvLSTM for Multitemporal LULC Classification: In essence, consecutive time steps of LULC maps are spatial-temporal sequences. As such, multitemporal LULC classification is not only an image segmentation problem, but also contains issues related to the encoding of spatial-temporal relationships (i.e., urban transition rules).
As discussed in Section I-D, the integration of ConvLSTM units with FCNs can facilitate the extracted spatial-temporal features for image segmentation. Therefore, to develop efficient methods for multitemporal LULC classification, this article incorporates ConvLSTM layers into a UNet-like encoder-decoder structure to form an architecture named as UNet-ConvLSTM.
The encoder part of the UNet-ConvLSTM follows a contracting path, which adopts convolutional layers for retrieving spatial features and max pooling layers for down-sampling and getting global context information. The decoder part of the model follows an expanding path, which stacks multiple ConvLSTM layers for extracting temporal relationships. On top of the encoder-decoder structure, the corresponding encoder and decoder blocks are concatenated by the skip connections (Fig. 1), which aims to integrate low-level spatial features with high-level spatial-temporal features. Furthermore, considering the datasets only contain four time steps, which can be regarded as short-term prediction problems for LSTM models, therefore it is not necessary to implement a complex ConvLSTM structure. To reduce the complexity of the integration of UNet and ConvLSTM, the depth of the encoder-decoder structure of UNet-ConvLSTM was decreased to two down-sampling scales.

B. Object-Based Voting (OBV)
OBV has been tested and proven to be an effective CPP method for CNNs [18]. Therefore, coupling OBV with relearning is very likely to bring improvements to initial model outputs. In this article, a combination of OBV and relearning is adopted as one of the main relearning strategy categories (Fig. 1). The process of conducting OBV-based relearning is described in Section II-C. Furthermore, OBV is also used as a simple CPP method at the last step for benchmarking. The detailed process of conducting OBV can be described as follows: For pixels x in the image I, they are partitioned into objects according to a scale parameter s. As shown in Fig. 2, an optimal value of s can be selected through exhaustive tests of a set of values S ∈ {. . . , s − 1, s, s + 1, . . .}. In general, multiscale segmentation can be conducted following the constraint: where f (O s j ) refers to the frequency of each label inside an object O s j . After the OBV postprocessing, the most frequent label is assigned to all the pixels inside the object O s j . F s j denotes the updated pixel labels at level s. After comparisons of F s j with ground truth F gt j an optimal scale can be determined.

C. Postclassification Relearning
A simple postclassification relearning has three main consecutive steps: 1) a pixel-based supervised model is trained with multispectral features F s as initial input; 2) the overlap-tile strategy segmentation [41] is deployed to obtain a seamless classification map by the trained model, whereas the resulting two dimensional classification map can be regarded as an additional feature F p ; 3) then F s and F p are stacked together at the beginning of each relearning phase to form an extended input space F s for the next phase of relearning. The purpose of this concatenation operation is to generate new features that are likely to be beneficial for improving the discriminative capability of the model. As such, the process from step 1 to step 3 can be carried out iteratively to seek an optimal result. Two main types of relearning strategies were tested in this article, pixel-based relearning [ Fig. 1(a)] and OBV-based relearning [ Fig. 1(b)]. The former refers to using the output of the initial trained model, a pixel-based classification map, as a relearned feature F p . Then F p is concatenated with the initial multispectral input features F s to form a new input for the next iteration of relearning. The latter adds an extra step of applying OBV on F p to generate a postprocessed output F obv , which is then concatenated with F s for OBV-based relearning. The reason of adding the extra step of OBV is because the OBV-processed results can yield improvement regarding classification accuracy, which means that F obv tends to have higher classification accuracy than F p . As such, the extend input space F s generated with F obv is likely to provide the next relearning phase with better guided information for LULC classification.

A. Study Area
Our study area includes a suburban area situated at the city border of Dongguan and Shenzhen, Pearl River Delta, southern China. Shenzhen and Dongguan have experienced rapid urbanization over the last four decades. During such expansion, a significant number of rural villages have been merged into the urban area, resulting in the prevalence of urban villages. Due to the complex social-economic development in China, these urban villages have unique building morphologies compared with informal settlements elsewhere [46]. More specifically, the informal settlements in Pearl River Delta mostly have their original rural settlements as old cores [47]. These old cores have been gradually encompassed by highly dense mid-rise informal settlements (so-called "handshake buildings") that were redeveloped on the plots of original rural settlements. The advent of DL provides new opportunities for remote sensing community to map informal settlements [23], [48]. In this article, we adjust the LULC category according to the local context to include "informal settlements" and the related "rural settlements" in the category.

B. Datasets
Two datasets were deployed for a quantitative evaluation of the models. Both datasets have a spatial coverage of 90 km 2 . Each dataset contains four time steps with an interval of approximately 5 years (Fig. 3). Due to the limited availability of temporal sequences, the datasets are from two sensor sources: WorldView-2 and QuickBird. The WorldView-2 data covers the years 2012 and 2018, and the QuickBird data covers the years 2002 and 2007. In terms of the spatial resolution, the data from both sources have a spatial resolution of 0.5 m with an image size of 4096 × 4096 pixels. Regarding the spectral resolution, QuickBird data contains four spectral bands, which are red, blue, green, and NIR. Although WorldView-2 data provides eight spectral bands, only four spectral bands (red, blue, green, and NIR1) are used to match the four corresponding bands of QuickBird data. All the ground truth labels were manually made under consideration of ancillary cadastral maps.
Satellite images were cropped into image tiles of 128 × 128 pixels with an overlap of 32 pixels for the purpose of increasing the amount of training data. The total number of cropped patches was 7056, 80% of which were randomly sampled for training and 20% were used for validation. To ensure a fair evaluation of all the experimental approaches, a subset of the validation dataset is selected as testing dataset for evaluation purposes. It should be noted that the subset selectively consists of the areas of the validation dataset that are not overlapped with training dataset, i.e., areas which are strictly spatially disjoint [49]. Consequently, the testing dataset consists of data that has completely not been used for training.
The VHR satellite images enable the observation of various building morphologies and different vegetation types, especially the differentiation between original rural villages and follow-up informal settlements in the study area. In total, 10 LULC classes were defined in this article, including: "rural settlements," "informal settlements," "formal settlements," "bare soil," "grassland," "farmland," "trees and bushes," "water," "roads," and "other impervious surface" (Fig. 4). It should be noted that the accuracy of LULC classification can be largely subject to the settings of LULC classes [10], larger numbers of LULC classes or semantic categories could result in a decrease of classification accuracy. Comparing with other commonly adopted LULC categories, the inclusion of "informal settlements" and "rural settlement" significantly increases the complexity of the classification task due to their unique urban fabric and the need to incorporate large amounts of semantic information.

C. Experimental Setup
Three convolutional models were built in this article, including a simple UNet model, a simple ConvLSTM model and an UNet-structured ConvLSTM model (Fig. 5). A simple UNet model was built to create a baseline to compare with the performance of the other two ConvLSTM-based models. The three models were trained on a Nvidia GeForce RTX 2080 GPU using Keras framework (Tensorflow backend). The parameters for all the three models were set to be consistent; each model has a batch size of six and uses "Adam" as the optimizer. The loss function chosen was categorical cross-entropy, and the initial learning rate was set to 10^-4, decreasing by a factor of 0.1 when validation loss stagnates for more than three epochs.
The three models were tested following two main relearning strategies: pixel-based relearning and OBV-based relearning. Experiments on each relearning strategies were set to follow three iterations to test to what extent this relearning strategy can improve the initial result. For each relearning strategy, OBV was applied to the relearned map after three iterations of relearning, in other words, OBV was deployed at the last step to provide a comparison with the iteratively relearned results.
In the process of segmenting an image into object-based segments, three parameters were determined: "scale," "shape," and "compactness." Since the scale parameter plays the most significant role regarding the effect of CPP, we kept "shape" and "compactness" constant but tested a wide range of "scale" parameters. "Shape" and "compactness" for all the approaches were set to 0.3 and 0.3, respectively. A wide range of scale parameters were tested with a linear ascending setting of S ∈  {10, 15, 20, 25, . . . , 115, 120}. Among the scale parameters in set S, we selected a scale parameter that has the largest improvement on initial model prediction for all the OBV-related approaches of each dataset.
Regarding the evaluation of classification results, the performance of all the approaches were assessed by the overall accuracy (OA) and Kappa coefficient. Perclass accuracy was also assessed to examine the separability between classes, especially for the effects of relearning approaches on the improvement of classification accuracy in thematic classes.   5. Flowchart of the experiment setup. Multitemporal VHR images are first input in a CNN-based model for initial training to produce (1) initial pixel-based classification maps, which are (2) postprocessed by OBV for benchmarking with relearning approaches; alternatively, the initial pixel-based classification maps are aggregated with VHR images for (3) pixel-relearning; after three iterations of pixel relearning, the PR maps are processed by OBV to generate (4) PR-OBV; following a similar work flow, (5) OBVR maps are generated after three iterations of aggregating OBV maps with VHR images. Finally, (6) OBVR-OBV maps are produced by post-processing OBVR maps with OBV.

IV. RESULTS AND DISCUSSION
In this section, we evaluate the overall performances of two relearning strategies applying on three different models. Moreover, we compare the pixel accuracies of each LULC class for different relearning methods. Visual observations were conducted for evaluating the temporal correlations between the classifications of each time steps.

A. Analysis of Segmentation Scale
The analysis of segmentation scale was conducted for each model in the two datasets. An optimal segmentation scale for each dataset varies from model to model. In dataset I, the optimal segmentation scales for UNet model, simple ConvLSTM model and UNet-ConvLSTM were 20, 85, and 70, respectively. In dataset II, the counterpart scales were 55, 100, and 60. Although the optimal segmentation scales of a model are different in the two datasets, accuracy effects with different scales of the same model share similar patterns. It can be observed that, in both datasets, UNets required relatively smaller segmentation scales to obtain an optimal improvement, whereas ConvLSTMs demand larger scales to achieve more accuracy improvement in OBV operation. As can be observed in Fig. 6, the accuracy effects were optimal when the scale was around 60 to 80, smaller or larger scale parameters receive less accuracy improvement.

B. Overall Accuracies and k Statistics
The initial training of three models followed by two relearning strategies generated results for nine approaches, then OBV was adopted as a last-step CPP method for each model to generate benchmarks for the two relearning strategies. In all, 18 different approaches for each dataset were evaluated and compared.
In general, as shown in Fig. 7, in both datasets I and II, UNet-ConvLSTM with relearning strategies achieved the highest accuracy, 79.1% in dataset I and 84.4% in dataset II. UNet with relearning strategies showed slightly less accuracy, the best accuracy of UNet achieved in two datasets were 79% and 77.4%. Whereas the classification results achieved by a simple ConvLSTM was much lower than the results achieved by the other two models.
Before using OBV as a last-step CPP, both of pixel-based relearning and OBV-relearning presented improvements on model performance. The accuracy of the initial training of all three models significantly increased after applying the two relearning strategies. In dataset I, OBV-relearning approaches showed higher kappa and OA values compared with initial training and pixel-based relearning approaches in all three models. More specifically, UNet-ConvLSTM with OBV-relearning achieved the best performance in terms of kappa and OA, which were 75.2% and 79.2%, respectively; they increased 9% and 5.8% compared with the initial training outcomes. However, for UNet-ConvLSTM in dataset II, the OBV-relearning result achieved the  same accuracy with its pixel-relearning result, both were the best score of all the approaches in dataset II.
Furthermore, it was also observed that the two relearning strategies have better effects on a UNet-ConvLSTM than a UNet. In dataset I, the OA of UNet-ConvLSTM increased from 73.4% to 79.2%; the OA of UNet raised from 71.8% to  TABLE I  PRECLASS ACCURACY OF 18 DIFFERENT APPROACHES FOR DATASET I AND DATASET II 76.9%. The improvements were also significant in dataset II: the OA of UNet-ConvLSTM climbed from 73.5% to 79.3%; for UNet, it increased from 76.2% to 77.4%. It is also interesting to observe that although in dataset II the UNet had better initial prediction (76.2%) than UNet-ConvLSTM (73.5%), the better effect of relearning strategies on the later allowed it to surpass the relearned prediction of the former.
When using OBV as a last-step CPP, the rates of improvement were the most significant when applying on the results of initial training and pixel-based relearning. After applying OBV as laststep CPP on UNet-ConvLSTM, the accuracy of pixel-relearned classification showed more considerable improvement than the accuracy of OBV-relearned classification. In dataset II, the OA of pixel-relearned UNet-ConvLSTM improved from 79.3% to 84.3%, whereas the OA of OBV-relearned UNet-ConvLSTM only improved from 79.3% to 81.3%. Such effect also could be seen in dataset I; this phenomenon is very likely due to the effect of OBV has already been functional during the process of OBV-relearning; therefore, its effect became less significant when using OBV as a last-step CPP on the OBV-relearned result.
However, it is also evident from the results of both dataset I and II, using OBV as last-step CPP did not bring improvements on the classification result of all the approaches, a few exceptional test results showed a decrease in the classification accuracy. For instance, the OA of the OBV-relearned UNet-ConvLSTM in dataset I dropped 0.2% after applying OBV CPP; also, the OA of the pixel-relearned UNet in dataset II decreased 0.3% after using OBV CPP.
In addition, the accuracy effects of UNet-ConvLSTM PR-OBV according to number of classes have been studied (Fig. 8). Besides the 10-class LULC categorization deployed in the main experiments, a 4-class LULC category and a 7-class LULC category were adopted for comparison. The 4-class categorization consists of 1) vegetation, 2) urban extent, 3) soil, and 4) water, and the 7-class categorization contains 1) farmland and grassland, 2) formal settlements, 3) informal settlements (including rural settlements), 4) other impervious surface (including roads), 5) soil, 6) trees, and 7) water. As can be seen from Fig. 8, in both dataset I and II, a smaller number of classes, that need to be distinguished, allows obtaining a higher level of accuracy.

C. Perclass Accuracies
As shown in Table I, in both dataset I and dataset II, all the top scores in each LULC class were distributed in UNet-ConvLSTM approaches. To be more specific, in dataset I, the best scores were scattered in various relearning strategies of UNet-ConvLSTM. The OBV-relearned UNet-ConvLSTM (UC-OBVR) captured more best scores than others. It can also be seen that the performance of the pixel-relearned UNet-ConvLSTM (UC-PR) was just slightly better than the performance of the one adding OBV as CPP (UC-PR-OBV). Whereas, in dataset II, most of the best scores of LULC classes were achieved by UC-PR-OBV.
The results also show that simple ConvLSTM approaches failed to recognize "informal settlements," as well as showed the poor performance of classifying "rural settlements" and "other impervious surface." These poor performances contributed to the lowest OA comparing with the other two models.
The class-based accuracy reflected that thematic classes have lower class separability than others. For example, "other impervious surface," having almost same spectral value with "roads," received the lowest accuracy in all the three models due to the separability of the class is mainly based on semantic information. Among all the LULC classes, relearning strategies had the most significant improvement of accuracy in "grassland" and "impervious surface." The pixel accuracy of "grassland" was improved from 56.5% to 74.5% by the pixel-relearned UNet-ConvLSTM with OBV as CPP (UC-PR-OBV) in dataset I and improved from 48.9% to 77.2% by UC-PR-OBV in dataset II. Meanwhile, the accuracy of "impervious surface" was improved from 49% to 71.1% by UC-PR-OBV in dataset II.

D. Visual Observation
In Figs. 9 and 18 approaches of the year 2012 classification maps of dataset I were compared. It is evident that all the simple ConvLSTM approaches showed poor performance in terms of recognizing "informal settlements," whereas UNet approaches and UNet-ConvLSTM approaches showed satisfactory results. This is very likely due to a ConvLSTM structure, which simply consist of a stack of ConvLSTM layers, does not have the strength of encoder-decoder structures in terms of extracting high-level spatial context features. Similarly, it can be argued that the encoder-decoder structure of UNet-ConvLSTM-based approaches is able to complement the disadvantage of a ConvLSTM model in terms of extracting high-level representation. As such, it is observed that the classification maps of UNet-ConvLSTM-based approaches achieved good performance in segmenting thematic classes.
Comparing UNet-ConvLSTM approaches and UNet approaches, the result of initial training of UNet-ConvLSTM presented noises in the classification labels, while the result of the initial training of UNet shows the effect of blurring boundaries. After three iterations of relearning, the OBV-relearned and pixelrelearned results showed similar visual representations. Both issues of the salt-and-pepper effects and blurring boundaries were considerably alleviated, especially in the results of UNet-ConvLSTM. However, not all the notable changes after relearning brought improvements to the accuracy of classification maps. For instance, a larger scale of OBV merged small segments into big parcels, although this process was helpful for mitigating salt-and-pepper effects, it resulted in wrongly classified pixels (e.g., the result of UNet-ConvLSTM initial OBV mistakenly classified parts of roads into trees and bushes).
As mentioned in the previous sections, the approaches of pixel relearning with OBV as CPP (PR-OBV) in dataset II achieved the best scores in both OA and preclass accuracy. In Fig. 10, obtained temporal sequence by PR-OBV-based approaches in dataset II are compared with the corresponding ground truth. First, temporal consistency regarding LULC changes are obvious in the ground truth maps, see Fig. 10(a). Furthermore, the results show that UNet-ConvLSTM-PR-OBV successfully captured such temporal consistency, see Fig. 10(c). In particular, the "informal settlements" in the classification maps of UNet-ConvLSTM grows in a consistent manner from the year 2002 to 2018. However, since UNet approaches did not take temporal relationship into account, the four time steps of classification maps generated by UNet-based approaches show inconsistencies in the changes of many LULC classes. For instance, the patterns of "informal settlements" obtained by UNet changed into rural settlements then changed back to informal settlements [ Fig. 10(d)]. In contrast, the UNet-ConvLSTM approaches that designed for spatial-temporal segmentation tasks showed better performance regarding reflecting temporal consistency. Since the capability of reflecting temporal consistency in classification results is not only useful in improving classification accuracy, but can be valuable in terms of change detection and trend analysis. Such clear benefit of the UNet-ConvLSTM relearning approaches should not be ignored.

E. Time Consumption of Relearning
The time consumption of pixel-based relearning and OBVbased relearning of three models (i.e., UNet, ConvLSTM, and UNet-ConvLSTM) was compared in Fig. 11. As mentioned, all the training operations were carried on a Nvidia GeForce RTX 2080 GPU using Keras framework (Tensorflow backend). Except the initial training of three models, the relearning operations were conducted for three iterations in the experiments. Since operating OBV is a manual process that does not involve model training, the time consumption of OBV operation is not included in this comparison. In total, six relearning approaches were compared with a breakdown of time consumed in each phase.
It can be observed that UNet-based approaches tend to demand less training time to complete three iterations of relearning (less than three hours), while UNet-ConvLSTMbased approaches required almost double the training time (less than 6 h). This is very likely due to the recurrent model structure in ConvLSTM-based models containing more learnable parameters and therefore demanding more time for data processing. It should be noted that, for the UNet-ConvLSTM-based methods in both datasets, the pixel-based relearning approach is more efficient than OBV-based relearning. By and large, the extra training time of the proposed method appears to internalize a good tradeoff given the achievement of a higher OA.

V. CONCLUSION
The main purpose of this article is to explore the extent to which a hybrid multitemporal relearning method can improve the accuracy of LULC classification. In this article, three CNN models with two main relearning strategies, pixel-based relearning and OBV relearning, were tested and analyzed. OBV was also deployed after each relearning approach as a last-step CPP for benchmarking. The classification tasks are set to be complex and challenging in order to test a wide range of capabilities of different relearning approaches. The main findings are as follows: 1) UNet-ConvLSTM outperforms UNet and the simple Con-vLSTM by achieving higher classification accuracy, as well as reflecting temporal consistency in multitemporal LULC classification. It is because a UNet-ConvLSTM takes advantages of encoder-decoder structures, as well as exploits the temporal dependency embedded in the multitemporal data. 2) Both pixel-representation relearning and OBR can improve classification accuracy. When conducting OBR, the process of selecting the most effective segmentation scale is crucial, too large scales can result in a failure of preserving small objects in the classification maps. It should be noted that although OBV relearning could be more effective in alleviating salt-and-pepper noise, conducting OBV scale selection is a time-consuming process. Since PR-OBV achieved similar, or even better, accuracy in this article, an integration of multitemporal segmentation model with PR-OBV proves to be an efficient method.
3) The performance of pixel-relearning and OBV-relearning can vary from model to model. Based on the test results in this article, these two learning strategies can achieve similar accuracy improvements. However, adding OBV as a last-step CPP can boost the accuracy of pixel-relearning approaches to a large extent than the OBV-relearning approaches. 4) It is true that the optimal OA of classification maps achieved in this article is not eminently high (e.g., over 90%). This is mostly due to the complex thematic LULC class category in this study. Compared with formal settlements, the difference between the building morphologies of informal settlements and rural settlement remains very subtle. Furthermore, the informal settlements in the study area have tiny plot sizes and mostly scattering inside rural settlements. These factors inevitably limited the optimization of classification accuracy; however, these features are not unique to this case study but reflect the reality of many rapidly urbanizing cities around the world. 5) During the experiments, we observed that the accuracy levels of the classification results have increased with an increasing number of labeled samples (i.e., number of the training patches). Moreover, considering that certain thematic classes generally have lower class separability than others and that training sets can be imbalanced, data augmentation techniques could be specifically applied on certain thematic classes to eventually enhance the discriminative properties of classifiers generally. Consequently, further improvements regarding the accuracy levels can be expected when integrating larger training sets and tailored augmentation procedures. In summary, this article shows that a combination of spatialtemporal models and corresponding suitable relearning strategies can produce very promising LULC classification maps from VHR satellite imagery, even for highly complex classification tasks with limited training data. Assuming a longer temporal sequence could better reflect features of temporal dependency, further research could focus on testing and developing relearning methods on very long temporal sequential data for optimizing multitemporal LULC classification.