A Novel Multi-Training Method for Time-Series Urban Green Cover Recognition From Multitemporal Remote Sensing Images

Urban green space plays a crucial role in the construction of ecological city and livable environment. While multitemporal remote sensing images provide strong support for urban green cover monitoring, they often suffer from data shifting, where the data distribution varies from phase to phase. Designing a general multitemporal framework to extract urban green cover is challenging, mainly due to possible time-consuming data labeling and inconsistent prediction. To address that, we propose multi-training, a novel method for land cover classification on multitemporal remote sensing images. Multi-Training is a two-stage method to independently train classifier on each phase in the training stage and then to combine the information from all the classifiers in the communication stage. As a semi-supervised learning method, multi-training adopts a new rule to obtain the confidence of unlabeled samples’ prediction, which reduces the dependence on labeled data and increases the result's consistency between phases. The experimental results show that multi-training outperforms self-training, co-training, tri-training, and super-training on both accuracy and consistency on multitemporal remote sensing image datasets. Furthermore, we have analyzed the necessary parameters in our method and conclude that the number and the combination of phases will dominate the prediction results.


I. INTRODUCTION
V EGETATION, as an important part of the urban ecological landscape, plays a key role in air filtration, microclimate regulation, noise reduction, and water quality amelioration [1]. Therefore, inventorying the temporal and spatial information of urban green cover is beneficial for decision making about natural landscape management and planning [2]. With the rapid development of remote sensing techniques, satellite data have been one of the most important data sources to extract urban green cover, which brings new impetus to the monitoring of urban vegetation. The multitemporal remote sensing images provide more information about the land cover in time series than single temporal image. Thus, it supplies strong support for land classification and landscape change detection [3]. However, the classification of remote sensing images under multitemporal conditions is mainly faced with these problems. 1) It is time-consuming and labor-tedious to prepare a large number of labeled samples for training models with supervised learning methods.
2) The labeled samples in one phase cannot be directly applied to another phase because the spectral information of vegetation is always influenced by phenology, weather, and atmospheric conditions [4].
3) The classification results in each phase always differ from each other which cannot guarantee the consistency in time series. To solve these three problems of land cover classification based on multitemporal remote sensing images, the existed techniques can be divided into three categories: deep learning, transfer learning, and semi-supervised learning (SSL).
Some researchers considered applying deep learning methods to land cover the classification of multitemporal images for its ability of automatically exploring the best representation of data's features. For example, recurrent neural networks (RNNs) were constructed to capture the relationship on time-series images [5], [6]. Long short-term memory was used to deal with long-time-series images to solve the problems of gradient disappearance and gradient explosion [7], [8]. One-dimensional convolutional neural networks (1DCNN) [9] and three-dimensional convolutional neural networks (3DCNN) [10] were designed to classify crops from multitemporal remote sensing images. In 1DCNN, the crop was classified with the characteristics of time dimension. In 3DCNN, the time direction was the third dimension besides the image length and width. In addition, several hybrid models were proposed. For example, Rußwurm and Körner [11] added two-dimensional convolutional layers as spatial feature extractors and connected recurrent cells in a bidirectional way to reduce temporal biases toward later images. Interdonato et al. [12] combined CNN and RNN into one endto-end architecture to learn diverse spectral-spatial-temporal feature representation. Although deep learning methods have This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ made great performance, it takes huge time cost to collect a great number of labeled samples for training.
Transfer learning is another mainstream method to deal with these problems. The purpose of this method is to apply the classifier trained in source domain to target domain and make full use of the knowledge learned in source domain [13]. In recent years, transfer learning methods have been developed for multitemporal remote sensing image classification and land cover classification [14], [15], [16]. However, these methods can work well only if the source domain can be fully trained, i.e., the training data in source domain are adequate. Another problem is that if too many training phases are involved, it can be difficult for the model to perform well on all phases.
SSL is a kind of approach to utilize the relationship and structure among the unlabeled samples to improve the model's performance on datasets [17]. In most cases, the number of unlabeled samples is large enough that contains sufficient information and SSL is proposed to utilize the unlabeled samples to improve the model's performance [18]. SSL methods can be categorized into four classes based on the assumptions as follows: 1) generative methods [19]; 2) low-density separation methods, such as transductive support vector machines [20]; 3) graph-based methods [21]; 4) divergence-based methods, such as cooperative training (co-training) method [22]. These SSL methods have achieved remarkable results in machine learning tasks, and many researchers have applied these methods to classify land cover and land use on remote sensing images [23], [24], [25], [26]. In addition, several researchers modified the SSL methods to fit the classification problem on multitemporal remote sensing images.
Co-Training was originally a semi-supervised method designed for two independent sufficient views that trained two classifiers separately and chose unlabeled samples for each other [22]. Further studies showed that the assumption of two conditionally independent views was not necessary [27]. The key for the co-training approaches to succeed is that there exists a large difference between the classifiers [28]. By using output smearing [30] and fine-tuning networks to create diversity between three convolutional neural networks, trinet [29] was able to classify multiple images and it used multiple dropouts to ensure the accuracy of prediction. However, the computational cost of this method is huge for training three deep neural networks and using multiple dropouts. Co-Training was extended to identify and extract snow cover under two phases of remote sensing images, which worked out greatly [31]. It independently trained two classifiers on unchanged area of each phase and updates the unlabeled samples whose prediction was greater than the threshold into training set. Before training, co-training adopted the fixed rule, Chi-Square distance [32], to compute the data difference between phases to distinguish the changed and unchanged areas, but it could not handle multitemporal data directly because the data distribution shifts seriously.
We propose a novel multi-training method for land cover classification of multitemporal remote sensing images based on a small number of training samples. In the previous works, when the number of labeled samples is limited, the updated samples with pseudolabels given by the model always contain much noise, which can decrease the model's performance in turn [33], [34]. Our motivation is to fully use the multitemporal information to eliminate the noise and update pseudolabels with high confidence. Because of the influence of phenology in multitemporal remote sensing images, vegetation always changes greatly during a long period, which brings great difficulty for green cover recognition. We focus on the accuracy and the consistency of vegetation recognition between different phases when evaluating the performance of the methods. We also divide other land covers into subclasses to decrease the misclassification between vegetation and nonvegetation.
The main contributions of the study are as follows. 1) We propose a method for utilizing information from multitemporal remote sensing images and classifying land covers based on a very small training set, even though each class has only one labeled sample. 2) We explore a new mode in this method to solve the classification problem of multitemporal remote sensing data and convert the data's difference to the power of model's classification ability. 3) We adopt a dynamic way to view different phases' data from a global aspect and make progress both on the accuracy and consistency compared with the previous works. The rest of this article is organized as follows. The proposed method is illustrated in Section II. The study area and the data are introduced in Section III. In Section IV, the experimental setup and related parameters are described. The results and comparative analysis are represented in Section V. Section VI presents the discussion. Finally, Section VII concludes this article.

II. METHODOLOGY
In this section, we first introduce how we design the algorithm for multitemporal problems. Then, we demonstrate the most important part of the proposed method that how to choose and update the unlabeled samples into the training set. Additionally, we explain why we select the random forest (RF) as the base classifier. At the end of this section, we present the proposed method with pseudocodes and explain some training details.

A. Multitemporal Extensions
We propose the multi-training method for the land cover classification problem in multiple phases. The key problem of classifying land cover on multitemporal images is how to make effective use of abundant information to improve the performance of the model. We are motivated by the fact that the joint information from all phases can maximally capture the relationship of land cover and enhance the consistency during the period, and we aim to improve the final classification result by combining the predictions on each phase. To do so, we randomly select training samples to initialize the base classifier on each phase and update the unlabeled samples whose confidence value is beyond the threshold. Then, given the updated training set, we iteratively optimize the classifiers until the training process terminates. The framework of multi-training is intuitively shown in Fig. 1.
We follow the assumption made by Zhu et al. [31] that changed areas commonly account for a small part of the image during a period. Based on this assumption, the initial classifiers can be trained independently on their own training data without worrying about the temporal inconsistency of the selected samples. Then, after each round of training, all the classifiers communicate with each other in two steps: evaluating and updating. During the evaluating step, all the classifiers' predictions are combined to compute the joint confidence value of the pixels at the same position under multiple phases. At the updating step, the samples with high joint confidence values will be added to the training set of each classifier. By sharing information between each phase, classifiers learn from each other in the communication stage, increasing their robustness and consistency. After finite iterations, the accuracy and consistency of the predictions given by multiple classifiers can be gradually improved. The distribution's difference in multitemporal remote sensing data can guarantee the heterogeneity of these classifiers, which can be regarded as a random data augmentation.
Multi-Training trains classifiers for each phase. The prediction process is to predict the test samples by the classifier trained on corresponding phase. In the accuracy evaluation, the prediction accuracy under each phase can be obtained. Therefore, the overall evaluation for multiple phases is to calculate the mean value of the accuracy of each phase.

B. Selection of Unlabeled Samples for Training
Data shifting, in which the data distribution varies from phase to phase, is an inevitable problem when processing with multitemporal data, as shown in Fig. 2. Many reasons cause this problem, including the following. 1) The influence of solar radiation and atmospheric condition at different phases [35] and changes in the position and height of the satellite lead to the difference in data.

2) Phenological changes in vegetation result in the changes
in the spectral spectrum under multiple phases [36].
3) The land cover will be changed by human activities during a long period, which causes the data's feature space in the same area to seriously shift. For the reason of data shifting, directly applying the labeled data from one phase to another will lead to low classification accuracy and inconsistent results.
To make full use of the multiple data information and enlarge the training set in each phase jointly, we need to determine which sample's class is unchanged so that the rich information of samples on the multiple phases can be utilized for training to improve classification accuracy. In the previous study [31], the changed and unchanged areas are divided into two classes by the unsupervised Kittler-Illingworth thresholding algorithm before running the co-training algorithm. Then, the updating process will take place in the unchanged areas. However, setting a fixed measurement over multiple phases will lead to a dilemma. It is difficult to judge whether a pixel's class changes or not based on the fixed criteria because the distribution of data in different phases has a large deviation.
There is another problem that, when the number of labeled samples is limited, the updated labeled samples in SSL always contain much noise that may mislead the classification of the model. Therefore, to overcome this difficulty, the model should update the samples with high confidence to filter noise.
In the proposed method, we replace explicitly measuring the distance between different phases' data in the previous method [31] by implicitly calculating the joint confidence value based on the classifiers' predicted probability on their own phase to find the unchanging area, as shown in Fig. 3. The joint confidence value on multiphases can not only help model to determine sample's class and change information but also eliminate the noise of updating the pseudolabeled samples. The specific calculation steps of joint confidence are demonstrated below. First, we train a classifier independently on each phase. After training, each classifier gives a probability prediction map of the whole image on their own phase. On each pixel of the probability map, there is a probability vector whose size is equal to the number of land cover classes and each value represents how likely the pixel belongs to this class. Then, we combine all the probability results Train models on each own labeled sample 4. Make predictions on the whole data, generate the probability map P i for each phase 5. Compute the joint confidence map with the function 6. Calculate the updating threshold of each class by the confidence map th = mean({p|p ∈ P i , i denotes the ith class}) 7. Randomly select the unlabeled samples whose confidence is higher than th for each class with setting number N u 8. End while to obtain the final confidence value during the whole episode where f 1 , f 2 , . . . , f k denote the models; P (y i |x i , θ i ) denotes the probability map on the ith phase; and C(f 1 , f 2 , . . . , f k ) denotes the joint confidence map. The idea of (1) comes from the harmonic mean. Dividing the numerator and denominator of (1) by the numerator will result in a structure similar to that of the harmonic mean. The implicit operation of counting the reciprocal makes (1) more sensitive to small values, which ensures that the results of large joint confidence come from inputs that are all large values. The 2 k power of the numerator is to balance the dimension and Z is to restrict the summation of the output confidences to 1.
We combine the class prediction information of samples in multiple phases into a confidence map over the whole period by (1). When computing the joint confidence, there may be three conditions. 1) If the probabilities given by all classifiers are high, the generated joint confidence value will also be high, which implies that the pixel's class has not changed and it belongs to this class in this period with high probability. 2) If some classifiers' probabilities are high but others' probabilities are low, it means that the class of the sample probably has changed. 3) If most classifiers' probabilities are low, the joint confidence value must be low, suggesting that this sample does not belong to the class. Compared with the voting methods [37], this mapping function provides a potential way to judge the pixels' class and change condition, and the final results will depend on the global joint confidence value. The changing map, where each value in this map maintains the maximum joint confidence given by (1), is intuitively represented in Fig. 4.
After each round of classifiers making predictions, the joint confidence map can be obtained by using the above-mentioned method, and then the training set can be updated. We calculate the average confidence value of one class at the updating step and set it as the threshold. Then, we randomly select the fixed number of pixels whose confidence value is higher than the threshold as a new training sample. To ensure the randomness, a fixed number of samples are randomly selected to add to the training set. The above process is executed repeatedly until the final accuracy converges.

C. Base Classifier
RF is a widely used model in machine learning community that ensembles many decision tree classifiers to predict the final result [38]. RF adopts random selection on data and features to construct many independent decision trees and combines their results by majority voting. RF can fully mine data information and achieve a great result in classification even with a limited number of samples. Therefore, we choose RF as the base classifier to justify whether the proposed algorithm can still improve the performance during the period under the condition that the base classifiers have made full use of the information on each phase.
It is worth noting that other classifiers that can give the predicted class probability of samples (e.g., 1DCNN [9], naive Bayesian model [39], and support vector machine [40]) are also able to be extended as the base classifier. In the proposed method, the RF classifier is adopted as a base classifier, but it is not the only feasible choice. In fact, under the framework of the proposed method, we can choose the most suitable classifier according to the specific problem settings. Furthermore, we performed a small set of experiments replacing the classifier from RF to 1DCNN to verify this statement.

D. Theoretical Analysis of Multi-Training
The main motivation of the proposed multi-training method is similar to that of ensemble learning, whereas it is slightly different from the general idea of ensemble learning. Ensemble learning combines all the classifiers' results on the data and then improves the final result by voting. We used the idea of ensemble learning to provide higher confidence for the updating of unlabeled samples based on the SSL framework. The data's similarity between different phases makes the base classifier under each phase have more reference data to make predictions, and the data's divergence provides the model with more disagreements. In the process of determining the updating samples, the discriminant equation of the algorithm adopts the idea of harmonic mean's definition so that the final joint confidence considers the prediction results of all classifiers. However, we only consider the confidence value as an updating standard, rather than directly assigning labels to the unlabeled data. Until all the samples have completed the joint confidence computing, the training set will be updated. Since this updating process considers the image information of the whole period, the prediction of each pixel is more robust than ensemble learning. Therefore, the spatial and temporal consistency of the classification results is ensured by the proposed method.

E. Process of Multi-Training
The multi-training process is presented in Algorithm 1 with pseudocodes. There are some details listed in the following text that need to be considered, which will affect the final results.
1) To test the accuracy and the stability of the proposed method, we need to randomly select the examples many times. The training set and the test set should be split before repeated sampling to ensure that the training data can only sample from the precut training set. The model cannot have access to the test set in advance. 2) To compare the performance between different methods, we need to fix the random seeds to guarantee that different methods share the same training set. 3) When updating the unlabeled samples into the training set, the samples whose confidence is higher than the threshold should be randomly selected. Randomized updating makes the training process more generalized.

A. Study Area
The study area is in Nanjing City, the capital of Jiangsu Province, China. Nanjing is situated in the lower reaches of the Yangtze River Plain. Located in the subtropical monsoon climate zone, it is greatly different in vegetation between seasons, which provides powerful support for multitemporal training. As the national ecological garden city, Nanjing is the most forested city in Jiangsu Province, which is an ideal study area for recognition of urban green cover. Thus, we chose Nanjing as the study area and carried out the study in the center of Nanjing that includes the vegetation in the parks, on the mountains, and along the streets.

B. Data
This study used Sentinel-2 multispectral instrument (MSI) images (obtained from google earth engine) of Xuanwu Lake area in Nanjing, as shown in Fig. 5. Sentinel-2 is a wide swath multispectral imaging mission supporting land monitoring studies, including the monitoring of vegetation, soil, and urban area, as well as the observation of inland waterways and coastal areas. The image contains 772 × 653 pixels, and each has four spectral bands with a spatial resolution of 10 m. The four spectral bands are blue (0.46-0.52 μm), green (0.54-0.58 μm), red (0.50-0.80 μm), and near infrared (0.78-0.90 μm). We selected eight images from February 11, 2017 to January 22, 2019, with roughly every three months an image, as shown in Table I. The images selected from different months were to verify the proposed method's ability to overcome the identification challenge caused by the difference of vegetation phenology, and the images chosen from different years were to test its performance for annual vegetation information updating.
To augment the existing data and classify green cover efficiently, we adopted some commonly used radiometric indices based on the availability. The selected nine indices are shown in Table II. We sampled from the data and calculated the correlation degrees between the features, as shown in Fig. 6. For two indices with large correlation coefficients, keep one of them to remove redundancy. The five retained indices (NDVI, MSAVI, MTVI, SAVI, and VARI) and the original four spectral bands were combined to construct the features of the images. The classification scheme of the study area includes six classes: tree, low vegetation, water, road, building, and shadow. For more refined vegetation extraction, we divided the vegetation into finer divisions, referring to the ISPRS dataset Vaihingen. A relatively rough classification scheme was used for other feature categories in which shadow represents a part of the earth's surface that is blocked from sunlight by buildings. It is noted that the classes other than vegetation were merged in the final results.
In each phase, we manually sampled the whole images, and 1000 labeled pixels were obtained for each category. The labeled samples were different in different phases. We further divided the labeled samples into training set and test set according to the 1:4 ratio, as shown in Table III. More precisely, the training set here should be named training pool for the reason that to test the efficiency of the proposed method, we only sampled a very small number of samples from it independently with a fixed size to form the training set during the training process in the repeated experiments. The unlabeled sample set used in the experiments is the whole pixels from images, excluding the labeled part.

A. Validation Metric
In our experiments, F1-score was selected as the validation metric because it requires both high-precision and recall values and is defined as follows: where TP (true positive) is the number of pixels which are classified correctly; FP (false positive) is the number in which negatives are misidentified as positives; and FN (false negative) is the number that positives are misidentified as negatives.

B. Experimental Setup
We designed five experiments to evaluate the performance of the proposed method as well as the updating process of the unlabeled data and the effect of labeled and unlabeled samples'    Tables IV and V. 1) Comparison of Different Methods: The first experiment was designed to compare the performances of different methods. We tested the proposed method and three other representative methods, i.e., self-training [50], co-training [31], tri-training [53], and super-training. Self-training method is an SSL method applied on each phase independently [50]. It aims to update the high-confidence samples into the training set through each training round. In the experiment, the classifiers trained from self-training predicted the probability of unlabeled samples, and these probabilities generated the mean value as the threshold of sample selection. Randomly selected a fixed number of samples greater than the threshold to join the sample set. In the experiment of co-training, the threshold was obtained in the same way as self-training. If the probability of an unlabeled pixel was greater than its respective threshold in both phases, the pixel was added to the alternative set. Then, a fixed number of samples were randomly selected in the alternative set to join the sample set. Tri-Training is the semi-supervised paradigm used by trinet. In the experiment of tri-training, three models were used. If two models predict a sample consistently then the sample is given to the third model for training and the prediction by the two models is used as a pseudolabel. Super-Training is a supervised learning method with five times as many labeled samples as the other semi-supervised methods. For each iteration, supertraining selects samples from the labeled sample set, unlike the semi-supervised methods that select unlabeled samples. To be fair, these methods also used RF as the classifier, and the parameter settings were the same as those of multi-training. These methods were compared with the proposed methods from different training mechanisms and different labeled samples.
Self-training was trained in a single phase, and its prediction was also predicted in a single phase. The overall evaluation of self-training is to take the average value of accuracy in each phase. Co-Training was conducted in two phases. In the case of multiple phases, the co-training accuracy of a certain phase was the average of the accuracies of the models trained by the combination of this phase and all other phases. For example, there are four phases T1, T2, T3, and T4. To calculate the accuracy under T1 phase, it is the mean value of the accuracy of the three models trained by T1&T2, T1&T3, and T1&T4. The overall evaluation of co-training is to take the average value of accuracy in each phase. Tri-Training was conducted in three phases. Similar to co-training, the accuracy of a certain phase was the average of the accuracies of the models trained by the combination of this phase and other two phases.
2) Effect of N l and N u : In the second experiment, we compared the effects of different numbers of labeled and unlabeled samples on the model's accuracy to find the optimal parameters of the proposed method. Since the optimal N l is related to N u [32], we simultaneously changed N l and N u for each class before training to analyze the parameters' sensitivity. In this experiment, N l ranges from 1 to 19 with step 2, and N u varies from 5 to 50 with step 5. We recorded the average F1-score and SDs on each parameter setting. We also compared the co-training algorithm's performance on the same parameter settings.
3) Consistency of Multitemporal Classification Results: Classification of multitemporal remote sensing images always faces the difficulty that the results are inconsistent on spatial and temporal. Therefore, the classification consistency is a crucial target to test the performance of multitemporal land cover classification algorithms. In this experiment, we modified the original prediction difference of the classifiers (PDC) in the previous work [31] to obtain a new consistency measurement on a multitemporal setting as follows: where f i (x i j ) denotes the predicted label of each classifier on each pixel; ϕ(·) returns the maximum number of inputs whose values are the same; and I(·) denotes the indicator function that returns 1 if the input is true else 0. We compared the PDC of multi-training and co-training on the parameter setting in which N l ranged from 1 to 9 with a transition at 2, N u ranged from 5 to 50 with a transition at 5, and N t was fixed as 4.

4) 1DCNN as the Classifier:
To further verify the applicability of the proposed semi-supervised method, the classifier was replaced from RF to 1DCNN [9] for an experiment. The experimental conditions were the same as in Experiment 1 except that the classifier was replaced. Still, the accuracies of the proposed method, self-training, co-training, tri-training, and super-training were compared. In this experiment, due to the small size of the samples, we designed a lightweight 1DCNN, consisting of three convolutional layers, three Relu layers, and one fully connected layer. To increase the dimensionality of the input data, we used five indices in addition to the four spectral bands, just as in the previous experiments. Two scenes of ISPRS Vaihingen images were used for pretraining. To enable the 1DCNN to converge during the initial training, five samples per class were used instead of one sample per class as in the previous experiments.

5) Effect of Phases:
To explore the best applicable conditions of the proposed method, we designed an experiment to evaluate the influence of N t and phases' combinations on the improvements of classification accuracy and stability by the proposed method. The number of phases should be considered as a key factor influencing the experimental results for the reason that the proposed method is designed for multitemporal remote sensing images. N t was changed from 3 to 8 in this experiment to explore the changing trend of F1-score and SDs. And we selected every four phases from all the images as the inputs of the multi-training algorithm to test the accuracy via different phases' combination.

A. Comparison of Different Methods' Classification Results
To visualize the results of vegetation recognition and analyze them qualitatively, we depicted the classification results under four phases and highlighted the area of vegetation recognition in Fig. 7.
In the aspect of visualization, it is clear that the vegetation extracted by the proposed method is more delicate and accurate. Meanwhile, there is a clear distinction between the spatial location of the tree and the low vegetation. Although the training process is based on the pixels, the final results are still very consistent in the area with a wide coverage of vegetation. In contrast, the vegetation area extracted by the self-training method on a single phase is relatively broken. In the urban blocks, it is apparent that most of the vegetation distributed along the road were recognized by the multi-training algorithm. Thus, we can conclude that the proposed method has the ability to extract the vegetation with great consistency during long period. The results in Fig. 7 can also reflect the influence of seasonal changes on vegetation. For example, in mountain areas, the recognized vegetation is sparser in winter than in summer. From the results, we can see that the proposed method can even obtain similar classification results as supervised training algorithm, which is trained on a large-scale labeled training set. To further illustrate the efficiency of the multi-training algorithm, we implemented a quantitative comparison on the performance of the proposed method and other aforementioned classification methods testing on the multitemporal dataset. The average F1-score of the four phases for the six classes and the average value of all classes are shown in Table VI. In the contrast experiments, N l was set as 1 and N u was set as 5. The random seed was fixed to ensure that different methods shared the same training set. Table VI indicates that the multi-training algorithm can significantly improve the recognition accuracy of tree, low vegetation, and shadow from 0.762 to 0.832, from 0.652 to 0.728, and from 0.698 to 0.801, respectively. A possible explanation is that the distribution of vegetation and shadow is quite different in multiple phases due to phenological influence and seasonal alternation. Therefore, the classifiers can capture more additional information about these classes than other land covers from different views, which increases the confidence of determining the true class of the unlabeled samples. With  abundant information, the model has the ability to add the samples with more accurate pseudolabels into the training set during updating. The water's classification accuracy improves most by co-training method, from 0.782 to 0.839, an increase of 7.29% units. This is because the co-training algorithm was mainly aimed at the coupdating between the two phases; thus, it has good improvement on the water area that changes little over the whole period. Generally, the identification of roads and buildings is a quite difficult task, especially when the training samples are not enough. Therefore, with a large training set, the supervised learning method greatly improves the classification accuracy in these two classes. From the results in Table VI, all the land cover's prediction accuracy of multi-training algorithm has improved obviously and the average F1-score improvement outperformed the other compared methods. This shows that the multi-training method has significant applicability, and thus, it can be used in the recognition of land cover when labeled samples are limited.

B. Influence of N l and N u
We analyzed different N l 's effects on the average F1-score and SDs of the final results and improved the accuracy on the vegetation and all classes in Fig. 8. From Fig. 8(a) and (e), by changing N l in the initial training set, we can find the changing trend of the final classification accuracy. It is easy to understand that both the proposed method and co-training can improve the final classification accuracy as N l increases. Especially when N l is increased from 1 to 5 in each class, the final classification accuracy of the model is improved significantly. In all cases, the average F1-scores of the proposed method are higher than those of the co-training method, which shows that the proposed method can improve the accuracy of land cover classification in multitemporal data. In Fig. 8(b) and (f), as N l increases, the SDs of the classification accuracy decrease gradually and the SDs of the proposed method are lower than that of the co-training. This is because more labeled samples can help the model to classify better and obtain more stable results. Through the communication step between all classifiers, the proposed method can learn the difference on different phases and stabilize quickly. In Fig.  8(c), (d), (g), and (h), it can be seen that the average F1-scores and SDs of the improvement between the initial and the final accuracy decrease as the N l decreases. It may be that if the training set contains enough labeled samples, the model has already learned very well on this initial dataset, which causes the model's ability to be restricted and cannot learn more useful  information in the following updating process. The large gap between N l = 1 and N l = 5 in these four curves strongly supports this explanation. In general, the proposed method shows significant improvements in accuracy and stability over the co-training.
The improvements of the average F1-scores and SDs based on different N l and N u are shown in Fig. 9. We can clearly see that, for a small N l , fewer unlabeled samples result in a higher F1-score and lower SDs. However, when N l is large enough, the result is reversed, and more unlabeled samples are needed to train a model with both high accuracy and stability. The results in this experiment corroborate the findings of previous work [31]. Therefore, a small N u is suggested when the training set only includes very few labeled samples, while a large N u is suitable if sufficient labeled samples are in the dataset.

C. Consistency of Multitemporal Classification Results
To further illustrate the consistency of the proposed method, we compared the PDC of multi-training and co-training under four phases. In each pair of this comparison experiment, we fixed the number of labeled samples and justified the unlabeled samples' number. In Fig. 10, it is clear that the PDC of co-training is obviously higher than that of multi-training and does not show any tendency with increasing N u . This is because, in multitraining, classifiers can communicate with each other to select high-confidence unlabeled training samples into a new training set, which provides more opportunity to maintain consistency between different phases. The curves of multi-training in Fig. 10 reflect a clear pattern via the number of unlabeled samples that the PDC values decrease significantly with the increase of N u . The possible reason is that with the increase of N u , there are more high-confidence samples in the training set, which will further enhance the prediction consistency of all the classifiers in the next iteration. It results in the low PDC values, given large number of unlabeled samples for each class during training, which provides an instructional parameter setting method for future applications.

D. Classification Accuracy With 1DCNN as Classifier
The initial accuracy of each method is improved in the case of using 1DCNN as a classifier based on five labeled samples, as shown in Table VII. Multi-Training has a significant increase in the final accuracy after SSL, and the improvement is significantly higher than that of the other semi-supervised methods, which further proves the effectiveness and applicability of multi-training. In addition, the improved accuracy of the building is large, up to 18.03%, showing that the feature extraction ability of 1DCNN for complex features can help multi-training perform better.

A. Effect of Phases
In Fig. 11(a), with increasing of N t , the improvement of the average F1-scores of vegetation increases obviously, from 0.02 under N t = 3 to 0.05 under N t = 8. However, whether this trend can be maintained when N t is larger needs further research. The increase of N t not only make the model more accurate but also make it more robust, especially from 3 to 7, which is beneficial to determine the unchanged area and selection of unlabeled samples when updating the training set. However, if N t is too large, i.e., the images are acquired in a long period, the SDs of the model's accuracy will be large although F1-score is  still increasing, as shown in the area where N t ≥ 7 in Fig. 11(a). It seems possible that the joint confidence value is not credible when N t is too large, which causes the updating process to no longer be stable.
To explore the effect of phases' combinations, we calculate the difference of data distribution for all phases. Because the proposed method is for multitemporal training, only computing the difference between two phases cannot truly represent the difference of the phases in the combination. Therefore, we use the following equation to calculate the difference in the whole combination: Dif f (T 1 , T 2 , . . . , T k ) where T i denotes the ith phase and x (f ) i denotes the fth band feature of data in the ith phase.
The result calculated by the equation above can represent the difference among multiple phases and satisfies the setting of the problem. We chose the five-phase combinations with the largest distribution difference and the five-phase combinations with the smallest distribution difference as the inputs of the proposed method. Our goal is to test under which conditions above the proposed method can obtain better performance on the land cover classification of multitemporal remote sensing images.
As shown in Fig. 11(b), there is a high correlation between the difference in data distribution and the improvement of the vegetation recognition accuracy. The large difference in the phase combination can help the model to obtain a great improvement of accuracy. The difference between phases makes the divergence of classifiers expand so that the co-training can be carried out better [27]. And in the communication stage, larger differences can make the model learn more information. In the process of iterative training, the classifiers will become more and more similar, resulting in the failure of co-training and even the decline of accuracy [52]. If the difference between phases is small, this problem will be more serious. It can be seen in Fig. 11(b) that the improved accuracy of the model trained on phase combination with small difference is small even negative.

B. Limitations and Future Improvements
In this study, we propose a novel semi-supervised method for time-series urban green cover recognition by utilizing the differences between images of each phase, which avoids the tedious sample labelling and alleviates the problems caused by data shifting. However, in our experiments, we used RF as the classifier and performed a feature selection and deredundancy process, which slightly increased the workload. In other application scenarios, deep learning models that can automatically extract features can be used as classifiers, which can make the process simpler.
Additionally, even though a safe method is used to assign the high-confidence samples pseudolabels, there is no guarantee that all pseudolabels are correct. Incorrect pseudolabels may aggravate the error during iteration leading to a decrease in classification accuracy. An appropriate noise-tolerant learning method may be a solution for this.

VII. CONCLUSION
In this study, a multi-training method for multitemporal remote sensing images is designed to recognize urban green cover. The study demonstrates that the proposed method has higher accuracy and consistency in vegetation recognition than the previous methods. In the case of few samples, the proposed method performs better and has more advantages. Furthermore, the difference between the observed spectral distributions of multitemporal data is exploited in a mutual learning way, which provides a new strategy to make use of multitemporal images. For better model accuracy, N u and N l should match in size. In addition, the large differences between multitemporal data improve the model's classification performance. Although this study mainly focuses on the recognition of urban vegetation, the proposed method can be easily migrated and applied to land cover classification and updating of land cover maps in time series.