A Novel Multitemporal Deep Fusion Network (MDFN) for Short-Term Multitemporal HR Images Classiﬁcation

—High-resolution (HR) satellite images, due to the tech-nicalconstraintsonspectralandspatialresolutions,usuallycontain onlyseveralbroadspectralbandsbutwithaveryhighspatialresolution.Thisprovidesrichspatialdetailsoftheobjectsonthe Earthsurface,whiletheirspectraldiscriminationisrelativelylow.Recently,theincreaseofthesatelliterevisittimesmadeitpossibleto acquiremorefrequentdatacoverageforﬁnerclassiﬁcation.Inthisarticle,weproposedanovelmultitemporaldeepfusionnetwork (MDFN)forshort-termmultitemporalHRimagesclassiﬁcation.Speciﬁcally,atwo-branchstructureofMDFNisdesigned,which includesalongshort-termmemory(LSTM)andaconvolutionalneuralnetwork(CNN).TheLSTMbranchismainlyusedtolearn thejointexpressionofdifferenttemporal-spectralfeatures.FortheCNNbranch,thethree-dimensional(3-D)convolutionisﬁrstly appliedalongthetemporalandspectraldimensionstojointlylearnthetemporal-spatialandspectral-spatialinformation,respectively, andthenthe2-Dconvolutionisperformedalongthespatialdi-mensiontofurtherextractthespatialcontextinformation.Finally, featuresgeneratedfromthetwodifferentbranchesarefusedtoobtainthediscriminativehigh-levelsemanticinformationforclas-siﬁcation.ExperimentalresultscarriedontworealmultitemporalHRremotesensingdatasetsdemonstratethattheproposedMDFN providesbetterclassiﬁcationperformanceoverthestate-of-the-artmethods,anditalsoshowsthepotentialitytouseshort-term multitemporalHRimagesformoreaccuratelanduse/landcovermapping.


I. INTRODUCTION
W ITH the rapid development of the earth observation (EO) technology, high resolution or even very-high-resolution (VHR) satellites [e.g., Pleiades, Gaofen (GF), and Worldview series] have been launched with a revisit time about 1 to 5 days (see Table I). Such high temporal resolution allows the adequate analysis of monotemporal or multitemporal images within a given period (e.g., short-term, medium-term, or long-term) [1], [2]. The increasing availability of HR remote sensing images offer great potential and opportunities for the applications, such as land use/land cover (LULC) mapping [3]- [5], forest monitoring [6], disaster evaluation [7] and urban management [8], etc.
Unlike hyperspectral (HS) images that provide a high spectral resolution, HR images usually have a relatively coarse spectral resolution but rich spatial details. Thus, it is quite difficult to accurately identify the rough spectral difference of different objects, especially for subcategories corresponding to the same major-category (e.g., grass, farmland, and trees that belong to the vegetation category) [9]. In addition, the phenomenon of misclassification is severe as the fact that these optical images are susceptible to clouds, shadows, illumination, and atmospheric reflection conditions, etc. In the latest multispectral HR satellite missions, the revisit frequency has improved leading to data with a high temporal resolution. Therefore, there is a very high potential in using short-term multitemporal information for HR image classification. How to effectively combine spatiotemporal-spectral information provided by such HR images becomes an open challenging question [4], [10], [11].
In the past decades, many spectral-spatial methods have been proposed for HR image classification by taking advantages of its rich context information. For example, popular methods developed based on the direct spectral-spatial feature extraction (e.g., Gabor filtering [12], edge-preserving filtering [5], [13], extended morphological profiles [14], and extended attribute profiles [15]), the multiple kernel learning [16], [17], the Markov random fields [18], [19] and the superpixel processing [20] achieve successful results in various applications in the literature. Despite their effectiveness in combining spectral-spatial features, these traditional machine learning methods usually generate many misclassification errors especially in edge details This   and objects homogeneity due to limited feature representation and generalization abilities [21].
Recently, deep learning (DL) based methods have demonstrated their outstanding performance for remote sensing image classification [22]- [24]. Among them, the convolutional neural network (CNN) and the recurrent neural network (RNN) have been widely used. By taking into account the characteristics of remote sensing images, researchers have designed several spectral-spatial or shallow-deep feature fusion networks, such as the fast dense spectral-spatial convolution network (FDSSC) [25], the spectral-spatial unified network (SSUN) [26], and the spectral-spatial residual network (SSRN) [27], etc. With the robust capability of automatic feature learning, DL-based methods can adaptively exploit more complex and effective high-level features to enhance the classification performance [28].
Despite the success of the aforementioned classification methods, they were all designed for monotemporal remote sensing images. In reality, optical HR images are frequently affected by complex acquisition situations such as clouds contamination, illumination conditions and image bad strips, which leads to inaccurate mapping of ground materials. On the other hand, the limited spectral information in HR images leads to a difficulty with separating certain LULC classes with similar spectral signatures. Even for the common DL-based methods, most of them usually rely on high-level features in the last layer for classification. For example, information for the fully connected layer in the CNN has often undergone many down-sampling operations, which may not be suitable to accurately describe small objects details. The lack of a dominant spatial information with coarse spectral information would greatly reduce the applicability of multispectral HR image data.
Considering the above problems in monotemporal multispectral HR image classification, a joint classification of multitemporal HR images expands the spectral representation towards the temporal domain, increasing the possibility for more accurate LULC classification. Moreover, for typical ground objects (e.g., buildings, roads, trees, and water), they are usually stable within a certain period, which also makes the short-term multitemporal HR images useful.
In the literature, there are existing works on the classification of multitemporal HR images. As an example, in [29], multitemporal texture features (pseudocross variogram) were directly combined with the original bands for multitemporal images classification. In [30], some bands and some temporal normalized difference vegetation index were extracted from Sentinel-2 multispectral images time series for land and crop classification using the random forest (RF) classifier. In [31], a manifold alignment framework was proposed to leverage prior knowledge while exploiting spectral similarities in the underlying manifolds of two multitemporal HS images. Similarly, [32] introduced a novel semisupervised kernel manifold alignment (KEMA) method, which successfully applied KEMA to multitemporal and multisource VHR classification tasks. In [33], a two-stage classifier was proposed for classifying cloud-and snow-contaminated multitemporal images. The proposed classifier mainly combined multitemporal information to improve the missing data and then performed classification. In addition, some researchers also proposed DL-based methods for multitemporal images classification. For example, a temporal-attention CNN and gated recurrent unit (CNN-GRU) approach was proposed in [34] to distinguish subtle crop differences. A DL-based architecture namely twin neural networks for sentinel data (TWINNS) was proposed in [11] to boost the LULC classification task using radar and optical satellite images.
Despite the many existing studies on multitemporal classification methods there are still several problems and challenges.
1) Most existing approaches unitized traditional medium resolution images with a relatively long revisit time. There are few works on the short-term multitemporal HR images. The complementarity between spatio-temporal and spectral dependencies had not been fully considered in these methods, which will lead to the presence of many misclassifications. 2) Most studies still relied on traditional hand-crafted features, which are not able to extract the high-level discriminative feature representation. In addition, there were many misclassification errors because the fact that they did not further explore the invariant temporal-spectral features to suppress the abrupt or abnormal changes on each monotemporal image. To overcome these drawbacks, in this article, a novel framework named multitemporal deep fusion network (MDFN) is proposed for dealing with the short-term multitemporal HR images classification, where long short-term memory (LSTM) and CNN branches are combined to extract and fuse rich spatiotemporal-spectral features. The main contributions of this article are summarized as follows.
1) By integrating LSTM and CNN branches, the unified spatio-temporal-spectral features are extracted and fused at different layers, and in this way the invariant temporal-spectral features combined with multiple shortterm temporal images can be greatly benefited. Furthermore, rich and complex high-level information is generated for the classification by fusing features at different layers without the pooling or other down-sampling operations.
2) The three-dimensional (3-D) convolutions toward the spectral dimension and temporal dimension are designed to learn the rich spectral-spatial and temporal-spatial correlation information, and then reduce the misclassification errors caused by clouds, shadows, illumination, and atmospheric reflection conditions on each monotemporal image.
3) The 2-D convolutions based on concatenating two types of 3-D convolution features are designed to capture the spatial context information. This guarantees a strong descriptive capability for the invariant spatio-temporal-spectral features that contribute to the improved classification. The rest of this article is organized as follows. Section II introduces the related techniques including LSTM and CNN. Section III provides a detailed description of the proposed MDFN framework. Section IV presents the experimental results obtained on two real multitemporal HR image datasets. Finally, Section V draws the conclusion and discusses future research.

A. Long Short-Term Memory
LSTM is a special structure of RNN and it is capable to learn long-term dependencies and deal with the gradient vanishing or the exploding problems present in the traditional RNN [35], [36]. It works well to solve a large variety of problems in the temporal or spectral domains, and is successfully used for HS classification. The neurons of the LSTM not only receive information from other neurons, but also receive their own information with feedback loops. Therefore, the LSTM with "memory" makes it more suitable for time series analysis compared with other networks such as CNN.
The key concepts of the LSTM are the cell state and the gating mechanism. In particular, the data transmission and processing in LSTM are realized by three key gate units: the forget gate f t , the input gate i t , and the output gate o t , which are used for implementing information protection and control [36]: 1) Forget Gate f t : It is mainly used to control the amount of information that needs to be forgotten respect to the previous moment. Its mathematical model can be formulated as follows where W f and b f represent the weight and bias of f t , respectively, x t is the candidate state at current time t, and σ is the nonlinear activation function, such as Tanh, Sigmoid, and ReLU.
2) Input Gate i t : It is used to control the amount of information that needs to be saved in x t . The specific formulas are defined as where W i and b i represent the weight and bias of i t , respectively, C t is the new candidate value generated by the tanh function, and W C and b C represent its weight and bias, respectively.
3) Output Gate o t : It represents the initial output, and h t is the final output which is obtained by multiplying the C t compressed to the [−1, 1] state with o t . These mathematical formulas of o t and h t can be defined as where C t is the new cell state information at time t, which can be considered as an intermediate state saved by LSTM, W o and b o represent the weight and bias of o t , respectively.

B. Convolutional Neural Network
The CNN is a popular type of neural network, which uses at least one convolution layer in the framework [37], [38]. In general, the CNN mainly consists of four components, i.e., the convolution layer, the pooling layer, the batch normalization layer and the fully connected layer. According to the dimension of the convolution layer, CNNs can be divided into 1-D-CNN, 2-D-CNN and 3-D-CNN.
1) 1-D-CNN: it is usually designed to use the pixel vector along the radiometric dimension to extract deep features, which can be conceptually considered as a spectral-based classification approach [39], [40].
2) 2-D-CNN: the convolution kernels of a 2-D-CNN move along the height and width directions of an image, and extract multilevel features (e.g., shallow, medium, and deep features) from specified local neighborhoods. The shallow layers usually extract some low-level features like edges and textures, while the deep layers generate more abstract and discriminative high-level features. For a 2-D convolution layer, the convolution value Y of the given pixel (i, j) on the kth channel in the lth convolution layer can be obtained as where b is the bias of the convolutional layer, f is the nonlinear activation function, (i, j) represents the coordinate of a given pixel, and X l−1 represents the set of input sensors (and is also the set of output in the l-1th layer [41]).
2) 3-D-CNN: different from the 2-D-CNN that extracts image features from the spatial direction, a 3-D-CNN extends the 2-D convolution into a 3-D convolution along the channel direction. It performs convolution along the height, width, and channel dimensions of the input image. In case of multiframe videos and multitemporal images, the 3-D convolution is capable to learn time-series dependencies. For a 3-D convolution layer, the output value Y at the pixel location (i, j, r) on the kth channel in the lth convolution layer can be generated as

III. PROPOSED MDFN FRAMEWORK
The technical framework of the proposed MDFN is shown in Fig. 1. Its two-branch structure includes: the LSTM branch that is used for modeling the spectral-temporal dependencies of short-term multitemporal HR images; and the CNN branch with CONV3_T, CONV3_S and CONV2 modules. It is used for learning the rich spatio-temporal-spectral information at different levels. Note that in order to avoid the down-sampling effect that impacts on the classification of ground objects, there is no pooling layer in the CNN branch.

A. LSTM Branch
Based on the characteristics of the multitemporal images, the LSTM branch is designed to model the temporal correlation of the multitemporal spectral information. This branch makes full use of the spectral-temporal dependencies of the short-term multitemporal HR images, which is essentially conducive to a high-precision identification of different ground objects. Fig. 2 shows the structure of the LSTM branch in the proposed MDFN. By considering the number of HR images, three LSTM layers are set to extract the discriminative and invariant temporal-spectral features. In addition, the number of filter kernels is set to 128, and the effective ReLU is selected as its nonlinear activation function.

B. CNN Branch
To reduce the spectral inconsistence due to abnormal changes caused by complex acquisition situations, such as clouds contamination, illumination conditions and striping noise occurred in the monotemporal images, the CNN branch is designed to jointly use 3-D-2-D convolution layers. Specifically, 3-D convolution layers include two paralleled modules, i.e., the 3-D convolution along the multitemporal direction (denoted as CONV3_T) and the 3-D convolution along the spectrum direction (denoted as CONV3_S). CONV3_S is effective to capture the invariant information of the same object in the spatio-temporal-spectral dimension, and CONV3_T can learn the discriminative information between different or similar objects in the spatio-spectral-temporal dimension. This results in a more comprehensive and reasonable feature representation in the multitemporal classification domain. Finally, results of the CONV3_T and CONV3_S modules are imported to the 2-D convolution module (denoted as CONV2) to further learn the spatial context information based on different receptive fields. 1) CONV3_T: as shown in Fig. 3, the CONV3_T module includes two 3-D convolution layers: 64 3 × 3 × n and 64 1×1×n convolution kernels to capture the discriminative temporal-spatial information at two spatial scales (i.e., 3×3 and 1×1). Let the raw multispectral HR images be I mul ∈ R r×c×n×p , where r×c denotes the size of these images, n represents the number of channels for a monotemporal image, and p is the number of the phases for the multitemporal images. It is worth noting that the input of CONV3_T is resampled toI CONV3_T ∈ R r×c×n×p . With the 3-D convolution along the spectral-temporal direction, it is beneficial to learn the discriminative temporal-spatial information. 2) CONV3_S: similar to the CONV3_T, the CONV3_S module also includes two 3-D convolution layers: 64 3×3×p and 64 1×1×p convolution kernels (see Fig. 3). However, the input form of CONV3_S is resampled to I CONV3_S ∈ R r×c×p×n . The main difference between CONV3_T and CONV3_S is that the former directly stacks each monotemporal image according to the time sequence, and the latter stacks the multitemporal data based on the spectral sequence. Therefore, the defined CONV3_S module performs the 3-D convolution along the temporal-spectral direction. It is useful to model the stable spectral-spatial features on the time series, and then reduce the spectral variations or even abrupt temporal changes caused by the influence of clouds, shadows, and other external environmental conditions. 3) CONV2: As illustrated in Fig. 3, the CONV2 module is performed based on the concatenated of 3-D convolution features to further enhance the local spatial context information extraction at different scales (i.e., 3×3 and 1×1). This further improves the strong descriptive capability for the invariant spatio-temporal-spectral features that contribute to the final classification.

C. Multilevel Feature Fusion and Classification
In this step, the final spatio-temporal-spectral features generated by the LSTM and the CNN branches are fused together by the concatenation layer. The stacked multilevel features are then imported into to the fully connected layer to obtain the higherlevel semantic information. Finally, the discriminative features are input into the Softmax classifier to realize an end-to-end automatic multitemporal HR images classification.

A. Datasets Descriptions
In this article, two real HR multitemporal datasets are used to evaluate the performance of the proposed and reference methods.
1) Ponte Arche (PA): The first dataset is built on three Plan-etScope images, which were collected over the PA area, in the Autonomous Province of Trento in Italy. PlanetScope satellite constellation consists of more than 130 small satellites called Doves. These Dove satellites launched in groups greatly improved the revisit times. So PlanetScope can provide once daily images of every location on Earth. Three short-term multitemporal images characterized by four spectral bands (blue, green, red and near-infrared) were acquired over the PA area on April 2, 17, 22, and 2018, respectively. This dataset has a size of 304×361 pixels with a ground resolution of 3 meters. False color composite images for the three dates and the ground reference (GR) map are shown in Fig. 4. There exist five LULC classes in the analyzed area (i.e., buildings, roads, trees, farmland and soil). From three monotemporal images, it can be seen that there are various radiometrical changes due to illumination variations, topographical relief, and measurement noise conditions (e.g., see the forest and farmland areas in Fig. 4).

2) Shanghai (SH):
The second dataset is made up of four images acquired over the city center of Houkou district, Shanghai, China, by the GF-1/6 satellite sensors. The acquisition times for the four images are April 8, 12, 15 and 18, 2019, respectively. It is worth noting that GF-1 and GF-6 are the high-resolution optical EO satellites of China National Space Administration (CNSA) with a four-day repetition cycle. Based on the acquired raw images, several image preprocessing (e.g., orthorectification, pansharpening, and image registration) were carried out. These images contain four spectral bands (i.e., blue, green, red and near-infrared) with a spatial resolution of 2 m after the pansharpening operation. The size of the final selected image subset for the experiment is 400×400 pixels. Fig. 5 presents the four dates false color composite images and the GR map. There are five land-cover classes (i.e., buildings, roads, trees, grass, and water) in different colors are shown in Fig. 5(e). In addition, the quality of monotemporal images changes at different times, especially it can see clearly that clouds contamination and shadows in T2 image [see Fig. 5 Fig. 5(a)-(d)]. Note that the GR maps in the two datasets were made according to a careful image interpretation. Details of the reference samples for each class in the two datasets are given in Table II.

B. Experimental Setup and Parameter Settings
To demonstrate the effectiveness of the proposed MDFN approach, five reference methods were implemented and compared in the experiments including two popular classifiers, i.e., support vector machine (SVM) and RF, and three state-of-the-art classification networks: FDSSC [25]; SSUN [26]; and SSRN [27]. The FDSSC, SSUN and SSRN are three excellent state-of-the-art DL-based frameworks. Specifically, the FDSSC framework uses different 3-D convolutional kernel sizes to extract spectral and spatial features separately, and the 3-D densely-connected structures was used for deep learning of features, leading to extremely accurate classification. The SSUN structure also includes the LSTM and CNN branches that is similar to the proposed MDFN. But it is designed for monotemporal HS image classification, where LSTM is used to learn the spectral group information and CNN for extracting the spatial context information. The SSRN is designed with 3-D residual blocks and 3-D convolutional layers, potentially more suitable to learn the spatio-temporal-spectral information.
Parameter setting and evaluation were carried out in the experiment, in order to analyze the classification performance of the proposed MDFN and the compared methods. For the SVM classifier, the radial basis function was selected as kernel function. For the RF classifier, the number of decision trees was set to 500. For different networks used in this article, detailed parameter settings are given in Table III. In general, a larger window size (w) will help to increase the classification accuracy.    However, this also increases the possibility of the oversmoothing and time consumption. Therefore, considering the tradeoff between the accuracy and the computational cost, we set the size of the input block as four for the FDSSC, SSRN and MDFN methods, and eight for the SSUN due to the fact that its network includes three average pooling layers. For the batch size and epoch, they are set according to the literature work and multiple trials in our experiments, which are provided as given in Table III. In addition, considering the Adam optimizer can solve sparse gradients on noisy problems, so the Adam with a learning rate of 0.0001 was selected as the optimization algorithm of the proposed MDFN framework. In addition, in order to make consistence with the multitemporal input in the proposed MDFN method, we stacked p-times of the monotemporal image as the input of the proposed MDFN.
The training samples of two datasets are set as 1% randomly, and the rest (99%) are used for test. Quantitative experiments are performed with ten runs in order to eliminate the errors caused by random samplings. Four indices are utilized to evaluate the performance of the considered classification methods, including the class accuracy (CA), the overall accuracy (OA), the kappa coefficient (κ), and the computing cost (T).
All experiments were carried out on a computer with Intel (R) Core (TM) i7-6700 CPU, 3.41 GHz, NVIDIA Quadro K620, RAM 32.0 GB.
C. Classification Performance 1) Results of the PA Dataset: Fig. 6 illustrates the quantitative comparison between the monotemporal and multitemporal images classification accuracies obtained by different methods after ten runs. From the fluctuations of monotemporal (T1-T3) and multitemporal (TM) curves, one can see that the OA of the multitemporal images classification is not only higher than the one of monotemporal image classification, but also relatively less affected by random sampling, exhibiting a more stable performance. By comparing multitemporal results obtained by different methods (see the red curves in Fig. 6), the proposed MDFN approach with the highest OA values and lowest fluctuating deviations exhibits the best performance. These results demonstrate that the proposed MDFN is robust under the random sampling condition, but also that it has a high descriptive capability of stable spatio-temporal-spectral features for short-term multitemporal HR images. Table IV gives the quantitative assessment of the proposed MDFN and the reference methods based on the multitemporal images classification. The proposed MDFN approach produces the highest OA value with the lowest standard deviation value (i.e., 0.05). In particular, the MDFN (OA = 99.50%) achieves roughly 1% and 2% of OA increase than the three advanced DLbased methods FDSSC (98.72%), SSRN (98.69%) and SSUN (97.65%), respectively. Traditional machine learning methods SVM and RF classifiers present the low computing cost as they do not implement multitemporal feature fusion operations, but their accuracies are lower than the proposed MDFN. Fig. 7 visualizes the best classification maps obtained by different methods on each monotemporal image and multitemporal image. In order to better illustrate the classification performance at local scales, the subsets highlighted in red rectangles in Fig. 7 are further compared in Fig. 8. It is obvious that the qualitative results between different methods are in line with the quantitative results provided in Fig. 6 and Table IV. Among all results, the monotemporal image with only four broad spectral bands is quite prone to confuse between buildings and soil, and between trees and farmland (see in rows 1-3 in Fig. 7, and in columns T1-T3 in Fig. 8). After fusing three monotemporal images, the misclassification problems in buildings and trees are greatly improved (see in row 4 in Fig. 7, and in column T4 in Fig. 8). This demonstrates that the multitemporal information integration can significantly improve the separability of similar ground objects (e.g., trees and farmland). Furthermore, compared with the reference methods, the proposed MDFN results in the most accurate, regular and smooth classification map with OA = 99.57% (see the whole classification map in Fig. 7d-6 and the subset results in the row MDFN in Fig. 8). To sum up, it verifies that the proposed MDFN is more robust owing to the strong spatio-temporal-spectral complementary ability even under limited training samples. The good generalization capability and effective feature representation enables the MDFN to offer an outstanding performance for short-term multitempotal HR images classification.
2) Results on the SH Dataset: Fig. 9 illustrates the classification accuracies obtained on four monotemporal and multitemporal images by different methods. One can see that in all methods, the multitemporal classification is clearly superior to the ones using monotemporal images according to the higher OA values.
Monotemporal classification results showed different performances due to the variety in their data quality. The SVM, RF and MDFN methods have relatively smaller fluctuations affected by the random sampling compared with other DL-based methods. The proposed MDFN method presents the best multitemporal classification performance with the highest OA values and a better stability. In addition, one can observe that MDFN is also effective in monotemporal classification compared with the reference methods.
The quantitative assessment results of different methods for multitemporal images classification are given in Table V. Compared with the five reference methods, the proposed MDFN achieved the highest classification accuracy (i.e., OA = 96.30%) with a small standard deviation value (i.e., 0.55%). The SSRN Fig. 11. Classification maps obtained by different methods at local subsets on the SH dataset. Columns 1-4 represent three monotemporal and the multitemporal classification results of the subset highlighted in Fig. 10; column 5 shows the GR maps of the local subset; and rows 1-6 represent the local classification maps obtained by SVM, RF, FDSSC, SSUN, SSRN, and MDFN methods, respectively. method achieved the highest OA (95.55%) among five reference methods. However, its computing cost is about three times higher than MDFN. The performance of three DL-based methods is reduced since the limited number of training samples were considered in this case, so it is very difficult to meet the network training requirements. This also demonstrates that the proposed MDFN approach effectively integrates the multitemporal information, while shows a strong robustness even in a small-sample case. Fig. 10 visualizes the best classification maps obtained by different methods on each monotemporal and multitemporal image. In order to better illustrate the classification performance at local scales, subsets highlighted in white rectangles in Fig. 10 are further compared in Fig. 11. In consistent with the above quantitative analysis results, from the qualitative analysis, it can be seen that the joint classification of the multitemporal HR images can effectively reduce the miss and false alarms in some typical ground objects. For example, the classification maps of column 5 in Fig. 11 (i.e., TM) present less misclassification between trees (in dark green) and grass (in bright green) than in the first columns (T1-T4). Especially, classification errors in T2 image affected by clouds and shadows (see Fig. 10 column 2) are eliminated, thus they are correctly classified in TM results (see Fig. 10 column 6). FDSSC resulted in good classification results as shown in Fig. 10e-3 and Fig. 11 row FDSSC, comparing with the other reference methods. However, it fluctuates greatly and its average accuracy with the highest standard deviation value (2.99%) under ten runs of random sampling (see Fig. 9 and Table V). Therefore, the proposed MDFN with the higher robustness and excellent classification performance is more suitable for multitemporal HR images classification.

V. CONCLUSION
In this article, a novel multitemporal deep fusion and classification network MDFN has been developed based on short-term multitemporal HR images. The two branches of LSTM and CNN are designed in MDFN to learn the discriminative information from spectral, spatial and temporal dimensions. Experimental results obtained on two real multitemporal HR datasets validated the effectiveness of the proposed MDFN. Compared with traditional and state-of-the-art methods, MDFN exhibits three main advantages: the discriminative spatio-temporal-spectral feature extraction and fusion; high model stability under fewshot learning; and high descriptive capability for the temporalinvariant ground objects. With a robust end-to-end learning process, the proposed MDFN not only efficiently learns the multitemporal information, but also improves the spectral stability and the discrimination of typical ground objects. It can be also applicable to the short-term multiple UAV/airborne image classification.