A Dual-Branch Deep Learning Architecture for Multisensor and Multitemporal Remote Sensing Semantic Segmentation

Multisensor data analysis allows exploiting heterogeneous data regularly acquired by the many available remote sensing (RS) systems. Machine- and deep-learning methods use the information of heterogeneous sources to improve the results obtained by using single-source data. However, the state-of-the-art methods analyze either the multiscale information of multisensor multiresolution images or the time component of image time series. We propose a supervised deep-learning classification method that jointly performs a multiscale and multitemporal analysis of RS multitemporal images acquired by different sensors. The proposed method processes very-high-resolution (VHR) images using a residual network with a wide receptive field that handles geometrical details and multitemporal high-resolution (HR) image using a 3-D convolutional neural network that analyzes both the spatial and temporal information. The multiscale and multitemporal features are processed together in a decoder to retrieve a land-cover map. We tested the proposed method on two multisensor and multitemporal datasets. One is composed of VHR orthophotos and Sentinel-2 multitemporal images for pasture classification, and another is composed of VHR orthophotos and Sentinel-1 multitemporal images. Results proved the effectiveness of the proposed classification method.


I. INTRODUCTION
T HE analysis of remote sensing (RS) images with very high spatial resolution (VHR) allows improving the land-use classification [1] and urban monitoring [2], [3], [4] performance. However, VHR images provide many geometrical details that are often poorly handled by standard RS methods. Many methodologies model the high spatial resolution by using parcel-based strategies that analyze portions of VHR images and apply an object-based classification [5], [6]. These methods extract regions from the images using segmentation techniques, but they Manuscript  may penalize the classification performance due to segmentation errors. The exploitation of deep learning (DL) methods allows accurate classification of VHR RS images. DL models, such as convolutional neural networks (CNNs), automatically learn features during the training phase that analyze the spatial information of the input data. The hierarchical structure of these models allows analyzing the input image in a multiresolution way and increases the receptive field of the model. The receptive field determines the DL model analysis area [7], [8]. The larger it is, the larger the area of the image that the model analyzes. The increase of the receptive fields and the automatic learning of spatial features allow more effective extraction of the geometrical information in VHR images than other non-DL methods. Thus, the use of DL-based methods leads to an increase in classification and land-cover mapping performance [9], [10].
Sensors providing VHR images usually acquire a limited number of spectral bands and have a low acquisition frequency due to the relatively small ground-projected field of view (GFOV). Recent VHR sensors reduce the revisit time by acquiring images with multiple acquisition angles. However, when tasks have to deal with image time series analysis, small variability of the image viewing angle within the time series is usually required to limit undesired radiometric differences that do not correspond to changes on the ground [11]. On the contrary, sensors with a lower spatial resolution acquire more spectral bands by keeping similar viewing angles throughout time, have a higher acquisition frequency, and are more easily accessible than VHR images. The latter aspect depends on data-providing policies and not technical reasons. Hence, it is easier to obtain long image time series acquired by sensors with either medium or high spatial resolution rather than VHR images.
Multiresolution DL models process input images with different spatial and spectral resolutions to improve the scene analysis. These methods merge the high spatial resolution of VHR images with the rich spectral information of images with relatively high spectral resolution [12] to increase the classification accuracy. Other approaches extract and analyze multiscale features through the use of pretrained CNN [6], or atrous convolutional layers with various dilation rates [13]. The latter exploits a strategy similar to [14] to extract multiscale features using atrous convolutional layers with heterogeneous receptive fields that are concatenated to perform a multiscale This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ analysis. However, the use of single-date RS images does not provide temporal information that helps the discrimination of time-dependent classes. Thus, processing of the RS image time series is required to account for the time component. Many methods exploit random forest (RF) [15], [16] or support vector machine (SVM) [17] to model the temporal information through an image time series. In [18], the authors exploit a time-weighted dynamic time warping (DTW) to select the images within two image time series that better characterize specific classes, and therefore, perform a land-cover and land-use classification. However, these methods are pixel-based and do not consider spatial information. Hence, their performance drops when classes show strong spatial-context correlations, such as in urban areas. These methods exploit handcrafted features to analyze the time component of an image time series. This can lead to an information loss in the time domain. In the last years, recurrent neural networks (RNNs) and long short-term memory (LSTM) architectures were used to model the time information of image time series [19], [20], [21], and learn automatic temporal features during the training. In [22], the authors exploit a deep RNN to process and extract features from an image time series that are used to retrieve land-cover maps. An alternative way to process the temporal information consists in applying CNNs to the temporal domain [23], [24]. 1-D-CNNs can be also used to analyze temporal profiles extracted for each pixel [24]. However, these methods learn and extract temporal features but analyze the spatial information using only handcrafted spatial features.
Classification methods usually process data acquired by a single sensor. However, some applications require information acquired by multiple sensors with heterogeneous characteristics. Thus, methodologies for analyzing multisensor images with different spatial, spectral, and temporal resolutions should be considered. Some methods merge the temporal information of image time series acquired by heterogeneous sensors by extracting temporal features using RNN-based models and selecting the most informative ones through an attention mechanism [25], [26]. Spatiotemporal features are extracted and combined to segment image time series with homogeneous spatial resolutions using an attention mechanism to learn the most informative features to discriminate the time-based classes [27]. These methods do not properly model the spatial information of the images so they achieve suboptimal results. Some methods combine the spatial context features extracted from CNNs analyzing high-resolution (HR) images with the temporal one processing image time series with RNNs [28] or separately process the temporal and spatial information using 3-D-CNNs and 2-D-CNNs, respectively [29]. In [30], the authors process the spatial information with the temporal one through the use of CNNs to extract spatial features from multisensor data and convolutional RNN (ConvRNN) to process both spatial and temporal information. These methods achieve good results but handle homogeneous spatial resolution only. However, the discrimination of some classes requires a multiresolution analysis of heterogeneous sensors to generalize and improve the classification and perform a multiscale analysis [31]. To this purpose, some methods exploit a deep neural network with two branches [32], [33]: one analyzes the spatial information of VHR images, while the other the spectral information of multispectral data. However, these methods do not consider temporal information. In [34], the authors perform segmentation of multisensor image time series representing crop fields using models based on U-Net [35] exploiting a convolutional LSTM (ConvLSTM) or a 3-D-CNN to jointly analyze the spatial and temporal information. The authors in [36] extend the latter by exploiting bidirectional ConvLSTM to analyze the temporal information in both directions.
Summarizing state-of-the-art (SoA) methods analyze multisensor data single-date multiresolution RS image or multitemporal data with similar spatial resolution. Therefore, in this article, we propose a DL method for supervised classification that analyzes multisensor data with heterogeneous properties in spatial, spectral, and temporal resolutions. The proposed method allows analyzing single-date and multitemporal images acquired by different sensors with heterogeneous spatial, spectral, and temporal resolutions using a DL model having two input branches. The first one is a deep residual network (ResNet) [37] having a wide receptive field that properly models VHR geometrical details, while the other processes the spatial and temporal information of multitemporal images acquired by a sensor with a lower spatial resolution than the previous input data using a 3-D-CNN. The spatial and temporal output features of the two branches are merged and processed by a ResNet-based decoder to obtain the land-cover map. In this way, the classification considers both the multiresolution and multitemporal information and retrieves accurate results.
This article has the following outline. Section II describes the methodology. Section III presents the experimental settings and the results. Finally, Section IV concludes this article.

II. MULTIRESOLUTION AND MULTITEMPORAL CLASSIFICATION
The proposed method aims to obtain a classification map Y by analyzing a VHR image X VHR ∈ R w VHR ×h VHR ×b VHR acquired at time t VHR and an image time series X TS = {X 1 , . . ., X i , . . ., X I } composed by I images X i ∈ R w TS ×h TS ×b TS (i = 1, . . ., I) acquired between t 1 and t I by a sensor with different spatial, spectral, and temporal resolutions with respect to X VHR . X VHR and X TS represent the same geographical area. Assuming that no semantic changes occurs in the analyzed scene (e.g., new buildings and missing objects), X VHR can be acquired at any time with respect to the interval [t 1 , t I ]. We assume the availability of labeled training samples Y that provide class information about the geographical area represented in X VHR and X TS . A labeled training dataset X is retrieved by sampling N patches corresponding of X VHR and X TS according to Y .
VHR images provide many geometrical details but limited spectral and temporal information. Thus, the modeling of timedependant classes is challenging. On the contrary, HR image time series have a better tradeoff in terms of spectral and temporal resolutions allowing the modeling of time-dependant classes, but lower spatial resolution than VHR images. Hence, they have fewer geometrical details, and the modeling of classes characterized by small-size elements is less accurate. The proposed architecture combines the spatial context information of VHR images with the spectral and temporal information provided by the HR image time series to accurately model classes varying over time and characterized by small-size elements. It relies on a deep-learning (DL) model based on ResNet [37] that allows dealing with inputs having different spatial resolutions and different characteristics due to the heterogeneous acquisition sensors. The DL model analyzes them in two branches (see Fig. 1). We define the two branches to obtain the same number of feature maps with the same spatial size. Thus, the output features of the two branches can be easily concatenated and have a balanced contribution to the final classification. The decoder processes the concatenated features to obtain the classification map.

A. VHR Image Analysis With a the Residual Neural Network
The first branch analyzes X VHR , which has the highest spatial resolution and many geometrical details to be processed. We analyze the spatial information of X VHR using an encoder f (X VHR ) composed of multiple residual blocks. As in [37], each residual block exploits the bottleneck design to reduce the training time and increase the depth of the model by controlling the number of model parameters. Each block is composed of three 2-D convolutional layers with a kernel size of 1 × 1, k × k, and 1 × 1, where k is the kernel size of the middle convolutional layer in the residual block (see Fig. 2). The first and third layers reduce and increase the dimensions, whereas the second one analyzes the spatial information. The depth of the encoder depends on the spatial resolution of X VHR . The higher the X VHR spatial resolution, the deeper the encoder. When the spatial resolution of the input image is very high, the sensor captures many geometrical details, and neighboring pixels show a strong spatial correlation [38], [39]. To effectively process the spatial context information of the image and produce informative feature maps, the model requires a large receptive field. Thus, we use multiple stride convolutional layers to compress the spatial information of X VHR . The output of the encoder is defined by

B. Image Time-Series Analysis With 3-D Convolutional Layers
The processing of X TS with size w TS × h TS × I × b TS , where I is the number of images in the time series and b TS is the spectral band number, is challenging since it provides both spatial and temporal information. 2-D convolutional layers effectively analyze the spatial information of each image in the time series but do not model the temporal relationship between them. 2-D convolutional layers sum the contribution of each image in the time series, losing the temporal information. To process both the spatial and temporal information of X TS , we use 3-D convolutional layers [40] with 3-D kernels. Thus, we process X TS with a 3-D-CNN) with L layers (see Fig. 3). Each 3-D convolutional layer compresses the spatial and temporal information to increase the receptive field of the model in both the dimensions and obtain informative features. Since the spatial resolution of X TS is lower than the X VHR one, we need a smaller receptive field in the spatial dimensions. On the other side, the more images are in the time series, more layers are needed to model the temporal relationship. Therefore, the number of 3-D convolutional layers is a tradeoff that depends on the spatial resolution and the number of images in the time series. We use stride 3-D convolutional layers that compress the spatial and temporal information and enlarge the receptive field in all dimensions. Let us assume, for simplicity, that the spatial size of the image time series is w TS = h TS , the LS feature map size is w LS = h LS , the padding operation is applied to avoid dimension reduction due to the convolution operation, and all the division terms of the following equations are divisible. The spatial dimension of the 3-D-CNN output (i.e., w LS ) is defined as a function of the relationship between the spatial size of the input image time series and the spatial stride of the 3-D convolutional layers where s w is the stride value of the spatial dimensions in the 3-D convolutional layer. We can retrieve the number of layers L that are needed to achieve the spatial dimension w LS by exploiting (2).
Since in the proposed method the data time dimension is reduced to 1 to obtain feature maps providing both spatial and temporal information, we can use an equation similar to (2) to retrieve the number of layers L that are needed to fully process the temporal information where s t is the stride value of the temporal dimension in the 3-D-convolutional layers. Thus, the total number of 3-D convolutional layers for the joint analysis of spatial and temporal information is given by The output of the encoder is defined by where g(.) is the function of the 3-D-CNN that provides F feature maps of size w LS × h LS . The spatial resolution of Z TS ∈ R w LS ×h LS ×F has to be harmonized with the one of Z VHR to merge the processed data of the two sensors.

C. Merge of the Data in the Two Branches and Classification
We design the two branches of the DL model to obtain from each of them F -dimensional LS feature maps (i.e., Z VHR and Z TS ) with the same spatial dimensions w LS × h LS that are merged to process the information derived from the two sensors together. In this way, the output of the two branches provides a balanced contribution to the final classification, and the information of one sensor does not dominate the other. We concatenate Z VHR and Z TS to process them through a decoder and obtain the land-cover map. After the concatenation, we ob- We process Z ∈ R w LS ×w LS ×2F using residual blocks composed of atrous convolutional layers to further enlarge the receptive field of the model without compressing the spatial information of Z [41]. We then process Z through a decoder composed of residual blocks using deconvolutional layers to decompress the spatial information of the features and increase their spatial dimensions to achieve w VHR × h VHR . A 2-D convolutional layer with C kernels, where C is equal to the number of classes, of 1 × 1 ends the decoder and retrieves the land-cover map Y ∈ R w VHR ×h VHR ×C using a softmax activation function. The output of the DL model is defined by where d(.) defines the decoder function. We use a cross-entropy loss function to train the whole DL model as it proved its effectiveness in many SoA classification methods [42], [43], [44], [45]. We add to the loss function a weight contribution to regularize the model weights during the training and improve the classification performance. The final loss function is given by where λ is a constant that controls the weight regularization and W 2 2 represents the L2 normalization of the model weights W .

III. EXPERIMENTAL SETTINGS AND RESULTS
In this section, we introduce the datasets used to test the proposed method, present the experimental setup, and discuss the results.

A. Description of Dataset
To test the proposed method, we used two datasets composed of images acquired over the Trentino region in Italy. Both datasets exploit a VHR orthophoto acquired by aircraft. The first dataset combined the orthophoto with an image time series acquired by Sentinel-1, whereas the second dataset uses multitemporal Sentinel-2 images together with the orthophoto.
1) Dataset Combining VHR Orthophotos and Sentinel-1 Image Time Series: We tested the proposed method by analyzing the temporal information using a synthetic-aperture-radar image time series. For this dataset, we combined one VHR orthophoto with an image time series acquired by Sentinel-1 [see Fig. 5(c)]. The airborne orthophoto was acquired in September 2017 and was organized in 470 tiles. They have a spatial resolution of 20 cm/pixel that was down-scaled to 1 m/pixel to make the computation lighter. These tiles were acquired in the red, green, blue, and near-infrared spectral bands [see Fig. 5(a)]. We exploited a DTM to infer the slope degrees using [46]. The Sentinel-1 image time series is VV polarized and has a spatial resolution of 10 m/pixel. It was acquired from January  Table I), which may lead to some classification errors due to the training bias. We sampled the tiles and the image time series according to the reference map and obtained a dataset composed of 271 155 patches with a spatial size of 120 × 120 for the VHR orthophoto and 12 × 12 for the image time series. The dataset was split into a training set with 219 791 patches, a validation set with 24 352 patches, and a test set with 27 012 patches.
2) Dataset Combining VHR Orthophotos and Multitemporal Sentinel-2 Images: This dataset was created for a project aiming to classify the quality of Trentino pasture areas (see Fig. 4). It included five classes: three classes are about the quality of the pasture, and two are about the presence of other kinds of vegetation. The five classes can be divided into the pasture ones and no pasture ones. The pasture ones composed by the following .
2) Level 20: 80% of grass; shrub and rock presence until the 20%. 3) Level 50: 50% of grass; shrub and rock presence until the 50%. And the no pasture ones include the following. 1) Forest. 2) Impervious areas: Lands that cannot be used for pasture because of the steep slope or the strong shrub and rock presence. Note that the five classes are similar and all related to the grass or the presence of vegetation, so their discrimination is challenging both in the spectral and temporal domains. As we can observe in Table II, the proportion between the five classes is slightly unbalanced. This can lead to a bias training that may cause classification errors.
The dataset was composed of a VHR airborne orthophoto with the same spectral bands and acquisition characteristics of the previous dataset and multitemporal HR satellite images acquired by Sentinel-2. The VHR orthophoto provides the  spatial context information but not the spectral and temporal one to model the pasture phenological behavior over time. On the contrary, Sentinel-2 images provide lower spatial context information (they have a spatial resolution of 10 m/pixel) to discriminate the mentioned classes accurately since they are highly fragmented. Instead, they bring the temporal information since the time series includes images acquired from June 2018 to September 2018, which is the most significant period for the phenological characterization of alpine pastures. We exploit ten spectral bands of the Sentinel-2 images by excluding the one at 60 m/pixel that provide atmospheric information only. For each acquisition month, we chose the atmospherically corrected Sentinel-2 images with the least cloud-cover percentage. It is worth noting that even if the airborne data and the multitemporal images were acquired almost one year apart, this was not a problem from the application point of view since no relevant changes were expected in pastures during this timeframe. We randomly sampled only the areas where the classes had no cloud coverage on any of the four dates of the Sentinel-2 images. We sampled patches with spatial dimensions of 120 × 120 for the VHR orthophotos and 12 × 12 for the Sentinel-2 multitemporal images. These patch sizes guaranteed that, in each sample, both kinds of data represented the same geographical area. We obtained a dataset composed of 94 048 patches split into training with 76 245 patches, validation with 8 403 patches, and test with 9 400 patches.

B. Design of Experiments
We set up the first branch f (X VHR ) of the deep learning (DL) model a ResNet [37] with ten residual blocks having a central convolutional layer with k = 3 and as last layer a 2-D convolutional layer with a kernel size of 3 × 3. To have the same number of feature maps of f (X VHR ), g(X TS ) was composed by two 3-D convolutional layers with kernel sizes 3 × 3 × 3 and 3 × 3 × 2 and strides 1 × 1 × 2 and 2 × 2 × 2. The decoder was composed of seven residual blocks with k = 3 preceded by an upsampling layer that decompresses the spatial information. The last 2-D convolutional layer had a 1 × 1 kernel size and a filter number equal to the number of classes (see Table III). It exploited a softmax activation function to retrieve the land-cover map. Every convolutional layer (2-D and 3-D) was followed by a rectified linear unit activation function and batch normalization.
We trained the proposed DL model in a supervised way for E = 300 using the Adam optimizer [47] with a learning rate equal to 10 −4 . Due to hardware constraints, we used a batch size equal to 80. We set the weight decay λ = 5 · 10 −5 , which is a value used in the SoA to improve the generalization capability of the model [48]. We tested the effectiveness of the proposed method using the two datasets in three experiments.
1) Experiment 1-We compared the proposed method with SoA methods. Both the SoA and proposed method were trained using the hyperparameters mentioned previously. 2) Experiment 2-We observed the contribution provided by the temporal information for land-cover mapping by comparing the results of the proposed DL model with the ones of a model analyzing only a VHR image. This model was composed of the same layers of f (.) and d(.), but it did not include g(.) to process the temporal information and the concatenation step. Thus, we compared the outcome of the ResNet processing only the VHR images with the proposed method processing both VHR images and multitemporal data. 3) Experiment 3-We observed the contribution of the spatial information provided by the VHR orthophotos by comparing the proposed-method results with the outcomes of a model composed only of g(.) and d (.). Hence, this model processed only the temporal information of multitemporal images. For Experiments 2 and 3, we used the same hyperparameters of Experiment 1. We evaluated the performance of the land-cover mapping methods by considering the average overall accuracy (OA) [(true positives+true negatives)/number of labeled pixels], F1-score, and kappa coefficient (κ). In Experiments 2 and 3, we provided the confusion matrix and the F1 score per class.

C. Experiment 1: SoA Comparison
We compared the proposed method results with the SoA using the datasets defined in Section III-A. The comparison with SoA techniques was performed against the following: 1) a DL model analyzing both spatial and temporal information using convolutional and recurrent layers (TWINNS) [30], and two models analyzing multiresolution RS images; 2) one using a CNN and stacked autoencoder (DMIL) [32]; 3) the other one using two 2-D-CNNs (MrFusion) [33]. Since 2) analyzes only images with a homogeneous spatial resolution, we upsampled the Sentinel-2 images to have the spatial dimensions equal to the orthophotos.    Table IV) (e.g., improvement of 5.77% with respect to TWINNS). The quantitative results proved that the joint analysis of the spatial context and temporal information performed by 3-D convolutional layers is more efficient in modeling the information provided by image time series and improved the classification performance with respect to SoA methods splitting the spatial context and temporal information analysis, such as DMIL (i.e., improvement of 2.7% with 10.7 M of parameters less). Although MrFusion obtained comparable results with respect to the proposed method, it was more computationally demanding since it required almost six times more training parameters (i.e., 18.4 M instead of 3.2 M). In most cases, the proposed method obtained better classification results than the SoA methods with fewer training parameters (e.g., 3.2 M). Thus, it was more effective than the SoA methods in the analysis of multimodal multiresolution image time series. The qualitative results confirmed the quantitative ones. The discrimination between the pasture and forest areas was the most common classification error [e.g., Fig. 6(b), the bottom part of Fig. 7(d), the center of Fig. 8(b)]. However, the proposed method alleviated this problem by jointly analyzing the spatial context and temporal information [e.g., Fig. 6(e), the bottom part of Fig. 7(e), the center of Fig. 8(e)], although it overestimated water bodies.
2) SoA Comparison Using the Dataset VHR Orthophotos and Multitemporal Sentinel-2 Images: Using the dataset composed of VHR orthophotos and Sentinel-2 image time series, the proposed method achieved better classification performance than the SoA ones (see Table V). MrFusion [33] and TWINNS [30] were the SoA methods with the best F1 score (i.e., 80.92% and 80.69%, respectively). TWINNS analyzed the spatial and temporal information using a convolutional RNN with an attention mechanism, whereas MrFusion used a 2-D-CNN. The proposed method slightly improved the F1 score with respect to TWINNS and MrFusion. This proved how the joint analysis of spatial and temporal information of 3-D convolutional layers used in the proposed method improved the classification performance with Fig. 6. Comparison between (a) reference maps of the first Sentinel-1 dataset area and the classification maps retrieved using (b) TWINNS model [30], (c) DMIL model [32], (d) MrFusion model [33], and (e) proposed method using Sentinel-1 image time series (yellow is Fruit trees, red is Artificial areas, green is Pastures, dark green is Forest, blue is Water bodies, gray is Impervious areas, and white is masked pixels). respect to 2-D-CNN used in MrFusion, which processed only the spatial information. The joint analysis of the spatial and temporal resolution performed by the proposed method using the 3-D-CNN improved the F1-score of 3.82% with respect to DMIL [32] that processed only the temporal information using an SAE. The proposed method was less computationally demanding (i.e., 3.0 M parameters) than the SoA ones (i.e., 17.9 M for DMIL, 18.4 M for MrFusion, and 13.2 M for TWINNS) and it achieved the best classification performance with the lowest number of parameters (see Table V).
The qualitative results confirmed the quantitative ones. The proposed method better discriminated the difference between pasture and nonpasture classes than the SoA ones (e.g., the right side of the classification maps in Fig. 9). The inaccurate classification between Level 50, Forest, and Impervious areas was the most common error [e.g., right part of Fig. 9(c) and (d)] that was alleviated by the proposed method [see Fig. 9(e)]. We can observe how the analysis of both spatial and temporal information performed by 3-D convolutional layers performed by the proposed method achieved better results [e.g., the upper part of Fig. 10(e) and the bottom left part of Fig. 11(e)] in the three pasture classes discrimination than the 2-D convolutional layers that learned only spatial features and not temporal ones [see Figs. 10(d) and 11(d)].

D. Experiment 2: Effectiveness of the Temporal Information
We compared the results obtained by processing a single-date VHR image and using the proposed method to jointly process a VHR orthophoto and multitemporal data (i.e., Sentinel-1 image time series or multitemporal Sentinel-2 image) from the two datasets to observe if the inclusion of the temporal component in the spatial analysis improves the classification accuracy with respect to the use of single-date VHR image. We proved that the temporal information analysis performed by the proposed method improved the model classification performance with respect to the method processing only single-date VHR images in both datasets (see Table VI). The Sentinel-1 temporal information increased the OA of 4.86%, whereas the addition of multitemporal Sentinel-2 images improved the pasture classification OA of 1.81%. The temporal information analysis of the Sentinel-2 images allowed to better model the phenological behavior of the classes over time. The inclusion of the Sentinel-1 temporal information in the classification analysis improved the method capability in the minority class discrimination, such as Fig. 7. Comparison between (a) reference maps of the second Sentinel-1 dataset area and the classification maps retrieved using (b) the TWINNS model [30], (c) DMIL model [32], (d) MrFusion model [33], and (e) proposed method using Sentinel-1 image time series (Yellow is Fruit trees, red is Artificial areas, green is Pastures, dark green is Forest, blue is Water bodies, gray is Impervious areas, and white is masked pixels).
fruit trees, and further increased the classification performance with the majority classes [see Fig. 12(a) and (c)]. The proposed method tended to misclassify minority classes (i.e., artificial areas and water bodies) [see Fig. 12(c)] since few labeled samples belonging to these classes were in the training set leading to overfitting them. The F1 score of each class increased by adding the Sentinel-2 temporal information. We can observe that the proposed method classified better pasture and no pasture class than the one using only single-date VHR images [see Fig. 13(a)]. However, the proposed method underestimated the class Level 20 and overestimated Levels 0 and 50 [see Fig. 13(c)] due to their similarity. The qualitative results (see Figs. 14 and 15) confirmed the quantitative ones and showed the classification performance improvement by adding temporal to the spatial context analysis. In the first dataset, the Sentinel-1 temporal information analysis allowed us to characterize better the impervious areas and water bodies [e.g., left side of Fig. 15(b) and (c)]. In the second dataset, we can observe that the Sentinel-2 temporal information allowed us to discriminate better similar classes, such as Pasture Level 50 and Forest [e.g., the right side of Fig. 15(b) and (c)]. It also improved the classification of similar classes, such as Level 0, 20, and 50 [e.g., the bottom side of Fig. 15(b) and (c)].

E. Experiment 3: Effectiveness of Multiscale Information
We compared the results obtained by analyzing only Sentinel-1 image time series or multitemporal Sentinel-2 images with the proposed method ones using the datasets defined in Section III-A to prove the improvement obtained by processing multiresolution images. The use of multiresolution images sharply improved the classification performance with respect to multitemporal data analysis in both datasets (see Table VII). Sentinel-1 image time series and multitemporal Sentinel-2 images did not provide many geometrical details due to their spatial resolution. Thus, the Fig. 8. Comparison between (a) reference maps of the third Sentinel-1 dataset area and the classification maps retrieved using (b) TWINNS model [30], (c) DMIL model [32], (d) MrFusion model [33], and (e) proposed method using Sentinel-1 image time series (yellow is Fruit trees, red is Artificial areas, green is Pastures, dark green is Forest, blue is Water bodies, gray is Impervious areas, and white is masked pixels). Fig. 9. Comparison between (a) reference maps of the first Sentinel-2 dataset area, and the classification maps retrieved using (b) TWINNS model [30], (c) DMIL model [32], (d) MrFusion model [33], and (e) proposed method (yellow is Level 0, red is Level 20, blue is Level 50, dark green is Forest, gray is Impervious areas, and white is masked pixels). Fig. 10. Comparison between (a) reference maps of the second NDVI dataset area, and the classification maps retrieved using (b) TWINNS model [30], (c) DMIL model [32], (d) MrFusion model [33], and (e) proposed method (yellow is Level 0, red is Level 20, blue is Level 50, dark green is Forest, gray is Impervious areas, and white is masked pixels). Fig. 11. Comparison between (a) reference maps of the third NDVI dataset area, and the classification maps retrieved using (b) TWINNS model [30], (c) DMIL model [32], (d) MrFusion model [33], and (e) proposed method (yellow is Level 0, red is Level 20, blue is Level 50, dark green is Forest, gray is Impervious areas, and white is masked pixels).   Comparison between (a) reference maps of the first dataset, and the classification maps retrieved by (b) processing only VHR single-date orthophotos, and the proposed method processing VHR orthophotos and (c) Sentinel-1 image time series (yellow is Fruit trees, red is Artificial areas, green is Pastures, dark green is Forest, blue is Water bodies, gray is Impervious areas, and white is masked pixels). Fig. 15. Comparison between (a) reference maps of the second dataset, and the classification maps retrieved by (b) processing only VHR single-date orthophotos, and (c) proposed method processing VHR orthophotos and multitemporal Sentinel-2 images (yellow is Level 0, red is Level 20, blue is Level 50, dark green is Forest, gray is Impervious areas, and white is masked pixels). Fig. 16. Comparison between (a) reference maps of the first dataset, and the classification maps retrieved by (b) processing only Sentinel-1 image time series, and (c) proposed method processing VHR orthophotos and Sentinel-1 image time series (yellow is Fruit trees, red is Artificial areas, green is Pastures, dark green is Forest, blue is Water bodies, gray is Impervious areas, and white is masked pixels). Fig. 17. Comparison between (a) reference maps of the second dataset, and the classification maps retrieved by (b) processing only multi-temporal Sentinel-2 images, and (c) proposed method processing VHR orthophotos and multitemporal Sentinel-2 images (yellow is Level 0, red is Level 20, blue is Level 50, dark green is Forest, gray is Impervious areas, and white is masked pixels). classification accuracy decreased (i.e., the OA of the first dataset is 73.79%, while the OA of the second one is 72.98%). The introduction of VHR images in the classification analysis added the spatial information unavailable in the multitemporal data and improved the discrimination capability of the classification method of 16.65% and 8.27% for the first and second datasets, respectively. It is possible to observe that F1 score of each class in both datasets improved with the VHR orthophoto inclusion in the classification process (see Figs. 12 and 13). The first dataset classes with fewer training samples and requiring high spatial resolution for the modeling were misclassified [see Fig. 12(b)]. The use of VHR images improved the discrimination of these classes using the same number of training samples. This proved the relevance of the multiresolution analysis.
As we can observe in the qualitative results, the multiscale classification analysis better improved the discrimination between Pastures, Forests, Water Bodies, and Impervious areas, in the first dataset [e.g., Fig. 16(b) and (c)]. These classes are the most fragmented ones and are challenging to discriminate accurately due to the spatial resolution of Sentinel-1 images. In the second dataset, the confusion between Forest and Level 50 was one of the most common classification errors [e.g., the right side of Fig. 17(b)]. This was due to the slight difference in the tree density between Level 50 and Forest, which is difficult to observe in Sentinel-2 images. Class Level 50 presented many trees in some areas and can be easily confused with the Forest class. The poor spatial resolution also caused problems in the discrimination of similar pasture classes (i.e., Level 0, Level 20, and Level 50) given their similarity [e.g., the middle-left part of Fig. 17(b)]. The multiscale analysis performed by the proposed method alleviated these problems [e.g., the middle-left part of Fig. 17(c)].

IV. CONCLUSION
In this article, we proposed a supervised DL classification method that processes multisensor and multiresolution RS image time series to combine the spatial context information derived by VHR images and the temporal information obtained from the HR image time series and retrieve a land-cover map. We tested our method using two datasets composed of one VHR orthophoto, Sentinel-2, or Sentinel-1 image time series acquired over the Trentino region in Italy. In the tests, we observed the improvement provided by the multiscale analysis in the correct discrimination of complex classes characterized by small objects. We proved that the temporal information analysis improved the modeling of classes with behavior varying over the year that cannot be represented by a single-date image. The proposed method improved the modeling of the temporal information with respect to the SoA ones and achieved more accurate classification performance using a less computationally demanding model. This improvement is due to the fusion of multiscale features extracted by multisensor multiresolution images and temporal features retrieved by analyzing the image time series with 3-D convolutional layers. These layers proved to better model spatial and temporal information than 2-D convolutional and SAE layers exploited in SoA classification methods.
In future activities, we plan to test the proposed method on other kinds of classes using images acquired by different sensors. Moreover, we want to exploit convolutional LSTM layers instead of 3-D convolutional ones to observe if they optimize the modeling of spatial and temporal information.